# Extracting the Data from FOMC Transcripts

First, I import the necessary packages:

- `pymupdf` allows me to get text from PDF files, page by page. I started off trying to use `PyPDF2` but found this to be less reliable. Specifically, it would add spaces into extracted text, sometimes in the middle of words. This has the potential to throw off much of the tokenization process.
- `os` provides tools to interact with the filesystem, including reading from files
- `re` is Python's implementation of Regular Expressions, which are useful for extracting information from text
- `pandas` is used at the end of this file to export the speaker-content data to a csv

In [2]:
import fitz as pymupdf
import os
import re
import pandas as pd

Here I configure the folder where to look for files

In [3]:
folder = 'pdfs'

The PDFs created by the Federal Reserve contain unicode characters. Since these complicate some of the RegEx operations I want to do later on, I replace them at this stage. Some of these I replace with a rough equivalent, while others I replace with a "landmark" so I can still find them and use them in my Regular expressions before removing them from the string later on. I define RegEx patterns to find these characters and then a strategic replacement for each of them. I use this mapping in the `process_page` function.

In [24]:
unicode_map = {
    '\\\\N\{EN DASH\}': '-', 
    '\\\\N\{SUPERSCRIPT [A-Z]*\}': '<SUPERSCRIPT>', 
    '\\\\N\{EM DASH\}': '--', 
    '\\\\N\{(?:RIGHT|LEFT) SINGLE QUOTATION MARK\}': '\'',
    '\\\\N\{(?:LEFT|RIGHT) DOUBLE QUOTATION MARK\}': '"',
    '_': ' ',
    '\\\\N\{VULGAR FRACTION THREE QUARTERS\}': '3/4',
    '\\\\N\{VULGAR FRACTION ONE QUARTER\}': '1/4',
    '\\\\N\{VULGAR FRACTION ONE HALF\}': '1/2',
    '\\\\N\{VULGAR FRACTION ONE THIRD\}': '1/3',
    '\\\\N\{VULGAR FRACTION TWO THIRDS\}': '2/3',
    '\\\\N\{VULGAR FRACTION FIVE EIGHTHS\}': '5/8',
    '\\\\N\{VULGAR FRACTION THREE EIGHTHS\}': '3/8',
    '\\\\N\{VULGAR FRACTION SEVEN EIGHTHS\}': '7/8',
    '\\\\N\{VULGAR FRACTION ONE EIGHTH\}': '1/8',
    '\\\\N\{SOFT HYPHEN\}': '',
    '\\\\N\{EURO SIGN\}': ' euros',
    '\\\\N\{YEN SIGN\}': 'yen',
    '\\\\N\{(?:DOUBLE )?PRIME\}': '',
    '\\\\N\{POUND SIGN\}': 'pounds',
    '\\\\N\{HORIZONTAL ELIPSIS\}': '...',
    '\\\\N\{MATHEMATICAL ITALIC SMALL PI\}': 'pi',
    '\\\\N\{LATIN SMALL LETTER A WITH .*\}': 'a',
    '\\\\N\{LATIN SMALL LETTER C WITH .*\}': 'c',
    '\\\\N\{LATIN SMALL LETTER E WITH .*\}': 'e',
    '\\\\N\{LATIN SMALL LETTER I WITH .*\}': 'i',
    '\\\\N\{LATIN (?:SMALL|CAPITAL) LETTER O WITH .*\}': 'o',
    '\\\\N\{LATIN SMALL LETTER U WITH .*\}': 'u',
    '\\\\N\{LATIN (?:SMALL|CAPITAL) LETTER S WITH .*\}': 's',
    '\\\\N\{LATIN SMALL LETTER N WITH .*\}': 'n',
    '\\\\N\{GREEK SMALL LETTER .*\}': 'pi',
    '\\\\N\{HORIZONTAL ELLIPSIS\}': '...',
    '\\\\N\{DAGGER\}': '',
    '\\\\N\{MINUS SIGN\}': '-'
    # '\\\\N{\HORI\}'
}

This function processes raw text, splitting a long string on indications of which person is talking. For example, it takes a string like this
```
'CHAIRMAN GREENSPAN. Good morning everyone. SEVERAL. Good morning.'
```
and turns it into the following array
```
['CHAIRMAN GREENSPAN.', 'Good morning everyone.', 'SEVERAL.', 'Good morning.']
```
In addition, this function also removes (most) page numbers, date headers, and footnotes. There are some of these that slip through the cracks, most of which I am able to identify and remove in later steps of the extraction process. In the end, this function returns an array of strings, which I process further in later functions, especially `extract_data_from_pdf`

In [18]:
def extract_page(raw_text):
    # Re-encode the raw text as an ASCII string, replacing any unicode characters with a string like \\N{SYMBOL NAME GOES HERE}
    raw_text = raw_text.encode('ascii', 'namereplace').decode('ascii')
    # according to the map defined above, replace these unicode characters
    for matchstr, sub in unicode_map.items():
        raw_text = re.sub(matchstr, sub, raw_text)

    # Split the text on the "speaker" indicators as described in the comment above, so I can later label text with the speaker who said it
    split = re.split('((?:MR\.|MRS\.|MS\.|(?:VICE )?CHAIR(?:MAN)?) [A-Z]{2,}\.|SEVERAL(?:\(\?\))?\.|SPEAKER ?\(\?\)\.|\n|PARTICIPANTS?\.|END OF MEETING)', raw_text)

    final = []
    # Separate page numbers into their own strings so they can be more easily removed
    for item in split:
        final.extend(re.split('([0-9]{1,3} of [0-9]{1,3})', item))

    # remove empty strings following the split
    final = [item.strip() for item in final if item.strip() != '']

    trimmed = []
    for i, item in enumerate(final):
        if i < 3:
            # Remove the page number and date usually found at the beginning of a page
            if re.match('[0-9]{1,3} of [0-9]{1,3}', item):
                continue
            elif re.match('[A-Z][a-z]{,8} [1-3]?[0-9](?:-(?:[A-Z][a-z]{,8} )?[1-3]?[0-9])?,? [0-9]{4}', item):
                continue
        # remove footnote prefixes
        if not re.match('[0-9]*  (.*)', item) is None:
            trimmed.append(re.match('[0-9]*  (.*)', item).group(1))
        else:
            trimmed.append(item)

    return trimmed

Need to figure out if I still need this, or if it should be removed

In [7]:
def drop_dates(array):
    # array = [item.strip() for item in array if item.strip() != '']
    array = [item for item in array if re.match('[0-1]?[0-9]/[0-3]?[0-9]', item) is None]
    return array

FOMC transcripts are prefixed with some information that is not important for my use case, such as who was in attendance and each person's affiliations. I want to skip over this information. So I identify some "landmarks" for different document formats and only extract the information after these "landmarks."

In [8]:
def extract_data_from_pdf(fname):
    with pymupdf.open(os.path.join(folder, fname)) as pages:
        include = False
        main_arr = []
        for page in pages:
            extracted = extract_page(page.get_text())
            if len(extracted) == 0:
                continue
            if not (re.match('Transcript of (?:the )?Federal', extracted[0].strip()) is None) or (len(extracted) >= 2 and 'the Federal Open Market Committee' in extracted[1].strip()):
                include = True
            # Dealing with inconsistent formatting on 12/12/2012 (No "Transcript of the FOMC" at the start)
            if 'December 11 Session' == extracted[0]:
                include = True
            if include == False:
                print('skipping')
                continue
            main_arr.extend(drop_dates(extracted))
        return main_arr

In [9]:
def get_utternaces(sep_array):
    utterances = []
    i = 0
    tagged = []
    while i < len(sep_array) - 1:
        is_speaker = not re.match('((?:MR\.|MRS\.|MS\.|(?:VICE )?CHAIR(?:MAN)?) [A-Z]{2,}\.|SEVERAL(?:\(\?\))?\.|SPEAKER ?\(\?\)\.|PARTICIPANTS?\.)', sep_array[i]) is None
        tagged.append({'is_speaker': is_speaker, 'content': sep_array[i]})
        i += 1

    ind = 0
    while ind < len(tagged) - 1:
        j = ind + 1
        if tagged[ind]['is_speaker']:
            while j < len(tagged) and not tagged[j]['is_speaker']:
                j += 1
            combined_content = [item['content'] for item in tagged[ind + 1:j]]
            speaker = tagged[ind]['content']
            processed_content = [item for item in combined_content if item != '']
            utterance = {'speaker': speaker, 'content': processed_content}
            utterances.append(utterance)
        ind = j
    return utterances

In [10]:
def add_index(utterance, i):
    utterance['index'] = i
    return utterance

In [11]:
def condense_content(content_arr):
    final_arr = []
    include = True
    for item in content_arr:
        if '(appendix ' in item:
            continue
        if 'END OF MEETING' in item:
            include = False
        if include:
            final_arr.append(item)
    concatenated = ' '.join(final_arr)

    date_page_patt = '[A-Z][a-z]{2,8} [1-3]?[0-9],? [0-9]{4} [0-9]{1,3} of [0-9]{1,3}'
    return re.sub(date_page_patt, ' ', concatenated)


In [12]:
def process_pdf(fname):
    year = fname[4:8]
    month = fname[8:10]
    mday = fname[10:12]
    transcript_type = fname[12:].split('.')[0]
    datestr = f'{year}-{month}-{mday}'
    raw_extract = extract_data_from_pdf(fname)
    utterances = get_utternaces(raw_extract)
    utterances = [add_index(utterance, i) for i, utterance in enumerate(utterances)]
    for utterance in utterances:
        utterance['condensed'] = condense_content(utterance['content'])
        utterance['date'] = datestr
        utterance['type'] = transcript_type
    return utterances

In [None]:
i = 0

records = []

for file in os.listdir('pdfs'):
    print(file)
    records.extend(process_pdf(file))
    i += 1

df = pd.DataFrame.from_records(records)

df.to_csv('transcripts.csv')
    

Identify unicode symbols and their frequencies

In [27]:
symbols = {}

for i, row in df.iterrows():
    result = re.search('\\\\N{(?:[A-Z]| )*}', row['condensed'])
    if not result is None:
        if result.group() in symbols:
            symbols[result.group()] += 1
        else:
            symbols[result.group()] = 1

symbols

{'\\N{HORIZONTAL BAR}': 3,
 '\\N{MODIFIER LETTER PRIME}': 4,
 '\\N{FIGURE DASH}': 7,
 '\\N{MATHEMATICAL ITALIC SMALL R}': 1,
 '\\N{ACUTE ACCENT}': 2,
 '\\N{MICRO SIGN}': 1,
 '\\N{REGISTERED SIGN}': 1,
 '\\N{BULLET}': 1}

Assess prevalence of unicode symbols

In [28]:
total = 0
for symbol, freq in symbols.items():
    total += freq
total

20