# Extracting the Data from FOMC Transcripts

First, I import the necessary packages:

- `pymupdf` allows me to get text from PDF files, page by page. I started off trying to use `PyPDF2` but found this to be less reliable. Specifically, it would add spaces into extracted text, sometimes in the middle of words. This has the potential to throw off much of the tokenization process.
- `os` provides tools to interact with the filesystem, including reading from files
- `re` is Python's implementation of Regular Expressions, which are useful for extracting information from text
- `pandas` is used at the end of this file to export the speaker-content data to a csv

In [3]:
import fitz as pymupdf
import os
import re
import pandas as pd
import math
import collections

from nltk import sent_tokenize

Here I configure the folder where to look for files

In [4]:
folder = 'pdfs'

The PDFs created by the Federal Reserve contain unicode characters. Since these complicate some of the RegEx operations I want to do later on, I replace them at this stage. Some of these I replace with a rough equivalent, while others I replace with a "landmark" so I can still find them and use them in my Regular expressions before removing them from the string later on. I define RegEx patterns to find these characters and then a strategic replacement for each of them. I use this mapping in the `process_page` function.

This function processes raw text, splitting a long string on indications of which person is talking. For example, it takes a string like this
```
'CHAIRMAN GREENSPAN. Good morning everyone. SEVERAL. Good morning.'
```
and turns it into the following array
```
['CHAIRMAN GREENSPAN.', 'Good morning everyone.', 'SEVERAL.', 'Good morning.']
```
In addition, this function also removes (most) page numbers, date headers, and footnotes. There are some of these that slip through the cracks, most of which I am able to identify and remove in later steps of the extraction process. In the end, this function returns an array of strings, which I process further in later functions, especially `extract_data_from_pdf`

In [5]:
endings = []

In [28]:
def date_page_remove(arr):
    result = []
    for i, item in enumerate(arr):
        if i < 3 or i > len(arr) - 3:
            if not re.match('[0-9]{1,2}\/[0-9]{1,2}(?:-[0-9]{1,2})?\/[0-9]{2}$', item) is None:
                continue
            if not re.match('[A-Z][a-z]{2,8} [0-9]{1,2}(?:-[0-9]{1,2})?,? [0-9]{4}$', item) is None:
                continue
            if not re.match('[0-9]{1,3} of [0-9]{1,3}$', item) is None:
                continue
            if not re.match('-?[0-9]{1,3}-?$', item) is None:
                continue
            result.append(item)
        else:
            result.append(item)
    return result

In [40]:
def extract_page(raw_text):
    # Re-encode the raw text as an ASCII string, replacing any unicode characters with a string like \\N{SYMBOL NAME GOES HERE}
    raw_text = unicode_nse_process(raw_text.encode('ascii', 'namereplace').decode('ascii'))

    # Split the text on the "speaker" indicators as described in the comment above, so I can later label text with the speaker who said it
    split = re.split(
        '((?:MR\.|MRS\.|MS\.|(?:VICE )?CHAIR(?:MAN)?) [A-Z]{2,}\.|MR\. D\. LINDSEY\.|SEVERAL(?:\(\?\))?\.|SPEAKER ?\(\?\)\.|\n|PARTICIPANTS?\.|END OF MEETING)', raw_text)

    # remove empty strings following the split
    final = [item.strip() for item in split if item.strip() != '']
    final = date_page_remove(final)

    trimmed = []
    for i, item in enumerate(final):
        if not re.match('[0-9]*  (.*)', item) is None:
            trimmed.append(re.match('[0-9]*  (.*)', item).group(1))
        else:
            trimmed.append(item)

    return trimmed

FOMC transcripts are prefixed with some information that is not important for my use case, such as who was in attendance and each person's affiliations. I want to skip over this information. So I identify some "landmarks" for different document formats and only extract the information after these "landmarks."

In [9]:
def extract_data_from_pdf(fname):
    with pymupdf.open(os.path.join(folder, fname)) as pages:
        include = False
        main_arr = []
        for page in pages:
            extracted = extract_page(page.get_text())
            if len(extracted) == 0:
                continue
            if not (re.match('Transcript of (?:the )?Federal', extracted[0].strip()) is None) or (len(extracted) >= 2 and 'the Federal Open Market Committee' in extracted[1].strip()):
                include = True
            # Dealing with inconsistent formatting on 12/12/2012 (No "Transcript of the FOMC" at the start)
            if 'December 11 Session' == extracted[0]:
                include = True
            if include == False:
                print('skipping')
                continue
            main_arr.extend(extracted)
        return main_arr

After the `extract_page` method creates the array format described above, I want to use this format to create a series of records that looks more like this:
```
[
    {'speaker': 'CHAIRMAN GREENSPAN', 'text': 'Good morning everyone.'},
    {'speaker': 'SEVERAL', 'text': 'Good morning.'}
]
```
This is the "utterance" format described in Shapiro and Wilson (2021). The `get_utterance` serves this purpose. It first tags each item in the initial array according to whether the item corresponds to a speaker (using a RegEx pattern) and then "walks" through the array such that when it finds an item tagged as speaker, it then finds all of the entries in the array before the next item tagged as a speaker and then assigns these as the content of the previously identified speaker. This may seem overly complex, as often the format created above will only have one entry corresponding to one speaker, but it is important to use this more flexible approach since speakers can have more than one item in the array that make up their utterance, such as if the utterance spans multiple pages.

In [42]:
def get_utternaces(sep_array):
    utterances = []
    i = 0
    tagged = []
    while i < len(sep_array) - 1:
        is_speaker = not re.match(
            '((?:MR\.|MRS\.|MS\.|(?:VICE )?CHAIR(?:MAN)?) (?:[A-Z]\. )?[A-Z]{2,}\.|SEVERAL(?:\(\?\))?\.|SPEAKER ?\(\?\)\.|PARTICIPANTS?\.)', sep_array[i]) is None
        tagged.append({'is_speaker': is_speaker, 'content': sep_array[i]})
        i += 1

    ind = 0
    while ind < len(tagged) - 1:
        j = ind + 1
        if tagged[ind]['is_speaker']:
            while j < len(tagged) and not tagged[j]['is_speaker']:
                j += 1
            combined_content = [item['content'] for item in tagged[ind + 1:j]]
            speaker = tagged[ind]['content']
            processed_content = [
                item for item in combined_content if item != '']
            utterance = {'speaker': speaker, 'content': processed_content}
            utterances.append(utterance)
        ind = j
    return utterances

I make a method here to add an index key to each utterance dictionary, which I use later on. The thought here is that I might eventually want to take advantage of the ordering of statements, which is what this index is meant to accomplish.

In [11]:
def add_index(utterance, i):
    utterance['index'] = i
    return utterance

Especially if the utterance content contains multiple text items, I need to condense it into one string to be able to tokenize it later. That is what I do here, with a few more clean-up steps to remove the non-informational content of the transcripts such as appendix references, end of meeting indicators, and page numbers.

In [36]:
def condense_content(content_arr):
    final_arr = []
    include = True
    for item in content_arr:
        if '(appendix ' in item:
            continue
        if 'END OF MEETING' in item:
            include = False
        if include:
            final_arr.append(item)
    concatenated = ' '.join(final_arr)

    # date_page_patt = '[A-Z][a-z]{2,8} [1-3]?[0-9],? [0-9]{4} [0-9]{1,3} of [0-9]{1,3}'
    # return re.sub(date_page_patt, ' ', concatenated)
    return unicode_nse_process(concatenated)

Earlier on in the `extract_page` method, I chose to try to include unicode characters in case they contained meaningful content, mostly just to be flexible and to allow myself or others to use these in the future. Here I end up replacing or removing most of those so they don't show up as tokens later on.

I also address what I call "non-speech events" (NSEs) here. For example, the transcripts sometimes contain strings like "[Laughter]" or "[Applause]." These do no contribute to the topic (but maybe they contribute to the sentiment, although I think that is beyond the scope of what I am doing here), so I choose to remove them, at least for now.

In [13]:
unicode_map = {
    '\\\\N\{EN DASH\}': '-',
    '\\\\N\{SUPERSCRIPT [A-Z]*\}': '<SUPERSCRIPT>',
    '\\\\N\{EM DASH\}': '--',
    '\\\\N\{(?:RIGHT|LEFT) SINGLE QUOTATION MARK\}': '\'',
    '\\\\N\{(?:LEFT|RIGHT) DOUBLE QUOTATION MARK\}': '"',
    '_': ' ',
    '\\\\N\{VULGAR FRACTION THREE QUARTERS\}': '3/4',
    '\\\\N\{VULGAR FRACTION ONE QUARTER\}': '1/4',
    '\\\\N\{VULGAR FRACTION ONE HALF\}': '1/2',
    '\\\\N\{VULGAR FRACTION ONE THIRD\}': '1/3',
    '\\\\N\{VULGAR FRACTION TWO THIRDS\}': '2/3',
    '\\\\N\{VULGAR FRACTION FIVE EIGHTHS\}': '5/8',
    '\\\\N\{VULGAR FRACTION THREE EIGHTHS\}': '3/8',
    '\\\\N\{VULGAR FRACTION SEVEN EIGHTHS\}': '7/8',
    '\\\\N\{VULGAR FRACTION ONE EIGHTH\}': '1/8',
    '\\\\N\{SOFT HYPHEN\}': '',
    '\\\\N\{EURO SIGN\}': ' euros',
    '\\\\N\{YEN SIGN\}': 'yen',
    '\\\\N\{(?:DOUBLE )?PRIME\}': '',
    '\\\\N\{POUND SIGN\}': 'pounds',
    '\\\\N\{HORIZONTAL ELIPSIS\}': '...',
    '\\\\N\{MATHEMATICAL ITALIC SMALL PI\}': 'pi',
    '\\\\N\{LATIN SMALL LETTER A WITH .*\}': 'a',
    '\\\\N\{LATIN SMALL LETTER C WITH .*\}': 'c',
    '\\\\N\{LATIN SMALL LETTER E WITH .*\}': 'e',
    '\\\\N\{LATIN SMALL LETTER I WITH .*\}': 'i',
    '\\\\N\{LATIN (?:SMALL|CAPITAL) LETTER O WITH .*\}': 'o',
    '\\\\N\{LATIN SMALL LETTER U WITH .*\}': 'u',
    '\\\\N\{LATIN (?:SMALL|CAPITAL) LETTER S WITH .*\}': 's',
    '\\\\N\{LATIN SMALL LETTER N WITH .*\}': 'n',
    '\\\\N\{GREEK SMALL LETTER .*\}': 'pi',
    '\\\\N\{HORIZONTAL ELLIPSIS\}': '...',
    '\\\\N\{DAGGER\}': '',
    '\\\\N\{MINUS SIGN\}': '-',
    '\\\\N\{FIGURE DASH\}': '-',
    '\\\\N\{MODIFIER LETTER PRIME}': '\'',
    '\\\\N\{NO-BREAK SPACE\}': ' ',
    '\\\\N\{HORIZONTAL BAR\}': '-',
    '\\\\N\{ACUTE ACCENT\}': '\'',
    '\\\\N\{MICRO SIGN\}': 'mu',
    '\\\\N\{REGISTERED SIGN\}': '',
    '\\\\N\{BULLET\}': '',
    '\\\\N\{MATHEMATICAL ITALIC SMALL [A-Z]*\}': '',
    '\\\\N\{ASTERISK OPERATOR\}': ''
}

In [35]:
non_speech_pats = ['\[(L|l)aught?er\.?\]', 
              '\[(R|r)ecess\]', 
              '\[(P|p)ause\]',
              '\[(A|a)pplause\]',
              '\[(L|l)unch (B|b)reak\]', 
              '\[(L|l)unch (R|r)ecess\]', 
              '\[(C|c)offee (B|b)reak\]', 
              '\[(N|n)o (R|r)esponse\]', 
              '\[(S|s)how (O|o)f (H|h)ands\]', 
              '\[No response\.?\]',
              '\[Meeting recess(ed)?(?: for lunch)?\]', 
              '\[Simultaneous conversation\]', 
              '\[Chorus of ayes\]', 
              '\[Dessert break\]', 
              '\[(L|a)ughter (A|a)nd (A|a)pplause\]', 
              '\[Extended applause\]', 
              '\[Break\]', 
              '\[Show of hands\]', 
              '\[Short break\]', 
              '\[Silence\]'
              '\[Laughter, followed by applause\.?\]',
              '\[Laughter and extended applause\.?\]'
              '\[(U|u)nintelligible\]',
              '\[(S|s)tatements?--(S|s)ee (A|a)ppendix\.?\]'
              ]

In [15]:
non_speech = ['[Extended applause]', 
              '[Meeting break]', 
              '[No audible response]',
              '[No comment indicated]', 
              '[Aside to Governor Fischer]', 
              '[Singing and applause]', 
              '[Chorus of agreement]', 
              '[Brief recess]', 
              '[Luncheon recess]', 
              '[pointing]', 
              '[Laughter and applause]', 
              '[Simultaneous speakers]', 
              '[Show of hands]', 
              '[gesture]', 
              '[Singing]', 
              '[Chorus of ayes]', 
              '[No response]']

In [32]:
def unicode_nse_process(input_string):
        result = input_string
        for matchstr, sub in unicode_map.items():
            result = re.sub(matchstr, sub, result)
        for matchstr in non_speech_pats:
            result = re.sub(matchstr, '', result)
        for string in non_speech:
            result = result.replace(string, '')
        return result

In [21]:
def process_pdf(fname):
    year = fname[4:8]
    month = fname[8:10]
    mday = fname[10:12]
    transcript_type = fname[12:].split('.')[0]
    datestr = f'{year}-{month}-{mday}'
    raw_extract = extract_data_from_pdf(fname)
    utterances = get_utternaces(raw_extract)
    utterances = [add_index(utterance, i)
                  for i, utterance in enumerate(utterances)]
    for utterance in utterances:
        if not isinstance(utterance['content'], collections.abc.Sequence) and math.isnan(utterance['content']):
            continue
        content = condense_content(utterance['content'])
        # according to the map defined above, replace these unicode characters
        utterance['content'] = content
        utterance['date'] = datestr
        utterance['type'] = transcript_type
    return utterances

In [43]:
i = 0

records = []

for file in os.listdir('pdfs'):
    print(file)
    records.extend(process_pdf(file))
    i += 1

df = pd.DataFrame.from_records(records)

df.to_csv('transcripts.csv')

FOMC19980929meeting.pdf
FOMC19930817meeting.pdf
FOMC20150318meeting.pdf
skipping
skipping
skipping
FOMC19890208meeting.pdf
FOMC19891016confcall.pdf
skipping
FOMC19961113meeting.pdf
FOMC20110315meeting.pdf
FOMC20060510meeting.pdf
FOMC20020924meeting.pdf
skipping
skipping
FOMC19970325meeting.pdf
FOMC20140319meeting.pdf
skipping
skipping
FOMC19991005meeting.pdf
FOMC20060808meeting.pdf
FOMC19880329meeting.pdf
FOMC20040504meeting.pdf
FOMC19960326meeting.pdf
FOMC20041110meeting.pdf
FOMC19910514meeting.pdf
FOMC19910703meeting.pdf
FOMC20090603confcall.pdf
skipping
skipping
FOMC19911105meeting.pdf
FOMC20170920meeting.pdf
FOMC20031209meeting.pdf
FOMC20131218meeting.pdf
skipping
skipping
FOMC19981015confcall.pdf
skipping
FOMC20041214meeting.pdf
FOMC19921117meeting.pdf
FOMC20011002meeting.pdf
FOMC19880105confcall.pdf
skipping
FOMC20070918meeting.pdf
FOMC19980921confcall.pdf
skipping
skipping
FOMC20160615meeting.pdf
FOMC20120425meeting.pdf
skipping
skipping
FOMC20150617meeting.pdf
skipping
skipping

In [None]:
for i in 

In [152]:
def process_pdf_sents(fname):
    year = fname[4:8]
    month = fname[8:10]
    mday = fname[10:12]
    transcript_type = fname[12:].split('.')[0]
    datestr = f'{year}-{month}-{mday}'
    raw_extract = extract_data_from_pdf(fname)
    utterances = get_utternaces(raw_extract)
    sentences = []
    doc_metadata = {'date': datestr, 'type': transcript_type}
    for utterance in utterances:
        utter_sents = sent_tokenize(condense_content(utterance['content']))
        sentences.extend([{'content': unicode_nse_process(sent), 'speaker': utterance['speaker'],
                         **doc_metadata} for sent in utter_sents])
    sentences = [add_index(sentence, i)
                 for i, sentence in enumerate(sentences)]

    return sentences

In [None]:
i = 0

records = []

for file in os.listdir('pdfs'):
    print(file)
    records.extend(process_pdf_sents(file))
    i += 1

df_sents = pd.DataFrame.from_records(records)

df_sents.to_csv('transcripts_sents.csv')

##

Identify unicode symbols and their frequencies

In [50]:
symbols = {}

for i, row in df.iterrows():
    result = re.search('\\\\N{.*\}', row['condensed'])
    if not result is None:
        if result.group() in symbols:
            symbols[result.group()] += 1
        else:
            symbols[result.group()] = 1

symbols

{}

Assess prevalence of unicode symbols

In [51]:
total = 0
for symbol, freq in symbols.items():
    total += freq
total

0

In [118]:
non_speech_events = {}

for i, row in df.iterrows():
    result = re.search('\[(\w| )+\]', row['content'])
    if not result is None:
        if result.group() in non_speech_events:
            non_speech_events[result.group()] += 1
        else:
            non_speech_events[result.group()] = 1
non_speech_events

{'[w]': 1,
 '[a]': 1,
 '[earlier and gradual]': 1,
 '[T]': 1,
 '[System Open Market Account]': 1,
 '[Research and Statistics]': 2,
 '[personal consumption expenditures]': 5,
 '[European Central Bank]': 5,
 '[non accelerating inflation rate of unemployment]': 1,
 '[Congressional Budget Office]': 1,
 '[Japanese government bond]': 1,
 '[Institute for Supply Management]': 2,
 '[MFP]': 1,
 '[West Texas intermediate oil]': 1,
 '[industrial production]': 2,
 '[American Economic Association]': 2,
 '[information technology]': 4,
 '[equipment and software]': 3,
 '[alternative minimum tax]': 1,
 '[probability density function]': 1,
 '[NIPA]': 1,
 '[metropolitan statistical areas]': 1,
 '[Bureau of Labor Statistics]': 3,
 '[Employment Cost Index]': 4,
 '[Survey of Professional Forecasters]': 1,
 '[West Texas Intermediate]': 1,
 '[the]': 1,
 '[and]': 1,
 '[of a return to 2 percent]': 1,
 '[domestic]': 1,
 '[English]': 1,
 '[our]': 1,
 '[European Monetary Union]': 1,
 '[Bureau of Economic Analysis]'