# Fetching the Transcript PDFs from the Federal Reserve Website

In this notebook, I download the transcripts from the Federal Reserve Website so I can use them in the rest of the project.

In [2]:
import requests
from bs4 import BeautifulSoup
import bs4
import random
import time
import pandas as pd
import re
import numpy as np
import os

In [3]:
base_url = 'https://www.federalreserve.gov'

First, I use web scraping to scan through the pages containing historical materials for each year to get the links for each of the transcripts (and conference calls, though I do not use those).

In [4]:
def get_minute_links(year):
    url = f'{base_url}/monetarypolicy/fomchistorical{year}.htm'
    response = requests.get(url).text
    soup = BeautifulSoup(response)
    panels = soup.find_all('div', class_='panel')
    links = []
    for panel in panels:
        transcript_element = panel(text=re.compile(r'Transcript'))
        if transcript_element:
            anchor = transcript_element[0].parent
            links.append(f'{base_url}{anchor["href"]}')
    print(len(links))
    return links

In [7]:
all_links = []

for i in range(1986, 2019):
    time.sleep(3 + random.random()* 2)
    print(i)
    all_links += get_minute_links(i)

1986
8
1987
10
1988
12
1989
14
1990
12
1991
19
1992
12
1993
16
1994
13
1995
11
1996
8
1997
8
1998
10
1999
8
2000
8
2001
13
2002
8
2003
13
2004
8
2005
8
2006
8
2007
11
2008
14
2009
11
2010
10
2011
10
2012
8
2013
9
2014
9
2015
8
2016
8
2017
8
2018
8


After getting all the links to the PDFs, I download them one by one (with an added delay so as not to overwhelm the server with many requests) and save them into a folder called "pdfs."

In [8]:
os.makedirs('pdfs', exist_ok=True)

for i, item in enumerate(all_links):
    if i % 10 == 0:
        print(f'{i}/{len(all_links)} documents completed.')
    fname = item.split('/')[-1]
    # skip any documents that were previously downloaded
    if os.path.exists(os.path.join('pdfs', fname)):
        continue
    response = requests.get(item)
    with open(os.path.join('pdfs', fname), 'wb') as f:
        f.write(response.content)
    time.sleep(5 + 5 * random.random())

0/341 documents completed.
10/341 documents completed.
20/341 documents completed.
30/341 documents completed.
40/341 documents completed.
50/341 documents completed.
60/341 documents completed.
70/341 documents completed.
80/341 documents completed.
90/341 documents completed.
100/341 documents completed.
110/341 documents completed.
120/341 documents completed.
130/341 documents completed.
140/341 documents completed.
150/341 documents completed.
160/341 documents completed.
170/341 documents completed.
180/341 documents completed.
190/341 documents completed.
200/341 documents completed.
210/341 documents completed.
220/341 documents completed.
230/341 documents completed.
240/341 documents completed.
250/341 documents completed.
260/341 documents completed.
270/341 documents completed.
280/341 documents completed.
290/341 documents completed.
300/341 documents completed.
310/341 documents completed.
320/341 documents completed.
330/341 documents completed.
340/341 documents complet