In [104]:
# imports
import pandas as pd 

## Data Preprocessing - FRED

https://fred.stlouisfed.org/

https://stackoverflow.com/questions/63702640/monthly-to-daily-values

Short for Federal Reserve Economic Data, FRED is an online database consisting of hundreds of thousands of economic data time series from scores of national, international, public, and private sources. We will use this data to historic information for key economic indicators in the US including, unemployment, GDP, Real GDP, Federal Funds Rate, and interest rates. All of this data can be extracted as a .csv file. 

The below code cleans this data so it can be loaded into the relational database.

In [105]:
# import data to dataframe

gdp = pd.read_csv('raw_data/GDP_raw.csv')
real_gdp = pd.read_csv('raw_data/GDPC1_raw.csv')
unemployment = pd.read_csv('raw_data/UNRATE_raw.csv')
funds_rate = pd.read_csv('raw_data/DFF_raw.csv')
interest_rate = pd.read_csv('raw_data/REAINTRATREARAT10Y_raw.csv')

All the data is importaed into the data frame. Now, we will go one by one and aggregate each dataframe to the weekly level. We will also remove any data older than the year 2000 to reduce the row count for this exercise. 

### GDP

In [106]:
gdp.head()

Unnamed: 0,DATE,GDP
0,1/1/1947,243.164
1,4/1/1947,245.968
2,7/1/1947,249.585
3,10/1/1947,259.745
4,1/1/1948,265.742


In [107]:
gdp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307 entries, 0 to 306
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   DATE    307 non-null    object 
 1   GDP     307 non-null    float64
dtypes: float64(1), object(1)
memory usage: 4.9+ KB


GDP is reported quarterly, in billions of dollars. The data starts 1/1/1947 and ends on 7/1/2023. First, we will convert the date to datetime. Then, we will drop rows where data is earlier than 1/1/2000. Lastly, we will disaggregate to the week level. 

In [108]:
# convert to datetime
gdp['DATE'] = pd.to_datetime(gdp['DATE'])

# drop rows older than 1/1/2000. 
gdp = gdp[~(gdp['DATE'] < '2000-01-01')]

# disaggregate to week level
gdp_cleansed = gdp.set_index('DATE').resample('W').ffill()

gdp_cleansed.head()

Unnamed: 0_level_0,GDP
DATE,Unnamed: 1_level_1
2000-01-02,10002.179
2000-01-09,10002.179
2000-01-16,10002.179
2000-01-23,10002.179
2000-01-30,10002.179


### Real GDP

In [109]:
real_gdp.head()

Unnamed: 0,DATE,GDPC1
0,1947-01-01,2182.681
1,1947-04-01,2176.892
2,1947-07-01,2172.432
3,1947-10-01,2206.452
4,1948-01-01,2239.682


In [110]:
real_gdp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307 entries, 0 to 306
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   DATE    307 non-null    object 
 1   GDPC1   307 non-null    float64
dtypes: float64(1), object(1)
memory usage: 4.9+ KB


Real GDP is reported quarterly, in billions of dollars. The data starts 1/1/1947 and ends on 7/1/2023. First, we will convert the date to datetime. Then, we will drop rows where data is earlier than 1/1/2000. Lastly, we will disaggregate to the week level. 

In [111]:
# convert to datetime
real_gdp['DATE'] = pd.to_datetime(real_gdp['DATE'])

# drop rows older than 1/1/2000. 
real_gdp = real_gdp[~(real_gdp['DATE'] < '2000-01-01')]

# disaggregate to week level
real_gdp_cleansed = real_gdp.set_index('DATE').resample('W').ffill()

real_gdp_cleansed.head()

Unnamed: 0_level_0,GDPC1
DATE,Unnamed: 1_level_1
2000-01-02,13878.147
2000-01-09,13878.147
2000-01-16,13878.147
2000-01-23,13878.147
2000-01-30,13878.147


### Unemployment Rate

In [112]:
unemployment.head()

Unnamed: 0,DATE,UNRATE
0,1/1/1948,3.4
1,2/1/1948,3.8
2,3/1/1948,4.0
3,4/1/1948,3.9
4,5/1/1948,3.5


In [113]:
unemployment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 910 entries, 0 to 909
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   DATE    910 non-null    object 
 1   UNRATE  910 non-null    float64
dtypes: float64(1), object(1)
memory usage: 14.3+ KB


Unemployment is reported monthly, as a percent. The data starts 1/1/1948 and ends on 10/1/2023. First, we will convert the date to datetime. Then, we will drop rows where data is earlier than 1/1/2000. Lastly, we will disaggregate to the week level. 

In [114]:
# convert to datetime
unemployment['DATE'] = pd.to_datetime(unemployment['DATE'])

# drop rows older than 1/1/2000. 
unemployment = unemployment[~(unemployment['DATE'] < '2000-01-01')]

# disaggregate to week level
unemployment_cleansed = unemployment.set_index('DATE').resample('W').ffill()

unemployment_cleansed.head()

Unnamed: 0_level_0,UNRATE
DATE,Unnamed: 1_level_1
2000-01-02,4.0
2000-01-09,4.0
2000-01-16,4.0
2000-01-23,4.0
2000-01-30,4.0


### Federal Funds Rate

In [115]:
funds_rate.head()

Unnamed: 0,DATE,DFF
0,2018-11-16,2.2
1,2018-11-17,2.2
2,2018-11-18,2.2
3,2018-11-19,2.2
4,2018-11-20,2.2


In [116]:
funds_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1827 entries, 0 to 1826
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   DATE    1827 non-null   object 
 1   DFF     1827 non-null   float64
dtypes: float64(1), object(1)
memory usage: 28.7+ KB


Federal funds rate is reported daily, as a percent. The data starts 11/16/2018 and ends on 11/12/2023. First, we will convert the date to datetime. Then, we will aggregate to the week level. 

In [117]:
# convert to datetime
funds_rate['DATE'] = pd.to_datetime(funds_rate['DATE'])

# aggregate to week level
funds_rate_cleansed = funds_rate.set_index('DATE').resample('W').ffill()

funds_rate_cleansed.head()

Unnamed: 0_level_0,DFF
DATE,Unnamed: 1_level_1
2018-11-18,2.2
2018-11-25,2.2
2018-12-02,2.2
2018-12-09,2.19
2018-12-16,2.19


### Interest Rate 

In [118]:
interest_rate.head()

Unnamed: 0,DATE,REAINTRATREARAT10Y
0,1982-01-01,7.623742
1,1982-02-01,7.656648
2,1982-03-01,7.128993
3,1982-04-01,7.408347
4,1982-05-01,7.320041


In [119]:
interest_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   DATE                503 non-null    object 
 1   REAINTRATREARAT10Y  503 non-null    float64
dtypes: float64(1), object(1)
memory usage: 8.0+ KB


Interest rate is reported monthly, as a percent. The data starts 1/1/1982 and ends on 11/1/2023. First, we will convert the date to datetime. Then, remove anything before the year 2000. Lastly, we will disaggregate to the week level.

In [120]:
# convert to datetime
interest_rate['DATE'] = pd.to_datetime(interest_rate['DATE'])

# drop rows older than 1/1/2000. 
interest_rate = interest_rate[~(interest_rate['DATE'] < '2000-01-01')]

# aggregate to week level
interest_rate_cleansed = interest_rate.set_index('DATE').resample('W').ffill()

interest_rate_cleansed.head()

Unnamed: 0_level_0,REAINTRATREARAT10Y
DATE,Unnamed: 1_level_1
2000-01-02,3.411051
2000-01-09,3.411051
2000-01-16,3.411051
2000-01-23,3.411051
2000-01-30,3.411051


### Save cleansed data to CSV

Since these are all economic indicators, we can merge them into one table for our final data product. There's just one problem, not every economic indicator has the same start and end date. All sources start on 2000-01-02 except for the funds rate which starts in 2018-11-18. All the data sources end on different days, with the oldest being GDP at 2023-06-4. 

First, we will remove any days after 2023-06-04: 

In [121]:
# reset index
gdp_cleansed.reset_index(inplace=True)
real_gdp_cleansed.reset_index(inplace=True)
unemployment_cleansed.reset_index(inplace=True)
funds_rate_cleansed.reset_index(inplace=True)
interest_rate_cleansed.reset_index(inplace=True)

# drop rows newer than 2023-06-04. 
gdp_cleansed = gdp_cleansed[~(gdp_cleansed['DATE'] > '2023-06-04')]
real_gdp_cleansed = real_gdp_cleansed[~(real_gdp_cleansed['DATE'] > '2023-06-04')]
unemployment_cleansed = unemployment_cleansed[~(unemployment_cleansed['DATE'] > '2023-06-04')]
funds_rate_cleansed = funds_rate_cleansed[~(funds_rate_cleansed['DATE'] > '2023-06-04')]
interest_rate_cleansed = interest_rate_cleansed[~(interest_rate_cleansed['DATE'] > '2023-06-04')]

Ok, now we can join the data together by date, and the funds rate will have to have NaN for all thw weeks before 2018. 

In [128]:
# Merge the df
gdp_merged_df = pd.merge(gdp_cleansed, real_gdp_cleansed, on="DATE", how="inner")

# Merge the df
gdp__unemployment_merged_df = pd.merge(gdp_merged_df, unemployment_cleansed, on="DATE", how="inner")

# Merge the df
gdp__unemployment__interest_merged_df = pd.merge(gdp__unemployment_merged_df, interest_rate_cleansed, on="DATE", how="inner")


# Merge the df
economic_indicator_df = pd.merge(gdp__unemployment__interest_merged_df, funds_rate_cleansed, on="DATE", how="outer")

# Show the resulting
print(economic_indicator_df)

           DATE        GDP      GDPC1  UNRATE  REAINTRATREARAT10Y   DFF
0    2000-01-02  10002.179  13878.147     4.0            3.411051   NaN
1    2000-01-09  10002.179  13878.147     4.0            3.411051   NaN
2    2000-01-16  10002.179  13878.147     4.0            3.411051   NaN
3    2000-01-23  10002.179  13878.147     4.0            3.411051   NaN
4    2000-01-30  10002.179  13878.147     4.0            3.411051   NaN
...         ...        ...        ...     ...                 ...   ...
1218 2023-05-07  27063.012  22225.350     3.7            1.536904  5.08
1219 2023-05-14  27063.012  22225.350     3.7            1.536904  5.08
1220 2023-05-21  27063.012  22225.350     3.7            1.536904  5.08
1221 2023-05-28  27063.012  22225.350     3.7            1.536904  5.08
1222 2023-06-04  27063.012  22225.350     3.6            1.060631  5.08

[1223 rows x 6 columns]


Ok, now we have 1,223 columns and six rows. Before we save to a csv, let's rename the columns and make sure the data types are correct.

In [130]:
economic_indicator_df.rename(columns={"DATE":"date", "GDP":"gdp","GDPC1":"real_gdp", "UNRATE":"unemployment_rate","REAINTRATREARAT10Y":"interest_rate","DFF":"fund_rate"}, inplace=True)

economic_indicator_df.head()

Unnamed: 0,date,gdp,real_gdp,unemployment_rate,interest_rate,fund_rate
0,2000-01-02,10002.179,13878.147,4.0,3.411051,
1,2000-01-09,10002.179,13878.147,4.0,3.411051,
2,2000-01-16,10002.179,13878.147,4.0,3.411051,
3,2000-01-23,10002.179,13878.147,4.0,3.411051,
4,2000-01-30,10002.179,13878.147,4.0,3.411051,


Now, we check the datatypes. 

In [131]:
economic_indicator_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1223 entries, 0 to 1222
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   date               1223 non-null   datetime64[ns]
 1   gdp                1223 non-null   float64       
 2   real_gdp           1223 non-null   float64       
 3   unemployment_rate  1223 non-null   float64       
 4   interest_rate      1223 non-null   float64       
 5   fund_rate          238 non-null    float64       
dtypes: datetime64[ns](1), float64(5)
memory usage: 57.5 KB


The data types looks correct. Lastly, save to a csv that can be inserted into the database. 

In [134]:
economic_indicator_df.to_csv('economic_indicators', sep=',')