# 1. Define Business Requirement

## Problem Statement
Title: Stock Price Prediction Based on Macroeconomic Factors

## Challenge:
How does changes in macroeconomic factors, like interest rates and inflation, affect the stock market development, and how can we predict these changes using historical data?

## Importance:
Macroeconomic changes have a direct impact on companies' borrowing costs and earnings, which in turn affect their stock prices. Being able to predict these changes can help investors make informed decisions and reduce market risks.

## Expected Solution:
This study will help us determine if the macroeconomic factors have a influence on the stock prices, and whether investors should invest or not, based macroecomomic factors.

We will work towards developing a machine learning model that predicts stock price changes based on historical macroeconomic factors. In this study we mainly focus on interest rate and inflation.



# 2. Data Collection

We will first retrieve the data we will be working with. Before we can begin we have to import the neccesary libraries

In [87]:
# importing libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [88]:
# retrieving data

df_gold = pd.read_csv('https://raw.githubusercontent.com/badranyoussef/bi-exam-project-stock/refs/heads/main/datasets/cleaned_gold_data.csv')
df_interest_inflation = pd.read_csv('https://raw.githubusercontent.com/badranyoussef/bi-exam-project-stock/main/datasets/fed_interest_rate_inflation.csv')
# ------- START YOUSSEF --------
df_interest_2017_to_now = pd.read_excel('/Users/youssefbadran/Documents/GitHub/bi-exam-project-stock/datasets/interest_rate_2017_now_cleaned.xlsx')
df_sp500 = pd.read_csv('/Users/youssefbadran/Documents/datamatiker/4. semester/BI/sp500_data.csv')
# ------- END YOUSSEF --------
# ------- START LASSE ----------
#df_interest_2017_to_now = pd.read_excel('/Users/lassekh/Documents/Datamatiker/4-semester/BI - Business Intelligence/bi-exam-project-stock/datasets/interest_rate_2017_now_cleaned.xlsx')
# Reading from split csv files and define arr with csv file names from directory
#dir = '/Users/lassekh/Documents/Datamatiker/4-semester/BI - Business Intelligence/bi-exam-project-stock/datasets/not in use/sp500/'
#csv_files = [f'{dir}part_{i}.csv' for i in range(1, 26)]
# load all csv-files into a data frame
#dfs = [pd.read_csv(file) for file in csv_files]
# combine all DFs in one
#df_sp500 = pd.read_csv('/Users/lassekh/Documents/Datamatiker/4-semester/BI - Business Intelligence/shortxprice/datasets/sp500_data.csv')
# ------ END LASSE ---------
russell2000_df = pd.read_csv('https://raw.githubusercontent.com/badranyoussef/bi-exam-project-stock/refs/heads/main/datasets/russell_2000.csv')
oil_df = pd.read_csv('https://raw.githubusercontent.com/badranyoussef/bi-exam-project-stock/refs/heads/main/datasets/BrentOilPrices.csv')
cpi = pd.read_csv('https://raw.githubusercontent.com/badranyoussef/bi-exam-project-stock/refs/heads/main/datasets/cpi_data.csv')


# 3. Cleaning data
Now that we have all the data needed, we will look the through to ensure that there are no missing values. In case values are missing we will fill in missing values depending on the variable/feature

in all stock datasets We will remove all columns unless the close values and date

As sp500 contains several stocks listed after each other we have a couple of million rows. Therefor we will work with the mean of the close of each stock based on date

### SP500

In [89]:

df_sp500.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4237173 entries, 0 to 4237172
Data columns (total 8 columns):
 #   Column     Dtype  
---  ------     -----  
 0   Date       object 
 1   Open       float64
 2   High       float64
 3   Low        float64
 4   Close      float64
 5   Adj Close  float64
 6   Volume     float64
 7   Ticker     object 
dtypes: float64(6), object(2)
memory usage: 258.6+ MB


In [90]:
# Converting Date column to DateTime format
df_sp500['Date'] = pd.to_datetime(df_sp500['Date'])

# dropping unnecesery columns
df_sp500 = df_sp500.drop(columns=['Ticker', 'Adj Close'])

# CHANGED TO BELOW -- df_sp500.rename(columns={'Close':'Close SP500'}, inplace=True)
# Renaming every column except 'Date' to append it with SP500
df_sp500.columns = [col + ' SP500' if col != 'Date' else col for col in df_sp500.columns]

# Group by Date and calculate the sum for each column
df_sp500_sum_of_date = df_sp500.groupby('Date').agg({
    'Close SP500': 'sum',
}).reset_index()

# Group by Date and calculate the mean for each column
df_sp500_mean_of_date = df_sp500.groupby('Date').agg({
    'Close SP500': 'mean',
    'Open SP500': 'mean',
    'High SP500': 'mean',
    'Low SP500': 'mean',
    'Volume SP500': 'mean'
}).reset_index()

In [91]:
df_sp500_mean_of_date   

Unnamed: 0,Date,Close SP500,Open SP500,High SP500,Low SP500,Volume SP500
0,1962-01-02,5.360604,0.585293,5.440671,5.351006,4.587460e+05
1,1962-01-03,5.347035,0.579997,5.396574,5.323280,7.200981e+05
2,1962-01-04,5.310444,0.584156,5.382963,5.304750,5.661492e+05
3,1962-01-05,5.218028,0.578683,5.314311,5.188456,5.300520e+05
4,1962-01-08,5.192786,0.568500,5.237573,5.117559,6.138960e+05
...,...,...,...,...,...,...
15788,2024-09-23,222.570479,221.895458,224.154341,220.092535,4.937914e+06
15789,2024-09-24,222.688543,222.558403,224.678268,220.081112,5.219927e+06
15790,2024-09-25,221.535270,222.857106,224.080609,220.104647,4.958486e+06
15791,2024-09-26,223.315329,223.190490,225.688436,220.722217,5.519059e+06


In [92]:
#df_sp500_mean_of_ticker.to_csv('df_sp500_cleaned.csv')

### Now looking into Gold

In [93]:
df_gold.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11604 entries, 0 to 11603
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      11604 non-null  object 
 1   Price     11604 non-null  float64
 2   Open      11604 non-null  float64
 3   High      11604 non-null  float64
 4   Low       11604 non-null  float64
 5   Change %  11604 non-null  float64
dtypes: float64(5), object(1)
memory usage: 544.1+ KB


In [94]:
df_gold['Date'] = pd.to_datetime(df_gold['Date'])

# CHANGED DELETED -- df_gold = df_gold.drop(columns=(['Open', 'High', 'Low', 'Change %']))
#df_gold.columns = [col + ' GOLD' if col != ['Date', 'Change %'] else col for col in df_sp500.columns]

df_gold.rename(columns={'Price':'Close Gold', 'Open':'Open Gold', 'High':'High Gold', 'Low':'Low Gold', 'Change %':'Change % Gold'}, inplace=True)

In [95]:
df_gold

Unnamed: 0,Date,Close Gold,Open Gold,High Gold,Low Gold,Change % Gold
0,2024-09-26,2675.57,2656.52,2685.61,2655.14,0.71
1,2024-09-25,2656.82,2655.90,2670.60,2649.84,0.00
2,2024-09-24,2656.70,2628.92,2664.47,2622.58,1.08
3,2024-09-23,2628.40,2621.81,2635.54,2613.60,0.25
4,2024-09-20,2621.96,2587.50,2625.79,2584.81,1.37
...,...,...,...,...,...,...
11599,1980-01-03,634.25,634.25,634.25,634.25,13.31
11600,1980-01-02,559.75,559.75,559.75,559.75,9.33
11601,1980-01-01,512.00,512.00,512.00,512.00,0.00
11602,1979-12-28,512.00,512.00,512.00,512.00,-0.68


### Now interest and inflation rates

In [96]:
df_interest_inflation.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 904 entries, 0 to 903
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          904 non-null    int64  
 1   Month                         904 non-null    int64  
 2   Day                           904 non-null    int64  
 3   Federal Funds Target Rate     462 non-null    float64
 4   Federal Funds Upper Target    103 non-null    float64
 5   Federal Funds Lower Target    103 non-null    float64
 6   Effective Federal Funds Rate  752 non-null    float64
 7   Real GDP (Percent Change)     250 non-null    float64
 8   Unemployment Rate             752 non-null    float64
 9   Inflation Rate                710 non-null    float64
dtypes: float64(7), int64(3)
memory usage: 70.8 KB


In [97]:
# drop all columns we don't need
df_interest_inflation_dropped = df_interest_inflation.drop(columns=['Federal Funds Target Rate', 'Federal Funds Upper Target', 'Federal Funds Lower Target', 'Real GDP (Percent Change)'])

# Combine the columns Year, Month, Day into one DateTime column
df_interest_inflation_dropped['Date'] = pd.to_datetime(df_interest_inflation_dropped[['Year', 'Month', 'Day']])

# Insert the new column at the beginning
df_interest_inflation_dropped.insert(0, 'Date', df_interest_inflation_dropped.pop('Date'))

# Drop Year, Month and Day
df_interest_inflation_dropped = df_interest_inflation_dropped.drop(columns=['Year', 'Month', 'Day'])

# renaming column
df_interest_inflation_dropped.rename(columns={'Effective Federal Funds Rate':'Interest Rate'}, inplace=True)

# filling in the missing values
df_interest_inflation_dropped.ffill(inplace=True)
df_interest_inflation_dropped.bfill(inplace=True)

df_interest_inflation_dropped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 904 entries, 0 to 903
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Date               904 non-null    datetime64[ns]
 1   Interest Rate      904 non-null    float64       
 2   Unemployment Rate  904 non-null    float64       
 3   Inflation Rate     904 non-null    float64       
dtypes: datetime64[ns](1), float64(3)
memory usage: 28.4 KB


In [98]:
df_interest_2017_to_now

Unnamed: 0,Effective Date,Rate (%)
0,09/16/2024,5.33
1,09/13/2024,5.33
2,09/12/2024,5.33
3,09/11/2024,5.33
4,09/10/2024,5.33
...,...,...
1932,01/09/2017,0.66
1933,01/06/2017,0.66
1934,01/05/2017,0.66
1935,01/04/2017,0.66


In [99]:
# removing all columns unless date and 
# CHANGED UNNESSASARY -- df_interest_2017_to_now1 = df_interest_2017_to_now.filter(items=['Effective Date', 'Rate (%)'])

# Convert the current column with date to a column with datetime data type and drop the 'Effective Date'
df_interest_2017_to_now['Date'] = pd.to_datetime(df_interest_2017_to_now['Effective Date'])
df_interest_2017_to_now = df_interest_2017_to_now.drop(columns=['Effective Date'])

# renaming column
df_interest_2017_to_now.rename(columns={'Rate (%)':'Interest Rate'}, inplace=True)

df_interest_2017_to_now

Unnamed: 0,Interest Rate,Date
0,5.33,2024-09-16
1,5.33,2024-09-13
2,5.33,2024-09-12
3,5.33,2024-09-11
4,5.33,2024-09-10
...,...,...
1932,0.66,2017-01-09
1933,0.66,2017-01-06
1934,0.66,2017-01-05
1935,0.66,2017-01-04


### combining the two interest datasets
As we have two datasets with interest values we will combine them so we can work with one dataset.

In [100]:
# combining the dataframes with 'Interest Rate'
df_interest_combined = pd.concat([df_interest_inflation_dropped, df_interest_2017_to_now])

# Sorting after 'Date'
df_interest_combined = df_interest_combined.sort_values(by='Date').reset_index(drop=True)

# Removing duplicates of dates if overlaping
df_interest_combined = df_interest_combined.drop_duplicates(subset='Date')

Just to be sure, we will check if we have duplicates of dates. We make a function as we might need it in the future

In [101]:
def check_for_duplicate_dates(df, category):
  duplicate_dates = df[df.duplicated(subset=category)]
  print(duplicate_dates)

In [102]:
check_for_duplicate_dates(df_interest_combined, 'Date')


Empty DataFrame
Columns: [Date, Interest Rate, Unemployment Rate, Inflation Rate]
Index: []


In [103]:
#df_interest_combined.to_csv('df_interest_inflation.csv')

### Now lets have a look at Russell Oil and CPI
We will as before ensure the date is the type datetime and again remove unneccecary columns and fill in any missing values

In [104]:
# Dropping columns
russell2000_df = russell2000_df.drop(columns=['Adj Close'])

# converting date to datetime
russell2000_df['Date'] = pd.to_datetime(russell2000_df['Date'])
oil_df['Date'] = pd.to_datetime(oil_df['Date'])
cpi['Date'] = pd.to_datetime(cpi['Date'])

# renaming columns
# CHANGED TO BELOW -- russell2000_df.rename(columns={'Close':'Close Russell'}, inplace=True)
russell2000_df.columns = [col + ' RUSSELL2000' if col != 'Date' else col for col in russell2000_df.columns]
oil_df.rename(columns={'Price':'Close Oil'}, inplace=True)
cpi.rename(columns={'DATE':'Date'}, inplace=True)




  oil_df['Date'] = pd.to_datetime(oil_df['Date'])


In [105]:
cpi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 751 entries, 0 to 750
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      751 non-null    datetime64[ns]
 1   CPIAUCSL  751 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 11.9 KB


## Now that we have all the data needed, we will combine all data sets into one dataframe

In [106]:
# merging sp500 into interest and inflation df
df_combined = pd.merge(df_sp500_mean_of_date, df_interest_combined, on='Date', how='outer')
# cpi into combined
df_combined = pd.merge(df_combined, cpi, on='Date', how='left')
# russell into combined
df_combined = pd.merge(df_combined, russell2000_df, on='Date', how='outer')
# oil into combined
df_combined = pd.merge(df_combined, oil_df, on='Date', how='left')
# gold into combined
df_combined = pd.merge(df_combined, df_gold, on='Date', how='left')

In [107]:
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16134 entries, 0 to 16133
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Date                16134 non-null  datetime64[ns]
 1   Close SP500         15793 non-null  float64       
 2   Open SP500          15793 non-null  float64       
 3   High SP500          15793 non-null  float64       
 4   Low SP500           15793 non-null  float64       
 5   Volume SP500        15793 non-null  float64       
 6   Interest Rate       2838 non-null   float64       
 7   Unemployment Rate   902 non-null    float64       
 8   Inflation Rate      902 non-null    float64       
 9   CPIAUCSL            722 non-null    float64       
 10  Open RUSSELL2000    8521 non-null   float64       
 11  High RUSSELL2000    8521 non-null   float64       
 12  Low RUSSELL2000     8521 non-null   float64       
 13  Close RUSSELL2000   8521 non-null   float64   

In [108]:
#df_combined.sort_values(by='Date')
df_combined.tail(20)

Unnamed: 0,Date,Close SP500,Open SP500,High SP500,Low SP500,Volume SP500,Interest Rate,Unemployment Rate,Inflation Rate,CPIAUCSL,...,High RUSSELL2000,Low RUSSELL2000,Close RUSSELL2000,Volume RUSSELL2000,Close Oil,Close Gold,Open Gold,High Gold,Low Gold,Change % Gold
16114,2024-08-30,219.069104,218.272765,220.220251,215.921716,5600102.0,5.33,,,,...,,,,,,2503.03,2520.89,2527.07,2494.36,-0.72
16115,2024-09-03,215.306547,218.145812,220.114214,213.635176,5874368.0,5.33,,,,...,,,,,,2492.76,2500.5,2506.44,2473.25,-0.26
16116,2024-09-04,215.590506,214.911333,217.541538,213.060707,5042043.0,5.33,,,,...,,,,,,2494.84,2492.94,2500.2,2471.95,0.08
16117,2024-09-05,213.950186,215.371075,216.580107,211.931461,4896872.0,5.33,,,,...,,,,,,2516.32,2495.5,2523.54,2493.76,0.86
16118,2024-09-06,211.66182,214.234007,216.127916,210.513362,5750606.0,5.33,,,,...,,,,,,2497.03,2516.9,2529.3,2485.15,-0.77
16119,2024-09-09,214.151627,212.933602,216.087077,211.360836,5409236.0,5.33,,,,...,,,,,,2505.25,2497.32,2507.42,2485.6,0.33
16120,2024-09-10,214.65175,214.670111,216.275041,211.822576,5237500.0,5.33,,,,...,,,,,,2516.12,2506.84,2518.57,2500.16,0.43
16121,2024-09-11,215.295092,214.218904,216.260468,209.775909,5743174.0,5.33,,,,...,,,,,,2511.44,2515.7,2529.4,2501.01,-0.19
16122,2024-09-12,216.604987,215.284442,217.767929,213.2422,5237635.0,5.33,,,,...,,,,,,2558.75,2512.02,2560.21,2511.02,1.88
16123,2024-09-13,218.516916,217.251977,220.027614,216.110638,4457100.0,5.33,,,,...,,,,,,2576.5,2556.52,2586.18,2556.52,0.69


In [109]:
#df_combined.tail(20)
#df_combined.to_csv('all data.csv')

#### As we will be filling in value we have created a function that does the job instead of repeating our self again and agin

In [110]:
def fill_missing_values(df, exclude_column):
  cols_to_fill = df.columns.difference([exclude_column])
  df_filled[cols_to_fill] = df[cols_to_fill].ffill().bfill()
  df_filled[exclude_column] = df[exclude_column]
  return df_filled

In [111]:
#df_filled = df_combined[['Interest Rate','Inflation Rate', 'Close SP500', 'CPIAUCSL', 'Close Russell', 'Close Oil', 'Close Gold']].ffill().bfill()
#df_filled['Date'] = df_combined['Date']

def fill_missing_values(df, exclude_column):
  #Saving the desired column which doesnt need to be filled
  excluded_data = df[exclude_column]
    
  # Removing the column which doesnt need to be filled
  df = df.drop(columns=[exclude_column])
    
  # Filling the missing values
  df = df.ffill().bfill()

  # Adding the saved column from first step
  df[exclude_column] = excluded_data
  return df


In [112]:
df_filled = fill_missing_values(df_combined, 'Date')

In [113]:
df_filled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16134 entries, 0 to 16133
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Close SP500         16134 non-null  float64       
 1   Open SP500          16134 non-null  float64       
 2   High SP500          16134 non-null  float64       
 3   Low SP500           16134 non-null  float64       
 4   Volume SP500        16134 non-null  float64       
 5   Interest Rate       16134 non-null  float64       
 6   Unemployment Rate   16134 non-null  float64       
 7   Inflation Rate      16134 non-null  float64       
 8   CPIAUCSL            16134 non-null  float64       
 9   Open RUSSELL2000    16134 non-null  float64       
 10  High RUSSELL2000    16134 non-null  float64       
 11  Low RUSSELL2000     16134 non-null  float64       
 12  Close RUSSELL2000   16134 non-null  float64       
 13  Volume RUSSELL2000  16134 non-null  float64   

In [114]:
df_filtered = df_combined[(df_combined['Date'] >= '1987-10-01') & (df_combined['Date'] < '2017-01-01')]

In [115]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7500 entries, 6675 to 14174
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Date                7500 non-null   datetime64[ns]
 1   Close SP500         7374 non-null   float64       
 2   Open SP500          7374 non-null   float64       
 3   High SP500          7374 non-null   float64       
 4   Low SP500           7374 non-null   float64       
 5   Volume SP500        7374 non-null   float64       
 6   Interest Rate       452 non-null    float64       
 7   Unemployment Rate   452 non-null    float64       
 8   Inflation Rate      452 non-null    float64       
 9   CPIAUCSL            351 non-null    float64       
 10  Open RUSSELL2000    7374 non-null   float64       
 11  High RUSSELL2000    7374 non-null   float64       
 12  Low RUSSELL2000     7374 non-null   float64       
 13  Close RUSSELL2000   7374 non-null   float64      

##### values before filling out data

Index: 7724 entries, 6699 to 14422
Data columns (total 8 columns):
     Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Date            7724 non-null   datetime64[ns]
 1   Interest Rate   452 non-null    float64       
 2   Inflation Rate  452 non-null    float64       
 3   Close SP500     7374 non-null   float64       
 4   CPIAUCSL        351 non-null    float64       
 5   Close Russell   7374 non-null   float64       
 6   Close Oil       7421 non-null   float64       
 7   Close Gold      7611 non-null   float64       

In [116]:
df_filtered_filled = fill_missing_values(df_filtered, 'Date')

In [117]:
df_filtered_filled

Unnamed: 0,Close SP500,Open SP500,High SP500,Low SP500,Volume SP500,Interest Rate,Unemployment Rate,Inflation Rate,CPIAUCSL,Open RUSSELL2000,...,Low RUSSELL2000,Close RUSSELL2000,Volume RUSSELL2000,Close Oil,Close Gold,Open Gold,High Gold,Low Gold,Change % Gold,Date
6675,9.135417,8.054483,9.184320,8.954092,3.787139e+06,7.29,6.0,4.3,115.000,170.820007,...,170.250000,171.399994,1.932000e+08,18.50,453.46,455.67,453.46,453.46,-1.20,1987-10-01
6676,9.175296,8.163558,9.257685,9.065771,3.981096e+06,7.29,6.0,4.3,115.000,171.399994,...,171.399994,172.080002,1.891000e+08,18.65,454.87,455.42,454.87,454.87,0.31,1987-10-02
6677,9.175669,8.191843,9.256174,9.065872,3.932984e+06,7.29,6.0,4.3,115.000,172.089996,...,172.089996,172.539993,1.597000e+08,18.78,456.83,457.02,456.83,456.83,0.43,1987-10-05
6678,8.993629,8.170532,9.203396,8.942656,4.409449e+06,7.29,6.0,4.3,115.000,172.550003,...,170.130005,170.210007,1.756000e+08,18.60,457.02,458.55,457.02,457.02,0.04,1987-10-06
6679,8.971139,8.015025,9.062113,8.852412,4.329170e+06,7.29,6.0,4.3,115.000,170.210007,...,168.479996,168.869995,1.863000e+08,18.58,457.63,459.41,457.63,457.63,0.13,1987-10-07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14170,83.971982,83.753608,84.327578,83.351554,3.852519e+06,0.54,4.7,2.2,242.637,1362.630005,...,1362.560059,1371.510010,2.020550e+09,53.93,1133.49,1128.89,1136.12,1128.18,0.43,2016-12-23
14171,84.235925,84.128478,84.744996,83.783470,5.111105e+06,0.54,4.7,2.2,242.637,1371.589966,...,1371.589966,1377.709961,1.987080e+09,53.93,1139.35,1134.06,1149.48,1131.85,0.48,2016-12-27
14172,83.446216,84.321632,84.563865,83.224388,7.919421e+06,0.54,4.7,2.2,242.637,1377.780029,...,1359.219971,1360.829956,2.392360e+09,54.95,1142.45,1139.64,1144.84,1136.54,0.27,2016-12-28
14173,83.529571,83.488670,84.079080,83.044043,7.475287e+06,0.54,4.7,2.2,242.637,1360.920044,...,1357.390015,1363.180054,2.336370e+09,54.97,1158.32,1142.39,1159.58,1140.81,1.39,2016-12-29


In [118]:
df_filtered_after1987 = df_combined[(df_combined['Date'] >= '1987-10-01')]

In [119]:
df_filtered_after1987_filled = fill_missing_values(df_filtered_after1987, 'Date')

df_filtered_after1987_filled.tail(20)

Unnamed: 0,Close SP500,Open SP500,High SP500,Low SP500,Volume SP500,Interest Rate,Unemployment Rate,Inflation Rate,CPIAUCSL,Open RUSSELL2000,...,Low RUSSELL2000,Close RUSSELL2000,Volume RUSSELL2000,Close Oil,Close Gold,Open Gold,High Gold,Low Gold,Change % Gold,Date
16114,219.069104,218.272765,220.220251,215.921716,5600102.0,5.33,4.7,2.2,314.121,2312.570068,...,2312.570068,2329.360107,3077580000.0,93.59,2503.03,2520.89,2527.07,2494.36,-0.72,2024-08-30
16115,215.306547,218.145812,220.114214,213.635176,5874368.0,5.33,4.7,2.2,314.121,2312.570068,...,2312.570068,2329.360107,3077580000.0,93.59,2492.76,2500.5,2506.44,2473.25,-0.26,2024-09-03
16116,215.590506,214.911333,217.541538,213.060707,5042043.0,5.33,4.7,2.2,314.121,2312.570068,...,2312.570068,2329.360107,3077580000.0,93.59,2494.84,2492.94,2500.2,2471.95,0.08,2024-09-04
16117,213.950186,215.371075,216.580107,211.931461,4896872.0,5.33,4.7,2.2,314.121,2312.570068,...,2312.570068,2329.360107,3077580000.0,93.59,2516.32,2495.5,2523.54,2493.76,0.86,2024-09-05
16118,211.66182,214.234007,216.127916,210.513362,5750606.0,5.33,4.7,2.2,314.121,2312.570068,...,2312.570068,2329.360107,3077580000.0,93.59,2497.03,2516.9,2529.3,2485.15,-0.77,2024-09-06
16119,214.151627,212.933602,216.087077,211.360836,5409236.0,5.33,4.7,2.2,314.121,2312.570068,...,2312.570068,2329.360107,3077580000.0,93.59,2505.25,2497.32,2507.42,2485.6,0.33,2024-09-09
16120,214.65175,214.670111,216.275041,211.822576,5237500.0,5.33,4.7,2.2,314.121,2312.570068,...,2312.570068,2329.360107,3077580000.0,93.59,2516.12,2506.84,2518.57,2500.16,0.43,2024-09-10
16121,215.295092,214.218904,216.260468,209.775909,5743174.0,5.33,4.7,2.2,314.121,2312.570068,...,2312.570068,2329.360107,3077580000.0,93.59,2511.44,2515.7,2529.4,2501.01,-0.19,2024-09-11
16122,216.604987,215.284442,217.767929,213.2422,5237635.0,5.33,4.7,2.2,314.121,2312.570068,...,2312.570068,2329.360107,3077580000.0,93.59,2558.75,2512.02,2560.21,2511.02,1.88,2024-09-12
16123,218.516916,217.251977,220.027614,216.110638,4457100.0,5.33,4.7,2.2,314.121,2312.570068,...,2312.570068,2329.360107,3077580000.0,93.59,2576.5,2556.52,2586.18,2556.52,0.69,2024-09-13


### we have 3 different dfs
- one with all rows with filled out missing values
- one with filtered data between date 1987 and 2017 with filled out missing values
- one with filtered data after 1987 with filled out missing values

We did that as the original file had too many missing values. so we want to check which data set can be best for machine learning later on

In [138]:
# saving all cleaned data to csv's
df_filtered_filled.to_csv('data between 1987 and 2017.csv', index=False)
df_filled.to_csv('data between 1952 and 2024.csv', index=False)
df_filtered_after1987_filled.to_csv('data after 1987.csv', index=False)


# 4. Data Exploration & Analysis

So far we have retrieved data and cleaned it. Now we will go through the next step in the process. we will explre the data and analyse it

I will first see all the values in a 