# Exploration and Analysis

We now have created three CSV files, in which our data are split. Because we are missing a larger portion of data from before 1987 and after 2024, we have split our data in three files:
1. Containing data from before 1987
2. Containing data from 1987 to 2017
3. Containing data from 1987 to 2024

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Load new combined CSV's with data

df_before_1987 = pd.read_csv('') #insert
df_1987_2017 = pd.read_csv('') #insert PRIMARY
df_1987_2024 = pd.read_csv('') #insert

In [None]:
df_1987_2017.info()

## Data overview
Let's see how the data looks in a line plot for the data frame df_1987_2017.

In [None]:
df_1987_2017.plot.line(y=['Interest Rate', 'Inflation Rate', 'Unemployment Rate', 'Volume SP500' 'Close SP500', 'Close Gold', 'Volume RUSSELL2000' 'Close RUSSELL2000', 'Close Oil', 'CPIAUCSL'] , x='Date')

We have big diffences in the our graphs, and high numbers for some data makes other data unreadable in this plot. That suggest we might have to normalize our data at some point to make et more comparable.

## Normalization
We want to normalize our data to be able to better compare it, and then see in a line plot again.

In [None]:
# Initialize MinMaxScaler
scaler = MinMaxScaler()
date_column = df_1987_2017['Date']
# Use scaler on df
df_1987_2017_scaled = scaler.fit_transform(df_1987_2017.drop(columns=['Date']))

# Convert back to a data frame
df_1987_2017_scaled = pd.DataFrame(df_1987_2017_scaled, columns=df_1987_2017.columns)

# Add the date column back by concatinating
df_1987_2017_scaled = pd.concat([date_column, df_1987_2017_scaled], axis=1)

df_1987_2017_scaled.head()

In [None]:
# Line plot of the scaled data set
df_1987_2017_scaled.plot.line(y=['Interest Rate', 'Inflation Rate', 'Unemployment Rate', 'Volume SP500' 'Close SP500', 'Close Gold', 'Volume RUSSELL2000' 'Close RUSSELL2000', 'Close Oil', 'CPIAUCSL'] , x='Date')

## Data distribution
Now let's have a look at histograms for our data.

In [None]:
# Columns to check
columns_to_check = df_1987_2017.drop(columns=['Date'])

# Plot histogrammer for hver kolonne
for col in columns_to_check:
    plt.figure(figsize=(6, 4))
    sns.histplot(df_1987_2017[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

Comment something about the histograms

### Outliers
Now let's have a look at outliers in our data.

In [None]:
for col in columns_to_check:
    plt.figure(figsize=(6, 4))
    df_1987_2017.boxplot(column=[col])
    plt.title(f'Box plot for {col}')
    plt.show()

Comment something about the outliers

## Data correlation
Let's take a look of the initial correlation of our data.

In [None]:
plt.figure(figsize=(10, 8))
corrmatt_1987_2017 = df_1987_2017[['Interest Rate', 'Inflation Rate', 'Unemployment Rate', 'Volume SP500' 'Close SP500', 'Close Gold', 'Volume RUSSELL2000' 'Close RUSSELL2000', 'Close Oil', 'CPIAUCSL']].corr()
sns.heatmap(corrmatt_1987_2017, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation (1987-2017)')
plt.show()

Comment on the correlation

## Feature engineering - adding column with change values
We want to add column with the changes in Open and Close, and Low and High prices for our indexes.

In [None]:
# Calculate the difference between the Open and Close value on the same day
df_1987_2017['OPEN_CLOSE_CHANGE_%_SP500'] = (df_1987_2017['Close SP500'] - df_1987_2017['Open SP500']) / df_1987_2017['Open SP500']
df_1987_2017['OPEN_CLOSE_CHANGE_%_RUSSELL2000'] = (df_1987_2017['Close RUSSELL2000'] - df_1987_2017['Open RUSSELL2000']) / df_1987_2017['Open RUSSELL2000']
df_1987_2017['OPEN_CLOSE_CHANGE_%_Gold'] = (df_1987_2017['Close Gold'] - df_1987_2017['Open Gold']) / df_1987_2017['Open Gold']
# Calculate the difference between the Low and High value on the same day
df_1987_2017['LOW_HIGH_CHANGE_%_SP500'] = (df_1987_2017['High SP500'] - df_1987_2017['Low SP500']) / df_1987_2017['Low SP500']
df_1987_2017['LOW_HIGH_CHANGE_%_RUSSELL2000'] = (df_1987_2017['High RUSSELL2000'] - df_1987_2017['Low RUSSELL2000']) / df_1987_2017['Low RUSSELL2000']
df_1987_2017['LOW_HIGH_CHANGE_%_Gold'] = (df_1987_2017['High Gold'] - df_1987_2017['Low Gold']) / df_1987_2017['Low Gold']

# Calculate other changes PERCENT OR ABS??????
df_1987_2017['Interest_Rate_Change'] = df_1987_2017['Interest Rate'].diff()
df_1987_2017['Inflation_Rate_Change'] = df_1987_2017['Inflation Rate'].diff()
df_1987_2017['Unemployment_Rate_Change'] = df_1987_2017['Unemployment Rate'].diff()
df_1987_2017['CPI_Change'] = df_1987_2017['CPI'].diff()

# Volume changes
df_1987_2017['VOLUME_CHANGE_%_RUSSELL2000'] = df_1987_2017['Volume RUSSELL2000'].pct_change()
df_1987_2017['VOLUME_CHANGE_%_SP500'] = df_1987_2017['Volume SP500'].pct_change()


