# Exploratory Data Analysis

In [None]:
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
from matplotlib import pyplot as plt

import os

In [None]:
os.chdir("../")
os.getcwd()

In [None]:
df = pd.read_csv("data/dataset_final_scenario_4.csv", index_col="Date")
df.head()

There are 1504 columns, let's plot only the first 20:

In [None]:
df.iloc[:, :20].plot(figsize=(20,10))

The correlation matrix can turn out ot be useful for missing values:

In [None]:
corr_matrix = df.corr()

In [None]:
corr_matrix

Just by looking at the first 5 securities (should be all bonds, to check), we can notice very high correlations. Maybe it is because they all belong to the same asset class. A similar plot taking all available columns is not feasible.

In [None]:
scatter_matrix(df.iloc[:, :5], figsize=(12,8))
plt.show()

## A compressed version of the data

The idea is to grup by asset class and get a smaller dataset to explore.  
We will use our custom `DataManager` for that:

In [None]:
from src.data_manager import DataManager

In [None]:
dm = DataManager()

In [None]:
dm.types

In [None]:
for t in dm.types:
    globals()[t.lower() + "_mean"] = dm[t].data.drop(["Type", "mapping_id", "has_na"], axis=1).mean()

In [None]:
avg_df = pd.DataFrame([globals()[t.lower() + "_mean"] for t in dm.types], index=dm.types).T

In [None]:
avg_df.index = pd.to_datetime(avg_df.index)
avg_df.sort_index(inplace=True)

In [None]:
avg_df.head()

Now that we got the `avg_df` let's explore intra asset classes relations:

In [None]:
avg_df.plot(figsize=(20,10))

In [None]:
scatter_matrix(avg_df, figsize=(12,8))
plt.show()