# Jane Street Market Prediction - EDA
Here we take a first look at the data for the Jane Street Market Prediction competition. The aim of this notebook is not to make any conclusions but just present an overview of the data we have. This may then be helpful later on.

## 1. Import Modules and Data
Let's start by importing the modules and the data for this competition.

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

# install datatable
!pip install datatable > /dev/null
import datatable as dt

Following the steps by [Carl McBride Ellis](https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance) to load in the data faster...

In [None]:
# Load in the data using datatable for speed
train_data_datatable = dt.fread('../input/jane-street-market-prediction/train.csv')

# Convert the datatable into a pandas dataframe
train_data = train_data_datatable.to_pandas()

We now have the training data. Let's take a look at the different features

In [None]:
print(train_data.keys())

# Weight Feature

In [None]:
# Find what proportion (%) of weights are zero
print(100*(train_data['weight'] == 0).astype(int).sum(axis=0)/len(train_data))

In [None]:
weights = train_data['weight'].values
non_zero_weights = weights[weights > 0.0]


fig, axs = plt.subplots(2,2, figsize=(15,10))
axs = axs.flatten()
axs[0].set_title('All Weights')
axs[0].hist(weights, bins=300, range=(0, 2))
axs[2].hist(weights, bins=300, range=(0, np.max(weights)))

axs[1].set_title('Non Zero Weights')
axs[1].hist(non_zero_weights, bins=300, range=(0, 2))
axs[3].hist(non_zero_weights, bins=300, range=(0, np.max(weights)))

axs[2].set_yscale('log')
axs[3].set_yscale('log')

## Resp Features

In [None]:
fig, axs = plt.subplots(3,2, figsize=(15,10))
axs = axs.flatten()
axs[0].hist(train_data['resp'], bins=300, range=(-0.1, 0.1), label='resp')
for i in range(1,5):
    axs[i].hist(train_data['resp_'+str(i)], bins=300, range=(-0.1, 0.1), label='resp_'+str(i))
fig.delaxes(axs[-1])
for i in axs:
    i.legend()

How are these resp features related to each other?

In [None]:
sns.pairplot(train_data[['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4']])


The feature resp seems the most correlated with resp_4. As resp is a measure of the return over a given period, it is likely that resp_4 is the most similar time period to resp.