# Getting Started: Market Research
This Jupyter notebook is a quick demonstration on how to get started on the market research section.

## 1) Download Data
Please download the train and test data and place it within the ./research/data path. If you've placed it in the correct place, you should see the following cell work:

In [None]:
import pandas as pd

train_data = pd.read_csv('./data/train.csv')
new_cols_train = pd.read_csv('./data/train_new.csv')
train_data = pd.concat([train_data, new_cols_train], axis=1)

test_data = pd.read_csv('./data/test.csv')
new_cols_test = pd.read_csv('./data/test_new.csv')
test_data = pd.concat([test_data, new_cols_test], axis=1)


X_cols = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P']
y_cols = ['Y1', 'Y2']

print(train_data.head())
print(test_data.head())

## 2) Investigate the Dataset
In the datasets, you're given columns of time and A through N, each of which represent some sort of real-life market quantity. In the train dataset, you're also given Y1 and Y2, real-life market quantities you'd like to predict in terms of time and A through N. You're not given Y1 and Y2 in the test set, because this is what you're being asked to predict.

Let's do some exploration of the relationships of A - N and Y1. In particular, let's look at the relationship between C and Y1:

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(8,4))
_ = ax1.hist(train_data['Y1'], bins=100)
_ = ax2.hist(train_data['Y2'], bins=100)

In [None]:
m = train_data.mean(axis=0)
std = train_data.std(axis=0)
print("col, mean, std")
for _m, _std, col in zip(m, std, X_cols):
    print(col, _m, _std)

## Emi's notes on this
* The biggest problem with this data is the distribution of Y1 and Y2 is not normal. The goal is to maximize $R^2$, and the corresponding loss function is $\ell_2$ but this produces a normal distribution of errors between $\hat y$ and $y$. Ideally, Y1 and Y2 should have normal distributions but they do not. Maybe resampling them will help?

* You should probably standardize the column mean and std because on inspection, column A is much larger than the others. (See following cell.)

* There are NAN values in O and P that have been replaced with 0.


## Linear regression

In [None]:
import matplotlib.pyplot as plt
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

train_data.fillna(0, inplace=True)

Xtrain = train_data[X_cols]
ytrain = train_data[y_cols]

scaler = StandardScaler()
scaler.fit(Xtrain)
print(scaler.mean_)

Xtrain = scaler.transform(Xtrain)

model = LinearRegression().fit(Xtrain, ytrain)
print("Train:", model.score(Xtrain, ytrain))

print("col, Y1, Y2")
for c1, c2, col in zip(model.coef_[0,:], model.coef_[1,:], X_cols):
    print(col, c1, c2)
#Xtest = test_data[X_cols]
#ytest = test_data[y_cols]

#print("Test:", model.score(Xtest, ytest))

In [None]:
# Calculate correlation between C and Y1
correlation = train_data['C'].corr(train_data['Y1'])
print(f"Correlation between C and Y1: {correlation:.4f}")

Clearly there's a strong relationship between C and Y1. You should definitely use C to predict Y1!

## 3) Submit Predictions
In order to submit predictions, we need to make a CSV file with three columns: id, Y1, and Y2. In the below example, we let our predictions of Y1 and Y2 be the means of Y1 and Y2 in the train set.

In [None]:
preds = test_data[['id']]
preds['Y1'] = train_data['Y1'].mean()
preds['Y2'] = train_data['Y2'].mean()
preds

In [None]:
# save preds to csv
preds.to_csv('preds.csv', index=False)

You should now be able to submit preds.csv to [https://quantchallenge.org/dashboard/data/upload-predictions](https://quantchallenge.org/dashboard/data/upload-predictions)! Note that you should receive a public $R^2$ score of $-0.042456$ with this set of predictions. You should try to get the highest possible $R^2$ score over the course of these next few days. Be careful of overfitting to the public score, which is only calculated on a subset of the test data—the final score that counts is the private $R^2$ score!