# Report

### Problem Overview

Problem set is 299 weekly time series that we need to filter through to predict the values for two target series.
I have decided to break this into two univariate analyses based on the idea that we should ascertain the significance of our features to each of the two target series. 

### Data and modelling approach

**The data:**

Firstly, we investigate the data by looking at summary statistics for the dataframe. We then split the data into a matrix of features (X) and a dataframe containing the two target variable vectors (y), before visualising the time series in appropriately sized tranches. As there are 299 features we visualise in tranches of 50.  

This presents numerous outliers. Further inspection shows certain series have means away from 0. 
Given the context that this project is for a hedge fund, and most of the series have means around 0, we might be able to assume that these can be taken as weekly returns data.
We will instead take a naive approach to the dataset.

There are multiple series with missing values. These series do not seem to be unclean, but rather it seems the series start at different points in time. There do not seem to be missing values between the start and end of non-null values in any given series.

Certain series exhibit seasonality, while others do not. Plotting on a monthly, quarterly and yearly basis.

**The Model:**

In this analysis we opt to use XGBoost. This is for multiple reasons: 

    1 - XGBoost is insensitive to differently scaled features so long as we use a decision tree model as its basis.
        The same goes for multicollinearity.
        
    2 - XGBoost provides in-built cross-validation.
    
    3 - XGBoost is less affected by NaN values as it bins the datapoints that are non-null and ignores series that have so few values that they are useless.
    
    4 - XGBoost is able to recognise seasonal and trend patterns.

**Cleaning:**

As there are no missing values we therefore assume that there does not need to be any imputation on the raw data. 

We remove rows with nan values in the target series, before splitting the dataset into the feature matrix and a dataframe containing the two target variable vectors.

We remove columns with less non-null values than the 20% required for a full test set in a classic 80.20 train/test split.

**Feature Selection:**

We opt for pearson correlation comparison as an easy way to distinguish between useful and unuseful features. This is due to the simplicity, and the fact that XGBoost will be able to handle series of differing lengths.

Relevant features are selected based on a criteria of being either above 0.3 or below -0.3. This is because pearson coefficients below these thresholds are negligible. We keep low scoring features (between 0.3 and 0.5, and between -0.3 and -0.5) so as to maximise information. It turns out XGBoost effectively deals with these features.

### Results

Initial regression yielded satisfactory performance. The R-squared scores and RMSE values were as follows:

Target 1:
R^2 - 0.984323; 
RMSE - 0.00859

Target 2:
R^2 - 0.949335; 
RMSE - 0.03925

However, multiple features in each feature set had an importance score of zero. This meant they were unused, and should be jettisoned.

It was also possible that shorter series used in the model were unnecessary to its performance, so they were also considered for removal.

**Tuning:**

After training and testing the model on these first sets of selected features we removed the features with lower counts of non-null values. We settled on an arbitrary bound of 510 to cut out shorter series, but included series with less than 10 fewer non-null values. This was done to maximise the information passed to the model, essentially just in case series with 508 or 514 values were important to performance.

The model was then refined by removing the scores with importance scores of 0.

**Results of Tuning:**

Removing the features with importance scores of 0 left R^2 unchanged. The RMSE was of course slightly different due to the randomness of the tree model.

Removing the shorter series yielded the following results:

Target 1: R^2 - 0.984428; RMSE - 0.00859

Target 2: R^2 - 0.949335; RMSE - 0.03923

The R^2 scores were essentially unchanged. As R^2 is vulnerable throwing as many features as possible at the model, the fact it is unchanged by using less features is gratifying.

RMSE is again variable, but it is included for completeness. It was also about the same.

### Conclusions

Solely based on correlating individual features with the two target sets, we see the following:

Most Significant to Target 1:
- Feature 251
- Feature 267
- Feature 138

Significant to Target 1:
- Feature 266
- Feature 297

Most Significant to Target 2:
- Feature 260
- Feature 261
- Feature 264

Significant to Target 2:
- Feature 265
- Feature 299
- Feature 297
- Feature 298
- Feature 261
- Feature 262
- Feature 294
- Feature 175

After tuning XGBoost, its importance score shows us the most significant factors are as follows:

Target 1:
- Feature 251 by far (0.94)
- Feature 267 (0.02)
- Feature 266 (0.007)

Target 2:
- Feature 260 by far (0.92)
- Feature 267 (0.022)
- Feature 255 (0.014)
- Feature 254 (0.009)
- Feature 257 (0.008)

### Improvements

- Making the code more pythonic by defining classes/functions to avoid reuse of code
- Improving individual feature selection by running simple OLS regressions instead of simply finding the Pearson coefficient
- Further tuning potentially by reducing features further