## Short-term market timing strategy based on boosting ML algos

This is Course project of Machine Learning for Finance at PHBS in 2019-2020 Module3.

* Yifan Hu/Evan        1901212691  [eiahb3838ya](https://github.com/eiahb3838ya) 
* Yuting Fang/Trista   1901212576  [ytfang222](https://github.com/ytfang222) 
* Zhihao Chen/Alfred   1901212567  [AlfredChenZH](https://github.com/AlfredChenZH) 
* Zilei Wang/ Lorelei  1901212645  [LoreleiWong](https://github.com/LoreleiWong) 

### PART1 Introduction

#### 1.1 Motivation

As the global financial market is generating mass data of different types every day, it is becoming more crucial and more **difficult to effectively extract and use these data to predict the trend of stocks**. The short term timing strategy has a few difficulties as follows:

1. Market sentiments strongly influence the short-term market trend;
2. How to extract effective factors;
3. How to build nonlinear factors;
4. How to solve collinearity among factors.

#### 1.2 Our project goal

In this project, we recognize the **price up or down** as a **classification problem** and implement several **machine learning algorithms** to predict the future price up or down of **WindA Index(Y)**([881001.csv](00%20data/881001.csv)), an index indicating the trend of Chinese A Share stocks, to build a **short-term timing strategy**.

#### 1.3 Brief Summary of Dataset

The X is **macroeconomic data in china**([cleanedFactor.pkl](00%20data/cleanedFactor.pkl)) plus **American index indicators**, like ([DJI.GI,NQ.CME](00%20data/AddNewData)).We also use the OHLC price of windA to **build some features(alphas)**.  
The Y is 01 **bool value of windA** in next trade day.  
The total number of features is 60.  
The time period: from 20080401 to 20200306.  
The data can be acquired from Wind Database directly. All factors are based on daily frequency data.

#### 1.4 Dataset sample

![images](picture/features.png)

#### 1.5 Workflow

![images](picture/workFlow.png)

We implement feature selection functions and establish Myclassifiers classes using logistic regression, naive Bayes, KNN, perceptron, decision tree, SVM, XGBoost and Sequential neural network model in Keras to fit and then predict the up or down of WindA Index in the next day. 

#### 1.6 Rolling Prediction 

As the financial data are time series data, we implement an **expanding window** training and prediction procedure as follows: 
1. We get at least 1800 days' data as the training dataset and use k-fold cross validation method to tune the hyperparameters for the best model, so the first signal we can get is the 1801 day.
2. The signal is the predict results of the up or down of WindA Index in the next day. If the signal is predicted to be 1, then we buy WindA Index at the close of the day. If it is predicted as 0, then we short WindA or do nothing at the close of the day.
3. We use the best model in Step 2 for 20 consecutive trading days and then add the 20 days' data into the training set in Step 1 to enter Step 1 again.

![images](picture/rollingprediction.png)

### PART2 Data Preprocessing & Feature Selection

Actually, we download raw data from windA database in different categories,so it needs some time to concate data and handle code issues. It is really tedious so we skip this part in pre.  
Really thanks to Evan doing this patiently and carefully ：） 

#### 2.1 Tackle with NaN 

Then we compute the number of NaN in each factor, as shown in the following image. After dropping all NaN including non-trading day data and other missing data, we get a dataframe including 2,903 observations.

In [20]:
import pandas as pd
import numpy as np
import plotly 
import os,sys
import matplotlib.pyplot as plt
ROOT = '../'
FACTOR_PATH = os.path.join(ROOT, '02 data process')
outputDir = os.path.join(ROOT, '02 data process')

X_df = pd.read_pickle(os.path.join(FACTOR_PATH, 'factor.pkl'))
X_df.head()

Unnamed: 0_level_0,IBO001,R007,B0,IBO001_pctChange5,R007_pctChange5,B0_pctChange5,SHIBORO/N,SHIBOR1W,SHIBOR2W,SHIBOR1M,...,ETFVolatility120,ETFVolatility60_pctChange5,ETFVolatility120_pctChange5,mktVolume,mktVolume_pctChange5,mktClose_pctChange5,ETFReturn,ETFTomorrowUp,windAReturn,windATomorrowUp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2007-09-03,1.8289,2.4612,2.4713,,,,1.8197,2.4963,2.7768,2.9081,...,,,,19342430000.0,,,,0.0,,0.0
2007-09-04,1.8828,2.178,2.1805,,,,1.886,2.2348,2.774,2.9625,...,,,,18337700000.0,,,-0.016827,0.0,-0.009198,1.0
2007-09-05,1.8201,2.3618,2.3678,,,,1.8122,2.3683,2.9631,3.0903,...,,,,14945670000.0,,,-0.002445,1.0,0.002539,1.0
2007-09-06,1.8173,2.4748,2.4389,,,,1.8198,2.4385,3.2259,3.2956,...,,,,16264600000.0,,,0.012255,0.0,0.010107,0.0
2007-09-07,2.016,2.8528,2.8629,,,,2.0446,2.8066,3.5218,3.4738,...,,,,18255420000.0,,,-0.016949,1.0,-0.021509,1.0


In [21]:
from pylab import * 
import matplotlib
matplotlib.rcParams['font.family'] = 'Microsoft YaHei'
mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #更新字体格式
mpl.rcParams['font.size'] = 9 
nas_df = X_df.isna()
# print(X_df.isna().sum())

# plt.figure(figsize = (15, 6))
# plt.title('NaN count in data')
# plt.xticks(rotation='vertical')
# plt.bar(nas_df.sum().index, nas_df.sum().values)
# plt.show()

![images](picture/NanNumber.png)

In [22]:
XDroped_df = X_df.loc['2008-04':].dropna(axis = 0, thresh=35)
print(XDroped_df.isna().sum()[XDroped_df.isna().sum()>0])
nas_df = XDroped_df.isna()

国债到期收益率:6个月           29
国债到期收益率:1年             2
国债到期收益率:2年             1
CRB现货指数:综合            97
期货收盘价(连续):COMEX黄金    159
期货结算价(连续):布伦特原油       22
COMEX黄金/WTI原油        159
dtype: int64


![images](picture/NanNumber2.png)

In [23]:
XFilled_df = XDroped_df.fillna(method = 'ffill')
# for date in XDroped_df.loc[XDroped_df['期货收盘价(连续):COMEX黄金'].isna()]
# for date in XDroped_df.loc[XDroped_df['期货收盘价(连续):COMEX黄金'].isna()].index: print(date)
XDroped_df = XDroped_df[XDroped_df['期货收盘价(连续):COMEX黄金'].isna()]
XDroped_df = XDroped_df[XDroped_df['国债到期收益率:6个月_pctChange5'].isna()]
XDroped_df = XDroped_df.iloc[:,25:]

In [24]:
XFilled_df.head(5)
print(np.isfinite(XFilled_df).all().head(5))

IBO001               True
R007                 True
B0                   True
IBO001_pctChange5    True
R007_pctChange5      True
dtype: bool


In [25]:
for aColumn in XFilled_df.columns:
    print(aColumn, XFilled_df[~np.isposinf(XFilled_df)].max()[aColumn])
    XFilled_df.loc[np.isinf(XFilled_df)[aColumn], aColumn] = XFilled_df[~np.isposinf(XFilled_df)].max()[aColumn]

IBO001 13.8284
R007 11.6217
B0 11.6493
IBO001_pctChange5 2.5731317968672056
R007_pctChange5 1.963432953826692
B0_pctChange5 1.9480121121554133
SHIBORO/N 13.444
SHIBOR1W 11.004
SHIBOR2W 9.0642
SHIBOR1M 9.698
SHIBOR3M 6.4611
SHIBOR6M 5.5242
SHIBORO/N_pctChange5 2.6073306294126484
SHIBOR1W_pctChange5 1.875638306229869
SHIBOR2W_pctChange5 1.5312022414671422
SHIBOR1M_pctChange5 0.9572557847345828
SHIBOR3M_pctChange5 0.3585700344136833
SHIBOR6M_pctChange5 0.08224215246636768
国债到期收益率:6个月 4.5621
国债到期收益率:1年 4.2109
国债到期收益率:2年 4.4507
国债到期收益率:6个月_pctChange5 2.1164144353899887
国债到期收益率:1年_pctChange5 0.9207119741100322
国债到期收益率:2年_pctChange5 0.39655504234026195
南华综合指数 1676.88
CRB现货指数:综合 580.32
期货收盘价(连续):COMEX黄金 1873.7
期货结算价(连续):布伦特原油 146.08
COMEX黄金/WTI原油 39.80631276901004
南华综合指数_pctChange5 0.11500998794842587
CRB现货指数:综合_pctChange5 0.054300397556482194
期货收盘价(连续):COMEX黄金_pctChange5 0.20361137313030597
期货结算价(连续):布伦特原油_pctChange5 0.29319781078967955
COMEX黄金/WTI原油_pctChange5 0.2570562801310421
标普500 23707.

#### 2.2 Tackle with extreme values 

We use MAD method to limit feature values to the range of [median – n*MAD, median + n*MAD]. We also standardize data before training our models.

Since we will roll all data in the following classifier models, it is necessary to calculate median, mean and variance of training data and testing data for each scrolling window, so we encapsulate the cutExtreme funtion to achieve standard input and output in cutting extreme values.

In [26]:
def cutExtreme(XFilled_df, n = 3.5):
    MAD_s = XFilled_df.mad()
    upper_s = XFilled_df.median()+n*MAD_s
    lower_s = XFilled_df.median()-n*MAD_s
    X_df = XFilled_df

    for aColumn in X_df.columns:
        X_df.loc[X_df[aColumn]>upper_s[aColumn], aColumn] = upper_s[aColumn]
        X_df.loc[X_df[aColumn]<lower_s[aColumn], aColumn] = lower_s[aColumn]

    XNoExtreme_df = X_df
    return(XNoExtreme_df)

XNoExtreme_df = cutExtreme(XFilled_df, n = 3.5)

XNoExtreme_df.to_csv(os.path.join(outputDir, 'cleanedFactor.csv'))
XNoExtreme_df.to_pickle(os.path.join(outputDir, 'cleanedFactor.pkl'))

#### 2.3 Correlation 

In [27]:
# have to install pandas_profiling first, may meet environment problem.
# if you don't want to do this, the output result is in 07 report/inputDataReport.html

# import pandas_profiling 
# X_df = pd.read_pickle('cleanedfactor.pkl')
# profile = pandas_profiling.ProfileReport(X_df)
# profile.to_file(outputfile="report.html")

![images](picture/pearson.png)

![images](picture/spearman.png)

#### 2.4 feature selection

We can see that correlation among these factors are relatively high, which is easy to understand. In order to solve this problem, we adopt some particular feature selection functionss to deal with this issue as can be seen in the following part.

Here we build five models to select features:
* naiveSelection.py
* pcaSelection.py
* SVCL1Selection.py
* treeSelection.py
* varianceThresholdSelection.py

To avoid high correlation among features as much as possible, we can choose LASSO in SVC model. To find the most import features, we can choose pca methods. Also, XGBoost includes feature selection itself. Morever, to make it easy to call feature selection model, we encapsulate them as standard functions.

#### sample feature selection function [pcaSelection.py]

In [28]:
import pandas as pd
import os
from sklearn import preprocessing
import warnings
warnings.filterwarnings('ignore')
from sklearn.decomposition import PCA

def pcaSelection(X_train, y_train, X_test, y_test, verbal = None, returnCoef = False):
    '''
    choose the feature selection method = 'pca'
    fit any feature_selection model with the X_train, y_train
    transform the X_train, X_test with the model
    do not use the X_test to build feature selection model
    
    return the selected X_train, X_test
    print info of the selecter
    return the coef or the score of each feature if asked
    '''
    #transform to standardscaler
    features = X_train.columns.tolist()
    scaler = preprocessing.StandardScaler().fit(X_train)
    X_train = pd.DataFrame(scaler.transform(X_train))
    X_test = pd.DataFrame(scaler.transform(X_test))
    X_train.columns = features
    X_test.columns = features
    
    pca = PCA(n_components = 40)
    X_train = pca.fit_transform(X_train)
    X_test = pca.transform(X_test)
    
    coef = pd.Series()
    # featureName = None
    
    if verbal == True:
        print('The total feature number is '+ str(X_train.shape[1]))
       # print('The selected feature name is '+ str(featureName))
       
    if not returnCoef:
        return(X_train, X_test)
    else:
        return(X_train, X_test, coef)