# <center><font color="blue", size="6">S&P 500 Market Prediction</font></center>

# I. Introduction

### Goals

One of the more exciting areas to apply machine learning and AI is to financial markets. This in fact is one of the very exciting applications for machine learning. [Economist, Machine Learning for Finance](#https://www.economist.com/news/finance-and-economics/21722685-fields-trading-credit-assessment-fraud-prevention-machine-learning). Our over-arching goal is to apply algorithms and machine learning to real-world problems through the use of software, therefore all the results developed herein are available in Github (**link**)  

In this article we focus our efforts on the S&P 500. Investors are keenly interested in monitoring and anticipating market ups and downs in order to gain financial benefit. Additionally, they are interested in anticipating when the market will turn down form "Bull" to "Bear" or conversely turn up from "Bear" to "Bull." This information allows the investor to guard against financial loss or gain during financial up markets. In this article we apply machine learning to the S&P 500. In a later article we will dive into specific stocks. 


### S&P Trade Strategy

For any problem we study as a data scientist, we first undrstand the business problem. In this case, the business problem is to create a predictive model that predicts the market S&P 500 movement so that the investor can take appropriate action. When it comes to the stock market, the key is to pick a trade strategy as discussed in this link https://www.dailyfx.com/forex/education/trading_tips/daily_trading_lesson/2012/07/10/How_to_build_a_trading_strategy.html. Below, we summarize the strategy assumed in this artilcle. 

In concept, our strategy is simple. Predict the price movement of the S&P 500 index up or down and execute a trade in order to optimize profitability. In addition to supporting a trading strategy, predicting the S&P up and down movements helps us understand the overall stock market. The strategy is summarized in the diagram below. Each day after the market closes, and before the next day's market is open, a predictive model receives inputs (feature variables) and outputs the predictor signal p_1. These feature variables are derived from the stock historical information and act as dependent variables, such as in linear regression and the output signal, p_1, is the dependent variable, output. If you own the stock then hold and if not buy. However, if the signal is -1 then this indicates sell. If you own the stock sell, but if you don't own the stock do not take any action, sit on cash. 

![Rendering preferences pane](./TradeStrategy.png)

Several different predictive model types are employed including models generated from supervised learning employing Decision Tree classification model, Random Forest classification model, a heuristic model based on S&P market cycles and and hybrid models based on a combination of supervised learning models and market cycle parameters. Each of these is back tested in order to evaluate it's effectiveness during historical stock market periods.

### Scope and Perspective
* Discussion will be focused on results and with a light introduction of the software. The author is happy to converse offline (email) for the sake of answering questions, clarifications and constructive suggestions. 
* Intimate understanding of the softare is not necesary to understand this article. However, if you are interested in the software, this overview should provide the introduction and examples to get you started. 
* The article is not written from the perspective of a financial theorist, but instead from the pserspective of machine learning perspective and focused on the real-world application that demonstrates financial value. 
* This article is written in a Jupyter, Pythyon notebook, where each of the results and graphs is generated by actual Python software functions that exemplify how to apply Python and the algosciquant functions. If you are unfamiliar with notebooks, just think of them as a working environment where Python computer code is executed within code blocks (as exemplified in the article).

### Overview

We follow a typical data science process atuned to our financial prediction problem.
  * Import data
  * Study the data and get some intution 
  * Feature engineering
  * Model generation and prediction 
  * Backtest

The remainder of the article these topics will be discussed within the following sections.   

&nbsp;&nbsp; **Section II**. &nbsp;  Notebook setup  
&nbsp;&nbsp; **Section III**.&nbsp; Market cycles  
&nbsp;&nbsp; **Section IV**. &nbsp; Features and Class Labels    
&nbsp;&nbsp; **Section V**. &nbsp;   Model Training and Prediction    
&nbsp;&nbsp; **Section VI**. &nbsp;  Back test    


In [8]:
%%html
<!-table style --> 
<style>
  table {margin-left: 0 !important;}
</style>

In [7]:
# I. Notebook Setup
import pandas as pd
import numpy as np
import datetime as dt
from io import StringIO
import matplotlib.pyplot as plt
from matplotlib import rcParams
from sklearn.tree import DecisionTreeClassifier, export_graphviz
%matplotlib inline
%run algosciquant

# Notebook Paramters
test_s = dt.datetime(2016,1,1)      # 2016 for daily predictions, 2000, 1970, 1960
dataStartDate=dt.datetime(2014,1,1) # 2014 for daily updates, 1950 for long term updates
train_s = dt.datetime(1950,1,1)
today = dt.datetime.today()
#test_e = dt.datetime(today.year,today.month,today.day)
test_e = dt.datetime(2017,4,28)
print("dataStartDate = ",dataStartDate,'\ntest_s = ',test_s,'\ntest_e = ',test_e)

# Read in S&P 500 Data
dfsp = pd.read_csv('./stock_data/sp500.csv',index_col=0,parse_dates=True)
dfsp = dfsp[dataStartDate:]

# Plot the S&P Price

plot=0
if plot == 1: 
    plot_stock(dfsp,test_s,test_e,plot_variables=['close_price'],labels=['close_price'], figsize=[12,4])
    

dataStartDate =  2014-01-01 00:00:00 
test_s =  2016-01-01 00:00:00 
test_e =  2017-04-28 00:00:00


# III. Market Cycles

### Computing Up and Down Market Cycles with *marketCycle()*

Understanding S&P market cycles is key to developing intuition about the market. The alcosciquant *marketCycle()* function takes as input S&P historical prices and generates the up and down market cycles. The input and output parameters are described in detail within algosciquant.py. Here, we provide a brief summary. *MarketCycle()* will return two DataFrames, *dfmc* and *dfmcsummary*. 

MarketCylce() inputs

  * *Mucdown* (input variable) - the percentage market down, measured from the previous high, at which time the market is declared to be a down market. As in the definition of a Bear Market, the official start of the down cycle is from the previous high. 
  * *Mdcup* (output variable) - the market cycle up percentage, from the previous low, at which time the market is declared to be an up market. As with a Bull Market, the cycle start point is at the previous low.  
  
*MarketCycle() outputs*

* *Dfmc* is a DataFrame indexed by date (one entry for each market day) including several columns ("parameters") that characterize the market on the given day. Two parameters of importance to this analysis are *mcnr* and *mcupm*. 
* *Mcnr* - market cycle normalized return, normalized to zero at the start of each market up or down cycle.
* *Mcupm* - markcet cycle up marker. along with a derived variable "mcupm," which indicates the day that the market is detected to switch from up to down, (1 to 0) or visa versa and holds the state until the next switch. Below, we will employ this variable as a a heuristicly derived trade signal, "buy" or "sell."
* *Dfmcsummary* includes one row summarizing each up and each down market cycle. The summary includes the start day, end day and normalized return (up or down) from the start of the market cycle. 

### Market Cycles (Bull and Bear)

Below we plot the Bull (Up 21% from market cycle low) and Bear (down, 20% from market high) market cycles of the S&P 500 from 1950 to 2017-04-28.  

![sdfsdfasdfsa](./mc_2021.png)



### Market Cycles Summary
For reference we list the *dfmcsummary* data frame from 1950 to 04/28/2017, resulting from the inputs *mucdown* = 20 (down 20% from the market high), and *mdcup = 21% (21% up from the market low). This good match with the market cycles in our *dfmcsummary* (listed below) and the Bull and Bear market cycles reported in http://www.gold-eagle.com/article/history-us-bear-bull-markets-1929.

Table: 

|mkt|startTime|endTime|startPrice|endPrice|mcnr|
|---|---------|-------|----------|--------|----|
|1.0|1950-01-03|1956-08-02|16.660000|49.639999|1.979592|
|-1.0|1956-08-02|1957-10-22|49.639999|38.980000|-0.214746|
|1.0|1957-10-22|1961-12-12|38.980000|72.639999|0.863520|
|-1.0|1961-12-12|1962-06-26|72.639999|52.320000|-0.279736|
|1.0|1962-06-26|1966-02-09|52.320000|94.059998|0.797783|
|-1.0|1966-02-09|1966-10-07|94.059998|73.199997|-0.221773|
|1.0|1966-10-07|1968-11-29|73.199997|108.370003|0.480465|
|-1.0|1968-11-29|1970-05-26|108.370003|69.290001|-0.360616|
|1.0|1970-05-26|1973-01-11|69.290001|120.239998|0.735315|
|-1.0|1973-01-11|1974-10-03|120.239998|62.279999|-0.482036|
|1.0|1974-10-03|1980-11-28|62.279999|140.520004|1.256262|
|-1.0|1980-11-28|1982-08-12|140.520004|102.419998|-0.271136|
|1.0|1982-08-12|1987-08-25|102.419998|336.769989|2.288127|
|-1.0|1987-08-25|1987-12-04|336.769989|223.919998|-0.335095|
|1.0|1987-12-04|2000-03-24|223.919998|1527.459961|5.821454|
|-1.0|2000-03-24|2001-09-21|1527.459961|965.799988|-0.367708|
|1.0|2001-09-21|2002-01-04|965.799988|1172.510010|0.214030|
|-1.0|2002-01-04|2002-10-09|1172.510010|776.760010|-0.337524|
|1.0|2002-10-09|2007-10-09|776.760010|1565.150024|1.014972|
|-1.0|2007-10-09|2008-11-20|1565.150024|752.440002|-0.519254|
|1.0|2008-11-20|2009-01-06|752.440002|934.700012|0.242225|
|-1.0|2009-01-06|2009-03-09|934.700012|676.530029|-0.276206|
|1.0|2009-03-09|2017-06-26|676.530029|2447.639893|2.617932|


In [8]:
# II. Market Cycles

# The code block gives a choice of computing the S&P 500 market cycles from the close_price market 
# data loaded in the previous step ("dfsp") or loading a DataFrame ("dfmc") of pre-computed ("saved") market cycles. 

%run algosciquant
compute=0   # if compute is 1 then compute new market cycles, else load from saved file
print("dataStartDate = ",dataStartDate,'\ntest_s = ',test_s,'\ntest_e = ',test_e)

# Computer Market Cycles
if compute==1:
    dfmc,dfmcsummary=compute_market_cycle(dfsp,dataStartDate,test_e,mcdown_p=20,mcup_p=25)

# Load Market Cycle files
if compute == 0:
    mcvariable='2025' # 2021, 2022, 2023, 2024, 2025
    print('mcvariable =',mcvariable)
    mc_filename='./data_jupyter_notebook/sp500_dfmc'+mcvariable+'_1950_2017-4-28.csv'
    mcs_filename='./data_jupyter_notebook/sp500_dfmcs'+mcvariable+'_1950_2017-4-28.csv'
    dfmc = pd.read_csv(mc_filename,index_col=0,parse_dates=True)
    dfmcsumary = pd.read_csv(mcs_filename,index_col=0,parse_dates=True)
    
# Plot S&P 500 Market Cycle
plot = 0
if plot==1:
    basic_plot(dfmc,dataStartDate,test_e,plot_variables=['mcnr','mcupm'],labels=['mcnr','mcupm'],
                figsize=[12,3],loc='upper right',save_fig='mc_'+mcvariable+'.png')
    
summary=0
if summary ==1:
    print('dfmcsummary:\n',dfmcsummary[['mkt','startTime','endTime','startPrice','endPrice','mcnr']])


dataStartDate =  2014-01-01 00:00:00 
test_s =  2016-01-01 00:00:00 
test_e =  2017-04-28 00:00:00
mcvariable = 2025


# IV. Features and Class Labels

### ML Features

In this section we generate machine learning features (input variables) and class labels necessary for training a predictive model, such as a Decision Tree or Random Forest, based on a supervised learning paradigm. The feature variables are listed as part of the code block output. In the table, we provide a brief description of machine learning feature varables, grouped by type of variable and provide a brief description.

Table: ML Features

|Feature Type and Description|  Feature Variables  |
|:----------------------------|:------------|
|Basic market variables | close_price, volume|
|high and low relative to open |high_price_ropen , low_price_ropen|
|Relative price change, 1 and 2 day. Relative price change, close_pricer = close_price[n-1]/close_price[n] -1| close_pricer_h1, close_pricer_h2 |
|Relative volume change, 1 and 2 day. volumer = volume[n-1]/volume[n] -1|volumer_h1, volumer_h2|
|Price trailing moving averages based on relative price change.| close_pricer_ma5,close_pricer_ma10, close_pricer_ma20, close_pricer_ma30, close_pricer_ma60, close_pricer_ma90, close_pricer_ma120 | 
|Volume trailing moving averages based on relelative volume variable |volumer_ma5, volumer_ma10, volumer_ma20, volumer_ma30,volumer_ma60, volumer_ma90,volumer_ma120|
|Yearly volatility measured over n trailing days | vol_y_10, vol_y_50, vol_y_120 | 
| market cycle variables | mc2025, mcupm, mcnr, mucdown, mdcup |


### Class labels N day future
We employ an N-day forward looking prediction as described by some bright Stanford students in their machine learning class project (http://cs229.stanford.edu/proj2013/DaiZhang-MachineLearningInStockPriceTrendForecasting.pdf). The class labeling method works as follows. The class labels are derived from the stock market data, where for each day the class label either +1 or -1 corresponding to the closing price of the n-th day in the future relative to the current day. Suppose n = 3, then if the current day is Monday and stock closing price is \$1 on Monday and the Thursday close price is \$1.10 then the class label, for Monday training prediction, is +1. Furthermore, suppose that the Friday closing price is \$0.90 then the class label is -1.

![sdfsdfasdfsa](./mc_tr_summary.png)

The graph below illustrates the accuracy (correct predictions divided by total predictions) on the y-axis  achieved by a Decision Tree (*dt*) and Random Forest (*rf) prediction models versus the *nday* future prediction on the x-axis. The predictions occured over the dates 2014-01-01 to 2017-04-28 with the Feature variables listed in the table above. We will discuss training prediction in a little more detail in the following section. The key point here is to observe that the prediction accuracy improves as *nday* varies from 1 to larger number of days into the future. In fact 1 day into the future the accuracy is very poor. We find *ndays* set to 43 leads to good prediction results.


In [9]:
# III. Features and Class Labels
%run algosciquant

# ML Features
dfML=mlSpFeatures(dfsp,dfmc,mcvariable,dataStartDate,test_e)

print("\nML features")
print(dfML.columns)

# Class Labels
nday=43
dfT= ndayTruth(dfsp.loc[dataStartDate:, ['close_price']], nday,tvariable='close_price')
print('Truth t_n')
print(dfT.columns)

# Null Rows
s=dt.datetime(1952,6,1)
e=dt.datetime(2017,1,1)
nullrows=sum([True for idx,row in dfML.loc[s:e].iterrows() if any(row.isnull())])
nrows=len(dfML.loc[s:e])
print("nrows = ",nrows,"null_rows = ",nullrows)

print('data start date =',dataStartDate, ', start date =',test_s,', end date =',test_e) 

plot=0
if plot ==1:
    fig = plt.figure()
    # close_price
    subplot_stock(fig,dfsp,311,test_s,test_e,plot_variables=['close_price'],
                  labels=['close_price'],figsize=[12,6],loc='lower right',save_fig='')
    #  moving average
    subplot_stock(fig,dfML,312,test_s,test_e,plot_variables=['close_pricer_ma120','close_pricer_ma60'],
                  labels=['close_pricer_ma120','close_pricer_ma60'],figsize=[12,6],loc='lower right')
    #  volatility    
    f3=fig.add_subplot(313)
    subplot_stock(fig,dfML,313,test_s,test_e,plot_variables=['vol_y_10','vol_y_50','vol_y_120'],
                  labels=['vlty_w10','vlty_w50','vlty_w120'],figsize=[12,6],loc='upper right',ncol=3,
                  save_fig='sp_vlty_'+str(test_s.year)+'_'+str(test_e.year)+str(test_e.month)+str(test_e.day)+'.png')



ML features
Index(['close_pricer', 'volumer', 'close_price', 'volume', 'high_price_ropen',
       'low_price_ropen', 'close_pricer_h1', 'close_pricer_h2',
       'close_pricer_ma5', 'close_pricer_ma10', 'close_pricer_ma20',
       'close_pricer_ma30', 'close_pricer_ma60', 'close_pricer_ma90',
       'close_pricer_ma120', 'volumer_h1', 'volumer_h2', 'volumer_ma5',
       'volumer_ma10', 'volumer_ma20', 'volumer_ma30', 'volumer_ma60',
       'volumer_ma90', 'volumer_ma120', 'vol_y_10', 'vol_y_50', 'vol_y_120',
       'mc2025', 'mcupm', 'mcnr', 'mucdown', 'mdcup'],
      dtype='object')
Truth t_n
Index(['close_price', 't_n'], dtype='object')
nrows =  756 null_rows =  120
data start date = 2014-01-01 00:00:00 , start date = 2016-01-01 00:00:00 , end date = 2017-04-28 00:00:00


# V. Model Training and Prediction


### Training and Prediction Method

The training and prediction follows a batch learning paradigm, whereby the predictive model is trained offline and offers a prediction. The model is trained over a training window of *train_days* days in the past. In the case of a Decision Tree model, *train_days* is set to 400 days. Each day the model is trained and predicts the market will be up or down *ndays* days into the future. The next day, the training window slides forward by 1 market day, again the model is trained and a prediction is made *n* trading days into the future. This cycle is summarized in the diagram below.

<div markdown style="float:left" {margin-left: 0 !important;}>  
![sdfsdfasdfsa](./sp_ml_tp_strategy.png)

The diagram, below describes this process in a bit more detail. In order to predict on day *k* + *ndays*, prior to market open, the prediction is made on day *k* with features from day *k*, after market close. The model is trained with a training set including features from *k*-*train_days* to *k* and class labels from day *k* - *ndays* - *train_days* - 1 to *k* + *ndays* -1. The model prediction performance and trade performance can be tested by saving the predictions as the training and prediction are performed over a historical market period. These saved predictions are then compared to the actual market behavior. 

<div markdown style="float:left" >  
![sdfsdfasdfsa](./tr_pred_days.png)

### Training Set Example
Let's take a more specific example. Suppose *ndays* = 1 and *train_days* = 2 and we want a buy or sell prediction for Tuesday, January 10, 2017 at market open. Also, suppose the training features consist of only 1 item, close price, where the close prices are  

  * Thursday, January 5, close_price = $1.00  
    
  * Friday, January 6, close_price =  $1.10  
    
  * Monday, January 9, close_price =  $1.00  

We train the model with features from Thursday and Friday (2 days), and class labels derived at Friday and Monday, market close. The table below lists the example training set with features X, labels Y and some description of these variables. 

<sp>  
<center>Table: Example training set:  featuers X and labels Y, *ndays* = 1, *train_days* = 2 for prediction Tuesday, Jan 10, 2017 at market open.</center>

|X (close price)|Y (label)|features description|label description (market up or down, 1 day in the future)|
|---|---|:---|:---|
|\$1.10| -1 |close price, Friday, Jan 6, at market close | sign (close price, Monday, Jan 9, at market close - close price Friday, Jan 6, market close) |
|\$1.00| 1 |close price, Thursday, Jan 5, at market close | sign (close price, Friday, Jan 6, at market close - close price Thursday, Jan 5, market close) |

### Model

The supervised learning model is specified in the *model* parameter. Several models are preconfigured including: Decision Tree (*model*='DT'), Random Forest (model='RF'), Suport Vector Machine (model='SVM'), Logistic Regression (model='LR'), K Nearest Neighbor (model='KNN'), Naive Bayes ('NBB'), XG Boost (model='XG'). Two models were found very useful for this study. The Decision Tree trains quickly and gives good performance comparable to the best models and thus is quite useful, especially for initial evaulation and iteration. The Random Forest provides consistently the best overall performance though the training and prediction (for back testing) is much slower, especially when executed over decades. Below is a summary of the 

Decision Tree - There are approximately 252 trading days in a year so this corresponds to approximately 2.5 years.

Random Forest -  There are approximately 252 trading days in a year so this corresponds to approximately 2.5 years.


### Predictor signals

A software object that can estimate some parameters based on a dataset is called an estimator. Similarly, an estimator that predicts behavior into the future based on historical data is a predictor. The Machine Learned model (code block below) outputs several predictors. In the next section (Section VI. Backtest) each of the predictors is described in more detail. In order to understand the our prediction model predictors, below we desicribe three of predictors generated the model: p, mc2025 and mc2025v. 

A few columns of the prediction output dataframe (dfTR) are listed below for the purpose of aiding the discussion. The last prediction date is (test_e = April 28, 2017), which is a prediction for the next market day. Future predictions for prior days  can be compared to actual market behavior in order to test the predictive performance.  The prediction corresponding to a given day is contained in the 'p' column and this is compared to the 't' column. For example, on 2017-04-27, *p* compared to *t*. Other predictors, such as *mc2025* and *mc2025v* are also compared to the *t* column in order assess their predictive performance. 

|date| close_price |t_n |p_n |t |p |t_1|p_1 |
|----|---|--- |---|--|---|---|--  |---|
|2017-04-24 |2374.15 | 1  | 0 |t |1.0 |1 | 1.0 | 
|2017-04-25 |2388.61   | 1  | 0 |1| 1.0 |-1|-1.0 |
|2017-04-26 |2387.45  |-1  | 0 | -1 | -1.0|1| 1.0 | 
|2017-04-27|2388.77  |-1  | 0 | 1 |1.0|1 |1.0  | 
|2017-04-28| 2384.20  | 1  | 0 | 1 |1.0| NA |1.0  |
  
*p_1* - is the prediction of the market behavior, derived from the supervised learning predictive model (e.g., Decision Tree as set by the *model* parameter below) and can be used as a trading signal (buy or sell) for the next trading day. For example, at the end of the trading day, say Monday, April 24, 2017, *p_1* = +1, because the *close_price* at Tuesday, April 25, 2017 is predicted to be greater than at Monday close. On Tuesday, April 25, 2017 *p_1* = -1 since the *close_price* at Wednesday close is predicted to be less than or equal to Tuesday close. The predictor *p_1* is the same as *p_n* (n days into the future) but shifted (Python *shift*) to the Python dataframe row for "today." *p* is the same as *p_1* shifted forward by 1 day so as to align the prediction to the day corresponding to the prediction. For trading purposes *p_1* predictor can be utilized as a predictor for the next day.
<br><br> 

There are several model variations including hybrid supervised learning and heuristic models. These are discussed in more detail in Section VI. "Backtest."

### Model Performance (Confusion Matrix)

The classification model prediction ("classification") performance is summarized in a confusion matrix. The actual total market up or down days are listed in the first column. The rows list the actual prediction results as a percentage of the totals. The result here are for a Decision Tree classifiatier over the period January 1, 2000 to April 28, 2017. If we conser

|Actuals	| Predicted MktDown	| Predicted MktUp |
|:-----|-----------|---------|
|Market Down days	1660|	0.749398|	0.250000|
|Market Up  days 2655|	0.202637|	0.797363|



In [11]:
#  V. Training and Prediction
#   Features          X = dfML.loc[train_s:test_e]
#   Labels            Y = dfT.loc[train_s:train_e]
#   Train/Predict     mClfTrainTest() classifier training and prediction."fit" classifier from test_s date to test_e
#                      Make one prediction per date, looking ahead by ndays and save in order to compare later to labels.
#                      dftr DataFrame
#   Volatility & MA   Compute volatility and moving averages heuristic model.
#   Save              save DataFrame "dftr" training results from t := train_s:train_e
#   Confusion Matrix  Compute and print confusion matrix
#   Plot              Training summary vs. nday

%run algosciquant

# Model Training and Prediction
model='DT' #
X = dfML.loc[train_s:,dfML.columns]
Y = dfT.loc[train_s:test_e]
print("...")
dfTR,clf = mktClfTrainTest(X,Y,nday,train_s,test_s,test_e,model,v=1)

# Volatility and MovingAverage Predictors  
mc_mcvariable='mc'+mcvariable
vltyw='120'; maw='60'
dfTR=volatilityPriceSP(dfTR,vltyw,maw,mcvariable=mc_mcvariable)
print("vltyw =",vltyw,", maw =",maw)

# Save Predictions Data Frame (dfTR)
tick='sp'
str_test_e=str(test_e.year)+str(test_e.month)+str(test_e.day)
str_test_syr=str(test_s.year)
save_dtr_filename='dfclfm_'+tick+'_nd'+str(nday)+'_'+str_test_syr+'_'+str_test_e+'_'+model+'.csv'
print('output filename =',save_dtr_filename)
dfTR.to_csv(save_dtr_filename)

# Print training results
print('model =',model, '\ntest start date:',test_s,'\ntest end date:',test_e)
print('\nPrice and Market Variables',dfTR[['close_price','vol_y_'+vltyw,'mucdown','mdcup','close_pricer_ma'+maw]].tail(5))
print('\nPredictor signals\n',dfTR[['t_n','p_n','p','t_1','p_1','v',mc_mcvariable,mc_mcvariable+'p',mc_mcvariable+'v',mc_mcvariable+'pv']].tail(5))

# Plot Train Summary
plot_train_summary=0
if plot_train_summary==1:
    rcParams['figure.figsize'] = 12, 3
    dftrainsummary = pd.read_csv('./data_jupyter_notebook/df_sp_trainsummary.csv',index_col=0,parse_dates=True)
    dftrainsummary[['S&P_rf_accuracy','S&P_dt_accuracy']].plot(use_index=True,grid=True)
    plt.xlim(1,60), plt.ylim(0.3,1), plt.savefig('mc_tr_summary.png')

# Confusion Matrix "p"

(samplesize, errors, correct, er, dfCMA, dfCMR)=mktPredConfusionMatrix(dfTR,"t",'p')

print('\nerror_rate =',er)
dfCMR[['Totals','Predicted MktDown','Predicted MktUp']]


...
2016-01-04
2017-01-03

vltyw = 120 , maw = 60
output filename = dfclfm_sp_nd43_2016_2017428_DT.csv
model = DT 
test start date: 2016-01-01 00:00:00 
test end date: 2017-04-28 00:00:00

Price and Market Variables             close_price  vol_y_120   mucdown  mdcup  close_pricer_ma60
2017-04-24  2374.149902   0.077694  0.009103    0.0           0.000563
2017-04-25  2388.610107   0.078036  0.003068    0.0           0.000679
2017-04-26  2387.449951   0.077238  0.003552    0.0           0.000771
2017-04-27  2388.770020   0.076448  0.003001    0.0           0.000795
2017-04-28  2384.199951   0.076150  0.004908    0.0           0.000758

Predictor signals
             t_n  p_n    p  t_1  p_1    v  mc2025  mc2025p  mc2025v  mc2025pv
2017-04-24  1.0  0.0  1.0    1  1.0  1.0     1.0      1.0      1.0       1.0
2017-04-25  1.0  0.0  1.0    1 -1.0  1.0     1.0      1.0      1.0       1.0
2017-04-26  1.0  0.0 -1.0    1  1.0  1.0     1.0      1.0      1.0       1.0
2017-04-27  1.0  0.0  1.0    1

Unnamed: 0,Totals,Predicted MktDown,Predicted MktUp
actual MktDown,57,0.684211,0.298246
actual MktUp,233,0.283262,0.716738


# VI. Backtest

### Strategies

We now test the trade performance of our models generated in the previous section (Section V. Model Training and Prediction). The predictors are listed below along with a brief description. 

* **p_dt** - Decision tree predictive model. Market up prediciton (buy) *p_dt* = +1, or market down prediction (sell) *p_dt* = -1 signal.  
<br>
* **p_rf** -  Random Forest predictive model. Market up prediction (buy) *p_dt* = +1, or market down (sell) *p_dt* = -1 signal generated by.  
<br>
* **pv_dt** - This model is a combination of decision tree predictive model and heuristics based on yearly volatility measure and moving average. If the volatility, derived based on trailing 120 days is greater than 20 percent and trailing 60 day price moving average is negative then *pv_dt* = -1, otherwise *pv_dt* = *p_dt*. In summary, if the volatility is high with a downward trend then sell, otherwise trade based on a decision tree predictive model (*p_dt*).  
<br>
* pv_rf - adlfjs  
<br>
* mc2025 - aldfs  
<br>
* mc2025v - asfdsjf  
<br>
* mc2025pv_dt - adf;sa  
<br>
* mc2025pv_rf - adf;sa  



### Backtesting Results

Back testing starting January 1, 2000, close_price on December 31, 1999 is $1469.25. Here we first give a brief description of each strategy and then summarize the results in the table below.


This period (Jan 1, 2000 to Apr 28, 2017) includes two Bear markets and serves as a good test for the strategies. The heuristic strategy mc2025v performs very close to the top performing model mc2025pv_rf. In the next seection, we graph the performance 
Table: Backtest summary, from January 1, 2000 to April 28, 2017.

|Strategy| End Value  |Total Rturn| Annualized Return |
|--------|----------- |------------|------------------|
|S&P 500 | \$2384.2   |63.84% | 2.89% |
|p_dt | \$2441.8 | 66.2% | 2.97% |
|p_rf | \$3417.87 | 132.6% | 5%|
|pv_dt| \$3385.06 | 130.39% |  4.93%
|pv_rf | \$3627.18 | 146.8% | 5.35%|
|mc2025  | \$4208.40  |186.4% | 6.26% 
|mc2025v | \$4674     |218.1% |6.96% |
|mc2025pv_dt| \$4377.14| 197.92% | 6.5% |
|mc2025pv_rf| \$4690.2| 219.2% | 6.92% | 


### Backtest Performance Graphs

Backtest from 1970 

![sdfsdfasdfsa](./mc_1970_20170428.png)

Backtest from 2000

![sdfsdfasdfsa](./mc_2000_20170428.png)

Backtest 1980 - 1990

![sdfsdfasdfsa](./mc_1980_19891231.png)


# Conclusions

### Summary

### Results
* We learn that basic machine prediction supplemented with heuristics provides improved financial return vs. S&P 500. Based on the data analysis, we gain many insights about the market, which as expected are consistent with basic market theory.
* Models studied here provide down-side protection and benefit from market up trends
* There are many improvements that can be made to the approach discussed here, such as feature selection, class labels and associated trading strategy, however such improvements go beyond the scope of this exercise. 

### Next Steps

Code optimization
Other articles - stock prediction 
Application to other indexes

In [205]:
# BackTest Code
%run algosciquant

# Strategy Trade
readfile=1
predictor='mc2025v' # 'p', 'v, 'pv', mc_mcvariable+'v', mcvariablepv, mcvariablep
                    # 'mc2025', 'mc2025v', 'mc2025pv'
bt_summary_graphs=0

# Read File
price_variable='close_price'
if readfile==1:
    bt_model='RF' # RF or DT
    bt_startyear='2000' # 1952 DT, 1970 DT and RF, 1980 DT and RF, 1990 DT and RF, 2000 DT and RF
    bt_test_s=dt.datetime(int(bt_startyear),1,1)
    bt_test_e=dt.datetime(2017,4,28)
    str_bt_test_e=str(bt_test_e.year)+str(bt_test_e.month)+str(bt_test_e.day)
    dfTR_fn='./data_jupyter_notebook/dfclfm_sp_nd43_'+bt_startyear+'_'+str_bt_test_e+'_'+bt_model+'.csv'
    dfTR2 = pd.read_csv(dfTR_fn,index_col=0,parse_dates=True)
    bt_nday=43;bt_vltyw=120; bt_maw=60;
else:
    bt_test_s=test_s
    bt_test_e=test_e
    bt_model=model;bt_nday=nday; bt_vltyw=vltyw; bt_maw=maw; bt_startyear=str(test_s.year)
    dfTR2=dfTR

print('...')

# Backtest
(dftsummary,dfreturns)=backTestSummary(dfTR2,dfsp,price_variable,predictor,bt_test_s,bt_test_e)

# Save backtest trade summary "dft" dataframe
ticker='sp'
str_test_syr=str(test_s.year)
str_=str(test_e.year)+str(test_e.month)+str(test_e.day)
save_dft_filename='dft_'+ticker+'_nd'+str(bt_nday)+'_'+str(predictor)+'_'+bt_startyear+'_'+str_+'_'+bt_model+'.csv'
dft.to_csv(save_dft_filename)


# Print context variables
print('strategy trade variable = ',predictor,',bt_startyear',bt_startyear,'\nend date,',test_e)
print('model = ',bt_model,', nday = ',bt_nday,', ma = ',bt_maw,', vltyw = ',bt_vltyw)
print('dft filename =',save_dft_filename)


# Annualized Returns
print("\nAnnualized Returns\n",dfreturns[['nyear','Rc','Rc_strat','Ra','Ra_strat']])

bt_summary_print=1
if bt_summary_print==1:
    print('\nYearly Trade Summary\n',dftsummary[['start_price','end_close_price','end_close_price_SP','return','return_SP']])

# Plot
bt_summary_graphs=0
if bt_summary_graphs==1:
    files=[
     './data_jupyter_notebook/dft_sp_nd43_p_'+bt_startyear+'_2017428_DT.csv',
     './data_jupyter_notebook/dft_sp_nd43_mc2025pv_'+bt_startyear+'_2017428_RF.csv',
     './data_jupyter_notebook/dft_sp_nd43_mc2025_'+bt_startyear+'_2017428_DT.csv',
     './data_jupyter_notebook/dft_sp_nd43_pv_'+bt_startyear+'_2017428_RF.csv',
     './data_jupyter_notebook/dft_sp_nd43_p_'+bt_startyear+'_2017428_RF.csv',  
     './data_jupyter_notebook/dft_sp_nd43_p_'+bt_startyear+'_2017428_DT.csv'
    ]
    lnames=['close_price','mc2025pv_rf','mc2025','pv_rf','p_rf','p_dt']
    fig = plt.figure()
    rcParams['figure.figsize'] = [12,4]
    normalized_plot_from_files(fig,bt_startyear,bt_test_e,files,lnames,plot_variable='close_price',nsubplot=111,ncol=1) 



...
strategy trade variable =  mc2025v ,bt_startyear 2000 
end date, 2017-04-28 00:00:00
model =  RF , nday =  43 , ma =  60 , vltyw =  120
dft filename = dft_sp_nd43_mc2025v_2000_2017428_RF.csv

Annualized Returns
                 nyear        Rc  Rc_strat        Ra  Ra_strat
2017-04-28  17.334247  0.638378  2.211884  0.028891  0.069633

Yearly Trade Summary
       start_price  end_close_price  end_close_price_SP    return  return_SP
2000  1455.219971      1283.270020         1235.734139 -0.118161  -0.150827
2001  1283.270020      1148.079956         1294.886003 -0.105348   0.047868
2002  1154.670044       879.820007         1294.886003 -0.238033   0.000000
2003   909.030029      1111.920044         1481.956549  0.223194   0.144469
2004  1108.479980      1211.920044         1615.235606  0.093317   0.089935
2005  1202.079956      1268.800049         1691.044740  0.055504   0.046934
2006  1268.800049      1416.599976         1888.031089  0.116488   0.116488
2007  1416.599976      1468.3

In [203]:
# BackTest Summary Graphs
%run algosciquant

graph = 0 
if graph == 1:
    g_startyear='1970' # 1970, 1980, 1990, 2000
    s = dt.datetime(1970,1,1)
    e = dt.datetime(2017,4,28)
    s=dt.datetime(int(g_startyear),1,1)
    #e =dt.datetime(1989,12,31)
    rcParams['figure.figsize'] = 12, 2.5
    # Plot 1
    files1=[
     './data_jupyter_notebook/dft_sp_nd43_p_'+g_startyear+'_2017428_DT.csv',
     './data_jupyter_notebook/dft_sp_nd43_mc2025pv_'+g_startyear+'_2017428_RF.csv',
     './data_jupyter_notebook/dft_sp_nd43_mc2025_'+g_startyear+'_2017428_DT.csv',
     './data_jupyter_notebook/dft_sp_nd43_pv_'+g_startyear+'_2017428_RF.csv',
     './data_jupyter_notebook/dft_sp_nd43_p_'+g_startyear+'_2017428_RF.csv',  
     './data_jupyter_notebook/dft_sp_nd43_v_'+g_startyear+'_2017428_DT.csv'
    ]

    lnames1=['close_price','mc2025pv_rf','mc2025' ,'pv_rf','p_rf','v']
    price_variable='close_price'
    fig1 = plt.figure()
    normalized_plot_from_files(fig1,s,e,files1,lnames1,price_variable,nsubplot=111,sfig=1,sfigname='mc_1970_20170428.png') 

    # plot 2
    fig2 = plt.figure()
    s=dt.datetime(2000,1,1)
    e=dt.datetime(2017,4,28)
    normalized_plot_from_files(fig2,s,e,files1,lnames1,price_variable,nsubplot=111,sfig=1,sfigname='mc_2000_20170428.png') 

    # plot 3
    s=dt.datetime(1980,1,1)
    e=dt.datetime(1989,12,31)
    fig3 = plt.figure()
    files3=[
     './data_jupyter_notebook/dft_sp_nd43_p_'+g_startyear+'_2017428_DT.csv',
     './data_jupyter_notebook/dft_sp_nd43_mc2025_'+g_startyear+'_2017428_DT.csv',
     './data_jupyter_notebook/dft_sp_nd43_pv_'+g_startyear+'_2017428_RF.csv' 
    ]
    lnames3=['close_price','mc2025','pv_rf']
    normalized_plot_from_files(fig3,s,e,files3,lnames3,price_variable,nsubplot=111,sfig=1,sfigname='mc_1980_19891231.png') 

    # plot 4
    fig4 = plt.figure()
    s = dt.datetime(1970,1,1)
    e = dt.datetime(2017,4,28)
    files4=[
     './data_jupyter_notebook/dft_sp_nd43_p_'+g_startyear+'_2017428_DT.csv',
     './data_jupyter_notebook/dft_sp_nd43_pv_'+g_startyear+'_2017428_DT.csv',
     './data_jupyter_notebook/dft_sp_nd43_pv_'+g_startyear+'_2017428_RF.csv',  
     './data_jupyter_notebook/dft_sp_nd43_p_'+g_startyear+'_2017428_RF.csv',
     './data_jupyter_notebook/dft_sp_nd43_p_'+g_startyear+'_2017428_DT.csv'
    ]

    lnames4=['close_price','pv_dt','pv_rf','p_rf','p_dt']
    normalized_plot_from_files(fig4,s,e,files4,lnames4,price_variable,nsubplot=111,sfig=1,sfigname='mc_1970_20170428_p.png') 
