# Machine Learning to predict Down Jones Industrial Average

This simple Machine Learning example shows how to predict [^DJI value](https://finance.yahoo.com/quote/%5EDJI?p=^DJI&.tsrc=fin-srch) based on the past calculated averages.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Read-data-into-a-SFrame" data-toc-modified-id="Read-data-into-a-SFrame-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read data into a SFrame</a></span><ul class="toc-item"><li><span><a href="#TODO:-Value-should-be-original-value" data-toc-modified-id="TODO:-Value-should-be-original-value-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>TODO: Value should be original value</a></span></li></ul></li><li><span><a href="#Select-the-data-to-train-and-test" data-toc-modified-id="Select-the-data-to-train-and-test-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Select the data to train and test</a></span><ul class="toc-item"><li><span><a href="#TODO:-Let's-NOT-take-last-few-days" data-toc-modified-id="TODO:-Let's-NOT-take-last-few-days-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>TODO: Let's NOT take last few days</a></span></li></ul></li><li><span><a href="#Create-the-model" data-toc-modified-id="Create-the-model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Create the model</a></span><ul class="toc-item"><li><span><a href="#Print-example-predictions" data-toc-modified-id="Print-example-predictions-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Print example predictions</a></span></li></ul></li><li><span><a href="#&quot;Be-Less-Wrong&quot;" data-toc-modified-id="&quot;Be-Less-Wrong&quot;-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>"Be Less Wrong"</a></span><ul class="toc-item"><li><span><a href="#Previous-results:" data-toc-modified-id="Previous-results:-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Previous results:</a></span><ul class="toc-item"><li><span><a href="#^DJI-averages-only" data-toc-modified-id="^DJI-averages-only-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>^DJI averages only</a></span></li><li><span><a href="#added-ISM-Manufacturing-Employment" data-toc-modified-id="added-ISM-Manufacturing-Employment-5.1.2"><span class="toc-item-num">5.1.2&nbsp;&nbsp;</span>added ISM Manufacturing Employment</a></span></li><li><span><a href="#added-US-Housing-Starts-m/m" data-toc-modified-id="added-US-Housing-Starts-m/m-5.1.3"><span class="toc-item-num">5.1.3&nbsp;&nbsp;</span>added US Housing Starts m/m</a></span></li><li><span><a href="#added-Manufacturing-PMI" data-toc-modified-id="added-Manufacturing-PMI-5.1.4"><span class="toc-item-num">5.1.4&nbsp;&nbsp;</span>added Manufacturing PMI</a></span></li></ul></li><li><span><a href="#TODO:-find-the-best-model" data-toc-modified-id="TODO:-find-the-best-model-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>TODO: find the best model</a></span></li></ul></li><li><span><a href="#Save-the-model" data-toc-modified-id="Save-the-model-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Save the model</a></span></li></ul></div>

## Setup

In [1]:
column_to_predict = "DJIA_Original"

# Location of the spreadsheet (Comma Delimited Value) with all info that I prepared in a separate notebook.
data_path="./DATA/processed/uber.csv"

# Install TuriCreate. Last updated November 4, 2020

# !pip install --upgrade pip
# !pip install Turicreate

import turicreate as tc

## Read data into a SFrame

In [2]:
# Load the data
data =  tc.SFrame(data_path)
data[364:370] # show data sample

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,float,float,float,float,float,float,float,float,float,float,float,float,float,float,float,float,float,float,float,float,float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


Day,Date,DJIA_Value,DJIA_Original,DJIA_Avg005,DJIA_Avg030,DJIA_Avg090,DJIA_Avg180,DJIA_Avg365
736339,2017-01-10,38.0,19957.119141,38.6,38.07,30.84,27.9,21.57
736340,2017-01-11,38.0,19973.419922,38.4,38.1,31.02,27.97,21.65
736341,2017-01-12,38.0,19929.289063,38.2,38.1,31.19,28.03,21.73
736342,2017-01-13,38.0,19952.029297,38.0,38.1,31.36,28.1,21.82
736343,2017-01-14,38.0,19952.029297,38.0,38.1,31.52,28.17,21.9
736344,2017-01-15,38.0,19952.029297,38.0,38.1,31.7,28.23,21.99

ISM_MFC_EMP_Value,ISM_MFC_EMP_Original,ISM_MFC_EMP_Avg090,ISM_MFC_EMP_Avg180,ISM_MFC_EMP_Avg365,HOUSE_SRT_MM_Value
71.0,53.1,62.64,51.82,42.85,-75.0
71.0,53.1,62.93,51.94,42.96,-75.0
71.0,53.1,63.22,52.06,43.06,-75.0
71.0,53.1,63.51,52.17,43.17,-75.0
71.0,53.1,63.8,52.29,43.28,-75.0
71.0,53.1,64.09,52.41,43.38,-75.0

HOUSE_SRT_MM_Original,HOUSE_SRT_MM_Avg090,HOUSE_SRT_MM_Avg180,HOUSE_SRT_MM_Avg365,MFC_MPI_Value,MFC_MPI_Original
-18.7,8.2,11.12,9.8,68.0,54.7
-18.7,7.54,10.66,9.6,68.0,54.7
-18.7,6.89,10.19,9.4,68.0,54.7
-18.7,6.23,9.72,9.19,68.0,54.7
-18.7,5.58,9.26,8.99,68.0,54.7
-18.7,4.92,8.66,8.79,68.0,54.7

MFC_MPI_Avg090,MFC_MPI_Avg180,MFC_MPI_Avg365
48.99,44.26,36.85
49.3,44.33,37.01
49.61,44.4,37.16
49.92,44.47,37.32
50.23,44.54,37.47
50.54,44.62,37.63


### TODO: Value should be original value

Please note the the "High" is normalized to Int8, 
but for the prediciton purposes it should be an original "real" value.

## Select the data to train and test

In [3]:
row_count = len(data)
# Do not take initial year data as averages are not complete
data = data[365:row_count] 
# Make a train-test split
train_data, test_data = data.random_split(0.8)

### TODO: Let's NOT take last few days

I need to save the last few days to see if I can really predict upcoming values.

## Create the model

- https://apple.github.io/turicreate/docs/api/generated/turicreate.regression.create.html
- Automatically picks the right model based on your data.
- target: is the number to be predicted.
- features: are the the values that we ues to try to find pattern leading to prediciton.

In [4]:
model = tc.regression.create(
    train_data, 
    target = column_to_predict,
    features = [
         'DJIA_Avg005'
        ,'DJIA_Avg030'
        ,'DJIA_Avg090'
        ,'DJIA_Avg180'
        ,'DJIA_Avg365'
        
        ,'ISM_MFC_EMP_Avg090'
        ,'ISM_MFC_EMP_Avg180'
        ,'ISM_MFC_EMP_Avg365'
        
        ,'HOUSE_SRT_MM_Value'
        ,'HOUSE_SRT_MM_Avg090'
        ,'HOUSE_SRT_MM_Avg180'
        ,'HOUSE_SRT_MM_Avg365'
        
        ,'MFC_MPI_Value'
        ,'MFC_MPI_Avg090'
        ,'MFC_MPI_Avg180'
        ,'MFC_MPI_Avg365'
    ],
    validation_set='auto', 
    verbose=True
)

# Predict values on data that was NOT used in training

In [5]:
#test_data.explore()
#test_data

In [6]:
## Save predictions to an SArray
predictions = model.predict(test_data)
#predictions

### Print example predictions

In [7]:
start = 0
end = len(predictions)
step = 30

print(column_to_predict)

for id in range(start, end, step):
    a = round( predictions[id], 2)
    b = test_data[id][column_to_predict]
    print( "predicted ", round(a, 0), "\t, but actual value was \t", round(b, 0) , "\t difference is \t", round(b-a, 2) ) # dict

DJIA_Original
predicted  19954.0 	, but actual value was 	 19883.0 	 difference is 	 -71.16
predicted  21267.0 	, but actual value was 	 21305.0 	 difference is 	 38.12
predicted  23459.0 	, but actual value was 	 23429.0 	 difference is 	 -30.15
predicted  24408.0 	, but actual value was 	 24108.0 	 difference is 	 -299.59
predicted  26096.0 	, but actual value was 	 26192.0 	 difference is 	 95.67
predicted  24913.0 	, but actual value was 	 24860.0 	 difference is 	 -52.94
predicted  27209.0 	, but actual value was 	 27282.0 	 difference is 	 72.5
predicted  28116.0 	, but actual value was 	 28291.0 	 difference is 	 174.25
predicted  23691.0 	, but actual value was 	 23730.0 	 difference is 	 39.35


## "Be Less Wrong"

Evaluate how good is the model

It appears that the predition results vary from run to run so it is worth to run it until you find the model with minimum error, 

or **as Elon Musk says "Be less wrong"**.

### Previous results:

#### ^DJI averages only 

- {'max_error': 1749.5078773959249, 'rmse': 124.58897796835019}
- {'max_error': 1621.9227669335778, 'rmse': 106.39104997423203}
- {'max_error': 1297.117071650111, 'rmse': 101.14871945325757} - BEST RMSE
- {'max_error': 1122.2711616305896, 'rmse': 183.129076342891}


#### added ISM Manufacturing Employment
- {'max_error': 1708.487399827758, 'rmse': 235.1824022060072}
- {'max_error': 1093.9698394310544, 'rmse': 188.48468898003293} - BEST Max Error

#### added US Housing Starts m/m
- {'max_error': 1715.473257380545, 'rmse': 255.97228904279297}
- {'max_error': 1255.8992158671826, 'rmse': 217.41264660030447}
- {'max_error': 1102.1036788719757, 'rmse': 226.13843475657265}
- {'max_error': 1397.611965987675, 'rmse': 236.1374235983197}

#### added Manufacturing PMI
- {'max_error': 1522.3008456494535, 'rmse': 236.8009135100335}
- {'max_error': 1629.302702719142, 'rmse': 229.67414881442798}

### TODO: find the best model

Create a "for" loop to find the best model

In [8]:
#TODO: write this in a loop to select the best model
# Evaluate the model and save the results into a dictionary
results = model.evaluate( test_data ) #test_data[0:2531]
results

{'max_error': 1537.3488101040712, 'rmse': 223.1978172408259}

## Save the model

Save the model for future use in MacOS, iOS, etc. applications

In [9]:
# Export to Core ML
model.export_coreml('./DATA/models/^DJI.mlmodel')