# Machine Learning to predict Down Jones Industrial Average

This simple Machine Learning example shows how to predict [^DJI value](https://finance.yahoo.com/quote/%5EDJI?p=^DJI&.tsrc=fin-srch) based on the past calculated averages.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1">Setup</a></span></li><li><span><a href="#Read-data-into-a-SFrame" data-toc-modified-id="Read-data-into-a-SFrame-2">Read data into a SFrame</a></span><ul class="toc-item"><li><span><a href="#TODO:-Value-should-be-original-value" data-toc-modified-id="TODO:-Value-should-be-original-value-2.1">TODO: Value should be original value</a></span></li></ul></li><li><span><a href="#Select-the-data-to-train-and-test" data-toc-modified-id="Select-the-data-to-train-and-test-3">Select the data to train and test</a></span><ul class="toc-item"><li><span><a href="#TODO:-Let's-NOT-take-last-few-days" data-toc-modified-id="TODO:-Let's-NOT-take-last-few-days-3.1">TODO: Let's NOT take last few days</a></span></li></ul></li><li><span><a href="#Create-the-model" data-toc-modified-id="Create-the-model-4">Create the model</a></span><ul class="toc-item"><li><span><a href="#Print-example-predictions" data-toc-modified-id="Print-example-predictions-4.1">Print example predictions</a></span></li></ul></li><li><span><a href="#&quot;Be-Less-Wrong&quot;" data-toc-modified-id="&quot;Be-Less-Wrong&quot;-5">"Be Less Wrong"</a></span><ul class="toc-item"><li><span><a href="#TODO:-find-the-best-model" data-toc-modified-id="TODO:-find-the-best-model-5.1">TODO: find the best model</a></span></li></ul></li><li><span><a href="#Save-the-model" data-toc-modified-id="Save-the-model-6">Save the model</a></span></li></ul></div>

## Setup

In [1]:
# Install TuriCreate. Last updated November 4, 2020

# !pip install --upgrade pip
# !pip install Turicreate

In [2]:
import turicreate as tc

In [3]:
# Location of the spreadsheet (Comma Delimited Value) with ^DJI info that I prpared in a separate notebook.
data_path="./DATA/processed/^DJI.csv"

## Read data into a SFrame

In [4]:
# Load the data
data =  tc.SFrame(data_path)
data[363:370] # show data sample

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,float,float,float,float,float,float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


Day,Date,Value,Original,Avg005,Avg030,Avg090,Avg180,Avg365
725033,1986-01-27,-125.0,1548.170044,-125.2,-125.1,-125.66,-126.33,0.0
725034,1986-01-28,-125.0,1561.349976,-125.0,-125.1,-125.63,-126.32,-126.98
725035,1986-01-29,-125.0,1578.099976,-125.0,-125.1,-125.61,-126.31,-126.97
725036,1986-01-30,-125.0,1572.589966,-125.0,-125.1,-125.59,-126.29,-126.96
725037,1986-01-31,-125.0,1582.910034,-125.0,-125.1,-125.57,-126.28,-126.96
725038,1986-02-01,-125.0,1582.910034,-125.0,-125.1,-125.54,-126.27,-126.95
725039,1986-02-02,-125.0,1582.910034,-125.0,-125.1,-125.52,-126.26,-126.94


### TODO: Value should be original value

Please note the the "High" is normalized to Int8, 
but for the prediciton purposes it should be an original "real" value.

## Select the data to train and test

In [5]:
row_count = len(data)
# Do not take initial year data as averages are not complete
data = data[365:row_count] 
# Make a train-test split
train_data, test_data = data.random_split(0.8)

### TODO: Let's NOT take last few days

I need to save the last few days to see if I can really predict upcoming values.

## Create the model

- https://apple.github.io/turicreate/docs/api/generated/turicreate.regression.create.html
- Automatically picks the right model based on your data.
- target: is the number to be predicted.
- features: are the the values that we ues to try to find pattern leading to prediciton.

In [6]:
model = tc.regression.create(
    train_data, 
    target='Original',
    features = [
        #'Value', # Training against the quantized original value is overfitting
        'Avg005',
        'Avg030',
        'Avg090',
        'Avg180',
        'Avg365'
    ],
    validation_set='auto', 
    verbose=True
)

# Predict values on data that was NOT used in training

In [7]:
#test_data.explore()
test_data

Day,Date,Value,Original,Avg005,Avg030,Avg090,Avg180,Avg365
725047,1986-02-10,-125.0,1633.140015,-125.0,-125.1,-125.37,-126.17,-126.87
725049,1986-02-12,-124.0,1640.47998,-124.8,-125.03,-125.33,-126.14,-126.85
725059,1986-02-22,-124.0,1702.75,-124.0,-124.63,-125.11,-125.98,-126.75
725064,1986-02-27,-124.0,1728.900024,-124.0,-124.47,-125.0,-125.89,-126.69
725078,1986-03-13,-123.0,1768.800049,-123.4,-123.9,-124.67,-125.64,-126.54
725087,1986-03-22,-123.0,1821.23999,-123.0,-123.6,-124.47,-125.44,-126.42
725088,1986-03-23,-123.0,1821.23999,-123.0,-123.57,-124.44,-125.42,-126.4
725089,1986-03-24,-123.0,1796.219971,-123.0,-123.53,-124.42,-125.4,-126.39
725093,1986-03-28,-123.0,1849.73999,-123.0,-123.4,-124.33,-125.31,-126.33
725094,1986-03-29,-123.0,1849.73999,-123.0,-123.37,-124.31,-125.29,-126.32


In [8]:
## Save predictions to an SArray
predictions = model.predict(test_data)
#predictions

### Print example predictions

In [9]:
start = 0
end = len(predictions)
step = 50

for id in range(start, end, step):
    a = round( predictions[id], 2)
    b = test_data[id]["Original"]
    print( "predicted ", round(a, 0), "\t, but actual value was \t", round(b, 0) , "\t difference is \t", round(b-a, 2) ) # dict

predicted  1586.0 	, but actual value was 	 1633.0 	 difference is 	 47.41
predicted  1809.0 	, but actual value was 	 1836.0 	 difference is 	 27.19
predicted  2366.0 	, but actual value was 	 2330.0 	 difference is 	 -36.11
predicted  1941.0 	, but actual value was 	 1982.0 	 difference is 	 41.01
predicted  2140.0 	, but actual value was 	 2151.0 	 difference is 	 10.72
predicted  2700.0 	, but actual value was 	 2719.0 	 difference is 	 18.06
predicted  2575.0 	, but actual value was 	 2624.0 	 difference is 	 48.64
predicted  2474.0 	, but actual value was 	 2502.0 	 difference is 	 27.91
predicted  3030.0 	, but actual value was 	 3042.0 	 difference is 	 12.27
predicted  3252.0 	, but actual value was 	 3268.0 	 difference is 	 16.58
predicted  3367.0 	, but actual value was 	 3329.0 	 difference is 	 -38.41
predicted  3584.0 	, but actual value was 	 3577.0 	 difference is 	 -6.99
predicted  3681.0 	, but actual value was 	 3651.0 	 difference is 	 -30.69
predicted  3712.0 	, b

## "Be Less Wrong"

Evaluate how good is the model

It appears that the predition results vary from run to run so it is worth to run it until you find the model with minimum error, 

or **as Elon Musk says "Be less wrong"**.

Previous results:

- {'max_error': 1749.5078773959249, 'rmse': 124.58897796835019}
- {'max_error': 1621.9227669335778, 'rmse': 106.39104997423203}

TODO: write this in a loop to select the best model

### TODO: find the best model

Create a "for" loop to find the best model

In [10]:
# Evaluate the model and save the results into a dictionary
results = model.evaluate( test_data ) #test_data[0:2531]
results

{'max_error': 1673.1557625073183, 'rmse': 117.22709863611067}

## Save the model

Save the model for future use in MacOS, iOS, etc. applications

In [11]:
# Export to Core ML
model.export_coreml('./DATA/models/^DJI.mlmodel')