# Machine Learning to predict Down Jones Industrial Average

This simple example shows how to predict ^DJI price based on the past averages.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1">Setup</a></span></li><li><span><a href="#Read-data-into-a-SFrame" data-toc-modified-id="Read-data-into-a-SFrame-2">Read data into a SFrame</a></span><ul class="toc-item"><li><span><a href="#TODO:-Value-should-be-original-value" data-toc-modified-id="TODO:-Value-should-be-original-value-2.1">TODO: Value should be original value</a></span></li></ul></li><li><span><a href="#Select-the-data-to-train-and-test" data-toc-modified-id="Select-the-data-to-train-and-test-3">Select the data to train and test</a></span><ul class="toc-item"><li><span><a href="#TODO:-Let's-NOT-take-last-few-days" data-toc-modified-id="TODO:-Let's-NOT-take-last-few-days-3.1">TODO: Let's NOT take last few days</a></span></li></ul></li><li><span><a href="#Create-the-model" data-toc-modified-id="Create-the-model-4">Create the model</a></span><ul class="toc-item"><li><span><a href="#Print-example-predictions" data-toc-modified-id="Print-example-predictions-4.1">Print example predictions</a></span></li></ul></li><li><span><a href="#&quot;Be-Less-Wrong&quot;" data-toc-modified-id="&quot;Be-Less-Wrong&quot;-5">"Be Less Wrong"</a></span><ul class="toc-item"><li><span><a href="#TODO:-find-the-best-model" data-toc-modified-id="TODO:-find-the-best-model-5.1">TODO: find the best model</a></span></li></ul></li><li><span><a href="#Save-the-model" data-toc-modified-id="Save-the-model-6">Save the model</a></span></li></ul></div>

## Setup

In [1]:
# Update TuriCreate. Last updated November 4, 2020

# !pip install --upgrade pip
# !pip install Turicreate

In [2]:
import turicreate as tc

In [3]:
data_path="./DATA/processed/^DJI.csv"

## Read data into a SFrame

In [4]:
# Load the data
data =  tc.SFrame(data_path)
data[363:373]

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,float,float,float,float,float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


Day,Date,High,Avg005,Avg030,Avg090,Avg180,Avg365
725033,1986-01-27,-125.0,-125.2,-125.1,-125.66,-126.33,0.0
725034,1986-01-28,-125.0,-125.0,-125.1,-125.63,-126.32,-126.98
725035,1986-01-29,-125.0,-125.0,-125.1,-125.61,-126.31,-126.97
725036,1986-01-30,-125.0,-125.0,-125.1,-125.59,-126.29,-126.96
725037,1986-01-31,-125.0,-125.0,-125.1,-125.57,-126.28,-126.96
725038,1986-02-01,-125.0,-125.0,-125.1,-125.54,-126.27,-126.95
725039,1986-02-02,-125.0,-125.0,-125.1,-125.52,-126.26,-126.94
725040,1986-02-03,-125.0,-125.0,-125.1,-125.5,-126.25,-126.93
725041,1986-02-04,-125.0,-125.0,-125.1,-125.48,-126.24,-126.92
725042,1986-02-05,-125.0,-125.0,-125.1,-125.46,-126.23,-126.92


### TODO: Value should be original value

Please note the the "High" is normalized to Int8, 
but for the prediciton purposes it should be an original "real" value.

## Select the data to train and test

In [5]:
# Do not take initial year data as averages are not complete
data = data[365:13063] 
# Make a train-test split
train_data, test_data = data.random_split(0.8)

### TODO: Let's NOT take last few days

I need to save the last few days to see if I can really predict upcoming values.

## Create the model

- https://apple.github.io/turicreate/docs/api/generated/turicreate.regression.create.html
- Automatically picks the right model based on your data.
- target: is the number to be predicted.
- features: are the the values that we ues to try to find pattern leading to prediciton.

In [6]:
model = tc.regression.create(
    train_data, 
    target='High',
    features = [
        'Avg005',
        'Avg030',
        'Avg090',
        'Avg180',
        'Avg365'
    ],
    validation_set='auto', 
    verbose=True
)

# Predict values on data that was NOT used in training

In [7]:
#test_data.explore()
test_data

Day,Date,High,Avg005,Avg030,Avg090,Avg180,Avg365
725039,1986-02-02,-125.0,-125.0,-125.1,-125.52,-126.26,-126.94
725041,1986-02-04,-125.0,-125.0,-125.1,-125.48,-126.24,-126.92
725044,1986-02-07,-125.0,-125.0,-125.1,-125.41,-126.21,-126.9
725047,1986-02-10,-125.0,-125.0,-125.1,-125.37,-126.17,-126.87
725050,1986-02-13,-124.0,-124.6,-125.0,-125.31,-126.13,-126.84
725052,1986-02-15,-124.0,-124.2,-124.93,-125.27,-126.09,-126.82
725061,1986-02-24,-124.0,-124.0,-124.57,-125.07,-125.94,-126.73
725064,1986-02-27,-124.0,-124.0,-124.47,-125.0,-125.89,-126.69
725077,1986-03-12,-123.0,-123.6,-123.97,-124.69,-125.67,-126.55
725079,1986-03-14,-123.0,-123.2,-123.87,-124.64,-125.62,-126.53


In [8]:
## Save predictions to an SArray
predictions = model.predict(test_data)
#predictions

### Print example predictions

In [9]:
start = 0
end = len(predictions)
step = 100

for id in range(start, end, step):
    a = round( predictions[id], 2)
    b = test_data[id]["High"]
    print( "predicted ", a, "\t, but actual value was ", b , "\t difference is ", round(b-a, 2) ) # dict

predicted  -124.99 	, but actual value was  -125.0 	 difference is  -0.01
predicted  -118.01 	, but actual value was  -118.0 	 difference is  0.01
predicted  -121.05 	, but actual value was  -121.0 	 difference is  0.05
predicted  -115.04 	, but actual value was  -115.0 	 difference is  0.04
predicted  -112.42 	, but actual value was  -113.0 	 difference is  -0.58
predicted  -108.98 	, but actual value was  -109.0 	 difference is  -0.02
predicted  -106.05 	, but actual value was  -106.0 	 difference is  0.05
predicted  -95.12 	, but actual value was  -95.0 	 difference is  0.12
predicted  -75.3 	, but actual value was  -74.0 	 difference is  1.3
predicted  -61.7 	, but actual value was  -61.0 	 difference is  0.7
predicted  -36.34 	, but actual value was  -36.0 	 difference is  0.34
predicted  -49.33 	, but actual value was  -49.0 	 difference is  0.33
predicted  -60.89 	, but actual value was  -60.0 	 difference is  0.89
predicted  -48.86 	, but actual value was  -49.0 	 difference is

## "Be Less Wrong"

Evaluate how good is the model

It appears that the predition results vary from run to run so it is worth to run it until you find the model with minimum error, 

or **as Elon Musk says "Be less wrong"**.

Previous results:

- {'max_error': 16.05437802584514, 'rmse': 0.9679348693484652}
- {'max_error': 15.462971948998508, 'rmse': 1.0355459041929513}
- {'max_error': 12.347353678085256, 'rmse': 0.9804142119131803}
- {'max_error': 9.234282740244765, 'rmse': 0.9490266831513133}
- {'max_error': 8.58596529862438, 'rmse': 0.8988876901151138} - best result

TODO: write this in a loop to select the best model

### TODO: find the best model

Create a "for" loop to find the best model

In [10]:
# Evaluate the model and save the results into a dictionary
results = model.evaluate( test_data ) #test_data[0:2531]
results

{'max_error': 15.166793321285851, 'rmse': 1.0168646986777539}

## Save the model

Save the model for future use in MacOS, iOS, etc. applications

In [11]:
# Export to Core ML
model.export_coreml('./DATA/models/^DJI.mlmodel')