In [6]:
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.preprocessing import FunctionTransformer
import numpy as np
from pandas import DataFrame
plt.style.use('fivethirtyeight')

In [7]:
cane = pd.read_csv('../data/CANE.csv', index_col=0)

***Exploratory Data Analysis (EDA)***

This phase of the project allows one to see how much work is needed in the dataset.  Stock data is quite clean compared to other datasets but one needs to see what one is working with considering that not all machine learning model types handle data the same way.  For example, linear regression will not work on a sine wave.

***Graph of The Closing Price***
This shows the price of the closing price for the time series

In [None]:
# matlib code here

**Basic Aggregates**

This is normally maximum, minimum, average and standard deviation.

In [13]:
cane.Close.describe()

count    2942.000000
mean       11.338899
std         4.566188
min         4.920000
25%         7.825000
50%         9.700000
75%        14.065000
max        26.309999
Name: Close, dtype: float64

***Transforming the Data***

It is best to create a pipeline for transforming the data.  Each step does a small thing and passes it along to the next step.  It reminds me of shell scripting using pipes.

In [10]:
def create_pipeline(column_name):
    def filter_out_columns(dataframe):
        return dataframe.filter([column_name])

    return Pipeline([
        ('filtering', FunctionTransformer(filter_out_columns)),
        ('globbing_data', FunctionTransformer(globbing_prices))
    ])

The first thing done to filter out only the closing price of each market day and then the data gets transformed by the `globbing_prices` function.

In [9]:
def globbing_prices(data: DataFrame, history=1, num_outputs=1):
    x = []
    y = []
    for i in range(history, len(data) - num_outputs):
        x.append(data[i - history:i].values)
        ends = data[i: (i + num_outputs)].values
        y.append(ends[0])

    x_np = np.array(x)
    y_np = np.array(y)
    return x_np.reshape((x_np.shape[0], x_np.shape[1])), y_np.reshape((y_np.shape[0], y_np.shape[1]))

`globbing_prices` takes the closing prices and puts them into two arrays.  The first array, named `x`, is yesterday's closing price.  The second array, named `y` is the closing price of the next day.  I played with the number of past days to present to the algorithm via the `history` parameter and found that one day is optimal.  I also added a `num_outputs`

In [11]:
def evaluate_models(x, y, model):
    params = {
        "max_depth": [i for i in range(7, 15)],
        "criterion": ["squared_error", "friedman_mse", "absolute_error", "poisson"],
        "splitter": ["best", "random"],
        "max_features": ["sqrt", "log2"],
        "min_samples_split": [i for i in range(2, 5)]
    }

    gs = GridSearchCV(model, params, cv=3)

    gs.fit(x, y)
    print(gs.best_score_, gs.best_params_)
    return gs.best_estimator_

In [12]:
cane = pd.read_csv('../data/CANE.csv', index_col=0)
pipeline = create_pipeline("Close")
x, y = pipeline.fit_transform(cane)

x_training_data, x_verify, y_training_data, y_verify = train_test_split(x, y, train_size=0.8, shuffle=False)
x_train, x_test, y_train, y_test = train_test_split(x_training_data, y_training_data, train_size=0.9, shuffle=False)

best_model = evaluate_models(x_train, y_train, DecisionTreeRegressor())
print(best_model.score(x_test, y_test))
print(best_model.score(x_verify, y_verify))

0.3177104356007454 {'criterion': 'absolute_error', 'max_depth': 8, 'max_features': 'sqrt', 'min_samples_split': 4, 'splitter': 'best'}
-0.04549445437917021
0.9874427336526271
