# Stock Prediction with Scikit learn

Stock preditction is one of the field where machine learning seems to be working well and there is a lot of research going on to predict stock prices in a better way so as to increase the profits by making smart investments.

We will be using the data recorded by DJI in this exercise and tend to predict the stock prices for future days. So, lets start and get familiar with our data.

In [1]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler

In [3]:
data_raw = pd.read_csv('19880101_20161231.csv', index_col='Date')
data_raw.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1988-01-04,1952.589966,2030.01001,1950.76001,2015.25,2015.25,20880000
1988-01-05,2056.370117,2075.27002,2021.390015,2031.5,2031.5,27200000
1988-01-06,2036.469971,2058.189941,2012.77002,2037.800049,2037.800049,18800000
1988-01-07,2019.890015,2061.51001,2004.640015,2051.889893,2051.889893,21370000
1988-01-08,2046.579956,2058.689941,1898.040039,1911.310059,1911.310059,27440000


# Finding the meaning of different terms in stock market

As we can see our data head in above cell, there are a lot of terms as column names which are not very common and are frequently used and recorded when doing the stock market analysis.

So, lets try and find what they mean first and how they can help us in predicting the prices.

1. Open :  Prices at the start of day.
2. High :  Highest price of the day
3. Low  :  Lowest price of the day
4. Close:  Prices at end of day
6. Volume : Number which is traded


This is the data that is given to us but it is really not enough to get good predictions.

# Feature Generation

So, as it turns out our data doesnt have enough predictors to be used to train a satisfactory model.

One general way to think about this type of problem is how we in real life will use the data to gain insights. As we think down this road, we deduced that first thing people look to predict this kind of thing is to look for trends in data like average of last week or last month or maybe last year prices can give us some idea on how prices of next day will come out.

So, this gives us the idea that if we can find common mathematical operations like average and standard deviations on our data for different periods of time. So, let us try to do that. Next, we will define a function which will give us the dataframe with all these features given any dataframe.

In [1]:
def generate_features(df):
    
    df_new = pd.DataFrame()
    # 6 original features
    df_new['open'] = df['Open']
    df_new['open_1'] = df['Open'].shift(1)
    df_new['close_1'] = df['Close'].shift(1)
    df_new['high_1'] = df['High'].shift(1)
    df_new['low_1'] = df['Low'].shift(1)
    df_new['volume_1'] = df['Volume'].shift(1)
    # 31 generated features
    # average price
    df_new['avg_price_5'] = df['Close'].rolling(5).mean().shift(1)
    df_new['avg_price_30'] = df['Close'].rolling(21).mean().shift(1)
    df_new['avg_price_365'] = df['Close'].rolling(252).mean().shift(1)
    df_new['ratio_avg_price_5_30'] = df_new['avg_price_5'] / df_new['avg_price_30']
    df_new['ratio_avg_price_5_365'] = df_new['avg_price_5'] / df_new['avg_price_365']
    df_new['ratio_avg_price_30_365'] = df_new['avg_price_30'] / df_new['avg_price_365']
    # average volume
    df_new['avg_volume_5'] = df['Volume'].rolling(5).mean().shift(1)
    df_new['avg_volume_30'] = df['Volume'].rolling(21).mean().shift(1)
    df_new['avg_volume_365'] = df['Volume'].rolling(252).mean().shift(1)
    df_new['ratio_avg_volume_5_30'] = df_new['avg_volume_5'] / df_new['avg_volume_30']
    df_new['ratio_avg_volume_5_365'] = df_new['avg_volume_5'] / df_new['avg_volume_365']
    df_new['ratio_avg_volume_30_365'] = df_new['avg_volume_30'] / df_new['avg_volume_365']
    # standard deviation of prices
    df_new['std_price_5'] = df['Close'].rolling(5).std().shift(1)
    df_new['std_price_30'] = df['Close'].rolling(21).std().shift(1)
    df_new['std_price_365'] = df['Close'].rolling(252).std().shift(1)
    df_new['ratio_std_price_5_30'] = df_new['std_price_5'] / df_new['std_price_30']
    df_new['ratio_std_price_5_365'] = df_new['std_price_5'] / df_new['std_price_365']
    df_new['ratio_std_price_30_365'] = df_new['std_price_30'] / df_new['std_price_365']
    # standard deviation of volumes
    df_new['std_volume_5'] = df['Volume'].rolling(5).std().shift(1)
    df_new['std_volume_30'] = df['Volume'].rolling(21).std().shift(1)
    df_new['std_volume_365'] = df['Volume'].rolling(252).std().shift(1)
    df_new['ratio_std_volume_5_30'] = df_new['std_volume_5'] / df_new['std_volume_30']
    df_new['ratio_std_volume_5_365'] = df_new['std_volume_5'] / df_new['std_volume_365']
    df_new['ratio_std_volume_30_365'] = df_new['std_volume_30'] / df_new['std_volume_365']
    # # return
    df_new['return_1'] = ((df['Close'] - df['Close'].shift(1)) / df['Close'].shift(1)).shift(1)
    df_new['return_5'] = ((df['Close'] - df['Close'].shift(5)) / df['Close'].shift(5)).shift(1)
    df_new['return_30'] = ((df['Close'] - df['Close'].shift(21)) / df['Close'].shift(21)).shift(1)
    df_new['return_365'] = ((df['Close'] - df['Close'].shift(252)) / df['Close'].shift(252)).shift(1)
    df_new['moving_avg_5'] = df_new['return_1'].rolling(5).mean().shift(1)
    df_new['moving_avg_30'] = df_new['return_1'].rolling(21).mean().shift(1)
    df_new['moving_avg_365'] = df_new['return_1'].rolling(252).mean().shift(1)
    # the target
    df_new['close'] = df['Close']
    df_new = df_new.dropna(axis=0)
    return df_new

In [None]:
data_raw = pd.read_csv('19880101_20161231.csv', index_col='Date')
data = generate_features(data_raw)

In [None]:
start_train = '1988-01-01'
end_train = '2015-12-31'

start_test = '2016-01-01'
end_test = '2016-12-31'

In [None]:
data_train = data.ix[start_train:end_train]
X_train = data_train.drop('close', axis=1).values
y_train = data_train['close'].values

print(X_train.shape)
print(y_train.shape)

In [None]:
data_test = data.ix[start_test:end_test]
X_test = data_test.drop('close', axis=1).values
y_test = data_test['close'].values

print(X_test.shape)

# Fitting a model with Linear Regression

Now, we have all our new features and we have split our dataset in train and validation sets using our date index.

It is time now to try out different algorithms starting with linear regression.

In [None]:
scaler = StandardScaler()

X_scaled_train = scaler.fit_transform(X_train)
X_scaled_test = scaler.transform(X_test)

param_grid = {
    "alpha": [1e-5, 3e-5, 1e-4],
    "eta0": [0.01, 0.03, 0.1],
}


from sklearn.linear_model import SGDRegressor
lr = SGDRegressor(penalty='l2', n_iter=1000)
grid_search = GridSearchCV(lr, param_grid, cv=5, scoring='r2')
grid_search.fit(X_scaled_train, y_train)

print(grid_search.best_params_)

lr_best = grid_search.best_estimator_

predictions_lr = lr_best.predict(X_scaled_test)

print('MSE: {0:.3f}'.format(mean_squared_error(y_test, predictions_lr)))
print('MAE: {0:.3f}'.format(mean_absolute_error(y_test, predictions_lr)))
print('R^2: {0:.3f}'.format(r2_score(y_test, predictions_lr)))

# Random Forest Regressor

Now, as we have the metrics for our linear regression. Let us try to do a bit better and try using random forst regressor.

In [None]:
param_grid = {
    'max_depth': [50, 70, 80],
    'min_samples_split': [5, 10],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [3, 5]

}


from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)
rf_best = grid_search.best_estimator_

predictions_rf = rf_best.predict(X_test)
print('MSE: {0:.3f}'.format(mean_squared_error(y_test, predictions_rf)))
print('MAE: {0:.3f}'.format(mean_absolute_error(y_test, predictions_rf)))
print('R^2: {0:.3f}'.format(r2_score(y_test, predictions_rf)))

# Support Vector Regressor

In [None]:
param_grid = [
    {'kernel': ['linear'], 'C': [100, 300, 500], 'epsilon': [0.00003, 0.0001]},
    {'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [10, 100, 1000], 'epsilon': [0.00003, 0.0001]}
]


from sklearn.svm import SVR

svr = SVR()
grid_search = GridSearchCV(svr, param_grid, cv=2, scoring='r2')
grid_search.fit(X_scaled_train, y_train)

print(grid_search.best_params_)

svr_best = grid_search.best_estimator_

predictions_svr = svr_best.predict(X_scaled_test)

print('MSE: {0:.3f}'.format(mean_squared_error(y_test, predictions_svr)))
print('MAE: {0:.3f}'.format(mean_absolute_error(y_test, predictions_svr)))
print('R^2: {0:.3f}'.format(r2_score(y_test, predictions_svr)))

# Neural Network

In [None]:
param_grid = {
    'hidden_layer_sizes': [(50, 10), (30, 30)],
    'activation': ['logistic', 'tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'learning_rate_init': [0.0001, 0.0003, 0.001, 0.01],
    'alpha': [0.00003, 0.0001, 0.0003],
    'batch_size': [30, 50]
}


from sklearn.neural_network import MLPRegressor

nn = MLPRegressor(random_state=42, max_iter=2000)
grid_search = GridSearchCV(nn, param_grid, cv=2, scoring='r2', n_jobs=-1)
grid_search.fit(X_scaled_train, y_train)


print(grid_search.best_params_)

nn_best = grid_search.best_estimator_

predictions_nn = nn_best.predict(X_scaled_test)

print('MSE: {0:.3f}'.format(mean_squared_error(y_test, predictions_nn)))
print('MAE: {0:.3f}'.format(mean_absolute_error(y_test, predictions_nn)))
print('R^2: {0:.3f}'.format(r2_score(y_test, predictions_nn)))