# National Stock Exchange Kenya (NSE) Prediction using Linear Regression
This notebook describes the process of predicting stock exchange prices of Diamond Trust Bank company using Linear Regression.

The data is scraped from the official [NSE Kenya Website](https://www.nse.co.ke/market-statistics.html). The scraping is done by a simple python script worker deployed on Heroku. The scraper then stores this data into a Google Spreadsheets document. This notebook thus gets the data from that Spreadsheet for use in the analysis

First, import the necessary modules:

In [33]:
import csv
import math
import pandas as pd
import numpy as np

from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression
from spreadsheet import GoogleSpreadSheets

### Preparing the data

In [182]:
sheets = GoogleSpreadSheets("NSE Stocks", "DTK")

In [183]:
dtk_data = sheets.get_all_records(head=2)

Let's convert the data into csv format for better handling

In [184]:
keys = dtk_data[0].keys()
with open('dtk_stocks.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(dtk_data)

In [185]:
dtk_stocks = pd.read_csv('dtk_stocks.csv')
print dtk_stocks.shape
dtk_stocks.head()

(20, 8)


Unnamed: 0,company,volume,last_traded_price_ksh,percentage_change,date,prev_price_ksh,low_ksh,high_ksh
0,Diamond Trust Bank Kenya Ltd Ord 4.00,1500.0,190,-0.52,Thu Nov 16 22:58:11 2017,191,190,190
1,Diamond Trust Bank Kenya Ltd Ord 4.00,24300.0,187,-1.58,Fri Nov 17 22:38:53 2017,190,186,190
2,Diamond Trust Bank Kenya Ltd Ord 4.00,32100.0,190,,Mon Nov 20 11:01:00 2017,190,190,190
3,Diamond Trust Bank Kenya Ltd Ord 4.00,32100.0,190,1.6,Tue Nov 21 15:01:15 2017,187,186,191
4,Diamond Trust Bank Kenya Ltd Ord 4.00,400.0,175,,Mon Oct 23 11:01:00 2017,175,175,175


Prepare features that affect the last traded price / closing price

In [186]:
# Percent volatility ((high - close)/close) * 100
dtk_stocks['HIGH_LOW_PERCENT'] = ((dtk_stocks['high_ksh'] - dtk_stocks['last_traded_price_ksh'])/dtk_stocks['last_traded_price_ksh'] * 100)

# Daily percent change / Daily move ((new - old)/old)*100
dtk_stocks['DAILY_PERCENT_CHANGE'] = ((dtk_stocks['last_traded_price_ksh'] - dtk_stocks['prev_price_ksh'])/dtk_stocks['prev_price_ksh'] * 100)

Order the stock data by date:

In [187]:
dtk_stocks['date'] = pd.to_datetime(dtk_stocks['date'])
dtk_stocks
dtk_stocks = dtk_stocks.sort_values('date')
dtk_stocks.head()

Unnamed: 0,company,volume,last_traded_price_ksh,percentage_change,date,prev_price_ksh,low_ksh,high_ksh,HIGH_LOW_PERCENT,DAILY_PERCENT_CHANGE
4,Diamond Trust Bank Kenya Ltd Ord 4.00,400.0,175,,2017-10-23 11:01:00,175,175,175,0.0,0.0
5,Diamond Trust Bank Kenya Ltd Ord 4.00,,175,,2017-10-24 11:01:00,175,175,175,0.0,0.0
6,Diamond Trust Bank Kenya Ltd Ord 4.00,7000.0,178,,2017-10-27 11:01:00,175,175,179,0.561798,1.714286
7,Diamond Trust Bank Kenya Ltd Ord 4.00,11200.0,178,,2017-10-30 11:01:00,178,175,178,0.0,0.0
8,Diamond Trust Bank Kenya Ltd Ord 4.00,2600.0,179,,2017-10-31 11:01:00,178,178,180,0.558659,0.561798


Filter out columns / features that will not be used

In [188]:
features = ['volume', 'last_traded_price_ksh', 'HIGH_LOW_PERCENT', 'DAILY_PERCENT_CHANGE']
dtk_stocks = dtk_stocks[features]
dtk_stocks.head()

Unnamed: 0,volume,last_traded_price_ksh,HIGH_LOW_PERCENT,DAILY_PERCENT_CHANGE
4,400.0,175,0.0,0.0
5,,175,0.0,0.0
6,7000.0,178,0.561798,1.714286
7,11200.0,178,0.0,0.0
8,2600.0,179,0.558659,0.561798


In [189]:
forecast_feature = 'last_traded_price_ksh' #  The column to be predicted by the model

# Set default values for Undefined data. This will be treated as an outlier
dtk_stocks.fillna(-99999, inplace=True) 

# Try to predict 10 percent of the data frame. 
# Using data from 10% of total days ago to predict today
forecast_out = int(math.ceil(0.01*len(dtk_stocks))) 
print forecast_out

# Make each column the last traded price 10% of total days into the future. Shift it 10% of total days
dtk_stocks['label'] = dtk_stocks[forecast_feature].shift(-forecast_out) 
dtk_stocks.dropna(inplace=True)
dtk_stocks.head()

1


Unnamed: 0,volume,last_traded_price_ksh,HIGH_LOW_PERCENT,DAILY_PERCENT_CHANGE,label
4,400.0,175,0.0,0.0,175.0
5,-99999.0,175,0.0,0.0,178.0
6,7000.0,178,0.561798,1.714286,178.0
7,11200.0,178,0.0,0.0,179.0
8,2600.0,179,0.558659,0.561798,180.0


Features will be defined as uppercase X and labels as lowercase y

In [190]:
X = np.array(dtk_stocks.drop(['label'], axis=1)) # Drop the 'label' feature. Returns a new dataframe
y = np.array(dtk_stocks['label'])

Now, scale / normalize the data to fit `-1` to `1` values

In [192]:
X = preprocessing.scale(X)
X = X[:-forecast_out] # Want to make sure we have X's only where we have values for Y
X_lately = X[-forecast_out]

y = np.array(dtk_stocks['label'])
print len(X), len(y) # Check to see if the lengths are equal

print X, y

19 19
[[-0.57732445 -0.24029318 -0.62309248 -0.1102887 ]
 [-1.92893441 -0.24029318 -0.62309248 -0.1102887 ]
 [-0.48847271 -0.08639755  0.51090588 -0.03695477]
 [-0.43193069 -0.08639755 -0.62309248 -0.1102887 ]
 [-0.5477072  -0.035099    0.5045707  -0.08625604]
 [-0.54636096  0.01619954  0.4983059  -0.1102887 ]
 [-0.50597381  0.06749808 -0.62309248 -0.08652307]
 [-0.24480355  0.22139371 -0.62309248 -0.03938572]
 [ 0.93988631  0.27269226  0.46799784 -0.08703971]
 [ 0.38927477  0.3239908  -1.70831673 -0.08716538]
 [-0.13710447  0.37528935  0.45632842 -0.0872897 ]
 [-0.58136316  0.42658789 -0.62309248 -0.08741269]
 [ 1.9966835   0.47788643 -0.62309248 -0.08753437]
 [ 1.9616813  -4.08768395  1.39542461 -2.12470836]
 [-0.29999932  0.58048352  1.49053798  3.78252072]
 [ 1.97110497  0.58048352 -0.62309248 -0.1102887 ]
 [-0.56251582  0.52918498 -0.62309248 -0.13268562]
 [-0.25557345  0.37528935  2.61517024 -0.1778331 ]
 [-0.15056685  0.52918498 -0.62309248 -0.1102887 ]] [ 175.  178.  178.  179.

### Training the Data
The training is carried out using the cross validatoin method

In [193]:
# Use 20 percent of the data as testing data
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.1)

#Fit the classifier
classifier = LinearRegression()
classifier.fit(X_train, y_train)
accuracy = classifier.score(X_test, y_test)

print accuracy

-172.841353324
