# Part 1 - Regression Intro and Data

* Regression is when continuos data is used to find the equation that best fits the data - forcast out a specific value

In [1]:
import pandas as pd
import quandl

In [2]:
df = quandl.get("WIKI/GOOGL")

In [3]:
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ex-Dividend,Split Ratio,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2004-08-19,100.01,104.06,95.96,100.335,44659000.0,0.0,1.0,50.159839,52.191109,48.128568,50.322842,44659000.0
2004-08-20,101.01,109.08,100.5,108.31,22834300.0,0.0,1.0,50.661387,54.708881,50.405597,54.322689,22834300.0
2004-08-23,110.76,113.48,109.05,109.4,18256100.0,0.0,1.0,55.551482,56.915693,54.693835,54.869377,18256100.0
2004-08-24,111.24,111.6,103.57,104.87,15247300.0,0.0,1.0,55.792225,55.972783,51.94535,52.597363,15247300.0
2004-08-25,104.76,108.0,103.88,106.0,9188600.0,0.0,1.0,52.542193,54.167209,52.10083,53.164113,9188600.0


In [5]:
#reduce the number of columns
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]

# Improving the quality of the data
* This calculates the % spread based on the closing price - which is essentially the crude measure of volatility

In [6]:
#transforming the data further
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Low'] * 100.0

The percentage change is calculated here

In [8]:
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0

calculating the daily percentage change

In [9]:
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0

Redefine the dataframe

In [11]:
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]

# Part 2 - Regression - Features and Labels

In [12]:
import math
import numpy as np
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression



* Features are the descriptive attributes and the label is what you are trying to predict/forecast

In [13]:
#define the y_value i.e. the label
forecast_col = 'Adj. Close'
df.fillna(value=-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df)))

* The code above defines the forecasting column that will store the predicted values
* Fill it with NaN data 
* a populare option with machine learning is to replace missing values with -99,999 - this is treated as an outlier
* If the features have a lot of missing values they can also be dropped but this may leave a lot of data out

In [14]:
df['label'] = df[forecast_col].shift(-forecast_out)

Features are a bunch of the current values and the label is the price associated with each row/instance of data - assume all current columns are features

A new column is added with the pandas shift operation

# Part 3 - Regression Training and Testing

In [15]:
#drop any NaN information from the dataframe
df.dropna(inplace=True)

* We now need to split the data into features and target
* Generally this is achieved by defining the features as X and the label as y

In [17]:
#everything in the df except for the label column
X = np.array(df.drop(['label'], 1))
#Just the label column from the dataframe
y = np.array(df['label'])

## Applying some normalization to the data

* generally want the features to be between 1 and -1
* Speeds up preprocessing
* improves accurracy

In [18]:
X = preprocessing.scale(X)

In [21]:
len(X)

3389

## Splitting into train and test sets

In [23]:
#create test and train sets with features and label
#80:20 split
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.2)

In [24]:
#Defining and training a model
#Support Vector Regression
clf = svm.SVR()

#training our model with .fit()
#this essentially fits our training features to our labels
clf.fit(X_train, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [25]:
#the classifier is now trained -  it can now be tested
confidence = clf.score(X_test, y_test)

In [26]:
print(confidence)

0.7933811475702399


In [27]:
clf = LinearRegression()
clf.fit(X_train, y_train)
confidence = clf.score(X_test, y_test)

In [28]:
print(confidence)

0.9790737437050373


## Threading the learning process
* N_jobs is used to indicate whether or not an algorithm can be threaded
* svm.svr() doesnt
* LinearRegression() does 
* n_jobs = -1 will use all available cores

In [30]:
clf = LinearRegression(n_jobs=-1)

There is a parameter to svm.SVR for example which is kernel. What in the heck is that? Think of a kernel like a transformation against your data. It's a way to grossly, and I mean grossly, simplify your data. This makes processing go much faster. In the case of svm.SVR, the default is rbf, which is a type of kernel. You have a few other choices though. Check the documentation, you have 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or a callable. Again, just like the suggestion to try the various ML algorithms that can do what you want, try the kernels. Let's do a few:

In [31]:
for k in ['linear', 'poly', 'rbf', 'sigmoid']:
    clf = svm.SVR(kernel=k)
    clf.fit(X_train, y_train)
    confidence = clf.score(X_test, y_test)
    print(k, confidence)

linear 0.978391830164321
poly 0.5920516084635513
rbf 0.7933811475702399
sigmoid 0.8958433774784313
