DAY 22 -- Mar 18, 2017

In this notebook, we will use SVM & CV to predict on the Kaggle HR dataset which has been loaded to a postgreSQL database (see: http://csiu.github.io/blog///update/2017/03/13/day17.html)

### Load data

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

# Load libs
import psycopg2
import pandas.io.sql as pdsql

# Specify our database
dbname="hr"
name_of_table = "survey"

# Connect to database
conn = psycopg2.connect(dbname=dbname)

In [2]:
# What are my columns
query = "SELECT column_name FROM information_schema.columns WHERE table_name='%s';" % name_of_table
colnames = pdsql.read_sql_query(query, conn)

features = colnames.column_name[1:9]
features

1       satisfaction_level
2          last_evaluation
3           number_project
4     average_montly_hours
5       time_spend_company
6            work_accident
7           left_workplace
8    promotion_last_5years
Name: column_name, dtype: object

In [3]:
# Define Inputs
column_name = ', '.join(list(features))
query = "SELECT %s FROM %s;" % (column_name, name_of_table)

X = pdsql.read_sql_query(query, conn)
X.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,work_accident,left_workplace,promotion_last_5years
0,0.38,0.53,2,157,3,False,True,False
1,0.8,0.86,5,262,6,False,True,False
2,0.11,0.88,7,272,4,False,True,False
3,0.72,0.87,5,223,5,False,True,False
4,0.37,0.52,2,159,3,False,True,False


In [4]:
# Define Outputs
query = "SELECT %s FROM %s;" % ("salary", name_of_table)
y = pdsql.read_sql_query(query, conn)
y.head()

Unnamed: 0,salary
0,low
1,medium
2,medium
3,low
4,low


### Randomly splitting the data into testing and training sets
Scikit-learn has a function [`train_test_split()`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to randomly split the data into test and training sets.

In [5]:
from sklearn.model_selection import train_test_split

# Use a subset (for demo else training takes a while)
X = X[:250]
y = y[:250]

print("Total set:\t", X.shape, y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

print("Training set:\t", X_train.shape, y_train.shape)
print("Test set:\t", X_test.shape, y_test.shape)

Total set:	 (250, 8) (250, 1)
Training set:	 (150, 8) (150, 1)
Test set:	 (100, 8) (100, 1)


### Using a Support Vector Machine (SVM)
SVMs are another supervised machine learning method. SVMs uses a hyperplane to classify data. 

As a simple example: image a line. Anything on one side of the line is class A and anything on the other side is class B.

In [6]:
from sklearn import svm

clf = svm.SVC(kernel='linear', C=1)
clf.fit(X_train, y_train.values.ravel())

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

`y_train.values.ravel()` was used I got this warning:

> A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

In [7]:
# Accuracy
clf.score(X_test, y_test)

0.76000000000000001

In [8]:
from sklearn import metrics
metrics.accuracy_score(y_test, clf.predict(X_test))

0.76000000000000001

### Computing CV metrics

[`cross_val_score()`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) evaluates a score by CV. There are differnt [scoring/evaluation metrics](http://scikit-learn.org/stable/modules/model_evaluation.html)

In [33]:
from sklearn.model_selection import cross_val_score

clf = svm.SVC(kernel='linear', C=1)

# 3-fold CV
scores = cross_val_score(clf, X, y.values.ravel(), cv=3)

print("The scores:", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

The scores: [ 0.78571429  0.78571429  0.79268293]
Accuracy: 0.79 (+/- 0.01)


In [34]:
# Obtaining predictions by CV
from sklearn.model_selection import cross_val_predict

predicted = cross_val_predict(clf, X, y.values.ravel(), cv=3)
metrics.accuracy_score(y, predicted)

0.78800000000000003

### k-fold

In [35]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=3)
for train_index, test_index in kf.split(X):
    clf = svm.SVC(kernel='linear', C=1)
    clf.fit(X.iloc[train_index], y.iloc[train_index].values.ravel())
    score = clf.score(X.iloc[test_index], y.iloc[test_index].values.ravel())
    print("The score: %0.2f" % score)

The score: 0.77
The score: 0.78
The score: 0.80
