## SKlearn Overview
Outline:
* Datasets
* Splitting data into test/train/validation sets
* Learning and predicting
* Parameter tuning
* Model persistence

### Loading builtin dataset

In [27]:
# Provides toy datasets
from sklearn import datasets
# Load the iris dataset for classification
iris = datasets.load_iris()
# load the digits dataset for classification
digits = datasets.load_digits()
# load boston housing price for regression
boston = datasets.load_boston()
# load diabetes dataset for regression
diabetes = datasets.load_diabetes()

### Understanding dataset
Check dataset object.<tab> to see various members

In [11]:
print("iris feature names: {}".format(iris.feature_names))
print("data type: {}".format(type(iris.data)))
print(iris.data[:10])
print(iris.target[:10])

iris feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
data type: <class 'numpy.ndarray'>
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]]
[0 0 0 0 0 0 0 0 0 0]


### train_test_split
Splitting data into validation, testing and training samples

In [12]:
from sklearn.model_selection import train_test_split
X = iris.data[:, :2]
y = iris.target
# 20% of data as testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
# split 100 of TRAINING data as validation set
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=100, random_state=4)

### Train a SVM classifier

In [13]:
# Import the classifier
from sklearn import svm
# C is a hyper-parameter
clf = svm.SVC(C=10)
# Training a classifier
clf.fit(X_train, y_train)

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [15]:
# Predict on the validation set to see accuracy
import numpy as np
predictions = clf.predict(X_valid)
print('validation accuracy = {}'.format(np.sum(predictions == y_valid)/len(y_valid)))

validation accuracy = 0.72


In [16]:
predictions = clf.predict(X_test)
print ('test accuracy = {}'.format(np.sum(predictions == y_test)/len(y_test)))

test accuracy = 0.6666666666666666


### Model Parameter tuning

In [24]:
# How to select the best value of C?
# See the value of C that gives best accuracy on validation data
best_acc = 0.0
best_C = 0.0
step_size = 5.0
C = 0.1
while C < 100.0:
    clf = svm.SVC(C=C)
    clf.fit(X_train, y_train)
    accuracy = np.sum(clf.predict(X_valid)==y_valid)/len(y_valid)
    print ('Accuracy at C = ' + str(C) + ' is ' + str(accuracy))
    if (accuracy > best_acc):
        best_acc = accuracy
        best_C = C
    C += step_size
print ('Best C = ' + str(best_C) + '. It has an accuracy of ' + str(best_acc))

clf = svm.SVC(C=best_C)
# after tuning parameter, we want use whole data available to train the model for better accuracy
X_train_valid = np.concatenate((X_train,X_valid))
y_train_valid = np.concatenate((y_train,y_valid))
clf.fit(X_train_valid, y_train_valid)
predictions = clf.predict(X_test)
print ('final test accuracy = {}'.format(np.sum(predictions == y_test)/len(y_test)))

Accuracy at C = 0.1 is 0.51
Accuracy at C = 5.1 is 0.69
Accuracy at C = 10.1 is 0.72
Accuracy at C = 15.1 is 0.65
Accuracy at C = 20.1 is 0.64
Accuracy at C = 25.1 is 0.64
Accuracy at C = 30.1 is 0.65
Accuracy at C = 35.1 is 0.65
Accuracy at C = 40.1 is 0.65
Accuracy at C = 45.1 is 0.64
Accuracy at C = 50.1 is 0.64
Accuracy at C = 55.1 is 0.64
Accuracy at C = 60.1 is 0.65
Accuracy at C = 65.1 is 0.65
Accuracy at C = 70.1 is 0.66
Accuracy at C = 75.1 is 0.66
Accuracy at C = 80.1 is 0.67
Accuracy at C = 85.1 is 0.67
Accuracy at C = 90.1 is 0.67
Accuracy at C = 95.1 is 0.67
Best C = 10.1. It has an accuracy of 0.72
final test accuracy = 0.9


Model persistence

In [25]:
# It is possible to save a model in the scikit by using Python’s built-in persistence model, namely pickle 
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)  

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [26]:
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
pred = clf2.predict(X[0:1])
print (pred)

[0]


In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big data

In [None]:
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')  #.pkl means a pickle file

In [None]:
clf = joblib.load('filename.pkl') 

Other type of models such as regressors, clustering mechansims etc. will be discussed later. This module was only to give a brief overview of the capabilities of sklearn