# Introduction to scikit-learn and Jupyter noteboooks

Check your installed packages

In [1]:
import IPython
print('IPython:', IPython.__version__)

import numpy
print('numpy:', numpy.__version__)

import scipy
print('scipy:', scipy.__version__)

import matplotlib
print('matplotlib:', matplotlib.__version__)

import sklearn
print('scikit-learn:', sklearn.__version__)

IPython: 6.0.0
numpy: 1.11.1
scipy: 0.19.0


ImportError: No module named 'matplotlib'

Load the Iris dataset and look inside

In [2]:
from sklearn.datasets import load_iris

iris = load_iris()

print(iris.keys())

['target_names', 'data', 'target', 'DESCR', 'feature_names']


**data** and **target** are the most interesting from the machine learning perspective as they contain the data for the predictors and the responses, respectively.  But from the *human* perspective, **feature_names** and **target_names** might be *more* interesting since they give us the *names* of the features and the responses.

In [None]:
print(iris.target_names)
print(iris.feature_names)

**setosa**, **versicolor**, and **virginica** are the possible responses of the dataset, which we call the *classes*.  In other words, given some data, we want to predict what *kind* or *class* of iris it might be.  The *features* or *depenedent variables* we are using to predict are **sepal length**, **sepal width**, **petal length**, and **petal width**.  In other words, given these characteristics of an unknown Iris flower, what type of Iris is it likely to be?

If you don't know *anything* about the dataset, it is advisable to read the **DESCR** to get a high-level understanding of the dataset.

In [None]:
print(iris.DESCR)

So now we have explored the different fields available in a typical dataset in scikit-learn.  Let's do some machine learning!  Remember that the *features* or *predictors* are contained in the **data** field and the *responses* or *labels* are contained in the **target** field. Let's explore the target field just a bit:

In [None]:
print(iris.target)

What?!?  What are these 0's, 1's and 2's?  Machine learning algorithms typically don't understand **setosa**, **versicolor**, and **virginica**, so we have to encode each *class* with an integer id (in the case of classification problems).

## **Hotdog!** and **Not Hotdog!**

If you watch the show *Silicon Valley* you may have seen the episode where Jian Yang develops the app *SeeFood* which is *supposed* to classify food dishes for recipes, but it turns out that *SeeFood* is only a binary classifier! *Silicon Valley* has many references to machine learning, another example was when Richard developed a neural network to help improve the Pied Piper compression algorithm.

In this case, we have three classes, so we use 0, 1, and 2 to represent the three different classes of Iris species.

Let's eyeball the data/features:

In [4]:
print(iris.data)

[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.4  3.7  1.5  0.2]
 [ 4.8  3.4  1.6  0.2]
 [ 4.8  3.   1.4  0.1]
 [ 4.3  3.   1.1  0.1]
 [ 5.8  4.   1.2  0.2]
 [ 5.7  4.4  1.5  0.4]
 [ 5.4  3.9  1.3  0.4]
 [ 5.1  3.5  1.4  0.3]
 [ 5.7  3.8  1.7  0.3]
 [ 5.1  3.8  1.5  0.3]
 [ 5.4  3.4  1.7  0.2]
 [ 5.1  3.7  1.5  0.4]
 [ 4.6  3.6  1.   0.2]
 [ 5.1  3.3  1.7  0.5]
 [ 4.8  3.4  1.9  0.2]
 [ 5.   3.   1.6  0.2]
 [ 5.   3.4  1.6  0.4]
 [ 5.2  3.5  1.5  0.2]
 [ 5.2  3.4  1.4  0.2]
 [ 4.7  3.2  1.6  0.2]
 [ 4.8  3.1  1.6  0.2]
 [ 5.4  3.4  1.5  0.4]
 [ 5.2  4.1  1.5  0.1]
 [ 5.5  4.2  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.   3.2  1.2  0.2]
 [ 5.5  3.5  1.3  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 4.4  3.   1.3  0.2]
 [ 5.1  3.4  1.5  0.2]
 [ 5.   3.5  1.3  0.3]
 [ 4.5  2.3  1.3  0.3]
 [ 4.4  3.2  1.3  0.2]
 [ 5.   3.5

Let's eyeball the responses:

In [5]:
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


Do we see a problem here?

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()


In [7]:
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(iris.data, iris.target,random_state=5)
print(Xtrain.shape, Xtest.shape)

((112, 4), (38, 4))


Let's fit a logistic regression model.

In [11]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(Xtrain, ytrain)
ypred = lr.predict(Xtest)

In [12]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, ypred)

0.92105263157894735

In [15]:
from sklearn.svm import SVC

# create the model
mySVC = SVC()


In [20]:
# fit the model to data
mySVC.fit(Xtrain,ytrain)
ypred = mySVC.predict(Xtest)

accuracy_score(ytest, ypred)

0.97368421052631582