# Introduction to scikit-learn and Jupyter noteboooks

Check your installed packages

In [None]:
import IPython
print('IPython:', IPython.__version__)

import numpy
print('numpy:', numpy.__version__)

import scipy
print('scipy:', scipy.__version__)

import matplotlib
print('matplotlib:', matplotlib.__version__)

import sklearn
print('scikit-learn:', sklearn.__version__)

Load the Iris dataset and look inside

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()

print(iris.keys())

**data** and **target** are the most interesting from the machine learning perspective as they contain the data for the predictors and the responses, respectively.  But from the *human* perspective, **feature_names** and **target_names** might be *more* interesting since they give us the *names* of the features and the responses.

In [None]:
print(iris.target_names)
print(iris.feature_names)

**setosa**, **versicolor**, and **virginica** are the possible responses of the dataset, which we call the *classes*.  In other words, given some data, we want to predict what *kind* or *class* of iris it might be.  The *features* or *depenedent variables* we are using to predict are **sepal length**, **sepal width**, **petal length**, and **petal width**.  In other words, given these characteristics of an unknown Iris flower, what type of Iris is it likely to be?

If you don't know *anything* about the dataset, it is advisable to read the **DESCR** to get a high-level understanding of the dataset.

In [None]:
print(iris.DESCR)

So now we have explored the different fields available in a typical dataset in scikit-learn.  Let's do some machine learning!  Remember that the *features* or *predictors* are contained in the **data** field and the *responses* or *labels* are contained in the **target** field. Let's explore the target field just a bit:

In [None]:
print(iris.target)

What?!?  What are these 0's, 1's and 2's?  Machine learning algorithms typically don't understand **setosa**, **versicolor**, and **virginica**, so we have to encode each *class* with an integer id (in the case of classification problems).

## **Hotdog!** and **Not Hotdog!**

If you watch the show *Silicon Valley* you may have seen the episode where Jian Yang develops the app *SeeFood* which is *supposed* to classify food dishes for recipes, but it turns out that *SeeFood* is only a binary classifier! *Silicon Valley* has many references to machine learning, another example was when Richard developed a neural network to help improve the Pied Piper compression algorithm.

In this case, we have three classes, so we use 0, 1, and 2 to represent the three different classes of Iris species.

Let's eyeball the data/features:

In [None]:
print(iris.data)

Let's eyeball the responses:

In [None]:
print(iris.target)

Do we see a problem here?  Let's play with logistic regression.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()


Before we fit a model, let's split our data into training and validation sets.

In [None]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(iris.data, iris.target,random_state=5)
print(Xtrain.shape, Xtest.shape)

Now we'll fit the logistic regression model.

In [None]:

model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)

To determine the performance of the model, we'll use **accuracy_score**

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, ypred)

Let's play with another machine learning algorithm, support vector machine for classification (SVC).

In [None]:
from sklearn.svm import SVC

# create the model
model = SVC()


In [None]:
# fit the model to data
model.fit(Xtrain,ytrain)
ypred = model.predict(Xtest)

accuracy_score(ytest, ypred)