## Lab 3:  Supervised Learning using Sklearn


#### CSC 180 Intelligent Systems

#### California State University, Sacramento





![scikit-learn logo](images/sklearn_algorithms.png)

## Motivation problem:  Iris flower dataset

#### 150 **observations**   where each obervation is of one of the following types of irises' (Setosa, Versicolour, and Virginica)
#### 4 **features** (sepal length, sepal width, petal length, petal width)
#### **Label** variable is the iris species

![Iris](images/sklearn_iris.png)

![Iris](images/sklearn_iris_data.png)

## Machine learning terminology

- Each row is an **observation** (also known as: object, sample, example, instance, record)
- Each column is a **feature** (also known as: predictor, attribute, independent variable, input)

- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)



### **This is a classification** problem since label is discrete!

## Suppose we want to use K-nearest neighbors (KNN)

#### 1. Pick a value for K.
#### 2. For any *** unknown iris***, find the top-K records in our data that are "nearest" to that iris.
#### 3. Majoity vote.  Use the ***most popular species*** of those K records as the predicted species for the unknown iris.


![knn](https://miro.medium.com/max/1182/0*sYMSaIon56Qng2hF.png)

## Prepare the training data in terms of X (feature matrix) and y (label/target vector)
X = input;  y = output

### Scikit-learn comes with a few datasets

In [1]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

iris = load_iris()

In [2]:
# store all the input features in a matrix "X"
X = iris.data
# print the shape of X
print(X.shape)
print(type(X))

(150, 4)
<class 'numpy.ndarray'>


In [3]:
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [4]:
# store response/lable vector in "y"
y = iris.target
# print the shape of y
print(y.shape)

(150,)


In [5]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

### y is already label-encoded:  0 for 'setosa', 1 for 'versicolor', 2 for 'virginica'

In [6]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [7]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

## Scikit-learn 4-step modeling pattern (ICFP)

## **Step 1:** Import the ***machine learning model*** you plan to use.

#### Check official API documentation here

* https://scikit-learn.org/stable/modules/classes.html
* https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [8]:
from sklearn.neighbors import KNeighborsClassifier

## **Step 2:** Create an instance of the model with the parameters specified

In [9]:
knn = KNeighborsClassifier(n_neighbors=3)

# http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

- Name of the object does not matter
- All parameters not specified are set to their defaults

## **Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between ***X (Input Matrix)*** and ***y (Target Vector)***
### Notice that X must be a Matrix,  y must be 1-d array

In [10]:
knn.fit(X, y)     # X must be a Matrix,  y must be 1-d array

## **Step 4:** Predict the target for a new record

- precict() takes a matrix (2D) as input

In [11]:
knn.predict([[3, 5, 4, 2]])            #  (sepal length, sepal width, petal length, petal width)
                                       # ('setosa', 'versicolor', 'virginica')

array([1])

### It is versicolor!!

- Returns a NumPy array
- Can predict for multiple observations at once

In [12]:
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]

In [13]:
knn.predict(X_new)

array([1, 1])

## Using a different value for K

In [14]:
# instantiate the model (using the value K=10)
knn = KNeighborsClassifier(n_neighbors=10)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(X_new)

array([1, 1])

## Using a different classification model

In [15]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model
logreg = LogisticRegression(solver='lbfgs', multi_class='ovr')

# fit the model with data
logreg.fit(X, y)

# predict the response for new observations
logreg.predict(X_new)

array([0, 0])