# Getting started in scikit-learn with the famous iris dataset

## Agenda

- What is the famous iris dataset, and how does it relate to machine learning?
- How do we load the iris dataset into scikit-learn?
- How do we describe a dataset using machine learning terminology?
- What are scikit-learn's four key requirements for working with data?

## Introducing the iris dataset

![Iris](images/03_iris.png)

- 50 samples of 3 different species of iris (150 samples total)
- Measurements: sepal length, sepal width, petal length, petal width

## Machine learning on the iris dataset

- Framed as a **supervised learning** problem: Predict the species of an iris using the measurements
- Famous dataset for machine learning because prediction is **easy**
- Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)
    
## Loading the iris dataset into scikit-learn

In [14]:
# import load_iris function from datasets module
# from sklearn.data_sets import load_iris
from sklearn.model_selection import train_test_split

In [13]:
# save "bunch" object containing iris dataset and its attributes


In [14]:
# print the iris data


## Machine learning terminology

- Each row is an **observation** (also known as: sample, example, instance, record)
- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)

In [15]:
# print the names of the four features


In [16]:
# print integers representing the species of each observation


In [17]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica


- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)
- **Classification** is supervised learning in which the response is categorical
- **Regression** is supervised learning in which the response is ordered and continuous

## Requirements for working with data in scikit-learn

1. Features and response are **separate objects**
2. Features and response should be **numeric**
3. Features and response should be **NumPy arrays**
4. Features and response should have **specific shapes**

In [18]:
# check the types of the features and response


In [19]:
# check the shape of the features (first dimension = number of observations, second dimensions = number of features)


In [20]:
# check the shape of the response (single dimension matching the number of observations)


In [21]:
# store feature matrix in "X"


# store response vector in "y"


## Agenda

- What is the **K-nearest neighbors** classification model?
- What are the four steps for **model training and prediction** in scikit-learn?
- How can I apply this pattern to **other machine learning models**?

## K-nearest neighbors (KNN) classification

1. Pick a value for K.
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

### Example training data

![Training data](images/04_knn_dataset.png)

### KNN classification map (K=5)

![5NN classification map](images/04_5nn_map.png)

## scikit-learn 4-step modeling pattern

**Step 1:** Import the class you plan to use

**Step 2:** "Instantiate" the "estimator"

- "Estimator" is scikit-learn's term for model
- "Instantiate" means "make an instance of"

- Name of the object does not matter
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to their defaults

**Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between X and y
- Occurs in-place

**Step 4:** Predict the response for a new observation

- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process

- Returns a NumPy array
- Can predict for multiple observations at once

## Using a different value for K

In [22]:
# instantiate the model (using the value K=5)


# fit the model with data


# predict the response for new observations


## Using a different classification model

In [23]:
# import the class


# instantiate the model (using the default parameters)


# fit the model with data


# predict the response for new observations
