# Getting started in scikit-learn with the famous iris dataset
*From the video series: [Introduction to machine learning with scikit-learn](https://github.com/justmarkham/scikit-learn-videos)*

## Agenda

- What is the famous iris dataset, and how does it relate to machine learning?
- How do we load the iris dataset into scikit-learn?
- How do we describe a dataset using machine learning terminology?
- What are scikit-learn's four key requirements for working with data?

## Introducing the iris dataset

![Iris](images/03_iris.png)

- 50 samples of 3 different species of iris (150 samples total)
- Measurements: sepal length, sepal width, petal length, petal width

In [1]:
from IPython.display import IFrame
IFrame('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', width=300, height=200)

## Machine learning on the iris dataset

- Framed as a **supervised learning** problem: Predict the species of an iris using the measurements
- Famous dataset for machine learning because prediction is **easy**
- Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

## Loading the iris dataset into scikit-learn

In [4]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

**Note:** The convention in scikit-learn is to import individual modules, classes or functions rather than importing scikit-learn as a whole

In [None]:
# save "bunch" object containing iris dataset and its attributes
iris = load_iris()
type(iris)

**Note:** A bunch is a special scikit-learn object type for storinf datasets and their attributes

In [None]:
# print the iris data
print(iris.data)

**Note:** Each row represents one flower and the four columns represents the four measurements

## Machine learning terminology

- Each row is an **observation** (also known as: sample, example, instance, record)
- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)

In [None]:
# print the names of the four features
print(iris.feature_names)

You can think of these like column headers for the data

In [None]:
# print integers representing the species of each observation
print(iris.target)

The target represents what we are going to predict

In [None]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)

- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)
- **Classification** is supervised learning in which the response is categorical
    - Meaning that its values are in a finite unordered set
    Predicting a species of Iris or an email as either "ham" or "spam" is a classification problem
- **Regression** is supervised learning in which the response is ordered and continuous
    - Such as the price of a house or the height of a person
- Looking at the iris data, ie the 0's, 1's and 2's, you cannot tell if this is a Classification or 

Regression problem. As a Machine Learning practitioner, you have to understand how your data is encoded and decide whether your response variable is best suited for Regression or Classification. In this case, we know that the numbers 0, 1 & 2 represent unordered categories and thus we know to use Classification, and nor Regression, techniques in order to solve this problem

## Requirements for working with data in scikit-learn

1. Features and response to be passed into the ML model as **separate objects**
    - iris.data and iris.target fufill this requirement as they are stored separately
2. Features and response objects should only be **numeric**
    - This is why iris.target is stored as 0's. 1's and 2's instead of the strings setosa, versicolor, & virginica
    - This condition must always be met for both Regression and Classification problems
3. Features and response should be stored as **NumPy arrays**
    - Numpy is a library for scientific computing that implements a homogenous multi-dimensional array, known as an ndarray, that has been optimised for fast computation
    - See below, iris.target and isris.data are already ndarrays
4. Features and response should have **specific shapes**
    - Specifically, the feature object should have two dimensions in which the first dimension, represented by rows, is the number of observations and the second dimension, represented by columns, is the number of features
    - All Numpy arrays have a shape attribute so that we can verify that the shape of iris.data is 150 x 4
    - The response object is expected to have a single dimension and that dimension should have the same magnitude as the first dimension of the feature object ie there should be one response corresponding to each observation
    - We can see below that the shape of iris.target is indeed 150

In [None]:
# check the types of the features and response
print(type(iris.data))
print(type(iris.target))

In [None]:
# check the shape of the features (first dimension = number of observations, second dimensions = number of features)
print(iris.data.shape)

In [None]:
# check the shape of the response (single dimension matching the number of observations)
print(iris.target.shape)

In [None]:
# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

## Resources

- scikit-learn documentation: [Dataset loading utilities](http://scikit-learn.org/stable/datasets/)
- Jake VanderPlas: Fast Numerical Computing with NumPy ([slides](https://speakerdeck.com/jakevdp/losing-your-loops-fast-numerical-computing-with-numpy-pycon-2015), [video](https://www.youtube.com/watch?v=EEUXKG97YRw))
- Scott Shell: [An Introduction to NumPy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf) (PDF)

## Comments or Questions?

- Email: <kevin@dataschool.io>
- Website: http://dataschool.io
- Twitter: [@justmarkham](https://twitter.com/justmarkham)

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()