# Getting started in scikit-learn with the famous iris dataset 

Created by [Data School](https://www.dataschool.io). Watch all 10 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos).

**Note:** This notebook uses Python 3.9.1 and scikit-learn 0.23.2. The original notebook (shown in the video) used Python 2.7 and scikit-learn 0.16.

## Agenda

- What is the famous iris dataset, and how does it relate to Machine Learning?
- How do we load the iris dataset into scikit-learn?
- How do we describe a dataset using Machine Learning terminology?
- What are scikit-learn's four key requirements for working with data?

## Introducing the iris dataset

- 50 samples of 3 different species of iris (150 samples total)
- Measurements: sepal length, sepal width, petal length, petal width

In [None]:
#from IPython.display import IFrame
#IFrame('https://www.dataschool.io/files/iris.txt', width=300, height=200)

## Machine Learning on the iris dataset

- Framed as a **supervised learning** problem: Predict the species of an iris using the measurements
- Famous dataset for Machine Learning because prediction is **easy**
- Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

## Loading the iris dataset into scikit-learn

In [None]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

In [None]:
# save "bunch" object containing iris dataset and its attributes
iris = load_iris()
type(iris)

 A Bunch is essentially a container object for datasets: it behaves like a Python dictionary but allows you to access its keys as attributes. This feature is particularly useful in scikit-learn, where datasets are often loaded into Bunch objects containing data arrays, target arrays, and sometimes additional information such as a description of the dataset:
- Attribute Access: You can access values using keys as attributes (e.g., bunch.key) instead of the standard dictionary access pattern (e.g., bunch['key']).
- Dictionary-like Behavior: It supports all dictionary methods, including .keys(), .values(), and .items(), making it versatile for data manipulation.
- Used for Datasets: scikit-learn uses Bunch to return datasets from load and fetch functions, making data handling intuitive. For example, when you load the Iris dataset using sklearn.datasets.load_iris(), the return value is a Bunch object containing features, targets, and a description of the dataset.

In [None]:
iris['data']

In [11]:
# print the iris data
print(iris.data)

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


## Machine Learning terminology

- Each row is an **observation** (also known as: sample, example, instance, record)
- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)

In [12]:
# print the names of the four features
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [13]:
# print integers representing the species of each observation
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [14]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)

## Requirements for working with data in scikit-learn

1. Features and response are **separate objects**
2. Features should always be **numeric**, and response should be **numeric** for regression problems
3. Features and response should be **NumPy arrays**
4. Features and response should have **specific shapes**

In [15]:
# check the types of the features and response
print(type(iris.data))
print(type(iris.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [16]:
# check the shape of the features (first dimension = number of observations, second dimensions = number of features)
print(iris.data.shape)

(150, 4)


In [17]:
# check the shape of the response (single dimension matching the number of observations)
print(iris.target.shape)

(150,)


In [None]:
# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

## Resources

- scikit-learn documentation: [Dataset loading utilities](https://scikit-learn.org/stable/datasets.html)
- Jake VanderPlas: Fast Numerical Computing with NumPy ([slides](https://speakerdeck.com/jakevdp/losing-your-loops-fast-numerical-computing-with-numpy-pycon-2015), [video](https://www.youtube.com/watch?v=EEUXKG97YRw))
- Scott Shell: [An Introduction to NumPy](https://sites.engineering.ucsb.edu/~shell/che210d/numpy.pdf) (PDF)

© 2021 [Data School](https://www.dataschool.io). All rights reserved.