# Inspection of the iris dataset
In this notebook the iris dataset is downloaded via `scikit-learn` with basic visualization and statistics computed.

---
tags: iris dataset, visualization, statistics,  scikit-learn

# Imports

In [1]:
import itertools
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.datasets.base import load_data
import matplotlib.pyplot as plt
%matplotlib inline

ModuleNotFoundError: No module named 'sklearn'

# Load and inspect data

From the source code, with the parameter `return_X_y` set to `True`, the function [`load_iris`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) downloads file `iris.csv` to `dirname(__file__)/'data'/'iris.csv'`, where `dirname(__file__)` is the name of the directory containing the module `sklearn.datasets`.

In [None]:
X, y = load_iris(return_X_y=True)

The function `load_iris()` returns the data in the form of `numpy` arrays:

In [None]:
X.shape, y.shape

showing that there are `150` samples with `4` features.  
They have numerical values, in particular the target values, which are the classes of irises:

In [None]:
n_samples = 6
np.hstack([X[:n_samples, :], y[:n_samples,None]])

We verify that there are `3`, numerical classes:

In [None]:
np.unique(y)

**Questions.**
- What are the classe names?
- What are the feature names?

## Download directory
Look at the directory where the dataset has been saved.  In my case, the `scikit-learn` datasets are saved under:

In [None]:
datasets_path = Path.home()/'.pyenv/versions/pyml/lib/python3.7/site-packages/sklearn/datasets'

The **data** itself is saved under

In [None]:
path = datasets_path/'data'
# print('\n'.join([item.name for item in path.glob("*")]))

while the **description** of the dataset is saved under

In [None]:
desc_path = datasets_path/'descr'
# print('\n'.join([item.name for item in desc_path.glob("*")]))

Inspect the data directory `path` in a bit more detail.
This directory contains all data already downloaded with `scikit-learn`:

In [None]:
sum(1 for item in path.rglob("*")), set(item.suffix for item in path.rglob("*.*"))

In [None]:
print('\n'.join([str(item.name) for item in path.rglob("*.csv")]))

# The description file `iris.rst`
From the description file `iris.rst` we learn that there are:
- 4 attributes:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
- 3 classes:
    - Iris-Setosa
    - Iris-Versicolour
    - Iris-Virginica

The names of the features are also hard encoded in the `load_iris()` function:

In [None]:
feature_names = ['sepal length (cm)', 'sepal width (cm)',
                 'petal length (cm)', 'petal width (cm)']

The metadata can also be recovered from `load_iris` by leaving `return_X_y` set to `False`, in which case the function returns a `Bunch` with the data and the metadata, in particular the class names and the feature names.  This is explained in the documentation of the `load_iris()` function.

In [None]:
# ??load_iris
# help(load_iris)

# Using the `load_data()` function

Alternatively, by inspecting the source code of `load_iris()`, one can recover the target names with the `load_data()` function:

In [None]:
X, y, target_names = load_data(datasets_path, 'iris.csv')
classes = {i: name for i, name in enumerate(target_names)}
print(X.shape)
print(y.shape)
print('\n'.join([f"class {i}: {name}" for i, name in classes.items()]))

The feature names must nevertheless be hard coded as above.

# Visualizing the data
There are `4` features, hence we produce `4!/(2!)(2!)=6` (two-dimensional) scatter plots of the data:

In [None]:
fig, axs = plt.subplots(3, 2, figsize=(12, 3*4))
axs = axs.ravel()
for ax, (i, j) in zip(axs, list(itertools.combinations(range(4), 2))):
    scatter = ax.scatter(X[:,i], X[:,j], marker='.', c=y)
    ax.legend(*scatter.legend_elements(prop='colors'), loc='upper right')
    ax.set_xlabel(feature_names[i], fontsize=14)
    ax.set_ylabel(feature_names[j], fontsize=14)
    ax.set_title(f"{feature_names[i]} vs {feature_names[j]}", fontsize=16)
    
plt.subplots_adjust(hspace=.5)

## Visualization via `pandas` dataframes

In [None]:
df = pd.read_csv(path/'iris.csv', header=0)
df.columns=['sepal length (cm)', 'sepal width (cm)',
            'petal length (cm)', 'petal width (cm)',
           'target']

In [None]:
fig, axs = plt.subplots(3, 2, figsize=(16, 3*6))
axs = axs.ravel()
for ax, (i, j) in zip(axs, list(itertools.combinations(range(4), 2))):
    df.plot.scatter(x=feature_names[i], y=feature_names[j], c='target', cmap='viridis', marker='+', ax=ax)
    ax.set_title(f"{feature_names[i]} vs {feature_names[j]}", fontsize=16)

plt.subplots_adjust(hspace=.5)

# Statistics

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(12, 8), sharey=True)
axs = axs.ravel()
for i, ax in enumerate(axs):
    df.groupby('target')[feature_names[i]].plot.hist(alpha=0.4, ax=ax);
    ax.set_xlabel(feature_names[i], fontsize=14)
    ax.legend();

In [None]:
df.head()

In [None]:
idx = df[df['target']==1].index.values[0]

In [None]:
type(idx)

In [None]:
idxs = slice(idx, idx+2)

In [None]:
idxs

In [None]:
print(df[idxs])

In [None]:
df.loc[50:52]

In [None]:
pd.__version__

In [None]:
df.index.values[-1]

In [None]:
dk = df[df['target']==1].copy()

In [None]:
idx = dk.index.values[0]
idx, type(idx)

In [None]:
dk.head()

In [None]:
dk.index

In [None]:
idx

In [None]:
dk.loc[idx]

In [None]:
idxs = slice(idx, idx+2)

In [None]:
dk.loc[idxs]