# Iris Dataset Project

The Iris Dataset is a basic dataset usually used to show data analysis techniques and machine learning algorithms.

This Notebook will be both your lesson & project for next session.

## Constraints
* Fill all the # TODO cells.
* When asked to do visualization, use plotting from **matplotlib**, not pandas or seaborn. Expect explicity authorized.
* Send your work to my inbox laure.daumal@ext.devinci.fr
* **Deadline**: Wednesday, 11th of March, 23h42. You have a bit more than one week and a half.

Don't forget to
1. Clear all outputs before saving & sending it to me (**Cell >> All Output >> Clear**).
* Restart the kernel and run all cells to check no errors are found (**Cell >> Run All**).

# Features Exploration

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d

from sklearn.decomposition import PCA

import seaborn as sns  #  <-- A new library!
from sklearn.preprocessing import StandardScaler  # <-- A new method!

In [None]:
iris_dataset = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

df = pd.read_csv(iris_dataset, names=['SepalLength (cm)', 
                                      'SepalWidth (cm)',
                                      'PetalLength (cm)',
                                      'PetalWidth (cm)',
                                      'Species'])
df

This dataset has 150 samples, defined by 4 features. 

The `Species` does not count as a feature because it is usually used as a **target** for classification.

Classification consist of creating a model that takes a sample of the 4 features given in centimeter and return the corresponding target, here an Iris species.

In [None]:
df["Species"].value_counts()

In this Notebook we will see some cool usage of the **`seaborn`** package.

For example, we can usually plot a bar count from a column:

In [None]:
sns.countplot('Species', data=df)

## Sepal Exploration

### Swarm plot

`swarmplot` method from `seaborn`: a kind of plotting we didn't see before: it's like a swarm of bees buzzing abount their hive.



In [None]:
sns.swarmplot(x="Species", y="SepalLength (cm)", data=df)

In [None]:
sns.swarmplot(x="Species", y="SepalWidth (cm)", data=df)

### Scatter

We can try to visualize each Iris feature in 2D.

Your task: Display a 2D visualization of each sample based on their SepalLength and SepalWidth.

In [None]:
# TODO: Print a matplotlib 2D visualization of each sample based on their SepalLength and SepalWidth. 
# Hint: Use a scatter plot.

In [None]:
# TODO: Use your previous visualization and change the color according to each sample's Species.

The equivalent can also be obtained with `seaborn` using `scatterplot` and the `hue` parameter.

In [None]:
sns.scatterplot(x="SepalLength (cm)", y="SepalWidth (cm)", hue="Species", data=df)

Below, we use the method `lmplot` from `seaborn`: It uses a linear regression model across a `FacetGrid` to represent the relationship between the two variables.

In [None]:
sns.lmplot(x="SepalLength (cm)", y="SepalWidth (cm)", hue="Species", data=df)

## Petal Exploration

In [None]:
# TODO: Use the previous matplotlib scatter visualization, but now show the relationship between the PepalLength and PepalWidth 
# (with the same species's colors)

Again, the equivalent from Seaborn with Linear regression.

In [None]:
sns.lmplot(x="PetalLength (cm)", y="PetalWidth (cm)", hue="Species", data=df)

## All Features Exploration


### Histograms
For each of the 4 features, show its distribution in a Histogram chart.

In [None]:
# TODO: For each of the 4 features, show its distribution in a matplotlib Histogram chart. 
# (You should get 4 different histograms)

In [None]:
# TODO: Use the previous histogram visualization, but show color according to each Species
# (Use the same colors as before)

Based on the histograms you just plotted, what could probably be the best feature(s) to differenciate the 3 different Iris Species? Why?

In [None]:
# TODO: Write your observations.

"""
You can use a multi-line comment to write big text in Python, and answer the question.
"""

This time, we use Pandas to generate our histograms. It can be useful to have a visualisation of the distribution of each feature.

In [None]:
df.hist()

### Pairplots

With seaborn, you can use `pairplot` with `kind="reg"` to show linear relationship between different variables.

Below, we use the same feature as Y axis, and use the 3 left features in X axis.

In [None]:
sns.pairplot(df, 
             x_vars=["PetalLength (cm)", "PetalWidth (cm)", "SepalWidth (cm)"], 
             y_vars=["SepalLength (cm)"],
             hue="Species",
             height=5, aspect=.8, kind="reg");

You can also select all your dataset, let Seaborn determine what to show and see what happens. (Not recommended if you have a lot of features, i.e. the Pokemon Go dataset)

In [None]:
sns.pairplot(df, hue="Species");

### Correlations

The correlation of each feature to another plays an important role. 

If there are features and many of the features are highly correlated, then training an algorithm with all the featues will reduce the accuracy of the resulting model.

You can get the correlation matrix of all numerical features of a pandas.DataFrame using `.corr()` method.

In [None]:
df.corr()

### Heatmap with Matplotlib

Let's visualize our correlation matrix with an Heatmap!

In [None]:
features = df.columns[:-1]
values = df.corr().values

fig, ax = plt.subplots(figsize=(16, 8))

im = ax.imshow(values, 
               cmap="cubehelix_r")

# Show X and Y ticks & label them
ax.set_xticks(np.arange(len(features)))
ax.set_yticks(np.arange(len(features)))

ax.set_xticklabels(features)
ax.set_yticklabels(features)

# Create text annotations.
for i in range(len(features)):
    for j in range(len(features)):
        color = "w" if values[i, j] > 0.5 else "black"
        text = ax.text(j, i, round(values[i, j], 2), 
                       ha="center", 
                       va="center", 
                       color=color)

ax.set_title("Correlation Heatmap")
plt.show()

### Heatmap with Seaborn

Spoiler: it's much easier.

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))

sns.heatmap(df.corr(), 
            ax=ax, 
            annot=True, 
            cmap='cubehelix_r')

plt.show()


Low correlation value == least correlation. Observations:


**Not correlated**:
* SepalWidth and PetalWidth
* SepalWidth and PetalLength

**Highly correlated**:
* PetalWidth and PetalLength
* SepalLength and PetalWidth & PetalLength

## 2D Principal Component Analysis

* Normalize
* Apply PCA

In [None]:
# Normalize features using StandardScaler.

scaler = StandardScaler()

matrix = scaler.fit_transform(df[features])
matrix

In [None]:
# TODO: Use PCA with `n_components=2` to transform your 4D dataset `matrix` into a 2D dataset

In [None]:
# TODO: Use your 2D dataset to plot a 2D scatter visualization.

## 3D PCA

In [None]:
# TODO: Use PCA with `n_components=3` to transform your 4D dataset `matrix` into a 3D dataset

In [None]:
# TODO: Use your 3D dataset to plot a 3D scatter visualization.

# Linear Discriminant Analysis (LDA)

We learnt about Principal Component Analysis (PCA) just before.

Both LDA and CPA are **linear transformation** techniques used for **dimensionality reduction**. 

But PCA is said to be an **unsupervised** algorithm, since it does not need a **target** (class value) to work, contrary to LDA. PCA will compute vectors (axes) that maximize the *variance*, whereas LDA will maximize the *distance* between each class center (the best way to separate classes).

LDA follows steps similar from PCA, but differs slightly:

* Get all the `d`-dimensions (the features we want to reduce), target excluded
1. Get the *mean vectors*, the mean of each feature **for each class**
1. Compute **multiple scatter matrices: in-between-class and within-class.**
1. Compute their eigenvectors $(e_{1}, e_{2},...,e_{d})$ and corresponding eigenvalues $(λ_{1},λ_{2},...,λ_{d})$ 
> The eigenvalues tell us the *length* (or *magnitude*) of the eigenvectors, and is a value >= 0.  
> If all the eigenvalues are similar in length, it means our dataset is well represented by those features.  
> If some eigenvalues are really high, and some are close to zero, it means the latters are less informative. We might consider dropping those features and keep only the higher values to construct the our `k`-dimensional subspace. 
1. Visualize the eigenvectors on a 3D Plan
1. Sort the eigenvectors by decreasing eigenvalues and choose `k` eigenvectors with the largest eigenvalues to form a `d`×`k` dimensional matrix $W$
1. Project our `d`-dimensional dataset in the new `k`-dimensional subspace, created from the eigenvectors with the highest eigenvalues.

## Transform String class values to Numerical

Remember I told you we could transform String or Boolean value in numerical values?

The simpler solution is to do it by hand by creating a label dictionnary.

In [None]:
label_dict = dict()

for i, species in enumerate(df['Species'].unique()):
    label_dict.update({species : i})
    
label_dict

And use `apply` to encode each String class value to a numerical one.

In [None]:
def str2num(species: str):
    return label_dict[species]

df['Species'] = df['Species'].apply(str2num)
df

In [None]:
x = df[features].values  # features
y = df['Species'].values  # target
y, x

## Compute mean vectors for each feature and each class

Since we have 4 features and 3 class, our `mean_vectors` matrix will be of shape (3, 4).

In [None]:
mean_vectors = []

# TODO: Iterate over the 3 classes and append the 4 features means to mean_vectors. You should get a shape of (3, 4).

mean_vectors = np.array(mean_vectors)
print(mean_vectors.shape)
print(mean_vectors)

## Compute the Scatter Matrices

We have 4 features, so we will compute two (4, 4)-dimensional matrices: the **in-between-class** and the **within-class** scatter matrices.

## Within-Class Scatter Mattrix

The within-class scatter matrix $ S_{W} $ is computed as:

$$ S_{W} = {\displaystyle{\sum_{i=1}^{c} S_{i}}} $$

Where

$$ S_{i} = {\displaystyle{\sum_{i=1}^{n} ({\displaystyle{\vec {z_{i}} \cdot \vec {z_{i}} }}) }} $$

Where $ \vec{z_{i}} $ is the vector containing the normalized values $ x_{i} - {\bar {x_{i}}} $ for the feature $i$.



In [None]:
d = len(features)

# Within-Class
wc_scatter_matrix = np.zeros((d,d)) 
wc_scatter_matrix

# For each class
for c, mean in zip(label_dict.values(), mean_vectors):
    column_mean = mean.reshape(4, 1)
    class_sc_matrix = np.zeros((d, d))
    # For each sample
    for sample in df[df['Species'] == c][features].values:
        column_vec = sample.reshape(4, 1)
        normalized_vec = column_vec - column_mean
        class_sc_matrix += normalized_vec.dot(normalized_vec.T)
        
    wc_scatter_matrix += class_sc_matrix

wc_scatter_matrix

## In-Between Class Scatter Mattrix

The within-class scatter matrix $ S_{B} $ is the sum of the dot product of each normalized mean, with each dot product multiplied by the size.

$$ S_{B} = {\displaystyle{\sum_{i=1}^{c} N_{i} ({\displaystyle{\vec {m} \cdot \vec {m} }})  }} $$

Where 
* $N_{i}$ is the 
* $ \vec{m} $ is the vector of the normalized means for the class i


In [None]:
# Between-Class
bc_scatter_matrix = np.zeros((d,d)) 

overall_mean = np.mean(df[features]).values.reshape((d, 1))

# For mean
for i, mean in enumerate(mean_vectors):
    column_mean = mean.reshape(4, 1)
    ni = df[df['Species'] == c].shape[0]
    normalized_vec = column_mean - overall_mean
    bc_scatter_matrix += ni * normalized_vec.dot(normalized_vec.T)

bc_scatter_matrix

## Compute eigenvalues and eigenvectors

Task: Use `np.linalg.eig` to get the eigenvalue and eigenvectors of $ S_{W} \cdot S_{B} $ (the dot product of the Within-Class scatter matrix and the Between-Class catter matrix)

In [None]:
# TODO: Compute the eigenvalues and vectors of the dot product of the matrices

eig_vals, eig_vecs = 0, 0

## Select the k best features

Exactly as we did for the PCA, select the `k = 2` best features based on eigen values.

Use `np.hstack` to create the matrix $ W $, `matrix_w`.

In [None]:
k = 2

# TODO: Create the matrix_w.
matrix_w = None

## Project to the new subspace

Transform the samples onto the new subspace via the equation $y = ^{t}Wx$

In [None]:
# TODO: Compute the x_lda value using `x` and `matrix_w`.
x_lda = np.array()
x_lda.shape

In [None]:
def plot_lda(x_lda):
    fig, ax = plt.subplots(figsize=(16, 8))

    for label, color in zip(label_dict.keys(), ('blue', 'red', 'green')):
        plt.scatter(x=x_lda[df['Species'] == label_dict[label]][:, 0],
                    y=x_lda[df['Species'] == label_dict[label]][:, 1],
                    marker="*",
                    color=color,
                    alpha=0.5,
                    label=label)

    plt.title('LDA Projection')
    plt.legend()
    plt.show()
    
plot_lda(x_lda)

## Compare with LDA from Scikit-learn

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=2)
x_lda_sklearn = lda.fit_transform(x, y)

plot_lda(x_lda_sklearn)

Tasks: Plot your LDA-transformed data using Seaborn into:
* A scatterplot
* A linear regression plot 

In [None]:
# TODO: Convert `x_lda` to a DataFrame and print the Seaborn scatterplot.

In [None]:
# TODO: Use the same DataFrame to print it using `lmplot`.