# IM939 Lab 4 - Part 1 - Iris

## Data

Many datasets have a high number of dimensions. We are going to explore dimension reduction (principle component analysis) and clustering techniques.

The simple Iris dataset is great for introducing these methods.

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()

The iris dataset is in an odd format.

In [None]:
type(iris)

Following [this stackoverflow answer](https://stackoverflow.com/questions/38105539/how-to-convert-a-scikit-learn-dataset-to-a-pandas-dataset) we can convert it into the pandas dataframe format we know and love.

In [None]:
import pandas as pd

iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df.head()

We will scale the data between 0 and 1 to be on the safe side. All we are doing is placing the data on the same scale which is often called Normalisation (see [this blog entry](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/)).

Standardisation often means centering the data around the mean.

Some algorithms are senstive to the size of variables. For example, if the sepal widths were in meters and the other variables in cm then an algorithm may underweight sepal widths. Normalising the data puts all the data on a single scale.

If you cannot choose between them then try it both ways. You could compare the result with your raw data, the normalised data and the standardised data.

In [None]:
from sklearn.preprocessing import MinMaxScaler
col_names = iris_df.columns
iris_df =  pd.DataFrame(MinMaxScaler().fit_transform(iris_df))

In [None]:
iris_df.columns = col_names
iris_df

In [None]:
iris_df.shape

Great.

Our dataset show us the length and width of both the sepal (leaf) and petals of 150 plants. The dataset is quite famous and you can find a [wikipedia page](https://en.wikipedia.org/wiki/Iris_flower_data_set) with details of the dataset.

## Questions

To motivate our exploration of the data, consider the sorts of questions we can ask:

* Are all our plants from the same species?
* Do some plants have similiar leaf and petal sizes?
* Can we differentiate between the plants using all 4 variables (dimensions)?
* Do we need to include both length and width, or can we reduce these dimensions and simplify our analysis?

## Initial exploration

We can explore a dataset with few variables using plots. 

In [None]:
import seaborn as sns

# some plots require a long dataframe structure
iris_df_long = iris_df.melt()
iris_df_long

In [None]:
sns.violinplot(data = iris_df_long, x = 'variable', y = 'value')

The below plots use the wide data structure.

In [None]:
iris_df

In [None]:
sns.scatterplot(data = iris_df, x = 'sepal length (cm)', y = 'sepal width (cm)')

In [None]:
sns.scatterplot(data = iris_df, x = 'sepal length (cm)', y = 'petal length (cm)')

Interesting. There seem to be two groupings in the data.

It might be easier to look at all the variables at once.

In [None]:
sns.pairplot(iris_df)

There seem to be some groupings in the data. Though we cannot easily identify which point corresponds to which row.

## Clustering

A cluster is simply a group based on simliarity. There are several methods and we will use a relatively simple one called K-means clustering.

In K-means clustering an algorithm tries to group our items (plants in the iris dataset) based on similarity. We decide how many groups we want and the algorithm does the best it can (an accessible introduction to k-means clustering is [here](https://www.analyticsvidhya.com/blog/2020/10/a-simple-explanation-of-k-means-clustering/)).

To start, we import the KMeans function from sklearn cluster module and turn our data into a matrix.

In [None]:
from sklearn.cluster import KMeans

iris = iris_df.values
iris

Specify our number of clusters.

In [None]:
k_means = KMeans(n_clusters = 3, init = 'random')

Fit our kmeans model to the data

In [None]:
k_means.fit(iris)

The algorithm has assigned the a label to each row.

In [None]:
k_means.labels_

Each row has been assigned a label.

To tidy things up we should put everything into a dataframe.

In [None]:
iris_df['Three clusters'] = pd.Series(k_means.predict(iris_df.values), index = iris_df.index)

In [None]:
iris_df

In [None]:
sns.pairplot(iris_df, hue = 'Three clusters')

That seems quite nice. We can also do individual plots if preferred.

In [None]:
sns.scatterplot(data = iris_df, x = 'sepal length (cm)', y = 'petal width (cm)', hue = 'Three clusters')

K-means works by clustering the data around central points (often called centroids, means or cluster centers). We can extract the cluster centres from the kmeans object.

In [None]:
k_means.cluster_centers_

It is tricky to plot these using seaborn but we can use a normal maplotlib scatter plot.

Let us grab the groups.

In [None]:
group1 = iris_df[iris_df['Three clusters'] == 0]
group2 = iris_df[iris_df['Three clusters'] == 1]
group3 = iris_df[iris_df['Three clusters'] == 2]

Grab the centroids

In [None]:
import pandas as pd

centres = k_means.cluster_centers_

data = {'x': [centres[0][0], centres[1][0], centres[2][0]],
        'y': [centres[0][3], centres[1][3], centres[2][3]]}

df = pd.DataFrame (data, columns = ['x', 'y'])

Create the plot

In [None]:
import matplotlib.pyplot as plt

# Plot each group individually
plt.scatter(
    x = group1['sepal length (cm)'], 
    y = group1['petal width (cm)'], 
    alpha = 0.1, color = 'blue'
)

plt.scatter(
    x = group2['sepal length (cm)'], 
    y = group2['petal width (cm)'], 
    alpha = 0.1, color = 'orange'
)

plt.scatter(
    x = group3['sepal length (cm)'], 
    y = group3['petal width (cm)'], 
    alpha = 0.1, color = 'red'
)

# Plot cluster centres
plt.scatter(
    x = df['x'], 
    y = df['y'], 
    alpha = 1, color = 'black'
)


## Number of clusters

What happens if we change the number of clusters?

Two groups

In [None]:
k_means_2 = KMeans(n_clusters = 2, init = 'random')
k_means_2.fit(iris)
iris_df['Two clusters'] = pd.Series(k_means_2.predict(iris_df.iloc[:,0:4].values), index = iris_df.index)

Note that I have added a new column to the iris dataframe called 'cluster 2 means' and pass only our origonal 4 columns to the predict function (hence me using .iloc[:,0:4]).

How do our groupings look now (without plotting the cluster column)?

In [None]:
sns.pairplot(iris_df.loc[:, iris_df.columns != 'Three clusters'], hue = 'Two clusters')

Hmm, does the data have more than two groups in it?

Perhaps we should try 5 clusters instead.

In [None]:
k_means_5 = KMeans(n_clusters = 5, init = 'random')
k_means_5.fit(iris)
iris_df['Five clusters'] = pd.Series(k_means_5.predict(iris_df.iloc[:,0:4].values), index = iris_df.index)

Plot without the columns called 'cluster' and 'Two cluster'

In [None]:
sns.pairplot(iris_df.loc[:, (iris_df.columns != 'Three clusters') & (iris_df.columns != 'Two clusters')], hue = 'Five clusters')

In [None]:
iris_df

Which did best?

In [None]:
k_means.inertia_

In [None]:
k_means_2.inertia_

In [None]:
k_means_5.inertia_

It looks like our k = 5 model captures the data well. Intertia, [looking at the sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) as the _Sum of squared distances of samples to their closest cluster center._.

If you want to dive further into this then Real Python's [practical guide to K-Means Clustering](https://realpython.com/k-means-clustering-python/) is quite good.

## Principle component analysis (PCA)



PCA reduces the dimension of our data. The method derives point in an n dimentional space from our data which are uncorrelated.

To carry out a PCA on our Iris dataset where there are only two dimentions.

In [None]:
from sklearn.decomposition import PCA

n_components = 2

pca = PCA(n_components=n_components)
iris_pca = pca.fit(iris_df.iloc[:,0:4])

We can look at the components.

In [None]:
iris_pca.components_

These components are intersting. You may want to look at a [PennState article on interpreting PCA components](https://online.stat.psu.edu/stat505/lesson/11/11.4).

Our second column, 'sepal width (cm)' is positively correlated with our second principle component whereas the first column 'sepal length (cm)' is postively correlated with both.

You may want to consider:

* Do we need more than two components?
* Is it useful to keep sepal length (cm) in the dataset?

We can also examine the explained variance of the each principle component.

In [None]:
iris_pca.explained_variance_

A nice worked example showing the link between the explained variance and the component is [here](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html).

Our first principle component explains a lot more of the variance of data then the second.

### Dimension reduction

For our purposes, we are interested in using PCA for reducing the number of dimension in our data whilst preseving the maximal data variance.

We can extract the projected components from the model.

In [None]:
iris_pca_vals = pca.fit_transform(iris_df.iloc[:,0:4])

The numpy arrays contains the projected values.

In [None]:
type(iris_pca_vals)

In [None]:
iris_pca_vals

Each row corresponds to a row in our data.

In [None]:
iris_pca_vals.shape

In [None]:
iris_df.shape

We can add the component to our dataset. I prefer to keep everything in one table and it is not at all required. You can just assign the values whichever variables you prefer.

In [None]:
iris_df['c1'] = [item[0] for item in iris_pca_vals]
iris_df['c2'] = [item[1] for item in iris_pca_vals]

In [None]:
iris_df

Plotting out our data on our new two component space.

In [None]:
sns.scatterplot(data = iris_df, x = 'c1', y = 'c2')

We have reduced our three dimensions to two.

We can also colour by our clusters. What does this show us and is it useful?

In [None]:
sns.scatterplot(data = iris_df, x = 'c1', y = 'c2', hue = 'Three clusters')

In [None]:
iris_df

## PCA to Clusters

We have reduced our 4D dataset to 2D whilst keeping the data variance. Reducing the data to fewer dimensions can help with the 'curse of dimensionality', reduce the change of overfitting a machine learning model (see [here](https://en.wikipedia.org/wiki/Dimensionality_reduction)) and reduce the computational complexity of a model fit.

Putting our new dimensions into a kMeans model

In [None]:
k_means_pca = KMeans(n_clusters = 3, init = 'random')
iris_pca_kmeans = k_means_pca.fit(iris_df.iloc[:,-2:])

In [None]:
iris_df.iloc[:,-2:]

In [None]:
iris_df['PCA 3 clusters'] = pd.Series(k_means_pca.predict(iris_df.iloc[:,-2:].values), index = iris_df.index)
iris_df

As we only have two dimensions we can easily plot this on a single scatterplot.

In [None]:
# a different seaborn theme
# see https://python-graph-gallery.com/104-seaborn-themes/
sns.set_style("darkgrid")
sns.scatterplot(data = iris_df, x = 'c1', y = 'c2', hue = 'PCA 3 clusters')

I suspect having two clusters would work better. We should try a few different models.

Copying the code from [here](https://medium.com/@dmitriy.kavyazin/principal-component-analysis-and-k-means-clustering-to-visualize-a-high-dimensional-dataset-577b2a7a5fe2) we can fit multiple numbers of clusters.

In [None]:
ks = range(1, 10)
inertias = []
for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(iris_df.iloc[:,-2:])
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

Three seems ok. We clearly want no more than three.

These types of plots show an point about model complexity. More free parameters in the model (here the number of clusters) will improve how well the model captures the data, often with reducing returns. However, a model which overfits the data will not be able to fit new data well - referred to overfitting. Randomish internet blogs introduce the topic pretty well, see [here](https://elitedatascience.com/overfitting-in-machine-learning), and also wikipedia, see [here](https://en.wikipedia.org/wiki/Overfitting).



## Missing values

Finally, how we deal with missing values can impact the results of PCA and kMeans clustering.

Lets us load in the iris dataset again and randomly remove 10% of the data (see code from [here](https://stackoverflow.com/questions/42091018/randomly-insert-nas-values-in-a-pandas-dataframe-with-no-rows-completely-miss)).

In [None]:
import numpy as np

x = load_iris()

In [None]:
iris_df = pd.DataFrame(x.data, columns = x.feature_names)

mask = np.random.choice([True, False], size = iris_df.shape, p = [0.2, 0.8])
mask[mask.all(1),-1] = 0

df = iris_df.mask(mask)

df.isna().sum()

In [None]:
df

About 20% of the data is randomly an NaN.

### Zeroing

We can 0 them and fit our models.

In [None]:
df_1 = df.copy()
df_1 = df_1.fillna(0)

In [None]:
df_1

In [None]:
k_means_zero = KMeans(n_clusters = 4, init = 'random')
k_means_zero.fit(df_1)
df_1['Four clusters'] = pd.Series(k_means_zero.predict(df_1.iloc[:,0:4].values), index = df_1.index)
sns.pairplot(df_1, hue = 'Four clusters')

What impact has zeroing the values had on our results?

Now, onto PCA.

In [None]:
# PCA analysis
n_components = 2

pca = PCA(n_components=n_components)
df_1_pca = pca.fit(df_1.iloc[:,0:4])

# Extract projected values
df_1_pca_vals = df_1_pca.transform(df_1.iloc[:,0:4])
df_1['c1'] = [item[0] for item in df_1_pca_vals]
df_1['c2'] = [item[1] for item in df_1_pca_vals]

sns.scatterplot(data = df_1, x = 'c1', y = 'c2')

In [None]:
df_1_pca.explained_variance_

In [None]:
df_1_pca.components_

### Replacing with the average

In [None]:
df_2 = df.copy()
for i in range(4):
    df_2.iloc[:,i] = df_2.iloc[:,i].fillna(df_2.iloc[:,i].mean())

In [None]:
df_2

In [None]:
k_means_zero = KMeans(n_clusters = 4, init = 'random')
k_means_zero.fit(df_2)
df_2['Four clusters'] = pd.Series(k_means_zero.predict(df_2.iloc[:,0:4].values), index = df_2.index)
sns.pairplot(df_2, hue = 'Four clusters')

In [None]:
# PCA analysis
n_components = 2

pca = PCA(n_components=n_components)
df_2_pca = pca.fit(df_2.iloc[:,0:4])

# Extract projected values
df_2_pca_vals = df_2_pca.transform(df_2.iloc[:,0:4])
df_2['c1'] = [item[0] for item in df_2_pca_vals]
df_2['c2'] = [item[1] for item in df_2_pca_vals]

sns.scatterplot(data = df_2, x = 'c1', y = 'c2')

In [None]:
df_2_pca.explained_variance_

In [None]:
df_2_pca.components_

# Useful resources

The scikit learn UserGuide is very good. Both approaches here are often referred to as unsupervised learning methods and you can find the scikit learn section on these [here](https://scikit-learn.org/stable/unsupervised_learning.html).

If you have issues with the documentation then also look at the scikit-learn [examples](https://scikit-learn.org/stable/auto_examples/index.html).

Also, in no particular order:

* The [In-Depth sections of the Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html). More for machine learning but interesting all the same.
* [Python for Data Analysis](https://www.amazon.co.uk/Python-Data-Analysis-Wes-Mckinney/dp/1491957662/ref=sr_1_3?dchild=1&keywords=Python+for+Data+Analysis%3A+Data+Wrangling&qid=1603809746&sr=8-3) (ebook is available via [Warwick library](https://encore.lib.warwick.ac.uk/iii/encore/search/C__Spython%20for%20data%20analysis__Orightresult__U;jsessionid=5A7D1DE9BAC479EE36B491F8FAC8F1FD?lang=eng))

In case you are bored:

* [Stack abuse](https://stackabuse.com/tag/python/) - Some fun blog entries to look at
* [Towards data science](https://towardsdatascience.com/) - a blog that contains a mix of intro, intermediate and advanced topics. Nice to skim through to try and undrestand something new.

Please do try out some of the techniques detailed in the lecture material The simple examples found in the scikit learn documentation are rather good. Generally, I find it much easier to try to understand a method using a simple dataset.