## Week 6 exercise

All of the functions and code that you need for this exercise will be covered in the notebook from Week 9 called *lecture-1-exploratory-data-analysis.ipynb*

Parts 1, 4 and 5 of this notebook will be important for your coursework.

You are given a dataset of materials. This data contains both inter-metallic materials and ionic materials. The data also has a series of features that describe each material. Your task this week is to explore the relationships between the descriptors. Look for any suspicious data. Decide if any descriptors are very highly correlated. Reduce the dimensionality and look at some initial clustering of the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## 1 Load and look at the data

* Use pandas to read `training-data-week-1.pickle`.
```
df = pd.read_pickle('training-data-week-1.pickle')
```
* Take a look at the top few entries of this data frame. See part 2 of *lecture-1-exploratory-data-analysis.ipynb*.
```
df.head()
```
* Explore some non-graphical summary statistics of the different columns using pandas. See part 2 of *lecture-1-exploratory-data-analysis.ipynb*.
```
df.describe()
```
* Get information about the types of data in each column using pandas. See part 2 of *lecture-1-exploratory-data-analysis.ipynb*.
```
df.info()
```

## 2 Graphical examination

Use seaborn to plot scatters of the different variables against each other.

```
import seaborn as sns
sns.pairplot(df)
```

## 3 Inspect the individual distributions

For each column calculate the skew and the kurtosis for each of the columns. Which data has the highest skew and the highest kurtosis?
You can get a list of columns using:

```
column_names = list(df.columns)
print(column_names)
```


Use the code for `skew` and `kurt` from lecture 1 [in the ebook](https://keeeto.github.io/ebook-data-analysis/lecture-1-exploratory-data-analysis.html).


## 4 Using boxplots 

Inspect each column and look for outliers 
* Make box plots for each of the columns
    * If you find a very serious outlier (say more than 1000 away from the mean value) drop it from the data - you will see the box plot disappear
    * If you find a bad outlier - remove that data
* Save the new clean dataframe to `week1-cleaned-data.pickle`
    

### Plot the box plots

Get the number of features : `print(len(list(df.columns)))`

In this case we have 9 features - so we can do a 3x3 plot. Use this code:
```
column_names = list(df.columns)
fig, ax = plt.subplots(3, 3, figsize=(10, 10))
for i in range(3):
    for j in range(3):
        data = df[df.columns[3*i + j]].values
        ax[i, j].boxplot(data)
        ax[i, j].set_title(df.columns[3*i + j])
plt.tight_layout()
```

### Identify outliers

Are there any box plots where there are points more than 1000 aaway from the mean?
If there are any - drop these rows from the data.

To drop the data, you can locate it using something like
```
df.drop(df[df['column name'] >= 1000].index, inplace = True)
```

When you have dropped the outliers, save the dataset:
```
df.to_pickle('week1-cleaned-data.pickle')
```

Do the 3x3 boxplots again and make sure outliers are gone.

## 5 Correlations

Obtain the pearson correlations between the different columns and inspect them.
Which columns seems to be most closely related - are there any possibly redundant columns that you might remove?

The code for plotting Pearson correlations in a heatmap is in lecture 1 [in the ebook](https://keeeto.github.io/ebook-data-analysis/lecture-1-exploratory-data-analysis.html).
. It is in the section *Explore correlations in the data*

 Drop correlated data. 
```
df.drop([<list of columns to drop>], inplace=True, axis=1)
```

Plot the heatmap again

## 6 Look at data reduction and clustering (not needed for the course work, but useful)

We will see if we can cluser the data to separate those oxides with bandgaps and intermetallics with no band gaps.
First try it on the raw data, we will look for exmaple at the columns `Eneg avg_dev` and `Radius avg_dev`, colouring the points by the true labels and we will see how this shows up.

**Hint** The code you need for this can be found in the notebook from Week 9 *lecture-2-pca.ipynb* and lecture 2  [in the ebook](https://keeeto.github.io/ebook-data-analysis/lecture-2-clustering-kmeans-GMM.html).

* Plot `Eneg avg_dev` and `Radius avg_dev`, colour the points by the true labels (called 'labels' in this data). Use the following syntax:
```
plt.scatter(df['<variable 1>'].values, df['<variable 2>'].values, c=df['<variable for colour>'].values)
```
* Do a PCA with the same number of principal components as columns
* Use this to work out how many componants can accoun for about 99% of the variance
* Do a scatter plot in 2d of the first two compononents - colour the plot by the band_gap values - do you see a trend?
* Try to cluster this data using k-means
    * Use the code from *lecture-2-clustering-kmeans-GMM.ipynb*
        * Section - `Clustering using k-means`


# NB

Before doing the PCA etc convert the dataframe to an array and make sure you drop the label column. Use the code below:

    X = df.values
    labels = X[:, 9]
    X = X[:, :9]

In [None]:
# Plot the two columns mentioned above
# Note that c means colour and we will colour by the bandgap, column 8
plt.scatter(df['Eneg avg_dev'].values, df['Radius avg_dev'].values, c=df['labels'].values)

In [None]:
X = df.values
labels = X[:, 7]
X = X[:, :7]

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=6).fit(X)
plt.plot(pca.explained_variance_ratio_)

In [None]:
pca_out = PCA(n_components=2).fit_transform(X)
labels[labels > 0] = 1
plt.scatter(pca_out[:, 0], pca_out[:, 1], c=labels)

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(pca_out[:, 0], pca_out[:, 1], c=y_kmeans)