# 1. Classification with Naive Bayes

We will initially do a classification with Naive Bayes. We will use this later on. But this code is not needed for the course work, so the code is provided here for you. If you finish early feel free to come back and look at this section.

**Note** you still need to run the cells with the code, to make the next part work well.



In the next part of the exercise - `Linear Regression` the code is not provided; that section will be important to know for your course work.

* Load up the data that you cleaned up last week `week1-cleaned-data.pickle`
* From this data build a vector X containing the independent variables (all columns except labels) and a vector y containing the labels
* On this data perform a PCA so that the data is reduced to 2D
* Train a Naive Bayes classifier on the resulting 2D data and the labels y
* Calculate the binary cross entropy between the predicted values of the training set and the true values on the training set. **Hint** binary cross entropy is called log_loss in scikit-learn

## 1.1 Load up the test test 

* Load up the test data from `test-data-week-2.pickle`
* As above build  a vector `x_test` that has all the columns from this except labels and `y_test` that contains the labels
* Transform this data using *the same* PCA as above *note* do **not** refit the PCA on this data
* Apply the Naive Bayes classifier that we just trained to the test data and see how well it clusters

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
df_train = pd.read_pickle('../Week 10/week1-cleaned-data.pickle')
X = df_train.values[:, :9]
y = df_train.values[:, 9]

### Transform with a PCA

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_out = pca.fit_transform(X)
plt.scatter(pca_out[:, 0], pca_out[:, 1], c=y)

### Naive Bayes classification

In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(pca_out, y)

ynew = model.predict(pca_out)

In [None]:
plt.scatter(pca_out[:, 0], pca_out[:, 1], c=ynew)

In [None]:
from sklearn.metrics import log_loss

log_loss(ynew, y)

In [None]:
df_test = pd.read_pickle('test-data-week-2.pickle')
x_test = df_test.values[:, :9]
y_test = df_test.values[:, 9]
x_test_pca = pca.transform(x_test)

In [None]:
plt.scatter(x_test_pca[:, 0], x_test_pca[:, 1], c=y_test)

In [None]:
ynew_test = model.predict(x_test_pca)

In [None]:
plt.scatter(x_test_pca[:, 0], x_test_pca[:, 1], c=ynew_test)

# 2. Linear regression

This part of the exercise will be important for the course work.


### Set up the data

* Using your classifier separate out all of the data that are ionic materials (label 0) from the training set

```
    ionic_indices = np.where(ynew==0)
    df = df_train.iloc[ionic_indices]
```    

### Drop unnecessary columns

* Use pandas to drop the `labels` column from this dataset - [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)

```
df.drop('labels', axis=1, inplace=True)
```

### Look at correlations

We examine the data to see if any features are highly correlated.

Look at a map of the Pearson correlations (see code from last week's exercise). On the basis of this **remove two of the columns** that are highly correlated to other columns. 

**NB** do not remove `band_gap` that is the column that you want to fit the model for.

**NB** generally we will not remove columns unless correlation is > 0.95, but in this case you can if necessary.

The code for the correlation matrix is in lecture 1 [of the ebook](https://keeeto.github.io/ebook-data-analysis/lecture-1-exploratory-data-analysis.html) - dont forget to `import seaborn as sns`

Recall that the code to remove columns is:
`df_reducded = df.drop([<list of columns to drop>], axis=1)`

### Set up x and y and save the data

x, the features for your model, are the first 6 columns (0-5)
y, the target of your mode is the final column (6)

The code to set this up and save the data is
```
x = df_reducded.values[:, :6]
y = df_reducded.values[:, 6]
df_reducded.to_pickle('week2-regression-train.pickle')

```

### Scale the data 

We then standardise the data using the `StandardScaler`. **Note** if the data has only one feature, like the label data y, we need to use a reshape when standardising. The code to do this is:
```

from sklearn.preprocessing import StandardScaler

scaler_x = StandardScaler()
x = scaler_x.fit_transform(x)
scaler_y = StandardScaler()
y = scaler_y.fit_transform(y.reshape(-1, 1))
```


### Split into train and test sets

Use the train_test_split tool from `scikit-learn` to make an 80:20 training:test split

### Set up a linear regression and fit the model

Use `LinearRegression` from `scikit-learn` to fit the data.

Look at the parameters of the final model:

**Question** Which of the features seems to have the greatest influence on the band gap?

### Analyse the performance of the model

First we look at how well it predicts the training set.
Use `predictions = regr.predict(x_train)` to make predictions.

Use a scatter plot from matplotlib to plot `predictions` versus `y_train`. Don't forget to label your axes.

Next do the same plot for the test set

### Summary statistics

Use metrics from `scikit-learn` to look at the performace of the model

