# Dimension Reduction Methods

## 1. Overview

Today we learned about dimensionality reduction and principal component analysis in particular. Here in this notebook, we will use PCA to revisit the breast cancer data from our last session. This will save us some time re exploratory data analysis, so we will have more time to focus and think about how dimension reduction can help us in our analytic workflow. 

If necessary, refresh your memory on the dataset here: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data.



## 2. Import and data prep
### 2.1. Importing Dependencies

As usual, we start by importing libraries we will use later on. Throughout the notebook, if any functions are unclear, try googling the library and function to familiarize yourself with the functions and their in- and outputs. 

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.graph_objs as go

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn import decomposition
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.metrics import accuracy_score,confusion_matrix

### 2.2. Load data

**TASK:** Load the file ```breast-cancer-wisconsin.csv``` located in the same directory as this notebook using ```read_csv()``` from pandas.

In [None]:
data = ...

### 2.3. Data summary

As usual, we start by having a peak at the data. 

**TASK:** Use ```head()```, ```describe()``` and ```info()``` on the dataframe to get a first idea.

### 2.4. Data preparation

#### 2.4.1. Data clean-up

**TASK:** We don't need the first and last column ('id', 'Unnamed: 32') - drop them. 

In [None]:
# Cleaning and modifying the data
data = data.drop(...)
data = data.drop(...)

ndat, nvar = data.shape
ndat, nvar

#### 2.4.2. Label encoding

Finally, we map the diagnostic values to binary values. 

In [None]:
data['diagnosis'] = data['diagnosis'].map({'M':1,'B':0})

**TASK:** Assign the data to ```X```, excluding ```diagnosis```, and assign the latter to ```y```.

In [None]:
X = data.loc[:, ...]

y = data.loc[:, 'diagnosis']

#### 2.4.3. Data scaling

We learned about the importance of scaling data whenever working with distance-based methods.

**TASK:** Use the function ```StandardScaler()``` to scale the data. Use ```describe()``` to see how the data has changed. 

In [None]:
sc = ...
scaled_X = ...
pd.DataFrame(scaled_X, columns=X.columns).describe()

**TASK:** Compare the scaled data to the original unscaled data - what did the scaling do?

## 3. Exploratory data analysis

**TASK:** Check out last session's notebook on clustering to refresh your memory on the dataset. Perhaps you can think of aspects to explore that you didn't check last time. 

### 3.1. For the mathematically curious

Using the ```linalg``` library from numpy, compute the eigenvalues of the above covariance matrix. Create two bar plots of the unnormalized and normalized eigenvalues (i.e. divided by the sum of eigenvalues). We can check later if they coincide with the explained variances by the PC's.

In [None]:
from numpy import linalg

# compute covariance matrix of transposed of scaled_X
cov_matrix = ...

# compute eigenvalues of the covariance matrix using linalg.eig()
eigenvalues, eigenvectors = ...

df1 = pd.DataFrame({'Eigenvalues Covariance matrix': eigenvalues,
                    'Eigenvalue number': [str(x) for x in range(1,nvar)]})

plt.figure(figsize = (25,6))
sns.barplot(x = 'Eigenvalue number',y = 'Eigenvalues Covariance matrix',data = df1)
plt.show()

In [None]:
from numpy import linalg

# plot as above but divide the eigenvalues by their sum
df1 = pd.DataFrame({'Normalized eigenvalues': ...,
                    'Eigenvalue number': [str(x) for x in range(1,nvar)]})

plt.figure(figsize = (25,6))
sns.barplot(x = 'Eigenvalue number',y = 'Normalized eigenvalues',data = df1)
plt.show()

## 4. Principal Component Analysis (PCA)
Now we will use PCA to find a linear projection of the variable space. First, we will look at the full PCA space to then decide how to reduce it. 

### 4.1. Full PCA

**TASK:** Create a PCA object called ```pca``` using ```decomposition.PCA()```, then transform the data into the PCA space by using ```fit_transform``` on ```scaled_X```. Check out the transformed data.

In [None]:
pca = ...
tdata = ...
tdata

**Reflect:** Have the dimensions been reduced? What is the new transformed data?

### 4.2. Analysis of the PCA space

**TASK:** Check out the ```explained_variance_``` and ```explained_variance_ratio_``` attributes of the pca object. Create bar plots of them and compare to the bar plots of the eigenvalue

**TASK:** Make a scree plot based on the ```explained_variance_ratio_```. Is there a clear scree point?

In [None]:
df1 = pd.DataFrame({'% variation explained': ...,
                    'PCs':['PC' + str(x) for x in range(1,nvar)]})

plt.figure(figsize = (25,6))
sns.barplot(x = 'PCs',y = '% variation explained',data = df1)
plt.show()

**TASK:** Now plot the cumulative explained variance ratios using ```np.comsum()```. Do you find this or the scree plot more informative?

In [None]:
df1 = pd.DataFrame({'% variation explained': ...,
                    'Number of PCs': [str(x) for x in range(1,nvar)]})

plt.figure(figsize = (25,6))
sns.barplot(x = 'Number of PCs',y = '% variation explained',data = df1)
plt.show()

**Reflect:** Look at the scree and the cumulative scree plots. How many components are required to explain at least 80% of the variance in the data?

**Optional TASK:** If you checked out the eigenvalues of the covariance matrix in the optional task in **§3.1**, now check out the ```explained_variance_``` attribute of our pca object. Make a plot and compare both this and the scree plot to those in **§3.1**. How do they relate to each other?

In [None]:
df1 = pd.DataFrame({'Variation explained':...,
                    'PCs':['PC' + str(x) for x in range(1,nvar)]})

plt.figure(figsize = (25,6))
sns.barplot(x = 'PCs',y = 'Variation explained',data = df1)
plt.show()

### 4.4. Reduced PCA


Above, we learned that 5 components are required to explain at least 80% of the variance in our data. Let's now do dimension reduction by only looking at the first 5 components. 

**TASK:** Again using ```decomposition.PCA()``` now compute a PCA projection of the data setting ```n_components=5```. Then transform ```scaled_X``` into the reduced PCA space and store in ```pca_5var```.

In [None]:
#we will do PCA with only 5 components now as they seem to provide 80% of the information.
pca_red = ...
pca_5var = ...

**TASK:** Check out ```explained_variance_``` and its sum to make sure we really explained at least 80% of our data variance.

**CHECK:** We should see that over 80% information is obtained in first 5 components.

**TASK:** Create a dataframe ```new_X``` from the transformed data using ```pd.DataFrame()``` on the transformed data, i.e. using ```columns=['PC1','PC2','PC3','PC4','PC5']``` of ```pca_5var```. Check it out using ```head()```.

In [None]:
new_X = ...


**TASK:** In the lecture, we learned that PCA makes a rotated projection of the data, where correlations have been removed. check if that's true.

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
sns.heatmap(...)


We can see that there is no correlation among the principal components which is very important. 

## 5. Classification of the breast cancer data

### 5.1. On the reduced PCA space

We will now train LogisticRegression on the data projected onto the 5 principal components.

**Reflect:** Why would we choose logistic regression on this data? 

Logistic regression is a classification method for binary outcome data.

**TASK:** Use ```train_test_split()``` on ```new_X``` and ```y``` to create a test-train data split with ```test_size=0.2``` and setting ```random_state``` to any number of your choice.

In [None]:
new_X_train, new_X_test, y_train, y_test = ...

**TASK:** Create a model called ```logreg``` using ```LogisticRegression()``` and setting ```solver='lbfgs'```. Fit the model on the training data using ```fit()``` and then make predictions on the test set using ```predict()``` on the logreg model. 

In [None]:
# create logreg model
logreg = ...

# fit on training data


# predict on test data
y_pred_test = ...

**TASK:** Print the confusion matrix and the accuracy of the logreg model using ```confusion_matrix()``` and ```accuracy_score()``` on the test data.

In [None]:
print(...)

### 5.2. On the original data space

Now let's see how the regression on the reduced PCA space compares to regression on the original data space. 

**TASK:** Use ```train_test_split()``` on ```X``` and ```y``` to create a test-train data split with ```test_size=0.2``` and setting ```random_state``` to any number of your choice.

In [None]:
#let's compare Logistic Regession without PCA when we have all of the original features
X_train, X_test, y_train, y_test = ...

**TASK:** Create a model called ```logreg2``` using ```LogisticRegression()``` and setting ```solver='lbfgs'```. Fit the model on the training data using ```fit()``` and then make predictions on the test set using ```predict()``` on the ```logreg2``` model. 

In [None]:
logreg2 = ...

# fit on training data

# predict on test data
y_pred_test2 = ...

**TASK:** Print the confusion matrix and the accuracy of the logreg2 model using ```confusion_matrix()``` and ```accuracy_score()``` on the test data.

**TASK:** Compare the accuracy we got for classification when using the original data and when it's been projected onto the reduced PCA space. 

**Optional TASK:** Can we get even better classification results by using more principal components?

### 5.3. Visualizing the classification results

Let's visualize the classification results on the whole data set.

**TASK:** First make predictions from the logreg models fitted on the reduced PCA and the original space, i.e. ```logreg``` and ```logreg2```, this time on the whole dataset, i.e. on ```new_X``` and ```X```, respectively.

In [None]:
y_pred_all = ...
y_pred_all2 = ...

**TASK:** Make scatter plots of the data projected onto the first two components, i.e. plotting ```new_X.PC1``` vs ```new_X.PC2```, and colour (```c=...```) the points by 
1. Predictions from classification on the reduced PCA space.
2. Predictions from classification on the original data space.
3. True diagnoses.

In [None]:
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True)

ax1.scatter(..., cmap = "jet", edgecolor = "None", alpha=0.35)
 
ax2.scatter(..., cmap = "jet", edgecolor = "None", alpha=0.35)

ax3.scatter(..., cmap = "jet", edgecolor = "None", alpha=0.35)

# don't forget to annotate: title and axis labels

plt.show()

Since accuracy is quite high for both classifiers, both predictors are very close to the true diagnoses. We know however that we main some accuracy gains, and we can keep in mind next time we do classification and achieve low accuracy that PCA or dimension reduction, more generally, can help us improve predictions. 

This notebook has been adapted from https://www.kaggle.com/code/jyotiprasadpal/dimension-reduction-methods/notebook.