#Principal Component Analysis

In this lab, we'll see in practice how to use Principal Component Analysis (PCA) to reduce the dimensionality of a dataset, and how to interpret the principal components. To this end, we'll use the "Wisconsin Breast Cancer dataset", which contains diverse characteristics of cancerous cell nuclei, and whether the patient's tumor is malignant (M) or benign (B). In the second part of the lab, we'll see how PCA can be used to compress an image. 

**Load the necessary libraries**

In [None]:
import sklearn 
import pandas as pd 
from sklearn.decomposition import PCA
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import cv2

**1) Read the dataset, drop the columns 'id' and 'Unnamed : 32', and check whether there are missing values. If any, drop the entire corresponding row.**

**2) Select X as all columns in the dataframe, at the exception of the target variable 'diagnosis', and y as the variable diagnosis. Center the matrix X column-wise.**

**3) Compute the total variance of the variable X.**

**4) Apply PCA to the dataset X by computing all principal components. Make sure you can access to the principal components, the variance explained by each component and the ratio of variance explained by each component.**

**5) Create a barplot of the proportion of total variance explained by the first 5 components. What do you observe ?**

**Generate also a barplot of the cumulative ratio of the total variance explained by the first 5 components.**

**Also, what is the proportion of the total variance explained by all the components ?**

**6) Generate a biplot of the component's scores in the space spanned by the first two components. Color the points depending on their target label ('M' or 'B'). Do you notice anything ?**

**7) Generate a table of the loadings for the first two principal components, i.e. a pandas Dataframe which columns are the principal components, and which rows are the loadings for each variable. Set the Dataframe's indices to be the original variable names. How do you interpret it ?**

**8) Use the function below to generate a loading plot. How do you interpret it ?**

In [1]:
def myplot(score,coeff,labels=None):
    fig, ax = plt.subplots()
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    ax.scatter(xs*scalex, ys*scaley, c = y)
    for i in range(n):
        ax.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            ax.text(coeff[i,0]* 1.15 , coeff[i,1] * 1.15 , "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            ax.text(coeff[i,0]* 1.15 , coeff[i,1] * 1.15 , labels[i], color = 'g', ha = 'center', va = 'center')
    plt.xlim(-0.5,1.1)
    plt.ylim(-1,1.1)
    ax.set_xlabel("PC{}".format(1))
    ax.set_ylabel("PC{}".format(2))
    fig.set_size_inches(18.5, 10.5)


**9) Split X (the centered data) and y into a training and a test set following a 80/20 partition. Fit a logistic regression model to the training data using all original variables, and evaluate its accuracy on the test set. Redo the same, but create a pipeline that adds a PCA pre-processing step to the data X. Fit the model on only the two first principal components. Is the difference in accuracy significant between the two approaches ?**

**10) We'll now see how PCA can be employed to compress an image. First, load the 'doggo.jpeg' picture using the library matplotlib.**

**11) Split the image into its red, green and blue channels (using the method cv2.split()). Then, on each channel, apply a PCA transformation with 1 component. For each channel, also compute the inverse PCA transform (using the pca.inverse_transform() method).**

**Stack the three inverted transforms (one for each channel) back together to form the compressed imgage, and display the image. Try by increasing the number of principal components until you reach a satisfactory quality.***