# CODE COURSE - Tutorial Principal component analysis.
---

This tutorial will explain in a interactive way the tools you need during PCA. 

## Contents of the tutorial
1. Introduction
    * What is PCA?
    * Where can you apply PCA?
    * What is a principal component?
2. Step by step PCA
    * Import the data
    * Standardization
    * Calculation of the covariance matrix 
    * Calculation of the Eigenvalues and Eigenvectoren
    * Determine the most important PC
    * Explained variance
    * Cummulative plot of the principal components
    * Reducing the dimensions of the data set / final plot
2. PCA with Python
    * Import the data
    * Standardization
    * Calculation of the covariance matrix 
    * Calculation of the Eigenvalues and Eigenvectoren
    * Determine the most important PC
    * Explained variance
    * Cummulative plot of the principal components
    * Reducing the dimensions of the data set / final plot

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

### 1. Introduction 

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

#### 1.1 What is PCA?

https://www.datacamp.com/community/tutorials/principal-component-analysis-in-python

.......

#### 1.2 Where can you apply PCA?

* Data visualization
* speeding machine learning algorithm
https://www.datacamp.com/community/tutorials/principal-component-analysis-in-python

#### 1.3 What is a principal component?
.......
https://www.datacamp.com/community/tutorials/principal-component-analysis-in-python

#### 1.4 Videos

https://www.youtube.com/watch?v=HMOI_lkzW08 <br>
https://www.youtube.com/watch?v=n7npKX5zIWI

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

### 2. Step by step PCA 

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

Step by step PCA:<br>
Step 1: Standardization of the data. <br>
Standardization is all about scaling your data in such a way that all the variable and their values lie within a similar range. 
Z = (variable value- mean)/ standard deviation <br> <br>

Step 2: Computing the covariance matrix <br>
A covariance matrix expresses the correlation between the different variables in the data set. It is essential to identify heavily dependent variables because they contain biased and redundant information which reduces the overall performance of the model. <br> <br>

Step 3: Calculation the eigenvectors and eigenvalues <br>
Eigenvectors and eigenvalues are the mathematical constructs that must be computed from the covariance matrix in order to determine the PCs of the data set. PCs are the new set of variables that are obtained from the initial set of variables. They compress and possess moest of the useful information that was scattered among the initial variables. <br>
The eigenvectors are those vectors when a linear transformation is performed on them then their direction does not change
Eigenvalues simply denote the scalars of the respective eigenvalues. 
Maximum variance is pc 1. <br> <br>

Step 4: Computing the principal components <br>
Once we have computed the eigenvectors and eigenvalues, all we have to do is order them in the descending order, where the eigenvector with the highest eigenvalue is the most significant and this forms the first PC. PC1 is the most significant and stores the maximum possible information (biggest variance). PC2 is the second most significant PC and stores remaining maximum info and so on <br> <br>

Step 5: Reducing the dimensions of the data set<br>
The last step in performing PCA is to re-arrange the original data with the final principal components which represent the maximum and the most significant information of the data set. 


<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-bottom:-20px"></div>

### 3. PCA with Python 

<div class="alert alert-info" role="alert" style="height:10px;padding:0px;margin-top:5px;"></div>

#### 3.1 Import the dataset. 

In this first tutorial you will work with a dataset about breast cancer.  (Explanation of the dataset). You can import the dataset with the following code;

In [17]:
from sklearn.datasets import load_breast_cancer
load_breast_cancer

<function sklearn.datasets.base.load_breast_cancer>

*load_breast_cancer* will give you the data and the labels ( malignant or benign). With *.data* you will get the data, with *.target* you wil get the labels and with *feature_names* you will get the features that are in the breast cancer dataset.

In [18]:
breast_data = load_breast_cancer().data

In [19]:
breast_labels = load_breast_cancer().target
#np.count_nonzero(breast_labels == 1)

In [20]:
features = load_breast_cancer().feature_names
features

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='<U23')

The next step is the import of numpy since you will reshaping the *breast_labels* to concatenate it with the *breast_data* so that you can finally create a *DataFrame* with Pandas which will have both the data and labels.

In [21]:
import numpy as np
import pandas as pd

In [22]:
labels = np.reshape(breast_labels,(569,1))

In [23]:
final_breast_data = np.concatenate([breast_data,labels], axis=1)
final_breast_data

array([[  1.79900000e+01,   1.03800000e+01,   1.22800000e+02, ...,
          4.60100000e-01,   1.18900000e-01,   0.00000000e+00],
       [  2.05700000e+01,   1.77700000e+01,   1.32900000e+02, ...,
          2.75000000e-01,   8.90200000e-02,   0.00000000e+00],
       [  1.96900000e+01,   2.12500000e+01,   1.30000000e+02, ...,
          3.61300000e-01,   8.75800000e-02,   0.00000000e+00],
       ..., 
       [  1.66000000e+01,   2.80800000e+01,   1.08300000e+02, ...,
          2.21800000e-01,   7.82000000e-02,   0.00000000e+00],
       [  2.06000000e+01,   2.93300000e+01,   1.40100000e+02, ...,
          4.08700000e-01,   1.24000000e-01,   0.00000000e+00],
       [  7.76000000e+00,   2.45400000e+01,   4.79200000e+01, ...,
          2.87100000e-01,   7.03900000e-02,   1.00000000e+00]])

Now you have to convert the array to a DataFrame with pandas.

In [24]:
breast_dataset = pd.DataFrame(final_breast_data)
breast_dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0.0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0.0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0.0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0.0


It could be seen that the features are missing in this Dataset. So, in de next steps we are going to append these features to the dataset. 
If you note in the feature arry, the label field is missing. Hence, you will have to manually add it to the feature array since you wil be equating this array with the column names of your *breast_dataset* dataframe.

In [25]:
features_labels = np.append(features,'label')

In [26]:
breast_dataset.columns = features_labels

In [27]:
breast_dataset.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0.0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0.0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0.0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0.0


Since the original labels are in 0,1 format, you will change the labels to benign and malignant using *.replace* function. You will use *inplace=True* which will modify the dataframe breast_dataset.

In [28]:
breast_dataset['label'].replace(0, 'Benign',inplace=True)
breast_dataset['label'].replace(1, 'Malignant',inplace=True)
breast_dataset.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,Benign
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,Benign
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,Benign
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,Benign
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,Benign



Now the dataset is ready to get started with PCA.

#### 3.2 Standarization 

The first step by PCA is always standardizing the data since PCA's output is influenced based on the scale of the features of the data. 
To apply normalization, you will import *StandardScaler* module from the sklearn library and select only the features from the *breast_cancer* (with *.loc* and *.values* to get only the values of the selected features) you created in the first part of this paragraph. So you select all the data expect the data in the colomns 'label'. Once you have the features, you will then apply scalling by doing *fit_transform* on the feature data. 

1) Use the command code *StandardScaler* and *fit_transform* to write a code that can calculate the standardization of the whole data, except the colomn lable. 


In [29]:
#code.....

2) Let's check whether the normalized data has a mean of zero and a standard deviation of one.

In [30]:
#code.....

3) Write below a code that shows the new data table with the normalizations. 

In [31]:
#code.....

#### 3.3 Calculation of the covariance matrix

4) Calculate the Covariance-matrix using the command *cov*

In [32]:
#code.....

#### 3.4 Calculation of the Eigenvalues and Eigenvectoren

5) Calculate the Eigenvalues and Eigenvectoren

In [33]:
#code.....


#### 3.5 Determine the most important PC

In order to decide which eigenvector(s) can dropped without losing too much information for the construction 
of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: 
The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; 
those are the ones that can be dropped.
In order to do so, the common approach is to rank the eigenvalues from highest to lowest in order choose the top k eigenvectors.

6) Make a eig_pairs list of (eigenvalue, eigenvector) tulples and print the Eigenvalues in descending order.



In [34]:
#code.....

#### 3.6 Explained variance
After sorting the eigenpairs, the next question is 
“how many principal components are we going to choose for our new feature subspace?” 
A useful measure is the so-called “explained variance,” which can be calculated from the eigenvalues. 
The explained variance tells us how much information (variance) can be attributed to each of the principal components

7) Calculate the totale varaince of the eigenvalues, the explained variance (var_exp) and the Cummulative explained variance (cum_var_exp)

In [None]:
7.2) Calculate the totale varaince or eigenvalues, the explained variance (var_exp) and the Cummulative explained variance (cum_var_exp)

In [35]:
#code.....

#### 3.7 Cummulative plot of the principal components.

8) Make in one plot a bar graph of the individual explained variance and a step graph of the cumulative explained variance.

In [36]:
#code.....

#### 3.8 Reducing the dimensions of the data set

9) Reshape the eig_pair data in matrix_w with only the Eigenvectors of the two highst eigenvalues. *hint: use hstack*

In [40]:
#code.....

10) calculate the inner product of X and matrix_w and name it **Y**

In [38]:
#code.....

11) Make a scatter plot that display the data of bennign in red and maligant in green with the plottitle *Principal Component Analysis of Breast Cancer Dataset*,<br> x axis Principal Component 1, y axis Principal Component 2 and legend.

In [39]:
#code.....