# **Principal Component Analysis(PCA)**

In this simple tutorial, we will learn how to implement a dimensionality reduction technique called  Principal Component Analysis (PCA) that helps to reduce the number to independent variables in a problem by identifying Principle Components.We will take a step by step approach to PCA.

## **Dataset**



The dataset can be downloaded from the following [link](https://archive.ics.uci.edu/ml/datasets/wine). The dataset gives the details of breast cancer patients. It has 32 features with 569 rows.

Let’s get started.Import all the libraries required for this project.

In [None]:
!python -m pip install pip --upgrade --user -q
!python -m pip install numpy pandas seaborn matplotlib scipy sklearn statsmodels scikit-image --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## **Loading the dataset**

In [None]:
#2. Import the dataset
dataset = pd.read_csv('Wine.csv', header=None)


dataset.columns = [  'name'
                 ,'alcohol'
             	,'malicAcid'
             	,'ash'
            	,'ashalcalinity'
             	,'magnesium'
            	,'totalPhenols'
             	,'flavanoids'
             	,'nonFlavanoidPhenols'
             	,'proanthocyanins'
            	,'colorIntensity'
             	,'hue'
             	,'od280_od315'
             	,'proline'
                ]
dataset.head()

We need to store the independent and dependent variables by using the iloc method.

In [None]:
X = dataset.iloc[:, 1:].values
y = dataset.iloc[:, 0].values

Split the training and testing data in the 80:20 ratio.

In [None]:
#3. Split the dataset into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## **PCA Standardization**

PCA can only be applied to numerical data. So,it is important to convert all the data into numerical format. We need to standardize data for converting features of different units to the same unit.

In [None]:
#4. Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## **Covariance Matrix**

Based on standardized data we will build the covariance matrix. It gives the variance between each feature in our original dataset. The negative value in the result below represents are inversely dependent on each other.

In [None]:
mean_vec=np.mean(X_train,axis=0)
cov_mat=(X_train-mean_vec).T.dot((X_train-mean_vec))/(X_train.shape[0]-1)
mean_vect=np.mean(X_test,axis=0)
cov_matt=(X_test-mean_vec).T.dot((X_test-mean_vec))/(X_test.shape[0]-1)
print(cov_mat)

## **Eigen Decomposition on Covariance Matrix**

Each eigenvector will have an eigenvalue and sum of the eigenvalues represent the variance in the dataset. We can get the location of maximum variance by calculating eigenvalue. The eigenvector with lowest eigenvalue will give the lowest amount of variation in the dataset. These values need to be dropped off.

In [None]:
cov_mat=np.cov(X_train.T)
eig_vals,eig_vecs=np.linalg.eig(cov_mat)
cov_matt=np.cov(X_test.T)
eig_vals,eig_vecs=np.linalg.eig(cov_mat)
print(eig_vals)
print(eig_vecs)

We need to specify how many components we want to keep. The result gives a reduction of dimension from 13 to 2 features. The first and second PCA will capture the most variance in the original dataset
.

In [None]:
#5. Apply PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

In [None]:
pca.components_

In this matrix array, each column represents the original data, and each row represents a PCA.

## **Fitting Logistic Regression To the training set**

As we are solving a classification problem, we can use the Logistic Regression for model prediction.

In [None]:
#6. Fit the Logistic Regression to the Training set

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

## **Predict the test Result**

In [None]:
#7. Predict the Test set results

y_pred = classifier.predict(X_test)

## **Evaluating the Algorithm**

For classification tasks, we will use a confusion matrix to check the accuracy of our machine learning model.

In [None]:

#8. Make the Confusion Matrix

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

## **Plot the training set**

In [None]:
#9. Visualize the Training set results

from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()

## **Plot the Test Set**

In [None]:
#10.Visualize the Test set results

from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()



##**Important things to note:**

> * PCA will take all the original training set variables and decompose them in a manner to make a new set of variables with high explained variance.
> * Principal component analysis involves extracting linear composites of observed variables.
> * PCA can be used to determine what amount of variability the independent variables can explain for the dependent variable and cannot be used to see whIch independent variables are more important for predictio

#**Related Articles:**

> * [PCA in Python](https://analyticsindiamag.com/principal-component-analysis-in-python/)
> * [Comparing PCA, LDA and PCA-kernel](https://analyticsindiamag.com/practical-approach-to-dimensionality-reduction-using-pca-lda-and-kernel-pca/)
> * [Mathematical Practical Approach to PCA](https://analyticsindiamag.com/principal-component-analysis-on-matrix-using-python/)