### Implementing PCA in Python with Scikit-Learn 


https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/


With the availability of high performance CPUs and GPUs, it is pretty much possible to solve every regression, classification, clustering and other related problems using machine learning and deep learning models. However, there are still various factors that cause performance bottlenecks while developing such models. Large number of features in the dataset is one of the factors that affect both the training time as well as accuracy of machine learning models. You have different options to deal with huge number of features in a dataset.

1- Try to train the models on original number of features, which take days or weeks if the number of features is too high.

2- Reduce the number of variables by merging correlated variables.

3- Extract the most important features from the dataset that are responsible for maximum variance in the output. Different statistical techniques are used for this purpose e.g. linear discriminant analysis, factor analysis, and principal component analysis.

In this article, we will see how principal component analysis can be implemented using Python's Scikit-Learn library.

**Principal Component Analysis**
Principal component analysis, or PCA, is a statistical technique to convert high dimensional data to low dimensional data by selecting the most important features that capture maximum information about the dataset. The features are selected on the basis of variance that they cause in the output. The feature that causes highest variance is the first principal component. The feature that is responsible for second highest variance is considered the second principal component, and so on. It is important to mention that principal components do not have any correlation with each other.

**Advantages of PCA**
There are two main advantages of dimensionality reduction with PCA.

The training time of the algorithms reduces significantly with less number of features.
It is not always possible to analyze data in high dimensions. For instance if there are 100 features in a dataset. Total number of scatter plots required to visualize the data would be 100(100-1)2 = 4950. Practically it is not possible to analyze data this way.

**Normalization of Features**
It is imperative to mention that a feature set must be normalized before applying PCA. For instance if a feature set has data expressed in units of Kilograms, Light years, or Millions, the variance scale is huge in the training set. If PCA is applied on such a feature set, the resultant loadings for features with high variance will also be large. Hence, principal components will be biased towards features with high variance, leading to false results.

Finally, the last point to remember before we start coding is that PCA is a statistical technique and can only be applied to numeric data. Therefore, categorical features are required to be converted into numerical features before PCA can be applied.

**Implementing PCA with Scikit-Learn**
In this section we will implement PCA with the help of Python's Scikit-Learn library. We will follow the classic machine learning pipeline where we will first import libraries and dataset, perform exploratory data analysis and preprocessing, and finally train our models, make predictions and evaluate accuracies. The only additional step will be to perform PCA to find out optimal number of features before we train our models. These steps have been implemented as follows:

In [0]:
#Importing Libraries
import numpy as np  
import pandas as pd  

In [0]:
## Execute the following script to download the dataset using pandas:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"  
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']  
dataset = pd.read_csv(url, names=names)  

In [0]:
dataset

In [0]:
dataset.head() 

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [0]:
# Preprocessing
# The first preprocessing step is to divide the dataset into a feature set and corresponding labels. The following script performs this task:
# DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

X = dataset.drop('Class', 1) # axis : {0 or ‘index’, 1 or ‘columns’}, default 0 Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
y = dataset['Class'] # labels : single label or list-like Index or column labels to drop.

In [0]:
# Check features

X

In [0]:
# check labels
y

In [0]:
# The next preprocessing step is to divide data into training and test sets. Execute the following script to do so:

# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  

#Feature Scaling 

Refer below link to get more info:

https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e

##Why Scaling:
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations, this is a problem.


If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes.

![alt text](![feature scaling](https://cdn-images-1.medium.com/max/1600/1*EyPd0sQxEXtTDSJgu72JNQ.jpeg))


To supress this effect, we need to bring all features to the same level of magnitudes. This can be acheived by scaling.


## How to Scale Features

There are four common methods to perform Feature Scaling.

**1- Standardisation**:

**values will lie be between -1 and 1.**

Standardization is one of the most popular methods for scaling features. It basically replaces the values with their Z scores.

This method redistributes the features with their mean = 0 and standard deviation = 1.

Scikit-Learn provides a preprocessing module that contains different preprocessing methods including standardization.

Here is a simple code to demonstrate that.

from sklearn.preprocessing import scale

Lets assume that we have a numpy array with some values
And we want to scale the values of the array
sc = scale(X)

Standardisation replaces the values by their Z scores.

![alt text](https://cdn-images-1.medium.com/max/1600/1*LysCPCvg0AzQenGoarL_hQ.png)

Standard Deviation:

#### Please follow below link for more details
https://www.mathsisfun.com/data/standard-deviation-formulas.html

This is the formula for Standard Deviation:
![alt text](https://www.mathsisfun.com/data/images/standard-deviation-formula.gif)

StandardScaler makes the mean of the distribution 0. About 68% of the values will lie be between -1 and 1.


**Scikit-learn: preprocessing.scale() vs preprocessing.StandardScalar()**

Those are doing exactly the same, but:

**preprocessing.scale(x) is just a function, which transforms some data**

**preprocessing.StandardScaler() is a class supporting the Transformer API**

I would always use the latter, even if i would not need inverse_transform and co. supported by StandardScaler().?


Excerpt from the docs:

**The function scale provides a quick and easy way to perform this operation on a single array-like dataset**

**The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline**


**This redistributes the features with their mean μ = 0 and standard deviation σ =1 . sklearn.preprocessing.scale helps us implementing standardisation in python.**
<hr>

**2-  Mean Normalisation:**


This distribution will have values between -1 and 1with μ=0.

![alt text](https://cdn-images-1.medium.com/max/2400/1*fyK4gMQrfJKV5pmbXSrNbg.png)

**Standardisation and Mean Normalization can be used for algorithms that assumes zero centric data like Principal Component Analysis(PCA).**


<hr>


**3- Min-Max Scaling:**

sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)

![alt text](https://cdn-images-1.medium.com/max/1600/1*19hq_t_NFQ6YVxMxsT0Cqg.png)

This scaling brings the value between 0 and 1.


<hr>


**4- Unit Vector:**


![alt text](https://cdn-images-1.medium.com/max/1600/1*u2Up0eaer56dpmaElU3Zxw.png)

Scaling is done considering the whole feature vecture to be of unit length.

**Min-Max Scaling and Unit Vector techniques produces values of range [0,1]. When dealing with features with hard boundaries this is quite useful. For example, when dealing with image data, the colors can range from only 0 to 255**




##When to Scale

Examples of Algorithms where Feature Scaling matters 
1. K-Means uses the Euclidean distance measure here feature scaling matters.
2. K-Nearest-Neighbours also require feature scaling.
3. Principal Component Analysis (PCA): Tries to get the feature with maximum variance, here too feature scaling is required.
4. Gradient Descent: Calculation speed increase as Theta calculation becomes faster after feature scaling.

**Note: Naive Bayes, Linear Discriminant Analysis, and Tree-Based models are not affected by feature scaling.**
**In Short, any Algorithm which is Not Distance based is Not affected by Feature Scaling.**


Rule of thumb I follow here is any algorithm that computes distance or assumes normality, scale your features!!!

Some examples of algorithms where feature scaling matters are:

k-nearest neighbors with an Euclidean distance measure is sensitive to magnitudes and hence should be scaled for all features to weigh in equally.

Scaling is critical, while performing Principal Component Analysis(PCA). PCA tries to get the features with maximum variance and the variance is high for high magnitude features. This skews the PCA towards high magnitude features.

We can speed up gradient descent by scaling. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

Tree based models are not distance based models and can handle varying ranges of features. Hence, Scaling is not required while modelling trees
.
Algorithms like Linear Discriminant Analysis(LDA), Naive Bayes are by design equipped to handle this and gives weights to the features accordingly. Performing a features scaling in these algorithms may not have much effect.

# Distance can be calculated between centroid and data point using these methods-

1 -Euclidean Distance : It is the square-root of the sum of squares of differences between the coordinates (feature values – Age, Salary, BHK Apartment) of data point and centroid of each class. This formula is given by Pythagorean theorem.

![alt text](https://cdncontribute.geeksforgeeks.org/wp-content/uploads/mink-dis.jpg)

where x is Data Point value, y is Centroid value and k is no. of feature values, Example: given data set has k = 3


2- Manhattan Distance : It is calculated as the sum of absolute differences between the coordinates (feature values) of data point and centroid of each class.

![alt text](https://cdncontribute.geeksforgeeks.org/wp-content/uploads/manh-dis.jpg)

3- Minkowski Distance : It is a generalization of above two methods. As shown in the figure, different values can be used for finding r.

![alt text](https://cdncontribute.geeksforgeeks.org/wp-content/uploads/mink-dis.jpg)

# Handling Missing Data

Handling Missing Data is very common in Machine Learning. It means your dataset does not contain any information for a certain feature in a specific row. Almost all datasets come with some missing values.
We know that Machine Learning algorithms are just some math equations. That means we cannot throw these empty (missing) values into those algorithms.

**There are two very commonly used methods for Handling Missing Data.**
1. Removing Data
2. Imputation


**1 - Removing Data**
In many cases, the solution is just removing the specific row. If we use pandas, then it is very simple. We just need to use one pandas methods called dropna().
Let’s say we have a pandas DataFrame df. This DataFrame contains some missing values, then we can write a simple code to delete those specific rows.

new_df = df.dropna(axis=1)

But this approach is not the best solution. A better solution to this problem is Imputation.

**2- Imputation**
Imputation is another very popular methods for handling missing values. In Imputation, instead of deleting the rows, we fill the rows with some other values. The imputed value is not the right number but it is very accurate to the right value.

Scikit-Learn provides for methods for Imputation. We can use the SimpleImputer method for Imputation. Here is code for that


import numpy as np
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy=’mean’)
new_data = imp.fit_transform(data)


SimpleImputer supports different strategies for imputation. In the above example, we have used mean strategy. That means this method replaces the missing values with the mean value for that column. We can also use a constant value for as a strategy so that it’ll replace all the missing values with a constant value.

# Categorical Data

**Data Encoding**
We know that Machine Learning algorithms require data in numerical form. But many times, datasets contains some features in some other form. So we need to convert these value into numerical form.
Scikit-Learn provides different encoding methods for Data Encoding.


**Label Encoding**

To understand Label Encoding, first, let’s assume a dataset contains three columns age, salary, and gender. Now in this dataset, the gender column is not in numerical form. That means we need to convert it into some type of numerical form.


To achieve that, we can use Label Encoding. We know that the gender columns contains two unique values, Male and Female. If we apply Label Encoding algorithm on this column, then it replaces the values by 0 and 1 (Male = 0 and Female = 1).

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
x = labelencoder.fit_transform(x).toarray()

**One Hot Encoding**

Label Encoding sometimes causes different problems. Sometimes by using Label Encoding, the Machine Learning algorithm may confuse and assume that the data have some type of hierarchical order. To avoid this, we can use One Hot Encoding.
What one hot encoding does is, it takes a column which has categorical data, which has been label encoded and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value.

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()




In [0]:
# As mentioned earlier, PCA performs best with a normalized feature set. We will perform standard scalar normalization to normalize our feature set. To do this, execute the following code:

# standardization technique
# https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()  
X_train = sc.fit_transform(X_train)  
X_test = sc.transform(X_test) 

In [0]:
X_test

In [0]:
## Applying PCA
#It is only a matter of three lines of code to perform PCA using Python's Scikit-Learn library. The PCA class is used for this purpose. PCA depends only upon the feature set and not the label data. Therefore, PCA can be considered as an unsupervised machine learning technique.
#Performing PCA using Scikit-Learn is a two-step process:
#Initialize the PCA class by passing the number of components to the constructor.
#Call the fit and then transform methods by passing the feature set to these methods. The transform method returns the specified number of principal components.
#Take a look at the following code:

from sklearn.decomposition import PCA
pca = PCA()  
X_train = pca.fit_transform(X_train)  
X_test = pca.transform(X_test)  

In [0]:
X_test

In [0]:
# In the code above, we create a PCA object named pca. We did not specify the number of components in the constructor. Hence, all four of the features in the feature set will be returned for both the training and test sets.
# The PCA class contains explained_variance_ratio_ which returns the variance caused by each of the principal components. Execute the following line of code to find the "explained variance ratio".

explained_variance = pca.explained_variance_ratio_
explained_variance

array([0.72226528, 0.23974795, 0.03338117, 0.0046056 ])

In [0]:
# Let's first try to use 1 principal component to train our algorithm. To do so, execute the following code:
from sklearn.decomposition import PCA

pca = PCA(n_components=1)  
X_train = pca.fit_transform(X_train)  
X_test = pca.transform(X_test)  

In [0]:
# Training and Making Predictions
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(max_depth=2, random_state=0)  
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)  



In [0]:
# Performance Evaluation
from sklearn.metrics import confusion_matrix  
from sklearn.metrics import accuracy_score

cm = confusion_matrix(y_test, y_pred)  
print(cm)  
print('Accuracy' + str(accuracy_score(y_test, y_pred)))  

[[11  0  0]
 [ 0 12  1]
 [ 0  1  5]]
Accuracy0.9333333333333333


In [0]:
# Results with 2 and 3 Principal Components
# Now let's try to evaluate classification performance of the random forest algorithm with 2 principal components. Update this piece of code:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)  
X_train_1 = pca.fit_transform(X_train)  
X_test_1 = pca.transform(X_test)  

          
