# References:

https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py

# # Principal Component Analysis:

- To understand what PCA is, we need to understand what is dimensionality reduction. A quick definition would be How do I take all variables I’ve collected and focus on only a few of them?
- Aim is to reduce the count and categorize them into little meaningful and explanatory variables.

### Two ways of Dimensionality Reduction:
#### Feature Elimination 
- We reduce the feature space by eliminating features. As a disadvantage, though, you gain no information from those variables you’ve dropped

#### Feature Extraction
- we create ten “new” independent variables, where each “new” independent variable is a combination of each of the ten “old” independent variables. However, we create these new independent variables in a specific way and order these new variables by how well they predict our dependent variable. we’re keeping the most valuable parts of our old variables, even when we drop one or more of these “new” variables!

- Principal Component Analysis is a technique which is based on Feature Extraction concept. As an added benefit, each of the “new” variables after PCA are all independent of one another. This is a benefit because the assumptions of a linear model require our independent variables to be independent of one another. Most common use of PCA is Speed up machine learning algorithms. Second important use is Data Visualization.

## Let's try to use PCA for visualization: 
- Data visualization is important technique and when you have data with more than 3 dimensions it becomes difficult to view it and understand which feature is important.We can use PCA for data visulization. It helps reducing dimensions from any number to any required number. In our example we will be reducing 4 dimension data to 2 dimension by making sure that we do not lose important features/variations.


In [1]:
#Load the Iris Data
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

df = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target'])

In [2]:
#Let's see how our data looks like
df.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,target
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Importance of Feature Scaling
- Before we move ahead with visulization we need to understand the concept of standardization. 
- Feature scaling through standardization (or Z-score normalization) can be an important preprocessing step for many machine learning algorithms. Standardization involves rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.
- While many algorithms (such as SVM, K-nearest neighbors, and logistic regression) require features to be normalized, intuitively we can think of Principle Component Analysis (PCA) as being a prime example of when normalization is important. In PCA we are interested in the components that maximize the variance.

- If you want to read more about Feature scaling then follow link:

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py

- PCA is effected by scale so you need to scale the features in your data before applying PCA. Use StandardScaler to help you standardize the dataset’s features onto unit scale (mean = 0 and variance = 1) which is a requirement for the optimal performance of many machine learning algorithms.

In [3]:
from sklearn.preprocessing import StandardScaler     #importing Scaler function
features = ['sepal length', 'sepal width', 'petal length', 'petal width'] #taking four dimensions into obe object


In [4]:
#Now let's separate out the features & our independent variable (target column) and standardize features

x = df.loc[:, features].values

y = df.loc[:,['target']].values

# Standardizing the features
x = StandardScaler().fit_transform(x)


- Now let's use PCA to reduce dimensions from 4 to 2. Before that, we need to make a note that after dimensionality reduction, there usually isn’t a particular meaning assigned to each principal component. The new components are just the two main dimensions of variation.

In [5]:
from sklearn.decomposition import PCA    #import library for PCA

pca = PCA(n_components=2) #we are creating instance of PCA here and n_components denotes number of target dimensions 

principalComponents = pca.fit_transform(x) #fit our standardized features

principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2'])

# In above line we are creating new DataFrame with two features using transformed data extraxted from 4 features
principalDf

Unnamed: 0,principal component 1,principal component 2
0,-2.264542,0.505704
1,-2.086426,-0.655405
2,-2.367950,-0.318477
3,-2.304197,-0.575368
4,-2.388777,0.674767
5,-2.070537,1.518549
6,-2.445711,0.074563
7,-2.233842,0.247614
8,-2.341958,-1.095146
9,-2.188676,-0.448629


In [6]:
#Now we need to concatenate our Dependent column (target) to this new dataset

finalDf = pd.concat([principalDf, df[['target']]], axis = 1)
finalDf

Unnamed: 0,principal component 1,principal component 2,target
0,-2.264542,0.505704,Iris-setosa
1,-2.086426,-0.655405,Iris-setosa
2,-2.367950,-0.318477,Iris-setosa
3,-2.304197,-0.575368,Iris-setosa
4,-2.388777,0.674767,Iris-setosa
5,-2.070537,1.518549,Iris-setosa
6,-2.445711,0.074563,Iris-setosa
7,-2.233842,0.247614,Iris-setosa
8,-2.341958,-1.095146,Iris-setosa
9,-2.188676,-0.448629,Iris-setosa


In [7]:
#Finally we are done with all the processing and are ready to visualize the 2 dimensional data.
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
    indicesToKeep = finalDf['target'] == target
    ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
               , finalDf.loc[indicesToKeep, 'principal component 2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

plt.show()

<Figure size 800x800 with 1 Axes>

# PCA to Speed-up Machine Learning Algorithms

- We can use PCA in combination with other machine learing algorithms to speead up the analysis and get better results. In this section we would be combining PCA with Logistic regression to do the classification.

- For this analysis we cannot use Iris dataset because it had only 4 features and 150 observations. Instead, we will be using MNIST database of handwritten digits as it has 784 features (784 dimensions), a training set of 60,000 examples, and a test set of 10,000 examples.

In [8]:
#Let's load the data

#from sklearn.datasets import fetch_mldata
#mnist = fetch_mldata('MNIST original')
#mnist.data.head()

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
mnist

{'data': array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 'target': array(['5', '0', '4', ..., '4', '5', '6'], dtype=object),
 'feature_names': ['pixel1',
  'pixel2',
  'pixel3',
  'pixel4',
  'pixel5',
  'pixel6',
  'pixel7',
  'pixel8',
  'pixel9',
  'pixel10',
  'pixel11',
  'pixel12',
  'pixel13',
  'pixel14',
  'pixel15',
  'pixel16',
  'pixel17',
  'pixel18',
  'pixel19',
  'pixel20',
  'pixel21',
  'pixel22',
  'pixel23',
  'pixel24',
  'pixel25',
  'pixel26',
  'pixel27',
  'pixel28',
  'pixel29',
  'pixel30',
  'pixel31',
  'pixel32',
  'pixel33',
  'pixel34',
  'pixel35',
  'pixel36',
  'pixel37',
  'pixel38',
  'pixel39',
  'pixel40',
  'pixel41',
  'pixel42',
  'pixel43',
  'pixel44',
  'pixel45',
  'pixel46',
  'pixel47',
  'pixel48',
  'pixel49',
  'pixel50',
  'pixel51',
  '

In [9]:
#Now we must split our data in test and training sets. Default splitting or common split is 80-20 but here we are choosing 1/7th
#to be test set and 6/7th to be in training set.
# We use "test_size: for mentioning what proportion of original data is used for test set
from sklearn.model_selection import train_test_split

train_img, test_img, train_lbl, test_lbl = train_test_split(mnist.data, mnist.target, test_size=1/7.0, random_state=0)

In [10]:
#As we did in PCA for Data Visulization, we will be doing Standardize the Data using StandardScaler function here

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#We will do the fitting only on training set

scaler.fit(train_img)

#We will do transformation on both training and testing set

train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

### Feature reduction using PCA
- Now since our data is ready, we are ready to use PCA for feature reduction. Before that we need to decide how much of varince should be retained. Here we would be retaining 95% that is we would be providing .95 as component (remember PCA(n_components=2 we did in PCA for Visulization???)) parameter. Observe the results when we apply it on training set.


In [11]:
from sklearn.decomposition import PCA
#Make an instance of model using 95% as parameter

pca = PCA(.95)

#Here we have created instance of PCA which will retain 95% variance after feature deduction. Now we will apply this on our training set.
#Note: We will be applying it to only TRAINING Set and not testing set.

In [12]:
pca.fit(train_img)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [13]:
#We can find how many features PCA has choose after model fitting. See below, here this model is has chosen 327 features.
pca.n_components_

327

In [14]:
#Now, let's apply the mapping (transform) on both training and testing set and make our data ready for Logistic regression

train_img = pca.transform(train_img)
test_img = pca.transform(test_img)

### Apply Logistic Regression to the Transformed Data

In [15]:
#import model to be used

from sklearn.linear_model import LogisticRegression

In [16]:
#Instantiation of model:

logisticRegr = LogisticRegression(solver = 'lbfgs')

# default solver (liblinear) is incredibly slow hence we are using 'lbfgs'

In [17]:
#Now next step is training the model on the data and storing the information learned from the data
# Here, model is learning the relationship between digits and labels
logisticRegr.fit(train_img, train_lbl)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [18]:
#It's time to use trained model on test data for prediction. We will try to use on 1 observation:

logisticRegr.predict(test_img[0].reshape(1,-1))

array(['0'], dtype=object)

In [19]:
logisticRegr.score(test_img, test_lbl)

#We can see that the accuracy score of our model for one observation is approximately 91%

0.9116

In [20]:
#Let's try to use model on complete test data and find accuracy.

logisticRegr.predict(test_img)

array(['0', '4', '1', ..., '1', '3', '0'], dtype=object)

In [21]:
logisticRegr.score(test_img, test_lbl)

0.9116

- Same accuracy score...

- This is how we can use PCA long with other machine learning algorithms. We can try above example by retaining different amounts of variance each time. We can calculate accuracy score each time and finalize the model with best score.