# Finding optimal number of Features to retain in PCA
In this Notebook, we will see how to find the optimal number of features that we need to use in PCA without losing on too much information, i.e. what is the optimal value of K

## Step 1 : Importing Libraries

In [7]:
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

## Step 2 : Loading and Preprocessing Dataset
This step includes loading data, applying feature scaling, dividing into testing and training data

In [3]:
data = datasets.load_breast_cancer() # This has 30 features
X = data.data
Y = data.target

In [4]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [6]:
x_train, x_test, y_train, y_test = train_test_split(X_scaled, Y, random_state=9)

## Step 3 : Finding the optimal value of K

In [9]:
# Initially, we will keep all the components of the PCA
pca = PCA()
pca.fit_transform(x_train)

array([[-2.08467219e+00, -9.24675424e-01,  1.41849431e+00, ...,
        -2.25353802e-03, -1.56893301e-03, -2.05526077e-02],
       [ 8.92532980e-02,  7.52189385e+00,  3.94804094e+00, ...,
         3.92700451e-02,  1.41076541e-03,  7.12134605e-04],
       [-1.87199556e+00, -1.10817015e-01, -1.42147276e+00, ...,
        -2.70524052e-02, -1.41171125e-02,  1.39094363e-03],
       ...,
       [ 2.83199163e+00,  3.24248783e+00, -9.78314203e-01, ...,
        -3.13454599e-02, -2.64013363e-02,  1.38056879e-02],
       [-3.32846408e+00,  5.33594385e-01, -7.94371117e-01, ...,
        -8.32189353e-03, -1.70456409e-02, -3.34587636e-03],
       [-1.55282811e+00,  1.19841019e+00,  1.39552978e+00, ...,
        -8.29891885e-03,  1.26197668e-02, -7.47087923e-03]])

In [10]:
# Let's look at the Eigen value of the 30 new features
pca.explained_variance_

array([1.37192780e+01, 5.77668054e+00, 2.92761796e+00, 1.92606196e+00,
       1.73657729e+00, 1.19344964e+00, 7.12883924e-01, 5.05245524e-01,
       4.60835887e-01, 3.59810245e-01, 2.94645208e-01, 2.65992633e-01,
       2.49913311e-01, 1.57876879e-01, 8.70749459e-02, 7.55464948e-02,
       6.11117397e-02, 5.20142460e-02, 4.14321560e-02, 3.06926112e-02,
       2.98658892e-02, 2.67500564e-02, 2.49618718e-02, 1.67398210e-02,
       1.51640481e-02, 8.90776608e-03, 6.78591530e-03, 1.45887029e-03,
       6.17079691e-04, 1.28719607e-04])

In [11]:
# Let's find the total variance, i.e. the sum of Eigen Values
total_variance = pca.explained_variance_.sum()

# Initialise K = 0, and the current variance sum = 0
# We will keep increasing K and keep updating the current variance sum until %age of current variance
# is 99% of the total variance (99% is a parameter that's my choice. It depends on how much you 
# are willing to compomise)
k = 0
current_variance = 0

while current_variance/total_variance < 0.99 :
    current_variance += pca.explained_variance_[k]
    k += 1

k

17

It seems like we I keep 17 features in my breast cancer dataset, I can achieve 99% of variance that is in original data