### Objectives
* Introduction & Concepts
* Properties of PCA
* Applications of PCA

<hr>

### Introduction
* Main idea is to reduce the dimensionality of the data
* Dimensions of data means columns in the data
* Feature selection means choosing important feature
* Dimensionality Reduction is about deriving new features (m) out of original features (n)
* m < n
* You don't want to compromise in accuracy.

* This is achieved by transforming variables(columns) to a new set of columns or variables which are known as principal component
* These principal components have a parameter telling how important they are in representing the data

In [19]:
import pandas as pd

In [20]:
df = pd.DataFrame({'A':[1,2,3,4,5],'B':[1,2,3,4,5]})
df

Unnamed: 0,A,B
0,1,1
1,2,2
2,3,3
3,4,4
4,5,5


In [21]:
pca = PCA(n_components=1)
pca.fit_transform(df)

array([[ 2.82842712],
       [ 1.41421356],
       [-0.        ],
       [-1.41421356],
       [-2.82842712]])

In [22]:
pca.components_.shape

(1, 2)

In [23]:
df['C'] = 10
df

Unnamed: 0,A,B,C
0,1,1,10
1,2,2,10
2,3,3,10
3,4,4,10
4,5,5,10


In [25]:
pca = PCA(n_components=1)
pca.fit_transform(df)

array([[ 2.82842712],
       [ 1.41421356],
       [-0.        ],
       [-1.41421356],
       [-2.82842712]])

In [38]:
pca.explained_variance_

array([5.00000000e+00, 4.29747739e-32, 0.00000000e+00])

####

### 

In [75]:
import pandas as pd
data = pd.read_csv('breast-cancer.csv')
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [76]:
data.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

In [67]:
from sklearn.decomposition import PCA

In [79]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
data = ss.fit_transform(data.drop('diagnosis', axis = 1))

In [80]:
data.shape

(569, 31)

In [81]:
pca = PCA(n_components=10)
res = pca.fit_transform(data)
res

array([[ 9.18319983,  1.97127137, -1.17162472, ...,  2.18052754,
        -0.2315924 , -0.09040462],
       [ 2.38329766, -3.75345877, -0.58022866, ...,  0.04493656,
         0.4268971 , -0.65993274],
       [ 5.74247239, -1.08035048, -0.53308788, ..., -0.71520883,
        -0.01070768, -0.08230334],
       ...,
       [ 1.2518901 , -1.89397674,  0.53446685, ..., -0.17866797,
         0.26211508,  0.47597832],
       [10.36503528,  1.69639755, -1.90741786, ...,  0.27520069,
        -0.07275619, -0.51818609],
       [-5.47826365, -0.67278804,  1.47716504, ...,  1.63559123,
         0.96946424,  0.67773018]])

In [82]:
pca.explained_variance_

array([13.31145188,  5.70683496,  2.84038694,  1.98484548,  1.65171815,
        1.23684643,  0.97999555,  0.67293563,  0.46160368,  0.40384284])

### Notes
* Scaling should be done before PCA
* Explained_variance tells how important those derived features are

In [83]:
pca.components_.shape

(10, 31)

In [84]:
data=pd.DataFrame(res)
data['class']=data['diagnosis']
data.columns=["PCA1","PCA2","class"]
data.head()

KeyError: 'diagnosis'

## PCA using breast cancer data

In [60]:
data = breast_cancer_data.data
target = breast_cancer_data.target

In [35]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss

In [48]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline

In [40]:
trainX, testX, trainY, testY = train_test_split(data, target)

In [37]:
pipeline = make_pipeline(StandardScaler(),PCA(n_components=10),DecisionTreeClassifier())

In [38]:
pipeline = make_pipeline(StandardScaler(),PCA(n_components=10),LogisticRegression())

In [39]:
pipeline

In [41]:
gs = GridSearchCV(pipeline, param_grid={'pca__n_components':[10,11,12,13,14]},cv=5)

In [42]:
gs.fit(trainX,trainY)

In [43]:
gs.best_params_

{'pca__n_components': 13}

In [44]:
gs.best_score_

0.9812038303693571

In [45]:
gs.score(testX,testY)

0.9790209790209791

In [49]:
pipeline = make_pipeline(PCA(n_components=10),DecisionTreeClassifier())

In [50]:
pipeline = make_pipeline(DecisionTreeClassifier())

In [51]:
gs = GridSearchCV(pipeline,cv=5)

TypeError: __init__() missing 1 required positional argument: 'param_grid'

In [52]:
pipeline.fit(trainX,trainY)

In [53]:
gs.best_score_

0.9812038303693571

In [54]:
pipeline.score(testX,testY)

0.8951048951048951

In [55]:
data

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [56]:
data.shape

(569, 30)

In [57]:
lr = LogisticRegression()

In [58]:
lr.fit(trainX,trainY)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [59]:
lr.score(testX,testY)

0.8741258741258742

### Advantages of using PCA
* Reduces dimension of the data thus improving training time
* Since the data is better represented, simple models(linear) might work better.