**PCA on Breast Cancer Dataset**  
We will first Run Logistic Regression on breast cancer dataset, look at the accuracy, time.  
And then we will apply PCA to reduce dimensionality and then we will apply Logistic Regression and then compare the accuracy and time taken on both cases.  

In [11]:
from sklearn import decomposition, ensemble, datasets, linear_model
import numpy as np
import time
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [2]:
breast_cancer = datasets.load_breast_cancer()

In [3]:
X = breast_cancer.data
X.shape

(569, 30)

In [4]:
sc = StandardScaler()
X_std = sc.fit_transform(X)

In [5]:
x_train, x_test, y_train, y_test = train_test_split(X_std, breast_cancer.target, random_state = 0)

We will apply PCA only on training data.  
And we will use same PCA object for applying PCA on testing data as well.  
We should not use PCA on applying test data because test data are not used in training.  


In [7]:
pca = decomposition.PCA(n_components = 15)
## we have 30 features so lets try to make it 15 

In [8]:
x_train_pca = pca.fit_transform(x_train)  ## transforming train data.

In [10]:
x_test_pca = pca.transform(x_test)


**We will call only transform and this will use exact same component so in the first call(fit transform) it will
do a lot more work it will find those components and this one(transform) it will just going to move the testing data to 
those component.**  

In [12]:
lr = linear_model.LogisticRegression()

Below is without PCA.

In [13]:
lr.fit(x_train, y_train)
print(lr.score(x_test, y_test))
## Also we want to see how much time it takes. 

0.965034965034965


In [18]:
start = time.time()
lr.fit(x_train, y_train)
ending = time.time()
print('Time: ' ,ending - start)
print('Score: ', lr.score(x_test, y_test))

Time:  0.016799211502075195
Score:  0.965034965034965


**This time we will do the same thing but with the PCA**

In [27]:
start = time.time()
lr.fit(x_train_pca, y_train)
ending = time.time()
print('Time: ' ,ending - start)
print('Score: ', lr.score(x_test_pca, y_test))


## different run will give you different amount of time.
## It also depends upon what else is running on CPU.

## lets say if we run it 10 - 15 times and take avg of it , pca one will run faster.  



Time:  0.014199972152709961
Score:  0.958041958041958


**This is only 15 dimensional data. But if we move to larger dataset, we will find out  
PCA is very amazing**

In [28]:
pca.explained_variance_

## below means the pca1 is explaining 13.02746837 variance.  
## second one is explaining 5.81556555 and so on and we see that it keeps going down.  
## so the maximum information that we are getting is through first few components.  
## These are the Eigen Values and the components_ that we saw above is Eigen Vectors.  

## Eigen Values are very Important and it tells us which vector is important.  
## Using this values we can fix which value to pick and which vector to pick.. 

array([13.02746837,  5.81556555,  2.85848795,  1.91901713,  1.70021491,
        1.20663908,  0.65333715,  0.42673847,  0.42645054,  0.34558986,
        0.30805491,  0.25605447,  0.228152  ,  0.14326274,  0.0926283 ])