# Compressing data via Dimension Reduction
This page explores using  **Principal Component Analysis** (PCA) in scikit-learn for dimension reduction.

Sources:

* Python Machine Learning by Sebastian Raschka & Vahid Mirjalili
 -- Chapter 5 (PCA vis scikit-learn)
*https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/code/ch05/ch05.ipynb





In [0]:
import pandas as pd
import scipy.io
import scipy.stats as stats
import matplotlib.pyplot as plt
import matplotlib 
import pandas as pd
import numpy as np
import pickle

## Importing the dataset (Julie's dataset)
it's best to save the file 'CORRECT-PROBABILITIES.csv' locally then upload in colab with: 

->files->upload->path to 'CORRECT-PROBABILITIES.csv' 

You can find the file in drive/data_sets



In [0]:
df = pd.read_csv('CORRECT-PROBABILITIES.csv')

## Selecting input / output
Since the majority of the non-zero probabilities are output in col 20, that's the one I use. Again, the goal here is to reduce dimensions, and to understand the tools available in Sci Kit Learn for doing so. We can more accurately reconfigure the data later.

In [0]:
from sklearn.model_selection import train_test_split

# Select the number of features and which output col
X = df.iloc[:, 1:5].values
y = df.iloc[:, 20].values
y = y.astype('int')

# Split the training / test set
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=0)

## Import standard scalar to ...

In [0]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

# Fit AND transform the train set
X_train_std = sc.fit_transform(X_train, y_train)

# ONLY transform the test set
X_test_std = sc.transform(X_test)


## Now we can use PCA (Principal Component Analysis)

Documentation link(s): 
* https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
*https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html



In [0]:
from sklearn.decomposition import PCA

pca = PCA()
X_train_pca = pca.fit_transform(X_train_std)

# Output the variance ration between the features
pca.explained_variance_ratio_


array([0.3667326 , 0.25335367, 0.24870434, 0.13120939])

### Fit and train the model


In [0]:
# Fit a training set for pca
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)

# Train the model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=0, solver='lbfgs')
lr = lr.fit(X_train_pca, y_train)

### Output some things to see what's going on

In [0]:
print('X_train_std.shape: ', X_train_std.shape)
print('X_train_pca.shape: ', X_train_pca.shape)
print('X_train_std.size: ', X_train_std.size)

# Features / dimensions should now be reduced according to n_components 
# specificed in PCA(n_components) (line[22])

X_train_std.shape:  (18029, 4)
X_train_pca.shape:  (18029, 2)
X_train_std.size:  72116


### By now we've reducing the features / dimensions