<a href="https://colab.research.google.com/github/fpiedra47/acp/blob/main/pca.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

In [2]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
mnist.data.shape

(70000, 784)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, random_state=123)

I want to account for 95% of the variance - so we can use that as an argument when fitting PCA.

In [4]:
scaler = StandardScaler()
pca = PCA(.95)
processor = make_pipeline(scaler, pca)

In [5]:
processor.fit_transform(X_train).shape

(52500, 324)

Notice that we have dramatically reduced the dimensionality of the dataset.

In [6]:
lr = LogisticRegression(solver = 'lbfgs', max_iter = 10000) 
lr_pipeline = make_pipeline(processor, lr)

In [7]:
lr_pipeline.fit(X_train, y_train)

Pipeline(steps=[('pipeline',
                 Pipeline(steps=[('standardscaler', StandardScaler()),
                                 ('pca', PCA(n_components=0.95))])),
                ('logisticregression', LogisticRegression(max_iter=10000))])

In [8]:
print('Training accuracy:', lr_pipeline.score(X_train, y_train))
print('Testing accuracy:', lr_pipeline.score(X_test, y_test))

Training accuracy: 0.9372380952380952
Testing accuracy: 0.9209714285714286
