# Logistic Regression + PCA

This code has the purpose of showing the advantages (if there are any) of using PCA alongside a Logistic Regression.

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd

Importing dataset and showing its info.

In [2]:
data = pd.read_csv('mnist.csv')

print(data.shape)
data.head()

(42000, 785)


Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Data Preprocessing

Splitting the data between labels and features.

In [3]:
labels = data[['label']]

data.drop(['label'], axis=1, inplace=True)

Showing the info on the labels and features data.

In [4]:
print(labels.shape)
labels.head()

(42000, 1)


Unnamed: 0,label
0,1
1,0
2,1
3,4
4,0


In [5]:
print(data.shape)
data.head()

(42000, 784)


Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Splitting all the data (labels and features) into training and testing sets and showinf their info.

In [6]:
train_set, test_set, train_label, test_label = train_test_split(data, labels, test_size=0.3)

In [7]:
print(train_set.shape)
print(train_label.shape)

(29400, 784)
(29400, 1)


In [8]:
print(test_set.shape)
print(test_label.shape)

(12600, 784)
(12600, 1)


Scaling the data using StandardScaler.

In [9]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(train_set)

train_set = scaler.transform(train_set)
test_set = scaler.transform(test_set)

  return self.partial_fit(X, y)
  
  import sys


## PCA

In [10]:
from sklearn.decomposition import PCA

Configuring PCA to retain at least 85% of the initial data variance.

Therefore, from all of the 784 components, it will select the minimum number that is enough to satisfy the condition specified.

In [11]:
pca = PCA(0.85, svd_solver = 'full')
# pca = PCA(n_components = 2, svd_solver = 'auto')

Fitting PCA to the data and showing the number of components retained.

In [12]:
pca.fit(train_set)

print(pca.n_components_)

177


Tranforming the data using pca, saving under new DataFrames to compare later.

In [13]:
train_set_pca = pca.transform(train_set)
test_set_pca = pca.transform(test_set)

## Comparing Regressions

In [14]:
from sklearn.linear_model import LogisticRegression
from time import time

Doing the whole process of the Logistic regression for the both with and without PCA data.

Using 'lbfgs' solver because it was the faster and most acurate of the ones available.

In [15]:
start = time()

logistic_regression = LogisticRegression(solver='lbfgs', multi_class='auto')

logistic_regression.fit(train_set, train_label.label.values)

score = logistic_regression.score(test_set, test_label.label.values)
print(score)

print('Finshed in:', time() - start, 'seconds')

0.9032539682539683
Finshed in: 13.795281171798706 seconds




In [16]:
start = time()

logistic_regression_pca = LogisticRegression(solver='lbfgs', multi_class='auto')

logistic_regression_pca.fit(train_set_pca, train_label.label.values)

score_pca = logistic_regression_pca.score(test_set_pca, test_label.label.values)
print(score)

print('Finshed in:', time() - start, 'seconds')

0.9032539682539683
Finshed in: 5.987042188644409 seconds




## Conclusion

By using PCA it was possible to retain the accuracy while spending half of the initial time in the process.

Of course this was a limited test, but it shows the potential that using a technique such as PCA to speed up some algorithms.