## Goal

Reduce the number of variables by merging correlated variables and extracting the most important features from the dataset (ex. variables that are responsible for maximum variance in the output).

In [1]:
from sklearn.datasets import load_digits

data = load_digits()

X = data.data
y = data.target

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)

X_train.shape

(1437, 64)

In [3]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_predict = logreg.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [4]:
from sklearn.metrics import accuracy_score

the_accuracy = accuracy_score(y_predict, y_test)
the_accuracy

0.9694444444444444

## Can we improve that, by reducing the number of variables? Let's see...

In [5]:
from sklearn.decomposition import PCA

pca_model = PCA(n_components=0.95)

pca_model.fit(X_train)

PCA(n_components=0.95)

In [6]:
X_train_pca = pca_model.transform(X_train)
X_train_pca.shape

(1437, 28)

In [7]:
X_test_pca = pca_model.transform(X_test)
X_test_pca.shape

(360, 28)

In [8]:
logreg4pca = LogisticRegression(C=1, penalty='l1', solver='liblinear')
logreg4pca.fit(X_train_pca, y_train)

y_pred_pca = logreg4pca.predict(X_test_pca)

In [9]:
from sklearn.metrics import accuracy_score

pca_accuracy = accuracy_score(y_pred_pca, y_test)
pca_accuracy

0.9805555555555555