<a href="https://colab.research.google.com/github/mayur7garg/66DaysOfData/blob/main/Day%2011/Dimensionality_Reduction_using_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dimensionality Reduction using sklearn

## Imports

### Installation

In [1]:
!pip install scikit-learn==0.24

import sklearn
print(f'\nVersion: {sklearn.__version__}')


Version: 0.24.0


### Importing necessary classes and modules

In [2]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.manifold import Isomap, LocallyLinearEmbedding

## Sample Data

In [3]:
X, y = make_classification(n_samples = 1000, n_features = 20, n_informative = 10, n_redundant = 10, random_state = 7)
X.shape, y.shape

((1000, 20), (1000,))

## Train data utility method

In [4]:
def train_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 3, random_state = 1)
    n_scores = cross_val_score(model, X, y, scoring = 'accuracy', cv = cv, n_jobs = -1)
    print(f'Accuracy:\nMean: {np.mean(n_scores):.3}\nStd Deviation {np.std(n_scores):.3}\n')

## Models

### Baseline model using Logistic Regression

In [5]:
model = LogisticRegression()
model

LogisticRegression()

In [6]:
%%time
train_model(model, X, y)

Accuracy:
Mean: 0.824
Std Deviation 0.0338

CPU times: user 149 ms, sys: 44.1 ms, total: 193 ms
Wall time: 1.94 s


### Principal Component Analysis

In [7]:
model = Pipeline([('pca', PCA(n_components = 10)), ('logreg', LogisticRegression())])
model

Pipeline(steps=[('pca', PCA(n_components=10)),
                ('logreg', LogisticRegression())])

In [8]:
%%time
train_model(model, X, y)

Accuracy:
Mean: 0.824
Std Deviation 0.0338

CPU times: user 108 ms, sys: 6.05 ms, total: 114 ms
Wall time: 413 ms


### Singular Value Decomposition

In [9]:
model = Pipeline([('svd', TruncatedSVD(n_components = 10)), ('logreg', LogisticRegression())])
model

Pipeline(steps=[('svd', TruncatedSVD(n_components=10)),
                ('logreg', LogisticRegression())])

In [10]:
%%time
train_model(model, X, y)

Accuracy:
Mean: 0.824
Std Deviation 0.0338

CPU times: user 105 ms, sys: 9.64 ms, total: 115 ms
Wall time: 424 ms


### Linear Discriminant Analysis

In [11]:
model = Pipeline([('lda', LinearDiscriminantAnalysis(n_components = 1)), ('logreg', LogisticRegression())])
model

Pipeline(steps=[('lda', LinearDiscriminantAnalysis(n_components=1)),
                ('logreg', LogisticRegression())])

In [12]:
%%time
train_model(model, X, y)

Accuracy:
Mean: 0.825
Std Deviation 0.0341

CPU times: user 107 ms, sys: 8.19 ms, total: 115 ms
Wall time: 327 ms


### Isomap Embedding

In [13]:
model = Pipeline([('iso', Isomap(n_components = 10)), ('logreg', LogisticRegression())])
model

Pipeline(steps=[('iso', Isomap(n_components=10)),
                ('logreg', LogisticRegression())])

In [14]:
%%time
train_model(model, X, y)

Accuracy:
Mean: 0.888
Std Deviation 0.0287

CPU times: user 246 ms, sys: 13.1 ms, total: 260 ms
Wall time: 10.9 s


### Locally Linear Embedding

In [15]:
model = Pipeline([('lle', LocallyLinearEmbedding(n_components = 10)), ('logreg', LogisticRegression())])
model

Pipeline(steps=[('lle', LocallyLinearEmbedding(n_components=10)),
                ('logreg', LogisticRegression())])

In [16]:
%%time
train_model(model, X, y)

Accuracy:
Mean: 0.886
Std Deviation 0.0284

CPU times: user 210 ms, sys: 10.9 ms, total: 221 ms
Wall time: 6.51 s


### Modified Locally Linear Embedding

In [17]:
model = Pipeline([('lle', LocallyLinearEmbedding(n_components = 5, method = 'modified', n_neighbors = 10)), ('logreg', LogisticRegression())])
model

Pipeline(steps=[('lle',
                 LocallyLinearEmbedding(method='modified', n_components=5,
                                        n_neighbors=10)),
                ('logreg', LogisticRegression())])

In [18]:
%%time
train_model(model, X, y)

Accuracy:
Mean: 0.848
Std Deviation 0.0367

CPU times: user 226 ms, sys: 21.6 ms, total: 248 ms
Wall time: 10.9 s
