### Dimensionality Reduction

- To reduce the number of features present in the data means Dimensionality Reduction
- There are 2 types :
       a. Principal Component Analysis
       b. Linear Discriminant Analysis
- Working: Derive a new set of features(m) out of the original features available in the data(n)
- m < n
- The focus again is to maintain accuracy

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = sns.load_dataset('iris')
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Seperate X and y

In [3]:
X = data.drop('species', axis = 1)
y = data['species']

## Split the data into train_test_split

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

## Encoding the species column of the data

In [5]:
dic = {'setosa' : 0, 'virginica' : 1, 'versicolor' : 2}
y = y.replace(dic)
y

0      0
1      0
2      0
3      0
4      0
      ..
145    1
146    1
147    1
148    1
149    1
Name: species, Length: 150, dtype: int64

## Apply Log Reg on the data

In [6]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr

In [7]:
lr.fit(X_train, y_train)

In [8]:
y_pred = lr.predict(X_test)

In [9]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred, y_test)

0.9736842105263158

## Apply Principal Component Analysis to reduce the number of features in the data

In [10]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)    ## Derive 2 new columns from of 4 columns present in the data
pca

In [11]:
res = pca.fit_transform(X)
res

array([[-2.68412563,  0.31939725],
       [-2.71414169, -0.17700123],
       [-2.88899057, -0.14494943],
       [-2.74534286, -0.31829898],
       [-2.72871654,  0.32675451],
       [-2.28085963,  0.74133045],
       [-2.82053775, -0.08946138],
       [-2.62614497,  0.16338496],
       [-2.88638273, -0.57831175],
       [-2.6727558 , -0.11377425],
       [-2.50694709,  0.6450689 ],
       [-2.61275523,  0.01472994],
       [-2.78610927, -0.235112  ],
       [-3.22380374, -0.51139459],
       [-2.64475039,  1.17876464],
       [-2.38603903,  1.33806233],
       [-2.62352788,  0.81067951],
       [-2.64829671,  0.31184914],
       [-2.19982032,  0.87283904],
       [-2.5879864 ,  0.51356031],
       [-2.31025622,  0.39134594],
       [-2.54370523,  0.43299606],
       [-3.21593942,  0.13346807],
       [-2.30273318,  0.09870885],
       [-2.35575405, -0.03728186],
       [-2.50666891, -0.14601688],
       [-2.46882007,  0.13095149],
       [-2.56231991,  0.36771886],
       [-2.63953472,

## Convert PCA output to a DataFrame

In [12]:
pca_data = pd.DataFrame(res, columns = ['PC1', 'PC2'])
pca_data['Species'] = y
pca_data

Unnamed: 0,PC1,PC2,Species
0,-2.684126,0.319397,0
1,-2.714142,-0.177001,0
2,-2.888991,-0.144949,0
3,-2.745343,-0.318299,0
4,-2.728717,0.326755,0
...,...,...,...
145,1.944110,0.187532,1
146,1.527167,-0.375317,1
147,1.764346,0.078859,1
148,1.900942,0.116628,1


## Seperate X and y

In [13]:
X = pca_data.drop('Species', axis = 1)
y = pca_data['Species']

## split the data using train_test_split

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

## Apply LogReg on the data

In [15]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr

In [16]:
lr.fit(X_train, y_train)

In [17]:
y_pred = lr.predict(X_test)
y_pred

array([1, 2, 0, 1, 0, 1, 0, 2, 2, 2, 1, 2, 2, 2, 2, 0, 2, 2, 0, 0, 1, 2,
       0, 0, 1, 0, 0, 2, 2, 0, 1, 2, 0, 1, 1, 2, 0, 1], dtype=int64)

In [18]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.9736842105263158

# Conclusion:

- Before PCA, with all 4 features in the data the accuracy of the model was 97.36%
- After PCA, with 2 newly derived fearures from the original data, the accuracy of the model is still maintained at 97.36%
- We conclude that the accuracy is not hampered with less number of columns in the data