Kernel PCA is used to transform the dataset trying to generate new features that would make a lot easier to classify the dataset

In [1]:
import pandas as pd
import sklearn
import matplotlib.pyplot as plt

from sklearn.decomposition import KernelPCA
from sklearn.decomposition import IncrementalPCA

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

Reading dataset

In [2]:
df_heart = pd.read_csv("../ml_pro/data/heart.csv")

In [3]:
df_heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


Target and features split

In [4]:
df_features = df_heart.drop(['target'], axis =1)
df_target = df_heart['target']

Feature normalization

In [5]:
df_features = StandardScaler().fit_transform(df_features)

Train/Test split. random state is used to apply repetitivity to get always the same split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df_features, df_target, test_size=0.3, random_state=42)
print(X_train.shape)
print(y_train.shape)

(717, 13)
(717,)


Implementing Kernel PCA. Number of components is defined to 4 only for testing. kernel = Polinomial

In [7]:
kpca = KernelPCA(n_components=4, kernel='poly')
kpca.fit(X_train)

KernelPCA(kernel='poly', n_components=4)

Applying logistic regression and getting the accuracy score

In [8]:
df_train = kpca.transform(X_train)
df_test = kpca.transform(X_test)

logistic = LogisticRegression(solver='lbfgs')

logistic.fit(df_train, y_train)
print("SCORE Kernel PCA: ", logistic.score(df_test, y_test))

SCORE Kernel PCA:  0.7987012987012987


Conclusion: We have reduced the dimension of our problem from 13 features to 4 getting a score of 79.9%. Not bad

We now use Kernel PCA with linear method

In [9]:
kpca = KernelPCA(n_components=4, kernel='linear')
kpca.fit(X_train)

df_train = kpca.transform(X_train)
df_test = kpca.transform(X_test)

logistic = LogisticRegression(solver='lbfgs')

logistic.fit(df_train, y_train)
print("SCORE Kernel PCA Lineal: ", logistic.score(df_test, y_test))

SCORE Kernel PCA Lineal:  0.8214285714285714


We now use Kernel PCA with gaussian method

In [11]:
kpca = KernelPCA(n_components=4, kernel='rbf')
kpca.fit(X_train)

df_train = kpca.transform(X_train)
df_test = kpca.transform(X_test)

logistic = LogisticRegression(solver='lbfgs')

logistic.fit(df_train, y_train)
print("SCORE Kernel PCA Gaussian: ", logistic.score(df_test, y_test))

SCORE Kernel PCA Gaussian:  0.8181818181818182
