# Enhancing SVM Classification with Data Cleaning Techniques

In this project, we explore the impact of data cleaning techniques on the performance of SVM classifiers. By artificially introducing label noise into a synthetic dataset and then applying various data cleaning methods (KNN, K-Means, GMM), we aim to investigate how these preprocessing steps affect SVM classification accuracy. This study provides insights into the robustness of SVM and the effectiveness of different data cleaning strategies.


In [15]:
import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt
from sklearn import svm
import random
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

test_size = 1000

## Data Generation and Initial Setup

We start by generating synthetic data points with Gaussian distributions and introduce label noise. This setup simulates a real-world scenario where data labels might be noisy or inaccurate. Our objective is to clean this data and improve the classification accuracy of an SVM model.


In [16]:
co_p = np.random.normal(1,1,size=(1000+test_size,2))
co_n = np.random.normal(1,1,size=(1000+test_size,2))

X_p = pd.DataFrame(co_p,columns=['x0','x1'])
X_n = pd.DataFrame(co_n,columns=['x0','x1'])

## Exploratory Data Analysis

Before applying any cleaning techniques, we explore the synthetic dataset. Visualizations of the data points and their respective labels give us an initial understanding of the noise level and data distribution.


In [17]:
X_p.head(3)

Unnamed: 0,x0,x1
0,0.310211,0.062671
1,1.341702,1.586315
2,0.033272,1.867829


In [18]:
X_n.head(3)

Unnamed: 0,x0,x1
0,1.292501,0.904944
1,1.551903,1.655517
2,1.212995,1.573792


## SVM Classification with Noisy Labels

Initially, we train an SVM classifier on the noisy dataset to establish a baseline for classification accuracy. This step will allow us to quantify the improvement brought by subsequent data cleaning methods.


In [19]:
mean = 0
for i in range(0,20):
    
    # Flipping Labels
    y_p = np.full(1000+test_size,1)
    y_n = np.full(1000+test_size,-1)
    for j in range(0,len(y_p)):
        y_p[j] = 1 if (random.randint(0,100)>35) else -1
    for j in range(0,len(y_n)):
        y_n[j] = -1 if (random.randint(0,100)>20) else 1
        
    # Constructing Dataframe
    df_p = pd.concat([X_p, pd.DataFrame(y_p, columns=['y'])], axis=1)


    df_n = pd.concat([X_n, pd.DataFrame(y_n, columns=['y'])], axis=1)

    df = pd.concat([df_p, df_n], axis=0)

    
    # Running SVM
    clf_a = svm.SVC(kernel='rbf', C=1, gamma=0.01)
    clf_a.fit(df[['x0','x1']][:1000],df['y'][:1000])
    score = clf_a.score(df[['x0','x1']][1000:],df['y'][1000:])
    
    # Report Scores
    mean +=score
    
mean = mean/20
print("Mean: {:.5f}".format(mean))


Mean: 0.35270


## Data Cleaning with KNN

We apply the K-Nearest Neighbors algorithm to clean the data. The idea is to see if using the labels of nearest neighbors can reduce label noise and improve SVM performance.


In [20]:
mean = 0
for i in range(0,20):
    
    # Flipping Labels
    y_p = np.full(1000+test_size,1)
    y_n = np.full(1000+test_size,-1)
    for j in range(0,len(y_p)):
        y_p[j] = 1 if (random.randint(0,100)>35) else -1
    for j in range(0,len(y_n)):
        y_n[j] = -1 if (random.randint(0,100)>20) else 1
        
    # Constructing Dataframe
    df_p = pd.concat([X_p,pd.DataFrame(y_p,columns=['y'])],1)
    df_n = pd.concat([X_n,pd.DataFrame(y_n,columns=['y'])],1)
    df  = pd.concat([df_p,df_n],0)
    
    cleaner = KNeighborsClassifier(n_neighbors=100).fit(df[['x0','x1']],df['y'])
    df['y'] = cleaner.predict(df[['x0','x1']])
    
    # Running SVM
    clf_a = svm.SVC(kernel='rbf', C=1, gamma=0.01)
    clf_a.fit(df[['x0','x1']][:1000],df['y'][:1000])
    score = clf_a.score(df[['x0','x1']][1000:],df['y'][1000:])
    
    # Report Scores
    print(score)
    mean +=score
    
mean = mean/20
mean

TypeError: concat() takes 1 positional argument but 2 were given

In [21]:
mean = 0
for i in range(0,20):
    
    # Flipping Labels
    y_p = np.full(1000+test_size,1)
    y_n = np.full(1000+test_size,-1)
    for j in range(0,len(y_p)):
        y_p[j] = 1 if (random.randint(0,100)>35) else -1
    for j in range(0,len(y_n)):
        y_n[j] = -1 if (random.randint(0,100)>20) else 1
        
    # Constructing Dataframe
    df_p = pd.concat([X_p,pd.DataFrame(y_p,columns=['y'])],1)
    df_n = pd.concat([X_n,pd.DataFrame(y_n,columns=['y'])],1)
    df  = pd.concat([df_p,df_n],0)
    
    df['y'] = KMeans(n_clusters=2, random_state=0).fit_predict(df[['x0','x1']])
    
    # Running SVM
    clf_a = svm.SVC(kernel='rbf', C=1, gamma=0.01)
    clf_a.fit(df[['x0','x1']][:1000],df['y'][:1000])
    score = clf_a.score(df[['x0','x1']][1000:],df['y'][1000:])
    
    # Report Scores
    print(score)
    mean +=score
    
mean = mean/20
mean

TypeError: concat() takes 1 positional argument but 2 were given

In [22]:
mean = 0
for i in range(0,20):
    
    # Flipping Labels
    y_p = np.full(1000+test_size,1)
    y_n = np.full(1000+test_size,-1)
    for j in range(0,len(y_p)):
        y_p[j] = 1 if (random.randint(0,100)>35) else -1
    for j in range(0,len(y_n)):
        y_n[j] = -1 if (random.randint(0,100)>20) else 1
        
    # Constructing Dataframe
    df_p = pd.concat([X_p,pd.DataFrame(y_p,columns=['y'])],1)
    df_n = pd.concat([X_n,pd.DataFrame(y_n,columns=['y'])],1)
    df  = pd.concat([df_p,df_n],0)
    
    df['y'] = GaussianMixture(n_components=2, covariance_type='full').fit_predict(df[['x0','x1']])
    
    # Running SVM
    clf_a = svm.SVC(kernel='rbf', C=1, gamma=0.01)
    clf_a.fit(df[['x0','x1']][:1000],df['y'][:1000])
    score = clf_a.score(df[['x0','x1']][1000:],df['y'][1000:])
    
    # Report Scores
    print(score)
    mean +=score
    
mean = mean/20
mean

TypeError: concat() takes 1 positional argument but 2 were given