# Programming Machine Learning Lab
# Exercise 11

**General Instructions:**

1. You need to submit the PDF as well as the filled notebook file.
1. Name your submissions by prefixing your matriculation number to the filename. Example, if your MR is 12345 then rename the files as **"12345_Exercise_11.xxx"**
1. Complete all your tasks and then do a clean run before generating the final PDF. (_Clear All Ouputs_ and _Run All_ commands in Jupyter notebook)

**Exercise Specific instructions::**

1. You are allowed to use only NumPy and Pandas (unless stated otherwise). You can use any library for visualizations.

### Part 1

In this part, you will be using the credit card fraud detection dataset from https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud to train and test a Support Vector Machine (SVM) classifier. Your task
is to:

1. Download the data and split the dataset into training and testing sets (80-20 split) in a stratified manner to take care of the class imbalance. You need to code the stratified splitting function from scratch. *sklearn is not allowed for this part*
1. Implement the basic Pegasos Algorithm from the paper https://home.ttic.edu/~nati/Publications/PegasosMPB.pdf. This is in page 5, Fig 1.
1. Implement the mini-batch Pegasos algorithm from the paper https://home.ttic.edu/~nati/Publications/PegasosMPB.pdf. Do not forget the projection step. This is in page 6, Fig 2.
1. Implement the dual coordinate descent method for SVM’s from the paper https://icml.cc/Conferences/2008/papers/166.pdf. This is Algorithm 1 in the paper.
1. Report a final accuracy on the test set for all 3 approches.

In [1]:
### Write your code here
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score,roc_auc_score

def StratifiedSplit(X, y, test_size=0.3):
    unique_classes = np.unique(y)
    
    train_indices, test_indices = [], []
    
    for class_label in unique_classes:
        # Find indices of samples with the current class label
        class_indices = np.where(y == class_label)[0]
        np.random.shuffle(class_indices)
        
        # Calculate the number of samples for the test set
        num_test_samples = int(len(class_indices) * test_size)
        
        # Split indices into train and test sets
        train_indices.extend(class_indices[num_test_samples:])
        test_indices.extend(class_indices[:num_test_samples])
    
    # Shuffle the indices to randomize the order
    np.random.shuffle(train_indices)
    np.random.shuffle(test_indices)
    
    # Create the train and test sets based on the indices
    X_train, X_test = X[train_indices,:], X[test_indices,:]
    y_train, y_test = y[train_indices], y[test_indices]

    return X_train, X_test, y_train, y_test

# reading the file
df = pd.read_csv('creditcard.csv')
# separating into X and Y
X = df.iloc[:,:-1]
y=df.iloc[:,-1]
y = df.iloc[:,-1].values
# make the y labels as -1,1 instead of 0,1
y = np.where(y>0,y,-1)
X_train, X_test, y_train, y_test = StratifiedSplit(X.values, y, test_size=0.2)

In [2]:
class Pegasos:
    def __init__ (self, lamda, k, projection):
        self.lamda = lamda #lambda value
        self.k = k #number of observations to be used
        self.projection = projection #for projections
        
    def gradient(self, p): #to calculate the gradient
        return np.where(p<1,1,0)
    
    def fit(self, x, y, n_iters=1500):
        m, n = x.shape
        self.W = np.zeros(n)
        for t in range(1,n_iters+1): #iterate till max_iters
            #pick a random instance
            idx = np.random.choice(range(m), self.k, replace=False)
            lr = 1/(self.lamda*t) #get the learning rate
            x_i = x[idx] #get x_i
            y_i = y[idx] #get y_i
            prod = y_i * (x_i@self.W) #obtain the product
            #update the weights
            self.W = (1-lr*self.lamda)*self.W + \
            (lr/self.k)*(np.sum(np.multiply(y_i.reshape(-1,1),x_i)*self.gradient(prod).reshape(-1,1),axis=0))
    
    def predict(self, x):
        #transform the inputs using the weight vector
        p = x@self.W.reshape(-1,1)
        return np.sign(p) #the sign function outputs the class

In [3]:
class SVMDC:
    def __init__ (self, C, mode = "L1", tol=1e-3):
        self.C = C #C value
        self.mode = mode
        self.tol = tol #tolerance value to break out of the loop
    
    def partial_gradient(self,G,a,U): #to calculate the partial gradient
        if a == 0:
            return min(G,0)
        elif a == U:
            return max(G,0)
        elif (a>0) and (a<U):
            return G
    
    def fit(self, X, y,iters=100):
        m, n = X.shape
        self.w = 0 #weight matrix
        
        #SVMDC can be done in L1 and L2 modes
        if self.mode == "L1":
            Dii = 0
            U=self.C
        else:
            Dii = 1/(2*self.C)
            U=np.inf
        
        #to get the langrangian multipliers
        alpha = np.zeros(m)
        self.w = np.zeros(shape=(n)) #initialize the weight matrix
        Qii = np.sum(X**2, 1) + Dii #calculate Qii
        for t in range(iters): #iterate till max_iters
            err = 0 #calculate error to break the loop
            for i in range(m): #iterate over each instance
                Qhat = Qii[i] #get Q_bar
                G = np.multiply(np.dot(self.w,X[i,:]),y[i]) - 1 + Dii * alpha[i] #gradient of the objective function
                PG = self.partial_gradient(G,alpha[i],U) #partial gradient of the objective function
                if np.abs(G) > err: #to keep updating the error term
                    err = np.abs(G)
                
                #to find optimal solution
                if np.abs(G) > 0: 
                    alpha_new = min(max(alpha[i]-G/Qhat,0),U)
                    self.w = self.w+(np.multiply((alpha_new - alpha[i])* y[i] ,X[i,:]))
                    alpha[i] = alpha_new
            
            #stop iterating once the error fall below tolerance        
            if err<self.tol:
                break
        
    def predict(self, x):
        #project the points using the weight matrix
        p = x@self.w.reshape(-1,1)
        return np.sign(p) #the sign function tells the class which the object belong to

In [4]:
# Pegasos Basic
peg_basic = Pegasos(0.01,1,projection=False)
peg_basic.fit(X_train,y_train)

# Test the model
print(f'Accuracy of Basic Pegasos is {accuracy_score(y_test,peg_basic.predict(X_test))} and ROC AUC Score is {roc_auc_score(y_test,peg_basic.predict(X_test))}')

Accuracy of Basic Pegasos is 0.9982795245869981 and ROC AUC Score is 0.5


In [5]:
# Pegasos Batch
peg_batch = Pegasos(0.01,10,projection=False)
peg_batch.fit(X_train,y_train)
# Test the model
print(f'Accuracy of Batch Pegasos is {accuracy_score(y_test,peg_batch.predict(X_test))} and ROC AUC Score is {roc_auc_score(y_test,peg_batch.predict(X_test))}')

Accuracy of Batch Pegasos is 0.9982795245869981 and ROC AUC Score is 0.5


In [6]:
# Dual Coordinate Descent
svm = SVMDC(0.1)
svm.fit(X_train,y_train)
# Test the model
print(f'Accuracy of Dual Coordinate Descent SVM is {accuracy_score(y_test,svm.predict(X_test))} and ROC AUC Score is {roc_auc_score(y_test,svm.predict(X_test))}')

**NOTE:** Here, since we have a huge class imbalance in the dataset, therefore, a better approach would be to use some techniques like Oversampling, Undersampling, and SMOTE as a preprocessing step and then apply the model over the preprocessed dataset. Additionally, Pegasos and CD uses stochastic optimization approach therefore the results obtained after these techniques may differ largely for each run and for the choice of each random state. 