# Project 2: Classification

This project asks you to perform various experiments with classification. The dataset we are using is a toy dataset for credit card fraud detection:

https://www.kaggle.com/datasets/shubhamjoshi2130of/abstract-data-set-for-credit-card-fraud-detection

You will write code and discussion texts into code and text cells in this notebook. 

If a block starts with TODO:, this means that you need to write something there. 

Some code had been written for you to guide the project. Don't change the already written code.

## Grading
The points add up to 40, that is 30 + 10 bonus points. While there is no difference between the regular and the bonus points, I recommend that you solve the problems labeled "BONUS" after you finished the other ones. 


In [622]:
import pandas as pd
import numpy as np
import scipy as sp
import sklearn as sk

## Setup for the project

Here we load the dataset, and create the training and test datasets as numpy arrays.

In [623]:
df = pd.read_csv("creditcard.csv",  true_values=["Y"], false_values=["N"])
print(f"Number of rows {len(df.index)}")
print(f"The columns of the database {df.columns}")
df.value_counts("isFradulent")



Number of rows 3075
The columns of the database Index(['Merchant_id', 'Transaction date', 'Average Amount/transaction/day',
       'Transaction_amount', 'Is declined', 'Total Number of declines/day',
       'isForeignTransaction', 'isHighRiskCountry', 'Daily_chargeback_avg_amt',
       '6_month_avg_chbk_amt', '6-month_chbk_freq', 'isFradulent'],
      dtype='object')


isFradulent
False    2627
True      448
dtype: int64

In [624]:
xfields = [
    'Average Amount/transaction/day',
       'Transaction_amount', 'Is declined', 'Total Number of declines/day',
       'isForeignTransaction', 'isHighRiskCountry', 'Daily_chargeback_avg_amt',
       '6_month_avg_chbk_amt', '6-month_chbk_freq']

df = df.replace({'Y': 1, 'N': -1})       #converting y and n to binary

df_shuffled = df.sample(frac=1) # shuffle the rows
x = df_shuffled[xfields].to_numpy(dtype=np.float64)
y = df_shuffled["isFradulent"].to_numpy(dtype=np.float64)
# the training data is the first 2000 rows, after shuffled
training_data_x = x[:2000]
training_data_y = y[:2000]
# the test data is the remaining
test_data_x = x[2000:]
test_data_y = y[2000:]


In [625]:
print("Run this to help you with what number goes with what field:")
for i, x in enumerate(xfields):
    print(f"{i} = {x}")

Run this to help you with what number goes with what field:
0 = Average Amount/transaction/day
1 = Transaction_amount
2 = Is declined
3 = Total Number of declines/day
4 = isForeignTransaction
5 = isHighRiskCountry
6 = Daily_chargeback_avg_amt
7 = 6_month_avg_chbk_amt
8 = 6-month_chbk_freq


## P1: Create an accuracy metric (7 pts)
Create a simple accuracy metric function which for a pair of ground truth values $y$ and estimates $\hat{y}$ (both of them arrays) calculates the accuracy of the estimate $\hat{y}$. For instance, if you pass y = [1, 0, 1] and 
yhat = [1, 1, 0], the loss function should return 0.3

In [626]:
def accuracy(y, yhat):
    ## implement here
    total = 0
    for i in range(len(y)):
        total += (int(y[i]) ^ yhat[i])/(len(y))
    total = 1 - total
    return total


In [627]:
# test your function here
acc = accuracy([1, 0, 1], [1, 1, 0])
print(f"Accuracy is {acc}") # should print 0.33...

Accuracy is 0.33333333333333337


## P2: Implement a majority classifier (7 pts)
This classifier will always return the most likely value. Training the classifier means determining what is the most likely value (regardless vhat value you pass to it). For instance, if more than half of the transactions are fraudulent, then you just return fraudulent always. 

In [628]:
def classify_majority(x, theta):
    # whatever the value of x, we return the theta

    ones = [1] * len(x)
    if accuracy(training_data_y, ones) >= .5: return 1
    else: return 0

            

# TODO: implement the train majority function
def train_majority(training_x, training_y):
    # this function will have to determine which is more likely to 
    # be the value of y, one (true) or zero (false)
    return classify_majority(training_x, 0)

In [629]:
# TODO: use the train_majority function to find the theta value for the training dataset
theta = 0

theta = train_majority(training_data_x, training_data_y)

# TODO: now use the theta value to create the test_data_yhat array which contains the classification for each test value 

test_data_yhat = [theta] * (len(test_data_y))

# TODO: now calculate the accuracy of the classifier using the function implemented in P1, and print it out

print(accuracy(test_data_y, test_data_yhat))

0.8651162790697672


TODO: Discuss here the performance of the majority classifier. Would this beat a classifier that just returns random values? 

In [630]:
print("This would beat a classifier that just returns random values in most cases because the average accuracy of a random classifier is 50 percent\nwhere the majority classifier is always at least above 50 percent there may be instances where a random classifier outerforms the majority \nclassifier, but that possibility diminishes with scale")

This would beat a classifier that just returns random values in most cases because the average accuracy of a random classifier is 50 percent
where the majority classifier is always at least above 50 percent there may be instances where a random classifier outerforms the majority 
classifier, but that possibility diminishes with scale


## P3: Implement a hand engineered classifier (8 pts)

Engineer by hand a classifier function that predicts whether  a transaction is  fraudulent or not. Your function should have a $\theta$ parameter which allows you to tweak it. 
The problem requires you to design a function that performs this classification, tweak its parameters, and measure its accuracy for the best parametrization you found. You should aim for a function that, at minimum, performs better than the majority classifier. 

In [631]:
# TODO: implement here your hand-engineered classifier
# The example below is just a very bad example, but it gives you an idea of how you can reason about the classification problem.
# In your implementation, you should try to actually find some kind of clever algorithm. You can also use more complex parametrizations

class Classifiers:
        transaction_amount = 0.0
        high_risk = 1
        declines = 0.0
        chargebacks = 0.0

def predict_handwritten(x,theta):
    y_pred = [0] * len(x)
    for i in range(len(x)):
        if(x[i,1] > theta.transaction_amount):
            y_pred[i] = 1
            continue
        elif(x[i,5]):
            y_pred[i] = 1
            continue
        elif(x[i,3] > theta.declines):
            y_pred[i] = 1
            continue
        elif(x[i,8] > theta.chargebacks):
            y_pred[i] = 1
            continue
        else:
            y_pred[i] = 0

    return y_pred
    
def transaction_classifier(x,y):

    
    k = 0
    j = 0
    lowest = 999999999
    highest = 0
    amount = 0
    
    
    for i in range(len(x)):
        if 0 == y[i]:
            if x[i,1] < lowest:
                lowest = x[i,1]
            if x[i,1] > highest:
                highest = x[i,1]
            amount+= x[i,1]
            k+=1
       
    valid_mean = amount/k

    k = 0
    
    for k in range (int(highest)): 
        top = ((highest - k)-valid_mean)/(highest - lowest)
        if top < .75: break

    return ((highest-k))

def decline_classifier(x,y):

    
    k = 0
    j = 0
    lowest = 999999999
    highest = 0
    amount = 0
    
    
    for i in range(len(x)):
        if 0 == y[i]:
            if x[i,3] < lowest:
                lowest = x[i,3]
            if x[i,3] > highest:
                highest = x[i,3]
            amount+= x[i,3]
            k+=1
       
    valid_mean = amount/k

    k = 0
    
    for k in range (int(highest)): 
        top = ((highest - k)-valid_mean)/(highest - lowest)
        if top < .75: break

    return ((highest-k))

def chargeback_classifier(x,y):

    
    k = 0
    j = 0
    lowest = 999999999
    highest = 0
    amount = 0
    
    
    for i in range(len(x)):
        if 0 == y[i]:
            if x[i,8] < lowest:
                lowest = x[i,8]
            if x[i,8] > highest:
                highest = x[i,8]
            amount+= x[i,8]
            k+=1
       
    valid_mean = amount/k
    
    k = 0
    
    for k in range (int(highest)): 
        top = ((highest - k)-valid_mean)/(highest - lowest)
        if top < .6: break

    return ((highest-k))

    

def classify_handwritten(x, y, depth):

    classifiers = Classifiers()

    if (depth <= 0) | (len(x) <= 1):
        classifiers.transaction_amount = transaction_classifier(x,y)
        classifiers.declines = decline_classifier(x,y)
        classifiers.chargebacks = chargeback_classifier(x,y)
    
        return classifiers

    
    sub_left  = classify_handwritten(x[:len(x)//2], y[:len(x)//2], depth - 1)
    sub_right = classify_handwritten(x[len(x)//2:], y[len(x)//2:], depth - 1)

    y_left  = predict_handwritten(x[:len(x)//2], sub_left)
    y_right = predict_handwritten(x[len(x)//2:], sub_right)

    left_accuracy = accuracy(y[:len(x)//2], y_left)
    right_accuracy = accuracy(y[len(x)//2:], y_right)

    if right_accuracy > left_accuracy:
        return sub_right

    return sub_left
    

In [632]:
# TODO: Now, run some experiments with your function. Experiment with different values of the parameter theta. 
theta = Classifiers

depth = 0       #for my classify_handwritten function I'm using depth as the tweakable parameter instead of theta
                #depth can be any number 0-9 best accuracy is 0 and 1

theta = classify_handwritten(training_data_x, training_data_y, depth)

y_pred = predict_handwritten(test_data_x, theta)

print(accuracy(test_data_y, y_pred))

0.9776744186046512


In [633]:
# TODO: calculate the accuracy of the classifier on the test data with the best
# theta found above and print it.

theta = classify_handwritten(training_data_x, training_data_y, 0)

y_pred = predict_handwritten(test_data_x, theta)

print(accuracy(test_data_y, y_pred))

0.9776744186046512


TODO: Describe in one paragraph your experiments and evaluation. Discuss the overall accuracy your classifier. Did you manage to beat the "majority" classifier? Comment on how easy or hard was to do this. 

In [634]:
print("I spent a ton of time looking through the data to figure out the threshold for fraudulent transactions for each variable.")
print("I then found a way to approximate those thresholds by finding the upper quartile of the nonfraudulent transactions.")
print("Setting all transactions over the thresholds to fraudulent resulted in a .979 accuracy beating the majority classifier.")
print("Making the programs to find the upper quartiles for each variable wasn't that hard but I had to reference the textbook \nto figure out how to add the depth parameter. Unfortunately I couldn't implement this into just one function.")

I spent a ton of time looking through the data to figure out the threshold for fraudulent transactions for each variable.
I then found a way to approximate those thresholds by finding the upper quartile of the nonfraudulent transactions.
Setting all transactions over the thresholds to fraudulent resulted in a .979 accuracy beating the majority classifier.
Making the programs to find the upper quartiles for each variable wasn't that hard but I had to reference the textbook 
to figure out how to add the depth parameter. Unfortunately I couldn't implement this into just one function.


## P4: Implement a logistic regression classifier using sklearn (8 pts)
Implement a logistic regression function using the sklearn library. 
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


In [635]:
#data for sklearn funcitons

theta = classify_handwritten(training_data_x, training_data_y, 0)

i = 0

for i in range(len(training_data_x)):
    if (float(training_data_x[i][1]) > theta.transaction_amount): training_data_x[i][1] = 1
    else: training_data_x[i][1] = 0

    if (training_data_x[i][8] > theta.chargebacks): training_data_x[i][8] = 1
    else: training_data_x[i][8] = 0

    if (training_data_x[i][3] > theta.declines): training_data_x[i][3] = 1
    else: training_data_x[i][3] = 0

i = 0

for i in range(len(test_data_x)):
    if (float(test_data_x[i][1]) > theta.transaction_amount): test_data_x[i][1] = 1
    else: test_data_x[i][1] = 0

    if (test_data_x[i][8] > theta.chargebacks): test_data_x[i][8] = 1
    else: test_data_x[i][8] = 0

    if (test_data_x[i][3] > theta.declines): test_data_x[i][3] = 1
    else: test_data_x[i][3] = 0

print(np.shape(training_data_x[:,8]), np.shape(training_data_x[:,1:6]))

training_data_x = np.append(training_data_x[:,1:6], np.array(training_data_x[:,8]).reshape(-1,1), axis = 1)
test_data_x = np.append(test_data_x[:,1:6], np.array(test_data_x[:,8]).reshape(-1,1), axis = 1)

print(training_data_x[:][1])

(2000,) (2000, 5)
[0. 0. 0. 1. 0. 0.]


In [636]:
# TODO: implement the logistic regression here in a function 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

log = LogisticRegression()

log.fit(training_data_x,training_data_y)

y_pred = log.predict(test_data_x)

print(accuracy_score(test_data_y, y_pred))

0.9767441860465116


In [637]:
# TODO: now, run some experiments with it, and measure the accuracy with various parametrizations. In particular, you should run it with and without regularization. 
# In the last line, print the accuracy with the best parameters.

log = LogisticRegression(max_iter = 100, class_weight = {0:1, 1:1, 2:1, 3:1, 4:1, 5:1,}, random_state=1100000)

log.fit(training_data_x,training_data_y)

y_pred = log.predict(test_data_x)

print(accuracy_score(test_data_y, y_pred))

0.9767441860465116



TODO: Describe in one paragraph your experiments and evaluation of the Logistic Regression classifier. Consider things such as accuracy, training time, ease of tweaking of the parameters. Compare it with the accuracy of the hand-engineered classifier.

In [638]:
print("When experimenting I found that the only variable to effect the accuracy was index 1 and 0, if its weight was set to 0 the classifier would default to the majority classifier\nand if it was set too large it would bring the accuracy down to about 14 percent it scored about the same as my classifier")

When experimenting I found that the only variable to effect the accuracy was index 1 and 0, if its weight was set to 0 the classifier would default to the majority classifier
and if it was set too large it would bring the accuracy down to about 14 percent it scored about the same as my classifier


## P5 Bonus: Implement a random forest classifier using sklearn (5 pts)
Implement a random forest classifier using sklearn 
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [639]:
 # TODO: Implement the random forest classifier here
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth = 100, class_weight = {0:1, 1:1, 2:1, 3:1, 4:1, 5:1,}, random_state=1100000)

rf.fit(training_data_x,training_data_y)

y_pred = rf.predict(test_data_x)

print(accuracy_score(test_data_y, y_pred))


0.9776744186046512


In [640]:
# TODO: Perform some experiments here with different parameters of the random forest classifier. In the last line, print the accuracy with the best parameters.
rf = RandomForestClassifier(max_depth = 100, class_weight = {0:1, 1:1, 2:1, 3:1, 4:1, 5:1,}, random_state=1100000)

rf.fit(training_data_x,training_data_y)

y_pred = rf.predict(test_data_x)

print(accuracy_score(test_data_y, y_pred))

0.9776744186046512


TODO: Describe in one paragraph your experiments and evaluation of the random forest classifier. Consider things such as accuracy, training time, ease of tweaking of the parameters. 

In [641]:
print("Honestly this classifier seems to be effected by the same variables as the logistic regression model. \nThe accuracy is the same I imagine that it would run at a different time since the logistic regression \nprogram probably runs in a linear time where the rfc probably runs in log time.\nit has the same parameters as logistic regression for the most part")

Honestly this classifier seems to be effected by the same variables as the logistic regression model. 
The accuracy is the same I imagine that it would run at a different time since the logistic regression 
program probably runs in a linear time where the rfc probably runs in log time.
it has the same parameters as logistic regression for the most part


## P6 Bonus: Implement an AdaBoost classifer using sklearn (5 pts)

Implement an AdaBoost classifier using sklearn https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

In [642]:
# TODO: Implement the adaboost classifier here

from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(random_state=0)

ada.fit(training_data_x, training_data_y)

y_pred = ada.predict(test_data_x)

print(accuracy_score(test_data_y, y_pred))

0.9720930232558139


In [643]:
# TODO: Perform some experiments here with different parametrizations of the adaboost classifier. In the last line, print the accuracy with the best parameters.

ada = AdaBoostClassifier(learning_rate = 100, n_estimators = 100, random_state=1000)

ada.fit(training_data_x, training_data_y)

y_pred = ada.predict(test_data_x)

print(accuracy_score(test_data_y, y_pred))

0.8651162790697674


  sample_weight *= np.exp(
  return super().fit(X, y, sample_weight)


TODO: Describe in one paragraph your experiments and evaluation of the AdaBoost classifier. Consider things such as accuracy, training time, ease of tweaking of the parameters. 

In [644]:
print("The accuracy of adaboost has been lower than the accuracy of the other two classifiers I've been testing based on the instances I've observed.\nThe parameters of adaboost is also different than the other two classifiers. learning_rate doesnt seem to effect the speed of the algorithm, \nhowever the number of estimators almost killed my pc")

The accuracy of adaboost has been lower than the accuracy of the other two classifiers I've been testing based on the instances I've observed.
The parameters of adaboost is also different than the other two classifiers. learning_rate doesnt seem to effect the speed of the algorithm, 
however the number of estimators almost killed my pc
