<a href="https://colab.research.google.com/github/cdebruyn/PackageName/blob/master/Classification_Hackathon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification Hackathon

For the specifications for today's Hackathon use the slides linked below. Just note a couple things before you start:
* Use your full name and ```_EDSA``` as your Zindi username.
* The dataset for this challenge is very large and will take a long time to process. In order to use your time wisely, only use a small subset of the data to figure out how to solve this challenge, and once you're happy with that, train your model with the entire dataset.
* This Zindi challenge is tough. This will be taken into account when the supervisors mark your work. Do not worry too much about your placement on the leaderboard. In the Regression Hackathon ```laura_the_explorer``` was in first place but is outside the top 100 in this challenge
* To submit your Hackathon to Athena, zip your notebook and your submission csv file, and upload that here. Note that your report card will say you have 100% once you submit your file.
* Please attach the *Honour code* (below) cell to your notebook. 

Further instructions found on these slides: https://docs.google.com/presentation/d/1AbVndI5aOd27Jm0E1qNoYzRtWiZ6-DE3BDE0djGxzIk/edit?usp=sharing

** Good luck! **

## Honour Code
I Chad, de Bruyn, confirm - by submitting my - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code (https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

# Xente Fraud Detection Challenge

## Imports

In [0]:
# import the necessary packages
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.neighbors.nearest_centroid import NearestCentroid

The competition training and test data sets are imported and then combined into one larger data set so that when dummy variables are later created, both sets will have the same number of these variables. This will prevent any errors when training models and using these models for prediction on the test set.

In [0]:
# import the competition data sets
train = pd.read_csv('training.csv')
test = pd.read_csv('test.csv')
df = pd.concat([train, test], sort=False)

## Data Cleaning

This is what the data set looks like...

In [3]:
df.head()

Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2018-11-15T02:18:49Z,2,0.0
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2018-11-15T02:19:08Z,2,0.0
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500.0,500,2018-11-15T02:44:21Z,2,0.0
3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,20000.0,21800,2018-11-15T03:32:55Z,2,0.0
4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-644.0,644,2018-11-15T03:34:21Z,2,0.0


And the size of the data set...

In [4]:
df.shape

(140681, 16)

It is important to establish which variable uniquely identifies each row. Also, if one categorical variable has too many levels, this might be computationally exhaustive and cause crashes when creating dummy variables. Below, each variable is shown with its associated number of unique levels.

In [5]:
for column in df.columns:
  print(column, ': ', len(np.unique(df[column].values)))

TransactionId :  140681
BatchId :  139493
AccountId :  4841
SubscriptionId :  4836
CustomerId :  7479
CurrencyCode :  1
CountryCode :  1
ProviderId :  6
ProductId :  27
ProductCategory :  10
ChannelId :  5
Amount :  2099
Value :  1880
TransactionStartTime :  138574
PricingStrategy :  4
FraudResult :  45021


'TransactionId' definitely acts as a dataframe index, as the number of unique levels of this variable is the same as the number of rows in the data set. Great! Also, take a look at 'BatchId', 'AccountId', 'SubscriptionId' and 'CustomerId'. These categorical variables have too many unique values, so when trying to 'dummify' them, the notebook will likely experience problems and might even crash! As a result, these features will be removed.

In [0]:
# remove features with too many unique levels
df = df.drop(['BatchId', 'AccountId', 'SubscriptionId', 'CustomerId'], axis=1)

Have a look at 'CurrencyCode' and 'CountryCode' above. There is only one level of each used throughout the data set. This is not meaningful for modeling purposes, so these features will also be removed below.

In [0]:
df = df.drop(['CurrencyCode', 'CountryCode'], axis=1)

Too many variables are being removed. To prevent 'TransactionStartTime' from being removed, it will first be investigated a little. If only the first 10 characters in this variables are used, i.e. the date in the form yyyy-mm-dd, then there are only 120 unique dates. This is shown below.

In [8]:
len(np.unique(df.apply(lambda x: x['TransactionStartTime'][0:10], axis=1)))

120

As a result, each instance of 'TransactionStartTime' will be changed to the abovementioned format. This will beome a new column being 'TransactionDate', and the original 'TransactionStartTime' will be removed.

In [0]:
# change the date format and create the new column
df.loc[:, 'TransactionDate'] = df.apply(lambda x: x['TransactionStartTime'][0:10], axis=1)
# remove the old date column
df = df.drop('TransactionStartTime', axis=1)

Having a look at the data types of all variables in the data set, observe that 'PricingStrategy' is an integer, when it should be of object-type. This change will be made below.    

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 140681 entries, 0 to 45018
Data columns (total 10 columns):
TransactionId      140681 non-null object
ProviderId         140681 non-null object
ProductId          140681 non-null object
ProductCategory    140681 non-null object
ChannelId          140681 non-null object
Amount             140681 non-null float64
Value              140681 non-null int64
PricingStrategy    140681 non-null int64
FraudResult        95662 non-null float64
TransactionDate    140681 non-null object
dtypes: float64(2), int64(2), object(6)
memory usage: 11.8+ MB


In [0]:
# change 'PricingStrategy' to object-type
df.loc[:, 'PricingStrategy'] = df['PricingStrategy'].astype('object')

Now all that is left to do is to create those dummy variables.

In [0]:
# save the transaction ids to insert into the data frame after creating dummy variables
df_ids = df['TransactionId']
# create dummy variables
df = pd.get_dummies(df.iloc[:, 1:], drop_first=True)
# return the transaction ids
df['TransactionId'] = df_ids

## Pre-Processing

The data set will now be split back into training and test sets.

In [0]:
train = df[:len(train)]
test = df[len(train):]
test = test.drop('FraudResult', axis=1)

Have a look at the imbalance of this data set with regards to the 'FraudResult' label.

In [14]:
print('The number of non-fraudulent cases: ', len(train[train['FraudResult']==0.0]))
print('The number of fraudulent cases: ', len(train[train['FraudResult']==1.0]))

The number of non-fraudulent cases:  95469
The number of fraudulent cases:  193


The data set is evidently very imbalanced. One possible way around this is to take a sample of the training set so that the fraudulent and non-fraudulent cases are equally-represented. This is done below.

In [0]:
# non-fraudulent subset
train_0 = train.loc[train['FraudResult']==0.0, :]
# fraudulent subset
train_1 = train.loc[train['FraudResult']==1.0, :]
# create a new data set with equal representation of the two subsets
train = pd.concat([train_0.sample(len(train_1)), train_1])
# create the training set of labels
y_train = train['FraudResult']
# create a training set of only features
X_train = train.drop(['FraudResult', 'TransactionId'], axis=1)
# create a test set of only features
X_test = test.iloc[:, :-1]
# keep the test ids for later use
test_ids = test['TransactionId']

## Models

### Logistic Regression
Firstly, a logistic model will be fitted on the training set. The subsequent accuracy score will be returned.

In [16]:
lr = LogisticRegression(C=0.1, solver='lbfgs', multi_class='multinomial')
lr.fit(X_train, y_train)
print('Logtistic regression accuracy: ', accuracy_score(y_train, lr.predict(X_train)))

Logtistic regression accuracy:  0.5


This model, unfortunately, is very unaccurate, and only correctly predicts half of the cases correctly.

### Nearest Centroid
Secondly, a nearest centroid model will be fitted on the training set. The subsequent accuracy score will be returned.

In [17]:
nc = NearestCentroid()
nc.fit(X_train, y_train)
print('Nearest centroid accuracy: ', accuracy_score(y_train, nc.predict(X_train)))

Nearest centroid accuracy:  0.7072538860103627


That's much better! This model will now be used to predict the test set's labels, and the consequent labels and their associated transaction ids will be saved to CSV.

In [0]:
# predict the test labels and create a dataframe with these labels and their associated transaction ids
predictions = pd.DataFrame([test_ids, nc.predict(X_test)]).T
# change the column names of this dataframe
predictions.columns = ['TransactionId', 'FraudResult']
# save this dataframe to CSV
predictions.to_csv('submission.csv', index=False)