# Machine Learning - Project

In [1]:
%pip install pandas
%pip install scikit-learn

import pandas as pd

Collecting pandas
  Downloading pandas-2.0.1-cp311-cp311-macosx_11_0_arm64.whl (10.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.7/10.7 MB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Collecting pytz>=2020.1
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Collecting numpy>=1.21.0
  Downloading numpy-1.24.3-cp311-cp311-macosx_11_0_arm64.whl (13.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m42.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: pytz, tzdata, numpy, pandas
Successfully installed numpy-1.24.3 pandas-2.0.1 pytz-2023.3 tzdata-2023.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip 

# Columns Definitions

Here is the meaning of every columns explain in `Xente_Variable_Definitions.csv`.

 * **TransactionId**: Unique transaction identifier on platform
 * **BatchId**: Unique number assigned to a batch of transactions for processing
 * **AccountId**: Unique number identifying the customer on platform
 * **SubscriptionId**: Unique number identifying the customer subscription
 * **CustomerId**: Unique identifier attached to Account
 * **CurrencyCode**: Country currency
 * **CountryCode**: Numerical geographical code of country
 * **ProviderId**: Source provider of Item bought.
 * **ProductId**: Item name being bought.
 * **ProductCategory**: ProductIds are organized into these broader product categories.
 * **ChannelId**: Identifies if customer used web,Android, IOS, pay later or checkout.
 * **Amount**: Value of the transaction. Positive for debits from customer account and negative for credit into customer account
 * **Value**: Absolute value of the amount
 * **TransactionStartTime**: Transaction start time
 * **PricingStrategy**: Category of Xente's pricing structure for merchants
 * **FraudResult**: Fraud status of transaction 1 -yes or 0-No


# Load training data

The first thing we do is to load the training data from `training.csv`.

In [2]:
training_data_path = './data/training.csv'

# Read data from file
training_data = pd.read_csv(training_data_path)

training_data.head(5)

Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2018-11-15T02:18:49Z,2,0
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2018-11-15T02:19:08Z,2,0
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500.0,500,2018-11-15T02:44:21Z,2,0
3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,20000.0,21800,2018-11-15T03:32:55Z,2,0
4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-644.0,644,2018-11-15T03:34:21Z,2,0


# Data analysis

Now that we have our data loaded, we will analyse the differents data in order the get the best possible model.

# Mutual Information

First thing first, we will look at the relationship between every columns and the target value.

In [29]:
from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

In [38]:
# Define the datasets
X = training_data.copy()
y = X.pop('FraudResult')

# Label encoding for categoricals
for colname in X.select_dtypes("object"):
    X[colname], _ = X[colname].factorize()

# Get the discrete features by checking that it is a int
discrete_features = X.dtypes == int
print(X.head())
mi_scores = make_mi_scores(X.head(10), y.head(10), discrete_features)
# mi_scores[::3]  # show a few features with their MI scores


0    0
1    1
2    2
3    3
4    4
Name: BatchId, dtype: int64


IndexError: boolean index did not match indexed array along dimension 1; dimension is 1 but corresponding boolean dimension is 15

## Transaction country

First, we will attack with the columns `CurrencyCode` and `CountryCode`. Actually, they point to the same thing: from which country the transaction has been done. Thus, we can say that having both columns is redundant to estimate our model prediction.

We will first look at the occurence of every country and then check its relationship with the target value.

In [20]:
# Load CurrencyCode and CountryCode
X_CurrencyCode = training_data['CurrencyCode']
X_CountryCode = training_data['CountryCode']

# Count occurence of each rows
CurrencyCode_Occurences = X_CurrencyCode.value_counts()
CountryCode_Occurences = X_CountryCode.value_counts()

# Show information
print(CurrencyCode_Occurences)
print('\n---\n')
print(CountryCode_Occurences)

CurrencyCode
UGX    95662
Name: count, dtype: int64

---

CountryCode
256    95662
Name: count, dtype: int64


As we can see, `CurrencyCode` and `CountryCode` has only one distinct value in all the dataset. Give that, we can say that this columns will not give any additional information to our model.

## Categorise the data

Now we are going the categorise our discrete value. We will apply a One-Hot Encoder on `ProductCategory` and `ChannelId`.

As a reminder, a One-Hot Encoder will create new binary columns, indicating the presence of each possible values from the original data.

![One-Hot Encoding](./pictures/one-hot-encoding.png)

One-Hot Encoding works well when the categorical variable takes on a small number of values (15 regarding [Kaggle](https://www.kaggle.com/code/dansbecker/using-categorical-data-with-one-hot-encoding)).

We will then check the number of unique categories in every one of our columns, and apply One-Hot Encoding if the distinct values is less than `15`.


In [27]:
# Define the columns to categorise
CategoriseCols = ['ProductCategory', 'ChannelId']

# Extract our categorical variable
CategoricalVariable = training_data.loc[:, CategoriseCols]

# Get number of unique values in each column
CategoricalVariable.nunique()

ProductCategory    9
ChannelId          4
dtype: int64

As we can see, `ProductCategory` has 9 distincts values and `ChannelId`has 4. We can then apply One-Hot Encoding on each column.

In [24]:
from sklearn.preprocessing import OneHotEncoder

# Define the columns to categorise
CategoriseCols = ['ProductCategory', 'ChannelId']

# Define One-Hot Encoder
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Apply encoder to the ProductCategory column
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(training_data[CategoriseCols]))

OH_cols_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


# Define X and Y

In [3]:
training_data_copy = training_data.copy()

y_training = training_data_copy.pop('FraudResult')
x_training = training_data_copy.copy()

# Categorize the ProductCategory

Make a **One-Hot** Encoding so that we don't give weight to different categories.

In [4]:
from sklearn.preprocessing import OneHotEncoder

cols = ['ProviderId', 'ProductCategory']

# Define One-Hot Encoder
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

# Apply encoder to the ProductCategory column
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(x_training[cols]))

# One-Hot Encoding remove index, put it back
OH_cols_train.index = x_training.index

# Remove the categorical column
num_x_train = x_training.drop(cols, axis=1)

# Add One-Hot column to dataset
OH_x_train = pd.concat([num_x_train, OH_cols_train], axis=1)

OH_x_train.head()



Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProductId,ChannelId,Amount,...,5,6,7,8,9,10,11,12,13,14
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProductId_10,ChannelId_3,1000.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProductId_6,ChannelId_2,-20.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProductId_1,ChannelId_3,500.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProductId_21,ChannelId_3,20000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProductId_6,ChannelId_2,-644.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


# Compute Mutual Information (MI) Score

In [43]:
from sklearn.feature_selection import mutual_info_regression

training_keys = ['']

# Get discrete features
discrete_features = x_training.dtypes == int
discrete_features.describe()
def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

# mi_scores = make_mi_scores(x_training, y_training, discrete_features)
# mi_scores[::3] 

x_training.head()

Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2018-11-15T02:18:49Z,2
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2018-11-15T02:19:08Z,2
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500.0,500,2018-11-15T02:44:21Z,2
3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,20000.0,21800,2018-11-15T03:32:55Z,2
4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-644.0,644,2018-11-15T03:34:21Z,2


# Define the target

In [None]:
training_keys = ['']

# Target
y_training = training_data.FraudResult