# AlvinApp Competition

This is a simple starter notebook to get started with the AvinApp Competition on Zindi.

This notebook covers:

*   Loading the data
*   Simple Exploratory Data Analysis and an example of feature engineering
*   Data preprocessing and data wrangling
*   Creating a simple model
*   Making a submission
*   Some tips for improving your score







## Importing libraries

In [76]:
# Dataframe and Plotting libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Machine Learning libraries
from catboost import Pool, CatBoostClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss


pd.set_option('display.max_columns', None)
# from google.colab import files
import warnings
warnings.filterwarnings('ignore')

## 1. Load the dataset

In [2]:
# Testing path
# path = '/content/drive/MyDrive/Colab Notebooks/alvinapp/'
path = ""

In [40]:
# Load the files into a Pandas Dataframe
train = pd.read_csv(path+'Train.csv')
test = pd.read_csv(path+'Test.csv')
sub = pd.read_csv(path+'SampleSubmission.csv')

In [41]:
# Let’s observe the shape of our datasets.
print('Train data shape :', train.shape)
print('Test data shape :', test.shape)

Train data shape : (373, 12)
Test data shape : (558, 11)


In [42]:
train["USER_GENDER"] = train["USER_GENDER"].apply(lambda x: "Gay" if pd.isna(x) else x)
test["USER_GENDER"] = test["USER_GENDER"].apply(lambda x: "Gay" if pd.isna(x) else x)

In [43]:
train.drop(['USER_AGE'], 1, inplace = True)
test.drop(['USER_AGE'], 1, inplace = True)

In [44]:
train["train"] = 1
test["train"] = 0

In [45]:
all_data = pd.concat([train, test])

In [46]:
# all_data = pd.get_dummies(all_data, prefix_sep="_", columns=['MERCHANT_NAME'])
freq_coll = ['MERCHANT_NAME', 'USER_ID']
for col in freq_coll:
    all_data[col] = all_data[col].map(all_data.groupby(col).size() / len(all_data))

In [47]:
le = LabelEncoder()
LE_cols = ['USER_ID', 'IS_PURCHASE_PAID_VIA_MPESA_SEND_MONEY', 'USER_HOUSEHOLD', 'USER_GENDER']
for le_col in LE_cols:
    all_data[le_col] = le.fit_transform(all_data[le_col])

Let's see how this affected the shapes

In [48]:
train = all_data[all_data["train"] == 1]
test = all_data[all_data["train"] == 0]

In [49]:
print("Train: ", train.shape)
print("Test: ", test.shape)

Train:  (373, 12)
Test:  (558, 12)


### Drop unnecessary columns

We can also drop unnecessary categorical columns as we're currently not interested in when the purchases were made or when they were categorized.

In [50]:
train = train.drop(['MERCHANT_CATEGORIZED_AT','PURCHASED_AT', 'Transaction_ID', "train"], axis=1)
test = test.drop(['MERCHANT_CATEGORIZED_AT','PURCHASED_AT', "train", "MERCHANT_CATEGORIZED_AS"], axis=1)

In [51]:
# Separate the features from the target in the training data
X = train.drop(["MERCHANT_CATEGORIZED_AS"], axis=1)
y = train["MERCHANT_CATEGORIZED_AS"]

In [52]:
X.shape

(373, 7)

In [108]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=101, stratify = y)

In [109]:
y_train.nunique(), y_val.nunique()

(13, 11)

In [110]:
# clf = RandomForestClassifier(random_state = 1)
# clf.fit(X_train, y_train)
# y_pred = clf.predict_proba(X_val)
# print("ACCURACY OF THE MODEL: ", log_loss(y_val, y_pred))

ValueError: y_true and y_pred contain different number of classes 11, 13. Please provide the true labels explicitly through the labels argument. Classes found in y_true: ['Bills & Fees' 'Data & WiFi' 'Emergency fund' 'Family & Friends'
 'Going out' 'Groceries' 'Health' 'Loan Repayment' 'Miscellaneous'
 'Shopping' 'Transport & Fuel']

In [100]:
cb_model_ = CatBoostClassifier(l2_leaf_reg = 9.441413522475084, depth = 5, bootstrap_type = 'MVS', learning_rate = 0.01712339213540557, n_estimators = 2950,
                               leaf_estimation_iterations = 1, random_strength = 0.18095032711212016, loss_function = 'MultiClass', verbose = 0, random_state = 1)
cb_model_.fit(X_train, y_train)
y_pred = cb_model_.predict(X_val)
print("ACCURACY OF THE MODEL: ", accuracy_score(y_val, y_pred))

ACCURACY OF THE MODEL:  0.5526315789473685


In [81]:
# from catboost import Pool, CatBoostClassifier
# cb_model = CatBoostClassifier(l2_leaf_reg = 9.441413522475084, depth = 7, bootstrap_type = 'Bayesian', learning_rate = 0.01772339213540557, n_estimators = 3167,
#                                                  leaf_estimation_iterations = 1, random_strength = 0.17095032711212016, loss_function = 'MultiClass', verbose = 0, random_state = 101)
# cb_model.fit(X_train, y_train)
# y_pred = cb_model.predict(X_val)
# log_loss(y_val, y_pred)

ValueError: could not convert string to float: 'Data & WiFi'

### 5. Making the first submission

Let’s see how the model performs on the competition test data set provided and how we rank on the competition leaderboard.

First we make predictions on the competition test data set.

In [86]:
# Get the predicted result for the test Data
predictions = clf.predict_proba(test.drop("Transaction_ID", axis=1))

In [93]:
def predict_and_submit(test_, filename):
    d = {"Transaction_ID": sub["Transaction_ID"], 'Bills & Fees':test_[:, 0], 'Data & WiFi':test_[:, 1], 'Education':test_[:, 2], 'Emergency fund':test_[:, 3],'Family & Friends':test_[:, 4],'Going out':test_[:, 5],'Groceries':test_[:, 6],\
        'Health':test_[:, 7],'Loan Repayment':test_[:, 8],'Miscellaneous':test_[:, 9],'Rent / Mortgage':test_[:, 10],'Shopping':test_[:, 11],'Transport & Fuel':test_[:, 12]}
    df_ = pd.DataFrame(data=d)
    df_ = df_[["Transaction_ID", 'Bills & Fees','Data & WiFi','Education','Emergency fund','Family & Friends','Going out','Groceries','Health','Loan Repayment','Miscellaneous','Rent / Mortgage','Shopping','Transport & Fuel']]
    df_.to_csv(f'{filename}.csv', index = False)
    return df_.shape

In [94]:
predict_and_submit(predictions, 'Manager_')

(558, 14)