<a href="https://colab.research.google.com/github/abakamousa/demo_kmerai/blob/main/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **MOBILE MONEY FRAUD DETECTION**

*  **What is mobile money?** Mobile money is a digital payment platform in its own right. The mobile money account acts as an electronic wallet associated with the SIM card on a user’s cellphone. The user can send and receive funds or pay for services from their cellphone without the need for a traditional bank account. They can also use registered agents to deposit cash (cash-in) or transfer funds to other accounts and receive cash in exchange (cash-out).

*   With over **$2 billion** of funds transferred every day, it’s easy to see why financial service companies such as Stripe are investing in mobile money markets. They recognize the potential growth in regions such as **sub-Saharan** Africa where access to formal banking systems may be limited. Offering fast transactions, convenient access and secure payments, mobile money gives users instant control of their finances.

*  As this industry grows, it faces greater risks relating to mobile money fraud. In 2020, nearly $4 billion was lost to fraudulent mobile money activity and scams, a figure that’s expected to grow over time as fraudsters adopt increasingly sophisticated methods.

*   The most common types of mobile money fraud involve **gaining control over a user’s cellphone** by **phishing** via voice calls (vishing) or SMS messages (smishing). Once scammers have access to the device, they may carry out SIM-swap fraud by instructing the phone service provider to transfer the number to one of their own SIM cards. Read more.).



 **Install librairies**

In [1]:
!pip install opendatasets
#!pip install dython

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


# Import librairies

In [2]:
import numpy             as np 
import pandas            as pd 
import opendatasets      as od
import seaborn           as sns
import matplotlib.pyplot as plt


from sklearn.preprocessing     import LabelEncoder
from sklearn.ensemble          import RandomForestClassifier
from sklearn.linear_model      import SGDClassifier
from sklearn.linear_model      import LogisticRegression
from xgboost                   import XGBClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing     import RobustScaler
from imblearn.over_sampling    import SMOTE 
from sklearn.model_selection   import train_test_split, GridSearchCV
from collections               import Counter
from sklearn.pipeline          import Pipeline
#from dython.nominal        import associations #for correlation analysis between categorical and continuous values

# Load dataset from kaggle

In [3]:
url="https://www.kaggle.com/datasets/ealaxi/paysim1"
od.download(url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: abakamousa
Your Kaggle Key: ··········
Downloading paysim1.zip to ./paysim1


100%|██████████| 178M/178M [00:01<00:00, 119MB/s]





# Functions

In [4]:
def encode_df (df):
    colName = []
    for i in df.columns:
        if (df[i].dtypes == 'object'):
            colName.append(i)
    # Encode Categorical Columns
    le = LabelEncoder()
    df[colName] = df[colName].apply(le.fit_transform)
    
    return df

# Exploratory data analysis

In [None]:
df = pd.read_csv("/content/paysim1/PS_20174392719_1491204439457_log.csv")

In [None]:
df.head()



*  step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

*    type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

*    amount - amount of the transaction in local currency.

*    nameOrig - customer who started the transaction

*    oldbalanceOrg - initial balance before the transaction

*    newbalanceOrig - new balance after the transaction

*    nameDest - customer who is the recipient of the transaction

*    oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

*    newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

*    isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control of customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

*    isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.


In [None]:
df.describe(include='all')

In [None]:
df.isnull().sum()

In [None]:
df['isFraud'].value_counts(normalize=True)

In [None]:
df['isFlaggedFraud'].value_counts(normalize=True)

**Countplot of each type of transactions**

In [None]:
plt.figure(figsize=(10,5))
ax=sns.countplot(x = "type", hue="isFraud", data = df)
plt.title('Countplot of different types of transaction (nonFraud and Fraud)')
for p in ax.patches:
  ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+50))
        

**Remarque:** pas de transaction frauduleuse pour les transactions de type PAYMENT, CASH_IN et DEBIT

**Proportion of different transactions**

In [None]:
type = df['type'].value_counts()
transaction = type.index
count = type.values

plt.figure(figsize=(8,8))
plt.pie(count, labels=transaction, autopct='%1.0f%%')
plt.legend(loc='lower left')
plt.show()

**Dataset splitting**

Etant donné que l'on peut rencontrer des fraudes sur des opérations de PAYMENT, CASH_IN et DEBIT, nous allons procéder à un découpage de notre jeu de données en deux:

*   Un premier jeu destiné à la réalisation de l'apprentissage non supervisé pour détecter les anomalies en prenant en compte les transactions de type PAYMENT, CASH_IN et DEBIT.
*   Un second jeu destiné à être utilisé pour réaliser un apprentissage supervisé à en prenant en compte les transactions de CASH_OUT et TRANSFER



In [None]:
df_unsupervised = df.loc[(df["type"] == "PAYMENT") | (df["type"] == "CASH_IN") | (df["type"] == "DEBIT")]
df_supervised   = df.loc[(df["type"] == "TRANSFER") | (df["type"] == "CASH_OUT")]

# **Data preparation for supervised ML with df_supervised**

In [None]:
print("Number of duplicated rows: ", df_supervised.duplicated().sum())

In [None]:
#label encoding
df_supervised = encode_df(df_supervised)

**Correlation analysis**

In [None]:
plt.figure(figsize=(10,7))
sns.heatmap(df_supervised.corr(), annot = True, fmt='.1g')

**Remarque:** La valeur max de corrélation entre deux variables distinctes de notre dataset est de 0.8.  

**Boxplot**

In [None]:
plt.figure(figsize=(15,8))
sns.boxplot(data=df_supervised, orient="h", palette="Set2")

Class analysis

In [None]:
df_supervised['isFraud'].value_counts(normalize=True)

Remarque: 
* l'on note que les features ne sont pas à la même échelle
* l'on note la présence d'outlier --> L'on tiendra compte de cela dans le choix de la méthode de normalisation des données

**Feature scaling**

In [None]:
scaler = RobustScaler()
#df_supervised_scaled = scaler.fit_transform(df_supervised)

**Train Test split**

In [None]:
#feature = df_supervised_scaled[:,:-2]
#target  = df_supervised_scaled[:,-2]
feature = df_supervised.drop(['isFraud', 'isFlaggedFraud'], axis=1)
target  = df_supervised.isFraud


In [None]:
#feature selection
new_feature = SelectKBest(f_classif, k=7).fit_transform(feature, target)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(new_feature, target, test_size=0.2)

**Resampling**

In [None]:
sm = SMOTE(sampling_strategy='minority', random_state=237)

X_res, y_res = sm.fit_resample(X_train, y_train)

In [None]:
print('Resampled dataset shape %s' % Counter(y_res))

# Prediction

In [None]:
#models
clf1 = SGDClassifier()
clf2 = RandomForestClassifier()
clf3 = LogisticRegression()
clf4 = XGBClassifier()


In [None]:
#pipeline

pipe_SGD  = Pipeline([('scaler', scaler), ('SGD', clf1)])
pipe_RF   = Pipeline(steps=[("scaler", scaler), ("RF", clf2)]) 
pipe_LR   = Pipeline(steps=[("scaler", scaler), ("LogisticRegression", clf3)])
pipe_XGB  = Pipeline(steps=[("scaler", scaler), ("XGB", clf4)])

In [None]:
#grid parameters


hyper_params_SGD = [{
'SGD__loss' : ['hinge', 'log', 'squared_hinge', 'modified_huber'],
'SGD__alpha' : np.arange(0, 0.1, 0.01),
'SGD__penalty' : ['l2', 'l1']
}]

hyper_params_RF = [{ 
'RF__n_estimators' : [100, 200, 500, 1000],
'RF__max_features' : ["auto", "sqrt", "log2"],
'RF__bootstrap': [True],
'RF__criterion': ['gini', 'entropy'],
'RF__oob_score': [True, False]
}]


hyper_params_LR = [{
'LogisticRegression__solver': ['newton-cg', 'sag', 'lbfgs'],
'LogisticRegression__multi_class': ['ovr', 'multinomial']
}]

hyper_params_XGB =  [{
'XGB__nthread':[4], #when use hyperthread, xgboost may become slower
'XGB__objective':['binary:logistic'],
'XGB__learning_rate': [0.05], #so called `eta` value
'XGB__max_depth': [6],
'XGB__min_child_weight': [11],
'XGB__silent': [1],
'XGB__subsample': [0.8],
'XGB__colsample_bytree': [0.7],
'XGB__n_estimators': [5], #number of trees, change it to 1000 for better results
'XGB__missing':[-999],
'XGB__seed': [1337]}]




In [None]:
SGD_grid_search = GridSearchCV(estimator=pipe_SGD,
        param_grid=hyper_params_SGD,
        scoring='accuracy',
        n_jobs=-1,
        cv=3,
        verbose = 10)

RF_grid_search = GridSearchCV(estimator=pipe_RF,
        param_grid=hyper_params_RF,
        scoring='accuracy',
        n_jobs=-1,
        cv=3,
        verbose = 10)

LR_grid_search = GridSearchCV(estimator=pipe_LR,
        param_grid=hyper_params_LR,
        scoring='accuracy',
        n_jobs=-1,
        cv=3,
        verbose = 10)

XGB_grid_search = GridSearchCV(estimator=pipe_XGB,
        param_grid=hyper_params_XGB,
        scoring='accuracy',
        n_jobs=-1,
        cv=3,
        verbose = 10)

grids = [SGD_grid_search, RF_grid_search, XGB_grid_search, LR_grid_search]

In [None]:
#for param in XGB_grid_search.get_params().keys():
#    print(param)

In [None]:
for pipe in grids:
    pipe.fit(X_res, y_res)

# Performance evaluation

# Inference