# Term Project Group 1


The Dataset: https://www.kaggle.com/datasets/waqi786/global-black-money-transactions-dataset

### Explanation:

This dataset gives a solid overview of black money transactions in different countries, focusing on financial activities tied to illegal dealings. It includes details like transaction amounts and risk scores, making it super useful for anyone looking to study financial crime trends or work on anti-money laundering tools.


### Dataset:

Transaction ID: Unique identifier for each transaction. (e.g., TX0000001)

Country: Country where the transaction occurred. (e.g., USA, China)

Amount (USD): Transaction amount in US Dollars. (e.g., 150000.00)

Transaction Type: Type of transaction. (e.g., Offshore Transfer, Property Purchase)

Date of Transaction: The date and time of the transaction. (e.g., 2022-03-15 14:32:00)

Person Involved: Name or identifier of the person/entity involved. (e.g., Person_1234)

Industry: Industry associated with the transaction. (e.g., Real Estate, Finance)

Destination Country: Country where the money was sent. (e.g., Switzerland)

Reported by Authority: Whether the transaction was reported to authorities. (e.g., True/False)

Source of Money: Origin of the money. (e.g., Legal, Illegal)

Money Laundering Risk Score: Risk score indicating the likelihood of money
laundering (1-10). (e.g., 8)

Shell Companies Involved: Number of shell companies used in the transaction. (e.g., 3)

Financial Institution: Bank or financial institution involved in the transaction. (e.g., Bank_567)

Tax Haven Country: Country where the money was transferred to a tax haven. (e.g., Cayman Islands)

# Pre-process and clean the dataset as appropriate.

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVC
# from lightgbm import LGBMClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE, SelectFromModel, chi2, SelectKBest
from sklearn.linear_model import LogisticRegression


## Exploring the data

### Load Data

In [2]:
# Load data
df = pd.read_csv('Big_Black_Money_Dataset.csv')

In [None]:
# View the first few rows
df.head()

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Get data types
df.dtypes

In [None]:
df.describe()

## Processing the data

### Handle missing values if applicable

In [7]:
# For numerical features
numerical_features = ['Amount (USD)', 'Money Laundering Risk Score', 'Shell Companies Involved']
imputer = SimpleImputer(strategy='median')
df[numerical_features] = imputer.fit_transform(df[numerical_features])

# For categorical features
categorical_features = ['Country', 'Transaction Type', 'Person Involved', 'Industry',
                        'Destination Country', 'Financial Institution', 'Tax Haven Country']
imputer_cat = SimpleImputer(strategy='most_frequent')
df[categorical_features] = imputer_cat.fit_transform(df[categorical_features])

### Dropping Features and OHE

In [8]:
# Drop Irrelevant Features
df.drop('Transaction ID', axis=1, inplace=True) # Dropped because it is unique for each transaction
# df.drop('Person Involved', axis=1, inplace=True) # Frequency Encoding will be implemented
# df.drop('Financial Institution', axis=1, inplace=True) # Implement Frequency Encoding
df.drop('Date of Transaction', axis=1, inplace=True) # Date of transaction is not relevant

# Convert 'Reported by Authority' to integer
df['Reported by Authority'] = df['Reported by Authority'].astype(int)

# Frequency encoding for 'Financial Institution'
df['Financial Institution'] = df.groupby('Financial Institution')['Financial Institution'].transform('count')

# Frequency encoding for 'Person Involved'
df['Person Involved'] = df.groupby('Person Involved')['Person Involved'].transform('count')

# Encode target variable
le = LabelEncoder()
df['Source of Money'] = le.fit_transform(df['Source of Money'])

# One-Hot Encode nominal categorical features
nominal_features = ['Country', 'Transaction Type', 'Industry',
                    'Destination Country', 'Tax Haven Country']
df = pd.get_dummies(df, columns=nominal_features, drop_first=True)

dummy_columns = df.filter(like='_').columns
df[dummy_columns] = df[dummy_columns].astype(int)



In [None]:
df.columns

In [None]:
df.head()

In [None]:
df['Source of Money'].value_counts()

In [12]:
features_to_modify = ['Amount (USD)', 'Money Laundering Risk Score', 'Shell Companies Involved']

def scale_features(df, features):
    df_S = df.copy()
    scaler = StandardScaler()
    df_S[features] = scaler.fit_transform(df[features])
    return df_S

def normalize_features(df, features):
    df_N = df.copy()
    scaler = MinMaxScaler()
    df_N[features] = scaler.fit_transform(df[features])
    return df_N



### Biased data correction

In [None]:
def Undersampling(X,Y, test_size):
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size, random_state=0)
    rus = RandomUnderSampler(random_state=0)
    X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
    return X_resampled, y_resampled, X_test, y_test

## Feature Selectors (Optional):

### Feature selector functions

In [13]:
def cor_selector(X, y,num_feats):
    # Your code goes here (Multiple lines)
    cor_list = []
    feature_name = X.columns.tolist()
    for i in feature_name:
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    #print(np.argsort(np.abs(cor_list)))
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    #print(cor_feature)
    cor_support = [True if i in cor_feature else False for i in feature_name]
    # Your code ends here
    return cor_support, cor_feature

def chi_squared_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    X_norm = MinMaxScaler().fit_transform(X)
    chi_selector = SelectKBest(chi2, k=num_feats)
    chi_selector.fit(X_norm, y)
    chi_support = chi_selector.get_support()
    #print(chi_support)
    chi_feature = X.loc[:,chi_support].columns.tolist()
    # Your code ends here
    return chi_support, chi_feature

def rfe_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    rfe_selector = RFE(estimator=LogisticRegression(random_state=42), n_features_to_select=num_feats, step=10, verbose=5)
    rfe_selector.fit(X, y)
    rfe_support = rfe_selector.support_
    rfe_feature = X.loc[:,rfe_support].columns.tolist()
    # Your code ends here
    return rfe_support, rfe_feature

def embedded_log_reg_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    embedded_lr_selector = SelectFromModel(LogisticRegression(penalty="l2", random_state = 42), max_features=num_feats)
    embedded_lr_selector.fit(X, y)
    embedded_lr_support = embedded_lr_selector.get_support()
    embedded_lr_feature = X.loc[:,embedded_lr_support].columns.tolist()
    # Your code ends here
    return embedded_lr_support, embedded_lr_feature

def embedded_rf_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42), max_features=num_feats)
    embeded_rf_selector.fit(X, y)
    embedded_rf_support = embeded_rf_selector.get_support()
    embedded_rf_feature = X.loc[:,embedded_rf_support].columns.tolist()
    # Your code ends here
    return embedded_rf_support, embedded_rf_feature

def embedded_lgbm_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    lgbc = LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=32, colsample_bytree=0.2, reg_alpha=3, reg_lambda=1, min_split_gain=0.01, min_child_weight=40)
    embeded_lgbm_selector = SelectFromModel(lgbc, max_features=num_feats)
    embeded_lgbm_selector.fit(X, y)
    embedded_lgbm_support = embeded_lgbm_selector.get_support()
    embedded_lgbm_feature = X.loc[:,embedded_lgbm_support].columns.tolist()
    # Your code ends here
    return embedded_lgbm_support, embedded_lgbm_feature

### Feature Selectors Combined:

In [14]:
def autoFeatureSelector(X, y, num_feats, methods=[]):

    support_dict = {}

    feature_name = list(X.columns)
    support_dict['Feature'] = feature_name
    
    if 'pearson' in methods:
        cor_support, cor_feature = cor_selector(X, y, num_feats)
        support_dict['Pearson'] = cor_support
    if 'chi-square' in methods:
        chi_support, chi_feature = chi_squared_selector(X, y, num_feats)
        support_dict['Chi-2'] = chi_support
    if 'rfe' in methods:
        rfe_support, rfe_feature = rfe_selector(X, y, num_feats)
        support_dict['RFE'] = rfe_support
    if 'log-reg' in methods:
        embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
        support_dict['Logistics'] = embedded_lr_support
    if 'rf' in methods:
        embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
        support_dict['Random Forest'] = embedded_rf_support
    if 'lgbm' in methods:
        embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
        support_dict['LightGBM'] = embedded_lgbm_support 
    
    # Combine all the above feature list and count the maximum set of features that got selected by all methods
    
    print("Combining all methods")
    feature_selection_df = pd.DataFrame(support_dict)
    feature_selection_df['Total'] = feature_selection_df.apply(lambda row: np.sum(row[1:].astype(int)), axis=1)
    print("Sorting features")
    feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
    feature_selection_df.index = range(1, len(feature_selection_df)+1)
    print("Selecting best features")
    best_features = feature_selection_df['Feature'].tolist()[:num_feats]
    return best_features, feature_selection_df

# Models:

- Utilize GridSearchCV to tune the parameters of each of the models.
- Check if better results can be obtained for any of the models.
- Discuss your observations regarding model performance.
- Randomly remove some features (or based on a certain hypothesis) and re-evaluate the models.
- Document your observations concerning model performances.

## Logistic Regression: Saif, Dwip


### Data for LR

In [15]:
# Get the data for Logistic Regression
df_LR = normalize_features(df, features_to_modify)


### Simple LR Model

In [None]:
# Implement a logistic regression model
X = df_LR.drop('Source of Money', axis=1)
y = df_LR['Source of Money']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

# Fit the model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predict
y_pred = log_reg.predict(X_test)

# Evaluate
print("Logistic Regression")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

### GridSearchCV LR

In [None]:
# Apply Grid Search CV to find the best parameters
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters
param_grid = {
    'penalty': ['l1', 'l2'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

# Instantiate GridSearchCV
grid_search = GridSearchCV(estimator=log_reg, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit the model
grid_search.fit(X_train, y_train)

# Get the best parameters
print(grid_search.best_params_)
print(grid_search.best_score_)
print(grid_search.best_estimator_)
# Get the best model
best_model = grid_search.best_estimator_

# Predict
y_pred = best_model.predict(X_test)

# Evaluate
print("Logistic Regression")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))


## Decision Tree: Nitish, Sehaj

### Data for DT Model

In [32]:
df_DT = df.copy()
# Drop person involved
df_DT.drop('Person Involved', axis=1, inplace=True)

### DT Model

In [None]:
# Splitting the data into features (X) and target (y)
X = df_DT.drop('Source of Money', axis=1)
y = df_DT['Source of Money']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training using RandomForestClassifier
clf = DecisionTreeClassifier(random_state=42, max_depth=10) # Changed
clf.fit(X_train, y_train)

# Predictions and accuracy score
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
feature_importances = clf.feature_importances_
feature_names = X.columns
importance_df = pd.DataFrame({'feature': feature_names, 'importance': feature_importances})

# Get top 10 features
top_features = importance_df.sort_values(by='importance', ascending=False).head(10)

#
plt.figure(figsize=(12, 8))
sns.barplot(x='importance', y='feature', data=top_features)
plt.title('Top 10 Feature Importances - Optimized Decision Tree')
plt.show()

In [None]:
# Fit a single decision tree to visualize
tree_clf = DecisionTreeClassifier(max_depth=4)  # Limit depth for clarity
tree_clf.fit(X_train, y_train)

# Plot the decision tree
plt.figure(figsize=(20, 10))
plot_tree(tree_clf, feature_names=X.columns, class_names=['Illegal', 'Legal'], filled=True)
plt.show()

## Random Forest: Egor, Ash

### Data for RF Model

In [23]:
# Get the data for RF model
df_RF = df.copy()

### This needs to be reviewed RF X and Y???

In [None]:
# Define features (X) and target (y) - binary classification on 'Money Laundering Risk Score'
X_new = df_RF.drop(columns=['Money Laundering Risk Score'])
y_new = (df_RF['Money Laundering Risk Score'] >= 5).astype(int)  # Binary target: 1 if score >= 5, else 0

# Split the dataset into training and testing sets
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_new, test_size=0.3, random_state=42)

# Initialize and train the RandomForest classifier
rf_clf_new = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf_new.fit(X_train_new, y_train_new)

# Make predictions on the test set
y_pred_new = rf_clf_new.predict(X_test_new)

# Generate the confusion matrix and classification report
conf_matrix_new = confusion_matrix(y_test_new, y_pred_new)
class_report_new = classification_report(y_test_new, y_pred_new)

# Display the results
print("Confusion Matrix:")
print(conf_matrix_new)
print("\nClassification Report:")
print(class_report_new)

In [None]:
# Split the dataset into training and testing sets
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_new, test_size=0.3, random_state=42)

# Define a simplified parameter grid for GridSearchCV
simple_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'bootstrap': [True, False]
}

# Initialize the RandomForest model
rf_clf_simplified = RandomForestClassifier(random_state=42)

# Set up the GridSearchCV
simple_grid_search = GridSearchCV(estimator=rf_clf_simplified,
                                  param_grid=simple_param_grid,
                                  cv=3,  # 3-fold cross-validation
                                  verbose=1,
                                  n_jobs=-1)

# Fit the simplified grid search model
simple_grid_search.fit(X_train_new, y_train_new)

# Best hyperparameters from the grid search
best_params_simplified = simple_grid_search.best_params_

# Train the best model on the training set
best_rf_model_simplified = simple_grid_search.best_estimator_
best_rf_model_simplified.fit(X_train_new, y_train_new)

# Make predictions with the tuned model
y_pred_tuned_simplified = best_rf_model_simplified.predict(X_test_new)

# Generate confusion matrix and classification report for the tuned model
conf_matrix_tuned_simplified = confusion_matrix(y_test_new, y_pred_tuned_simplified)
class_report_tuned_simplified = classification_report(y_test_new, y_pred_tuned_simplified)

# Output best parameters, confusion matrix, and classification report
print("Best Hyperparameters:", best_params_simplified)
print("Confusion Matrix:\n", conf_matrix_tuned_simplified)
print("Classification Report:\n", class_report_tuned_simplified)

## SGD: Devanshi, James, Abraham

In [26]:
# Stochastic Gradient Descent

## SVM: Eric, Moosa

### Data for SVM

In [27]:
df_svm = df.copy()

In [28]:
X = df_svm.drop(columns=['Source of Money'])
y = df_svm['Source of Money']

### Splitting the data for training and testing

In [29]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### SVM Model Training and Prediction

In [None]:
svc_model = SVC()
svc_model.fit(X_train, y_train)
y_pred = svc_model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(f"SVC Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred,  zero_division=1))

### GridSearchCV (Hyper Parameter tuning)

In [None]:
# Hyperparameter Tuning using GridSearchCV
# SVM GridSearchCV params
param_grid_svm = {
    'C': [0.1, 1, 10],
    'kernel': ['rbf'],
    'gamma': [1,0.1,0.01,0.001,0.0001]
}
grid_svm = GridSearchCV(SVC(), param_grid_svm, cv=5)
grid_svm.fit(X_train, y_train)
print(f"Best SVM Parameters: {grid_svm.best_params_}")
print(f"Best SVM Accuracy: {grid_svm.best_score_}")

# Conclusion and comparison


Present your work including approach and findings during the class on September 24th or 26th, 2024. Each group will have a maximum of 15 minutes to present their project. It is advised that your PowerPoint files to be no longer than 15 slides.

Prepare a written technical report of no longer than 15 pages to discuss the problem statement, various steps conducted, summary of findings and conclusions. Submit the report and the notebook file (with proper headings, explanatory comments and code sections) by the midnight of September 29th, 2024.