# Data Mangement and Mining - Group Assessment 

### Assessment Instructions: Analysing, Processing, and Modeling Financial Transaction Data

Your team has been given access to a dataset containing financial transaction data. Your goal is to analyze, process, and model this data to identify patterns, trends, and potential fraud.Once completed you will need to submit this notebook and an accompanying model file for assessment (See Assessment Brief Section 2 Part 1). Once an initial submission has been made groups should focus on improving the model and the analysis, then resubmit the updated notebook and model file (See Assessment Brief section 2 Part 2). The final submission should also include a short reflective summary, outlining the changes made and the reasons for these changes (See assessment brief Section 2 Part 2).

### Assessment Dataset

The dataset provided for the assessment contains a sample of financial transactions, made by customers. The features of the dataset are described below, they mainly include information about the transaction and the customer involved in the transaction. The dataset also contains a binary target variable called 'Is.Fraudulent', which indicates whether the transaction is fraudulent or not. The goal of the assessment is to build a model that can predict whether a transaction is fraudulent or not based on the features provided in the dataset and any additional features that you may create. 

| Variable          | Data Type     | Description             |
|-------------------|---------------|-------------------------|
| Transaction.Date   | object        | The date of the transaction. |
| Transaction.Amount | float64       | The amount of money involved. |
| Customer.Age      | int64         | The age of the customer.    |
| Account.Age.Days  | int64         | The number of days since opened. |
| Transaction.Hour  | int64         | The hour of day during transaction. |
| source            | object        | The source of the transaction.|
| browser           | object        | The browser used for transaction.|
| Payment.Method   | object        | The payment method used.    |
| Product.Category | object        | The category of product purchased.|
| Quantity          | int64         | The number of units purchased.|
| Device.Used       | object        | The device used to make transaction.|
| Is.Fraudulent     | int64         | A flag indicating fraudulent status.|


### Assessment Template 

The template provided below is a guide to help you structure your analysis. You can add additional code and text cells to the template as needed. There are cells and code throughout the template which you should not modify, these are required to ensure the assessment can be marked successfully. The cells or code you should not modify will be clearly labeled. Please ensure you save a copy of the template to your local machine before you start your analysis.

### Assessment Submission

In [55]:
# load the data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, f1_score

# load the data
df = pd.read_csv('student_dataset.csv')

# print the first 5 rows of the data
# df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240000 entries, 0 to 239999
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Transaction.Date    240000 non-null  object 
 1   Transaction.Amount  240000 non-null  float64
 2   Customer.Age        240000 non-null  int64  
 3   Account.Age.Days    240000 non-null  int64  
 4   Transaction.Hour    240000 non-null  int64  
 5   source              240000 non-null  object 
 6   browser             240000 non-null  object 
 7   Payment.Method      240000 non-null  object 
 8   Product.Category    240000 non-null  object 
 9   Quantity            240000 non-null  int64  
 10  Device.Used         240000 non-null  object 
 11  Is.Fraudulent       240000 non-null  int64  
dtypes: float64(1), int64(5), object(6)
memory usage: 22.0+ MB


In [36]:
# continue to analyze the data

# Data Preprocessing Function
def preprocess_data(df_raw):
    """Preprocess the financial transaction data"""
    df = df_raw.copy()
    
    # Convert Transaction.Date to datetime with mixed format handling
    df['Transaction.Date'] = pd.to_datetime(df['Transaction.Date'], format='mixed', errors='coerce')
    df['Transaction_Day'] = df['Transaction.Date'].dt.day
    df['Transaction_Month'] = df['Transaction.Date'].dt.month
    df['Transaction_Year'] = df['Transaction.Date'].dt.year
    
    # Handle missing values in date-derived columns
    df['Transaction_Day'] = df['Transaction_Day'].fillna(df['Transaction_Day'].median())
    df['Transaction_Month'] = df['Transaction_Month'].fillna(df['Transaction_Month'].median())
    df['Transaction_Year'] = df['Transaction_Year'].fillna(df['Transaction_Year'].median())
    
    # Handle missing values in numeric columns
    numeric_columns = ['Transaction.Amount', 'Customer.Age', 'Account.Age.Days', 
                      'Transaction.Hour', 'Quantity']
    for col in numeric_columns:
        df[col] = df[col].fillna(df[col].median())
    
    # Encode categorical variables
    categorical_columns = ['source', 'browser', 'Payment.Method', 
                         'Product.Category', 'Device.Used']
    df_encoded = pd.get_dummies(df[categorical_columns], drop_first=True)
    
    # Combine features
    numeric_features = df[numeric_columns + ['Transaction_Day', 'Transaction_Month', 'Transaction_Year']]
    df_processed = pd.concat([numeric_features, df_encoded], axis=1)
    df_processed['Is.Fraudulent'] = df['Is.Fraudulent']
    
    return df_processed

# Preprocess the data
df_processed = preprocess_data(df)
print("\nProcessed Data Shape:", df_processed.shape)
print("\nProcessed Data Columns:", df_processed.columns.tolist())

# Split features and target
X = df_processed.drop('Is.Fraudulent', axis=1)
y = df_processed['Is.Fraudulent']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Make predictions and evaluate
y_pred = rf_model.predict(X_test_scaled)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")


Processed Data Shape: (240000, 24)

Processed Data Columns: ['Transaction.Amount', 'Customer.Age', 'Account.Age.Days', 'Transaction.Hour', 'Quantity', 'Transaction_Day', 'Transaction_Month', 'Transaction_Year', 'source_Direct', 'source_SEO', 'browser_FireFox', 'browser_IE', 'browser_Opera', 'browser_Safari', 'Payment.Method_bank transfer', 'Payment.Method_credit card', 'Payment.Method_debit card', 'Product.Category_electronics', 'Product.Category_health & beauty', 'Product.Category_home & garden', 'Product.Category_toys & games', 'Device.Used_mobile', 'Device.Used_tablet', 'Is.Fraudulent']

Classification Report:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98     44544
           1       0.96      0.40      0.57      3456

    accuracy                           0.96     48000
   macro avg       0.96      0.70      0.77     48000
weighted avg       0.96      0.96      0.95     48000

F1 Score: 0.565


Once you have completed your analysis and are ready to submit the assessment you should export the trained model file (**only one model will be accepted**). The model file should be saved as a pickle file (.pkl). The model file should be saved in the same directory as the notebook. Once you have saved the model file you should upload both the notebook and the model file to the assessment submission portal. Please ensure you provide the model file name as a variable, see example below.

In [46]:
## Do not delete this cell ##

# export the model with pickle
import pickle

# save the model to disk

# define the filename, it should have a .pkl extension
filename = 'log_reg_model.pkl' # replace 'log_reg_model' with the name of your model variable

# save the model to the current directory
with open(filename, "wb") as f:
    pickle.dump(rf_model, f) # replace 'log_reg_model' with the name of your model variable


### Assessment Evaluation

This is required for the assessment to be marked. Groups should specify any data processing steps that are required to run the model in the cell below. This may include the installation of additional libraries, loading of the data, and any additional processing steps required to run the model. The model should be saved to a file called 'model.pkl' in the same directory as the notebook. The model file should be loaded and tested in the cell below to ensure it runs correctly. The model should be loaded and tested using the following code:

In [61]:
## Do not delete this cell ##

# load the evaluation data
import pandas as pd

# load the raw data
df_eval_raw = pd.read_csv('evaluation_dataset.csv')


In [63]:
# groups should add the necessary preprocessing steps to prepare the data for evaluation below 



# the final dataset should be saved in a DataFrame called df_eval
df_eval = preprocess_data(df_eval_raw)

In [65]:
## Do not delete this cell ##

# Load the model and evaluate it on the evaluation data 

# load the pickle model
with open(filename, "rb") as f:
    eval_model = pickle.load(f) # do not change the name of the model variable

# test the model on the evaluation data
y_eval = eval_model.predict(df_eval.drop('Is.Fraudulent', axis=1))

# calculate the f1 score
f1_eval = f1_score(df_eval['Is.Fraudulent'], y_eval)

# print the f1 score
print(f'F1 Score: {f1_eval:.3f}')

F1 Score: 0.081


