<h1 style="font-family: 'times new roman'; text-align: center;">Task 5: Online Payment Fraud Detection using Decision Tree Classification</h1>

<h2 style="font-family: 'times new roman';">1. Introduction (5 Marks)</h2>

<div style="font-family: 'times new roman';">
Online payment systems have become ubiquitous, facilitating transactions globally. However, this convenience comes with the significant risk of financial fraud. Detecting fraudulent transactions promptly is crucial for financial institutions and customers alike to mitigate losses and maintain trust. This project aims to address this challenge by developing a machine learning model capable of identifying potentially fraudulent online payment activities.
<br><br>
We will utilize the "Online Payment Fraud Detection" dataset, a synthetic dataset generated using the PaySim simulator, which mimics real-world mobile money transactions. The primary goal is to build and evaluate a classification model, specifically a Decision Tree, to distinguish between legitimate (non-fraudulent) and fraudulent transactions based on various transactional features. The success of this model will be measured by its ability to accurately predict the 'isFraud' label, thereby providing a tool to flag suspicious activities for further investigation.
</div>

<h2 style="font-family: 'times new roman';">2. Introduction to the Chosen Technique: Decision Tree (5 Marks)</h2>

<div style="font-family: 'times new roman';">
For this classification task, we have chosen the **Decision Tree** algorithm. A Decision Tree is a supervised machine learning technique used for both classification and regression problems, although it is more widely used for classification. It operates by recursively partitioning the dataset into smaller subsets based on the values of input features.
<br><br>
The structure resembles a flowchart, where:
<ul>
    <li>Each internal node represents a test on an attribute (e.g., 'Is transaction amount > $1000?').</li>
    <li>Each branch represents the outcome of the test (e.g., 'Yes' or 'No').</li>
    <li>Each leaf node represents a class label (decision taken after computing all attributes) - in our case, 'Fraudulent' (1) or 'Legitimate' (0).</li>
</ul>
The paths from the root node to the leaf nodes represent classification rules. The algorithm selects the best feature to split the data at each node using criteria like Gini impurity or information gain (entropy), aiming to create subsets that are as pure as possible in terms of the target class.
<br><br>
**Why Decision Tree?**
<ul>
    <li><b>Interpretability:</b> Decision Trees are often called 'white-box' models because their decision-making process is easy to visualize and understand. The generated rules can be explicitly stated and followed.</li>
    <li><b>Handling Data Types:</b> They can naturally handle both numerical and categorical data without requiring extensive preprocessing like normalization (though encoding categorical features is still necessary).</li>
    <li><b>Non-Linearity:</b> They can capture non-linear relationships between features and the target variable.</li>
    <li><b>Feature Importance:</b> Decision trees inherently provide a measure of feature importance, helping to understand which transaction characteristics are most indicative of fraud.</li>
</ul>
While powerful, Decision Trees can be prone to overfitting, especially with deep trees. Techniques like pruning or setting constraints on tree depth will be considered during implementation to mitigate this risk.
</div>

<h2 style="font-family: 'times new roman';">3. Introduction of the Dataset (5 Marks)</h2>

<div style="font-family: 'times new roman';">
The dataset used in this project is the "Online Payment Fraud Detection" dataset, sourced from Kaggle. It contains synthetic transaction data generated by the PaySim simulator. PaySim is based on a sample of real mobile money transactions and is designed to replicate normal payment operations while also injecting fraudulent activities.
<br><br>
Key characteristics of the dataset include:
<ul>
    <li><b>Data Size:</b> It typically contains a large volume of transactions (often millions), simulating real-world data scale.</li>
    <li><b>Features:</b> The dataset includes several attributes for each transaction:
        <ul>
            <li><code>step</code>: Represents a unit of time in the real world (1 step is 1 hour).</li>
            <li><code>type</code>: Type of transaction (e.g., CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER).</li>
            <li><code>amount</code>: The amount of the transaction.</li>
            <li><code>nameOrig</code>: Customer who initiated the transaction.</li>
            <li><code>oldbalanceOrg</code>: Balance before the transaction for the originator.</li>
            <li><code>newbalanceOrig</code>: Balance after the transaction for the originator.</li>
            <li><code>nameDest</code>: Recipient of the transaction.</li>
            <li><code>oldbalanceDest</code>: Balance before the transaction for the recipient.</li>
            <li><code>newbalanceDest</code>: Balance after the transaction for the recipient.</li>
            <li><code>isFlaggedFraud</code>: An indicator set by the simulation system for illegal attempts (transferring more than 200,000 in a single transaction). This is a rule-based flag from the simulator, not a perfect fraud label.</li>
        </ul>
    </li>
    <li><b>Target Variable:</b>
        <ul>
            <li><code>isFraud</code>: This is the ground truth label, indicating whether the transaction is actually fraudulent (1) or legitimate (0). This is the variable our model will predict.</li>
        </ul>
    </li>
    <li><b>Data Imbalance:</b> As is common in fraud detection scenarios, the dataset is highly imbalanced, with fraudulent transactions representing a very small percentage of the total transactions. This imbalance poses a challenge for model training and evaluation, requiring careful handling (e.g., using appropriate evaluation metrics like Precision, Recall, F1-score, or techniques like resampling).</li>
</ul>
Understanding these features and the inherent imbalance is crucial for effective data preprocessing, model training, and interpretation of results.
</div>

<!-- <h2 style="font-family: 'times new roman';">4. Input Encoding / Input Representation (5 Marks)</h2>

*(Content for this section will involve loading the data, analyzing feature types, and describing/implementing encoding strategies like One-Hot Encoding for categorical features like 'type'.)*

<h2 style="font-family: 'times new roman';">5. Coding for the Implementation with Comments (10 Marks)</h2>

*(Code cells for data loading, preprocessing, splitting, Decision Tree model training, prediction, and evaluation will go here, accompanied by explanatory comments.)*

<h2 style="font-family: 'times new new roman';">6. Analysis of Results and Comments (10 Marks)</h2>

*(This section will discuss the model's performance based on evaluation metrics, analyze the confusion matrix, potentially visualize the tree or feature importances, and comment on the findings and limitations.)* -->