# Assignment 1 - Binary Classification Evaluation Metrics

**Objective:**
The objective of this assignment is to assess your understanding of fundamental concepts in model evaluation for machine learning tasks. This assignment covers topics discussed in the first half of the course, including key evaluation metrics, confusion matrices, ROC curves, and Precision-Recall curves.
Instructions:

1. Theory Questions:
Answer the following theoretical questions:

    1. Explain the limitations of accuracy as an evaluation metric in imbalanced datasets. How does accuracy behave when classes are heavily skewed, and why might it provide misleading results?
    2. Describe the purpose and interpretation of a confusion matrix. How does it help in assessing a classification model's performance?
    3. Explain the concept of ROC curves. What does each point on an ROC curve represent? How is the area under the ROC curve (AUC-ROC) calculated?
    4. Compare and contrast the advantages and disadvantages of ROC curves and Precision-Recall curves. In what scenarios would you prefer to use one over the other, and why?

2. Practical Exercises:
* Implement Python code to calculate the following evaluation metrics for a given binary classification problem: Log Loss
* Select the best metric for an applied scenario

**Submission Guidelines:**
* Submit your responses to the theory questions in a neatly organized markdown.
* Include your Python code for the practical exercise.
* Submit your assignment as a single `.ipynb` file named `MY NAME Assignment 1 - Log Loss` via the course submission platform (slack).

## Part 1: Theory Questions (20 points)
Provide your answers here:

    1.
    2.
    3.
    4.

## Practicing Log Loss (25 Points)

**Objective:**
The objective of this assignment is to deepen your understanding of log loss, also known as logarithmic loss or cross-entropy loss, and its application in evaluating the performance of classification models.

**Instructions:**
In this assignment, you will be given a set of binary classification predictions along with their corresponding actual class labels. Your task is to calculate the log loss for each prediction and then analyze the overall log loss performance of the model.

**Dataset:**
You are provided with a dataset containing the following information:

Predicted probabilities for the positive class (ranging from 0 to 1) for a set of instances.
Actual binary class labels (0 or 1) indicating whether the instance belongs to the positive class or not.

**Assignment Tasks:**
1. Calculate the log loss for each instance in the dataset using the predicted probabilities and actual class labels.
2. Summarize the individual log losses and compute the overall log loss performance for the model.
3. Interpret the overall log loss value and analyze the model's performance. Discuss any insights or observations derived from the log loss analysis.


**Dataset:**

| Instance | Predicted Probability | Actual Label |
|----------|------------------------|--------------|
|    1     |          0.9           |       1      |
|    2     |          0.3           |       0      |
|    3     |          0.6           |       1      |
|    4     |          0.8           |       0      |
|    5     |          0.1           |       1      |


**Grading Criteria:**

* Correctness of log loss calculations.
* Clarity and completeness of the analysis.
* Insights derived from the log loss interpretation.
* Overall presentation and adherence to submission guidelines.

In [None]:
import pandas as pd

# Create a DataFrame with the dataset
data = {
    'Instance': [1, 2, 3, 4, 5],
    'Predicted Probability': [0.9, 0.3, 0.6, 0.8, 0.1],
    'Actual Label': [1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Display the DataFrame
print(df)

   Instance  Predicted Probability  Actual Label
0         1                    0.9             1
1         2                    0.3             0
2         3                    0.6             1
3         4                    0.8             0
4         5                    0.1             1


In [None]:
# YOUR CODE HERE

*Question: Interpret the log loss above. How would it change if the predicted probability for instance 0 changed from 0.9 to 0.6? Why?*

*Your answer:*

*Question: Why might you select log loss over precision, recall, or accuracy (in the context of any problem, not this one specifically)?*

*Your answer:*

## Application Scenario: Select a Metric (55 points)

**Application Scenario: Fraud Detection System**

You are working as a data scientist for a financial institution that wants to develop a fraud detection system to identify potentially fraudulent transactions. The dataset contains information about various transactions, including transaction amount, merchant ID, and transaction type. Your task is to build a machine learning model to classify transactions as either fraudulent or non-fraudulent.

**Problem Description:**

* Dataset: The dataset consists of historical transaction data, with labels indicating whether each transaction was fraudulent or not.
* Class Distribution: The dataset is mostly non-fraudulant cases, with a small percentage of transactions being fraudulent compared to legitimate transactions.
* Objective: The objective is to develop a fraud detection model that minimizes false negatives (fraudulent transactions incorrectly classified as non-fraudulent) while maintaining a reasonable level of precision.

**Stakeholder Requirements:**
Given the nature of the problem, it is crucial to prioritize recall (sensitivity) to ensure that as many fraudulent transactions as possible are detected. However, precision is also important to minimize false positives and avoid unnecessary investigations of legitimate transactions. Minimizing false negatives (missing fraudulent transactions) is of utmost importance.

**Task:**
Your task is to develop Python code to evaluate the performance of different machine learning models using various evaluation metrics, including accuracy, precision, recall, and F2 score. *Select the evaluation metric that best suits the problem and explain your choice*.

**Additional Guidelines:**
* You should preprocess the dataset as needed and split it into training and testing sets.
* Implement machine learning models of your choice (e.g., logistic regression, random forest) and evaluate their performance.
* Use appropriate evaluation metrics for binary classification tasks.
* Discuss the rationale behind your choice of evaluation metric and how it aligns with the problem requirements.
* Present your findings and recommendations for selecting the best model based on the chosen evaluation metric.

**Dataset Sample:**

| Transaction ID | Transaction Amount | Merchant ID | Transaction Type | Fraudulent |
|----------------|--------------------|-------------|------------------|------------|
| 1              | 1000               | M123        | Online Purchase  | 0          |
| 2              | 500                | M456        | ATM Withdrawal   | 0          |
| 3              | 2000               | M789        | Online Purchase  | 1          |
| 4              | 1500               | M123        | POS Transaction  | 0          |
| 5              | 800                | M456        | Online Purchase  | 0          |
| 6              | 3000               | M789        | ATM Withdrawal   | 1          |

* Transaction ID: Unique identifier for each transaction.
* Transaction Amount: The amount of money involved in the transaction.
* Merchant ID: Identifier for the merchant involved in the transaction.
* Transaction Type: The type of transaction (e.g., online purchase, ATM withdrawal, POS transaction).
* Fraudulent: Binary indicator (0 or 1) specifying whether the transaction is fraudulent (1) or not (0).

In [None]:
import pandas as pd

# Creating the dataset
data = {
    'Transaction ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                       11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                       21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
                       31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
    'Transaction Amount': [1000, 500, 2000, 1500, 800, 3000, 1200, 700, 1800, 1300,
                           900, 400, 2200, 1600, 850, 2800, 1100, 600, 1900, 1400,
                           950, 300, 2100, 1700, 820, 3200, 1250, 720, 1850, 1350,
                           880, 420, 2400, 1750, 830, 3100, 1150, 620, 1950, 1450],
    'Merchant ID': ['M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123',
                    'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456',
                    'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123',
                    'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456'],
    'Transaction Type': ['Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'POS Transaction', 'Online Purchase',
                         'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'POS Transaction',
                         'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase',
                         'POS Transaction', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal',
                         'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'POS Transaction', 'Online Purchase',
                         'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'POS Transaction',
                         'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase',
                         'POS Transaction', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal'],
    'Fraudulent': [0, 0, 1, 0, 0, 1, 0, 0, 1, 0,
                   0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
                   1, 0, 0, 1, 0, 0, 1, 0, 0, 1,
                   0, 0, 1, 0, 0, 1, 0, 0, 1, 0]
}

# Creating DataFrame
df = pd.DataFrame(data)

# Displaying the DataFrame
print(df)


    Transaction ID  Transaction Amount Merchant ID Transaction Type  \
0                1                1000        M123  Online Purchase   
1                2                 500        M456   ATM Withdrawal   
2                3                2000        M789  Online Purchase   
3                4                1500        M123  POS Transaction   
4                5                 800        M456  Online Purchase   
5                6                3000        M789   ATM Withdrawal   
6                7                1200        M123  Online Purchase   
7                8                 700        M456   ATM Withdrawal   
8                9                1800        M789  Online Purchase   
9               10                1300        M123  POS Transaction   
10              11                 900        M456  Online Purchase   
11              12                 400        M789   ATM Withdrawal   
12              13                2200        M123  Online Purchase   
13    

In [None]:
# YOUR CODE HERE