## Importing libraries:
---

In [None]:
import os
import zipfile
import pandas as pd
import requests

DATA_DIR = "../raw_data/"
ZIP_PATH = os.path.join(DATA_DIR, "creditcardfraud.zip")
CSV_PATH = os.path.join(DATA_DIR, "creditcard.csv")

In [None]:
os.makedirs(DATA_DIR, exist_ok=True)

if not os.path.exists(ZIP_PATH):
    url = "https://www.kaggle.com/api/v1/datasets/download/mlg-ulb/creditcardfraud"
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        with open(ZIP_PATH, "wb") as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print("Succesfully downloaded.")
    else:
        print(f"Failed to download the file. Status code: {response.status_code}")

if not os.path.exists(CSV_PATH) and os.path.exists(ZIP_PATH):
    with zipfile.ZipFile(ZIP_PATH, "r") as zip_ref:
        zip_ref.extractall(DATA_DIR)
    print("Data extracted succesfully.")

if os.path.exists(CSV_PATH):
    df = pd.read_csv(CSV_PATH)
    print("DataFrame loaded succesfully.")
else:
    print("CSV file not found.")

To help understand the problems we are facing, the following image helps to determine the best metric to be used based on the problem we are solving:

![metrics.png](https://machinelearningmastery.com/wp-content/uploads/2019/12/How-to-Choose-a-Metric-for-Imbalanced-Classification-latest.png)

source: [MachineLearningMastery](https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/)

In [None]:
df.head() # Display the first few rows of the DataFrame

### Stratified splitting:
---

In [None]:
# Split the DataFrame into features and target variable
# 'Class' is the target variable indicating fraud (1) or not fraud (0)
# The rest of the columns are features used for prediction
X = df.drop('Class', axis=1)
y = df['Class']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size = 0.3, random_state = 42)

### Oversample/undersample before or after splitting data?
---
Main rule: **Always** after.