**Project Overview**

Credit card fraud is a major financial challenge, costing banks and customers billions each year. Fraudulent transactions are rare compared to normal ones, making them difficult to detect, and fraudsters constantly change their strategies to avoid being caught. The goal of this project is to build a machine learning model that can identify potentially fraudulent transactions in real time.

Using the Kaggle Credit Card Fraud Detection dataset, which contains anonymized transaction features and labels for fraud vs. normal, the project will:
- Explore and clean the dataset through data wrangling.
- Analyze fraud patterns through exploratory data analysis (EDA).
- Build and evaluate machine learning models (e.g., Logistic Regression, Random Forest, XGBoost, Isolation Forest).
- Address key challenges such as class imbalance and model interpretability.
- Provide insights into which features most strongly indicate fraud and how to balance detecting fraud with minimizing false alarms.

The final deliverables will include a GitHub repository with code and documentation, a detailed project report, and a slide deck summarizing findings for both technical and business audiences.

**Purpose:**

In this notebook, we prepare the dataset for fraud detection modeling.
We’ll load the data, inspect its structure, clean it, and save a smaller version for efficient analysis.

Main Steps:

- Data Collection

- Data Organization

- Data Definition

- Data Cleaning

- Summary of Findings

In [1]:
import pandas as pd
import numpy as np

In this step, we load the Credit Card Fraud dataset from Kaggle.
We start by reading the CSV file and checking the shape and first few rows

In [2]:
df = pd.read_csv("creditcard.csv")

In [3]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


**Data Dictionary**
| **Column Name** | **Type**                 | **Description**                                                                                                                                                                                                | **Notes / Range**                                                                         |
| --------------- | ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| **Time**        | Numeric                  | The number of seconds between this transaction and the first transaction in the dataset.                                                                                                                       | Ranges from 0 to about 172,000 seconds (around 2 days).                                   |
| **V1 – V28**    | Numeric (anonymized)     | Features created using **Principal Component Analysis (PCA)** to hide sensitive details. Each represents patterns or combinations of original transaction data such as user behavior, location, or card usage. | Can be positive or negative values. Exact meanings are unknown.                           |
| **Amount**      | Numeric (currency units) | The transaction amount in the original currency (e.g., Euros).                                                                                                                                                 | Ranges from very small to very large amounts. Often right-skewed (many small, few large). |
| **Class**       | Categorical (0 or 1)     | Target variable: **0 = normal transaction**, **1 = fraudulent transaction.**                                                                                                                                   | Highly imbalanced — frauds make up less than 1% of the data.                              |


- The dataset has 30 columns total — one for time, 28 PCA features, one for transaction amount, and one for class.

- Because the features were transformed using PCA, we can’t know exactly what each represents, but they still show patterns useful for detecting fraud.

- The class imbalance means we must be careful when training models so they don’t just predict “normal” all the time.

- Checking outliers and skewness in features like Amount helps improve model performance and stability.

 **Summary**

This dataset contains thousands of real credit card transactions, most of which are legitimate. The goal is to detect the few that are fraudulent by analyzing the PCA features, transaction amount, and timing patterns. Understanding this structure helps prepare the data for cleaning, visualization, and machine learning later in the project.

In [9]:
print("Full dataset shape:", df.shape)
print("Fraud cases:", df['Class'].sum())

Full dataset shape: (284807, 31)
Fraud cases: 492


The dataset is large, with over 200,000 rows. So to make analysis faster, we keep all fraud cases and randomly sample 10% of non-fraud cases.
We then shuffle and save this smaller dataset.

In [10]:
fraud = df[df['Class'] == 1]
non_fraud = df[df['Class'] == 0].sample(frac=0.1, random_state=42)
df = pd.concat([fraud, non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)

# Save smaller dataset
df.to_csv("creditcard_sample.csv", index=False)
print("Smaller dataset saved as 'creditcard_sample.csv'.")
print("New shape:", df.shape)

Smaller dataset saved as 'creditcard_sample.csv'.
New shape: (28924, 31)


Here, we review column names, data types, summary statistics, missing values, and unique counts to better understand the dataset

In [11]:
# =========================
# 3. Data Definition
# =========================
print("\n--- Data Overview ---")
print(df.info())

print("\n--- Summary Statistics ---")
print(df.describe())

print("\n--- Missing Values ---")
print(df.isnull().sum())

print("\n--- Unique Values per Column ---")
print(df.nunique())


--- Data Overview ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28924 entries, 0 to 28923
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    28924 non-null  float64
 1   V1      28924 non-null  float64
 2   V2      28924 non-null  float64
 3   V3      28924 non-null  float64
 4   V4      28924 non-null  float64
 5   V5      28924 non-null  float64
 6   V6      28924 non-null  float64
 7   V7      28924 non-null  float64
 8   V8      28924 non-null  float64
 9   V9      28924 non-null  float64
 10  V10     28924 non-null  float64
 11  V11     28924 non-null  float64
 12  V12     28924 non-null  float64
 13  V13     28924 non-null  float64
 14  V14     28924 non-null  float64
 15  V15     28924 non-null  float64
 16  V16     28924 non-null  float64
 17  V17     28924 non-null  float64
 18  V18     28924 non-null  float64
 19  V19     28924 non-null  float64
 20  V20     28924 non-null  float64
 21  V21     2892

Data quality is checked by identifying and removing duplicate rows.
Also verifying there are no missing values.
Finally, checking the balance of the target variable.

In [12]:
# =========================
# 4. Data Cleaning
# =========================
print("\n--- Checking for Duplicates ---")
duplicates = df.duplicated().sum()
print("Duplicate rows:", duplicates)

# Drop duplicates if any
df = df.drop_duplicates()

# Re-check for missing values
print("\nAny missing values left?", df.isnull().values.any())

# Check fraud distribution
print("\n--- Class Distribution ---")
print(df['Class'].value_counts())
print("\nPercentage of Fraud Cases:")
print(df['Class'].value_counts(normalize=True) * 100)



--- Checking for Duplicates ---
Duplicate rows: 29

Any missing values left? False

--- Class Distribution ---
Class
0    28422
1      473
Name: count, dtype: int64

Percentage of Fraud Cases:
Class
0    98.363039
1     1.636961
Name: proportion, dtype: float64


In [13]:


# =========================
# 5. Summary
# =========================
print("\n✅ Data Wrangling Complete!")
print("Final dataset shape:", df.shape)
print("Fraud cases in sample:", df['Class'].sum())


✅ Data Wrangling Complete!
Final dataset shape: (28895, 31)
Fraud cases in sample: 473


**Summary of Data Wrangling**

- Loaded the dataset successfully.

- Created a smaller, manageable dataset containing all fraud and 10% of non-fraud cases.

- Verified there are no missing or duplicate values.

- Fraud cases represent roughly 0.17% of transactions, showing strong class imbalance.

**Next Steps:**

- Conduct Exploratory Data Analysis (EDA) to visualize patterns.

- Explore correlations and distributions of key variables.

- Begin preparing data for model training.