<a href="https://colab.research.google.com/github/chai1357/BigData_Creditcard_fraudDetection_1/blob/main/BigData__Creditcard_fraud_Detection_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Title**  
**Big Data Analysis on Credit Card Fraud Detection using PySpark**

---

# **Objective**
To detect fraudulent transactions in a credit card dataset using scalable big data tools like PySpark, and build a classification model to predict fraud effectively.

---

## **Tools Used**
- Python  
- PySpark (Big Data Framework)  
- Google Colab (for execution)  
- Logistic Regression (for classification)  

---

## **Dataset Description**
- Source: [Kaggle - Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)  
- Records: 284,807 transactions  
- Features: 30 anonymized features + Time, Amount, and Class (0 = normal, 1 = fraud)

---

## **Steps Performed**  
1. **Data Loading:** Loaded large CSV using PySpark for efficient processing.  
2. **Data Cleaning:** Removed duplicates and checked for null values.  
3. **Feature Engineering:** Used `VectorAssembler` to prepare data for ML.  
4. **Model Building:** Trained a Logistic Regression classifier.  
5. **Evaluation:** Achieved high performance on test data.

---

## **Results**  
- **Accuracy:** `99.92%`  
- **AUC:** `0.9683`  
- Very high precision, indicating strong fraud detection ability.

---

## **Insights & Outcome**  
- Logistic Regression is effective for fraud detection in this context.  
- PySpark handled the large dataset efficiently, showcasing scalability.  
- This approach is ideal for real-time financial fraud systems in big institutions.

---

##  **Deliverable **
A complete notebook including:
- Data loading, preprocessing  
- ML modeling and evaluation  
- Final conclusions


In [None]:
!pip install pyspark



In [None]:
from google.colab import files
uploaded = files.upload()


Saving creditcard.csv to creditcard.csv


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CreditCardFraudDetection").getOrCreate()
df = spark.read.csv("creditcard.csv", header=True, inferSchema=True) # Load dataset
df.show(5) # Show the first 5 rows


+----+------------------+-------------------+----------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+--------------------+-------------------+------------------+------------------+------------------+------------------+--------------------+-------------------+------+-----+
|Time|                V1|                 V2|              V3|                V4|                 V5|                 V6|                 V7|                V8|                V9|                V10|               V11|               V12|               V13|               V14|               V15|               V16|               V17|                V18|               V19|                V20|                 V21|                V22|     

In [None]:

df.printSchema()
print(f"Total Rows: {df.count()}")
print(f"Total Columns: {len(df.columns)}")

# Summary statistics for numerical columns
df.describe().show()
# Check class distribution (0 = Non-Fraud, 1 = Fraud)
df.groupBy("Class").count().show()


root
 |-- Time: double (nullable = true)
 |-- V1: double (nullable = true)
 |-- V2: double (nullable = true)
 |-- V3: double (nullable = true)
 |-- V4: double (nullable = true)
 |-- V5: double (nullable = true)
 |-- V6: double (nullable = true)
 |-- V7: double (nullable = true)
 |-- V8: double (nullable = true)
 |-- V9: double (nullable = true)
 |-- V10: double (nullable = true)
 |-- V11: double (nullable = true)
 |-- V12: double (nullable = true)
 |-- V13: double (nullable = true)
 |-- V14: double (nullable = true)
 |-- V15: double (nullable = true)
 |-- V16: double (nullable = true)
 |-- V17: double (nullable = true)
 |-- V18: double (nullable = true)
 |-- V19: double (nullable = true)
 |-- V20: double (nullable = true)
 |-- V21: double (nullable = true)
 |-- V22: double (nullable = true)
 |-- V23: double (nullable = true)
 |-- V24: double (nullable = true)
 |-- V25: double (nullable = true)
 |-- V26: double (nullable = true)
 |-- V27: double (nullable = true)
 |-- V28: double (nulla

In [None]:

df = df.dropDuplicates() # Remove duplicates (if any)

from pyspark.sql.functions import col, sum

df.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns]).show() # Check for nulls


+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+
|Time| V1| V2| V3| V4| V5| V6| V7| V8| V9|V10|V11|V12|V13|V14|V15|V16|V17|V18|V19|V20|V21|V22|V23|V24|V25|V26|V27|V28|Amount|Class|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+
|   0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|     0|    0|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+



In [None]:
from pyspark.sql.functions import col, sum

# Check for nulls with display
df.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns]).show()


+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+
|Time| V1| V2| V3| V4| V5| V6| V7| V8| V9|V10|V11|V12|V13|V14|V15|V16|V17|V18|V19|V20|V21|V22|V23|V24|V25|V26|V27|V28|Amount|Class|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+
|   0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|     0|    0|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+------+-----+



In [None]:
df.show(5)


+-----+------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-----------------+-----------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------+-----+
| Time|                V1|                V2|                V3|                 V4|                V5|                V6|                 V7|                V8|                V9|               V10|               V11|               V12|                V13|              V14|              V15|                V16|               V17|               V18|               V19|               V20|               V21|               V22|               

In [None]:
df_assembled.select("features", "Class").show(5)


+--------------------+-----+
|            features|Class|
+--------------------+-----+
|[26.0,-0.52991228...|    0|
|[158.0,-0.6008163...|    0|
|[190.0,-1.5057791...|    0|
|[265.0,-0.4910030...|    0|
|[459.0,-0.5282175...|    0|
+--------------------+-----+
only showing top 5 rows



In [None]:
# Split data into training and testing sets
train_data, test_data = df_assembled.randomSplit([0.8, 0.2], seed=42)

print(f"Training Data: {train_data.count()} rows")
print(f"Testing Data: {test_data.count()} rows")


Training Data: 226986 rows
Testing Data: 56740 rows


In [None]:
from pyspark.ml.classification import LogisticRegression

# Initialize logistic regression model
lr = LogisticRegression(labelCol="Class", featuresCol="features")

# Train the model
lr_model = lr.fit(train_data)


In [None]:

predictions = lr_model.transform(test_data) # Predict on test data

predictions.select("Class", "prediction", "probability").show(10) # Show sample predictions


+-----+----------+--------------------+
|Class|prediction|         probability|
+-----+----------+--------------------+
|    0|       0.0|[0.99965645047094...|
|    0|       0.0|[0.99965069427762...|
|    0|       0.0|[0.99847111350998...|
|    0|       0.0|[0.99998144180382...|
|    0|       0.0|[0.99939487578610...|
|    0|       0.0|[0.99991010999404...|
|    0|       0.0|[0.99722233104859...|
|    0|       0.0|[0.99958391454101...|
|    0|       0.0|[0.99995114859136...|
|    0|       0.0|[0.99986723782084...|
+-----+----------+--------------------+
only showing top 10 rows



In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Accuracy evaluator
evaluator = MulticlassClassificationEvaluator(labelCol="Class", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

print(f"Accuracy: {accuracy:.4f}")


Accuracy: 0.9992


In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# AUC evaluator
evaluator_auc = BinaryClassificationEvaluator(labelCol="Class", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
auc = evaluator_auc.evaluate(predictions)

print(f"AUC: {auc:.4f}")


AUC: 0.9683
