## Problem Statement
A dataset contains the users' likelyhood to purchase artifacts in the past. Based on the dataset we have to generate a model that can predict if a person will make purchase or not.

In [1]:
# Connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [None]:
# Load data
df = pd.read_csv('/content/drive/MyDrive/edurekaai/_data/samples/social_network_ads_v2.csv')
df.sample(10)
df['Purchased'].value_counts() #Clear case of binary classification

Unnamed: 0_level_0,count
Purchased,Unnamed: 1_level_1
No,257
Yes,143


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User ID          400 non-null    int64 
 1   Gender           400 non-null    object
 2   Age              400 non-null    int64 
 3   EstimatedSalary  400 non-null    int64 
 4   Purchased        400 non-null    object
dtypes: int64(3), object(2)
memory usage: 15.8+ KB


In [None]:
df.head(5)

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,No
1,15810944,Male,35,20000,No
2,15668575,Female,26,43000,No
3,15603246,Female,27,57000,No
4,15804002,Male,19,76000,No


## Preprocessing

In [None]:
# DROP UNUSED COLUMNS
## Drop "User ID" column since it has nothing to do with the decission making. It is simply an identifier.
df.drop('User ID', axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Gender           400 non-null    object
 1   Age              400 non-null    int64 
 2   EstimatedSalary  400 non-null    int64 
 3   Purchased        400 non-null    object
dtypes: int64(2), object(2)
memory usage: 12.6+ KB


In [None]:
# REMOVE NULL or MISSING DATA COLUMNS
df.isnull().sum()
## No Null Record in the Dataset. So, no action

Unnamed: 0,0
User ID,0
Gender,0
Age,0
EstimatedSalary,0
Purchased,0


In [None]:
# No Outliers in the Dataset
## So, no action

In [None]:
# ENCODING
## Gender and Purchases are categorical value. We must convert them into numerical equivalent.
le = LabelEncoder()
df_encoded = df.copy()
df_encoded['Gender'] = le.fit_transform(df_encoded['Gender'])
df_encoded['Purchased'] = le.fit_transform(df_encoded['Purchased'])
df_encoded.head(5)
#

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,1,19,19000,0
1,15810944,1,35,20000,0
2,15668575,0,26,43000,0
3,15603246,0,27,57000,0
4,15804002,1,19,76000,0


In [None]:
# SCALING
## Age and EstimatedSalary are numerical values. We must standardize the values.
sc = StandardScaler()
df_scaled = df_encoded.copy()
df_scaled[['Age', 'EstimatedSalary']] = sc.fit_transform(df_scaled[['Age', 'EstimatedSalary']])
df_scaled.sample(5)

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
391,15668521,0,0.988083,-1.020209,1
36,15764195,0,-1.877311,-0.755925,0
294,15795224,0,-0.158074,1.651993,1
185,15666675,0,0.797057,0.771048,0
86,15706185,0,-1.113206,-1.020209,0


In [None]:
# SEGREGRATING INDEPENDET AND DEPENDENT VARIABLES
X = df_scaled.drop('Purchased', axis=1)
y = df_scaled['Purchased']


In [None]:
# SPLITING DATA FOR TRAINING AND TESTING
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("Train Shape: ", X_train.shape[0] * 100 / df_scaled.shape[0])
print("Test Shape", X_test.shape[0] * 100 / df_scaled.shape[0])

Train Shape:  80.0
Test Shape 20.0


In [None]:
# TRAIN the MODEL
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# PREDICT on KNOWN / TRAIN DATA
y_pred_train = model.predict(X_train)

# PREDICT on TEST DATA
y_pred_test = model.predict(X_test)

In [None]:
# EVALUATE THE MODEL
cm = confusion_matrix(y_test, y_pred_test)
print(cm)

print("Accuracy on Train Data: ", accuracy_score(y_train, y_pred_train))
print("Error Rate on Train Data: ", 1 - accuracy_score(y_train, y_pred_train))

print("Accuracy on Test Data: ", accuracy_score(y_test, y_pred_test))
print("Error Rate on Test Data: ", 1 - accuracy_score(y_test, y_pred_test))


[[49  4]
 [ 8 19]]
Accuracy on Train Data:  0.834375
Error Rate on Train Data:  0.16562500000000002
Accuracy on Test Data:  0.85
Error Rate on Test Data:  0.15000000000000002


In [None]:
## Predicted Probability on Test Data (default threshold is 0.5)
y_pred_prob = model.predict_proba(X_test)
y_pred_prob

# [0.11230469, 0.88769531] means probability of No is 0.11230469 and probability of Yes is 0.88769531. Hence, the model predicts the class as Yes.

array([[0.80350862, 0.19649138],
       [0.17338475, 0.82661525],
       [0.70579627, 0.29420373],
       [0.56290657, 0.43709343],
       [0.68491508, 0.31508492],
       [0.9962239 , 0.0037761 ],
       [0.99478301, 0.00521699],
       [0.59671286, 0.40328714],
       [0.98628214, 0.01371786],
       [0.5226503 , 0.4773497 ],
       [0.90308645, 0.09691355],
       [0.45532823, 0.54467177],
       [0.98690799, 0.01309201],
       [0.72363945, 0.27636055],
       [0.98208142, 0.01791858],
       [0.99616613, 0.00383387],
       [0.01153897, 0.98846103],
       [0.87387785, 0.12612215],
       [0.60636133, 0.39363867],
       [0.69796133, 0.30203867],
       [0.57498045, 0.42501955],
       [0.5796091 , 0.4203909 ],
       [0.95269808, 0.04730192],
       [0.47110154, 0.52889846],
       [0.21411833, 0.78588167],
       [0.22978593, 0.77021407],
       [0.95646716, 0.04353284],
       [0.97626788, 0.02373212],
       [0.11957828, 0.88042172],
       [0.58724195, 0.41275805],
       [0.

In [None]:
# PRINT CLASSIFICATION REPORT ON TRAIN DATA
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.84      0.92      0.88       204
           1       0.83      0.68      0.75       116

    accuracy                           0.83       320
   macro avg       0.83      0.80      0.81       320
weighted avg       0.83      0.83      0.83       320



In [None]:
# PRINT CLASSIFICATION REPORT ON TRAIN DATA
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       0.86      0.92      0.89        53
           1       0.83      0.70      0.76        27

    accuracy                           0.85        80
   macro avg       0.84      0.81      0.83        80
weighted avg       0.85      0.85      0.85        80



# 📊 Binary Classification Model Performance Summary

## 🎯 Dataset & Label Overview
- **Binary classes detected**:
  - **Class 0**: Negative class (It means, Customer will not Purchase)
  - **Class 1**: Positive class (It means, Customer will purchase)
- Note: Update class definitions as per your dataset's label meaning.

---

## 📈 Classification Report

### 🔹 Train Set
| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| **0** | 0.84      | 0.92   | 0.88     | 204     |
| **1** | 0.83      | 0.68   | 0.75     | 116     |
| **Accuracy** |       |        | **0.83** | **320** |
| **Macro Avg** | 0.83 | 0.80   | 0.81     |         |
| **Weighted Avg** | 0.83 | 0.83 | 0.83     |         |

---

### 🔹 Test Set
| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| **0** | 0.86      | 0.92   | 0.89     | 53      |
| **1** | 0.83      | 0.70   | 0.76     | 27      |
| **Accuracy** |       |        | **0.85** | **80**  |
| **Macro Avg** | 0.84 | 0.81   | 0.83     |         |
| **Weighted Avg** | 0.85 | 0.85 | 0.85     |         |

---

## ✅ Strengths
- **High accuracy** on both training (0.83) and test (0.85) sets.
- **Class 0** is well-classified with **high recall (0.92)** and precision.
- Minimal overfitting: test performance is consistent with training.

---

## ⚠️ Areas for Improvement
- **Class 1 recall is lower**:
  - Train: 0.68
  - Test: 0.70
  - ➤ Model misses some true positives — likely making **false negative errors**.
- If class `1` is important (e.g., fraud, disease detection), this **needs to be improved**.

---

## 🎯 Problem Statement
The model is designed to **predict whether a customer will make a purchase**.

### ➕ Class Definitions:
- **Class `0`** → Customer will **not purchase**
- **Class `1`** → Customer **will purchase**

---

## ✅ Important Class

> **Class `1` (Customer will purchase)** is the most important class.

### Why?
- It represents your **business goal**: identifying customers likely to **convert**.
- Helps optimize:
  - **Marketing spend**
  - **Sales efforts**
  - **Promotional strategies**
  - **Customer engagement campaigns**

---

## 📊 Class 1 Metrics (Your Model)

| Metric     | Train | Test | Interpretation |
|------------|-------|------|----------------|
| **Precision** | 0.83  | 0.83 | Most predicted buyers are actually buyers. |
| **Recall**    | 0.68  | 0.70 | Model misses ~30% of actual buyers. Needs improvement. |
| **F1-score**  | 0.75  | 0.76 | Balanced performance, but recall improvement will help. |

---

## 🧠 Which Metric to Focus On?

| Metric   | Why it matters for Class 1 |
|----------|----------------------------|
| **Recall** | You want to **catch as many true buyers** as possible. Missing them = lost revenue. |
| **Precision** | Reduces false positives (wasting resources on non-buyers). |
| **F1-score** | Balances both precision and recall. Useful summary metric. |

---

## 🔧 Recommendations

1. **Improve Recall for Class 1**:
   - Adjust decision threshold (e.g., lower from `0.5` to `0.4`).
   - Use `class_weight="balanced"` in your model.

2. **Try Sampling Techniques**:
   - Use **SMOTE** (Synthetic Minority Over-sampling Technique) for class 1.
   - Or undersample class 0.

3. **Use Precision-Recall Tradeoff Curve**:
   - Helps choose the best threshold based on your business goal.

4. **Run A/B Testing**:
   - Test impact of targeting predicted class 1 customers in real campaigns.

---

## ✅ Conclusion

> Focus on **class 1**, the "buyer" class.  
> Your model performs well overall, but **boosting recall for class 1** will help catch more actual buyers and **maximize business value**.


---
