# ‚öôÔ∏èMODEL BUILDING WORKBOOK (Classification)
---

### Classification (Categorizing Outcomes) üóÇÔ∏è
**"Which one?" or "Yes/No?"** ‚ùì

Classification models are used when the target variable is **categorical**. The model analyzes input data to sort it into distinct, pre-defined buckets or classes.
* **Business Examples:** Predicting if a customer will Churn (Yes/No) üö™, or flagging a transaction as Fraudulent vs. Legitimate üí≥.
* **Key Note üí°:** While the final output is a category (e.g., "Churn"), these models actually calculate the **probability** (e.g., "85% risk of churn") behind the scenes before making the final decision.

![Classification](classification.jpg)

## üìÇSetup and Imports

In [None]:
# All imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import OneHotEncoder, StandardScaler

## 1. üìàLoad

In [None]:
df = pd.read_csv("customer_churn.csv")
print("Initial rows:", len(df))

## 2. üßπQuick cleaning

In [None]:
# Convert numeric columns
df["Total_Charges"] = pd.to_numeric(df["Total_Charges"], errors="coerce")
df["Total_Charges"] = df["Total_Charges"].fillna(df["Total_Charges"].median())
# Convert target to 0/1
df["ChurnFlag"] = df["Churn"].map({"Yes":1, "No":0})

## 3. üßæPrepare features - pick columns

In [None]:
# TODO: Fill FEATURES list with appropriate columns from df (exclude Churn and ChurnFlag)
FEATURES = ______ # <- FILL
X = df[FEATURES]
y = df["ChurnFlag"]

## 4. üì†Encode categorical features (example: Gender, Contract_Type)

In [None]:
# We will one-hot encode all object dtype columns in X
cat_cols = X.select_dtypes(include=["object"]).columns.tolist()
if cat_cols:
    # FIXED: Updated sparse to sparse_output for compatibility
    enc = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
    enc_arr = enc.fit_transform(X[cat_cols])
    enc_cols = enc.get_feature_names_out(cat_cols)
    X = pd.concat([X.drop(columns=cat_cols), pd.DataFrame(enc_arr, columns=enc_cols, index=X.index)], axis=1)

## 5. ‚úÇÔ∏èSplit 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= ____, random_state=42)

## 6. üìèScaling numeric columns

In [None]:
num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

## 7. ü•äTrain logistic regression

In [None]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

## 8. üîÆPredict & evaluate

In [None]:
pred = model.predict(X_test)
acc = accuracy_score(y_test, pred)
print("Accuracy:", acc)
print(classification_report(y_test, pred))

## 9. ‚ûóConfusion matrix plot

### üìå What the Decision Boundary Plot Means

- A scatter plot shows points of two categories:

- Red = Class A

- Blue = Class B

- The classifier draws a boundary line that separates them.

### üìå Why We Draw a Decision Boundary

- It shows how the model decides between categories.

- Everything on one side is predicted as Class A,
- everything on the other side is Class B.

### üìå What Tight Clusters Mean

- If points of the same color are grouped together ‚Üí easy classification.

- If colors are mixed ‚Üí model needs a more complex boundary.

### üìå Why Some Points Are Misclassified

- Real-world data is noisy.

- Two similar customers can behave differently.

- The boundary shows the model‚Äôs best guess, not perfection.

### üìå Confusion Matrix Visualization

- You can explain:

- True Positive: Model correctly predicts ‚ÄúYes".

- True Negative: Correctly predicts ‚ÄúNo‚Äù.

- False Positive: Model wrongly predicts ‚ÄúYes‚Äù.

- False Negative: Model misses a ‚ÄúYes‚Äù.

In [None]:
cm = confusion_matrix(y_test, pred)
plt.figure(figsize=(4,3))
plt.imshow(cm, interpolation='nearest')
plt.title("Confusion Matrix")
plt.colorbar()
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.xticks([0,1], ["No","Yes"])
plt.yticks([0,1], ["No","Yes"])
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, cm[i, j], ha="center", va="center", color="white" if cm[i,j]>cm.max()/2 else "black")
plt.tight_layout()
plt.show()

print("Classification module complete.")

# CONGRATULATIONS!! üéâüéâ‚ú®
## We have successfully Learnt Logistic Regression !!!
