![Python](https://img.shields.io/badge/python-3.9-blue)
![Status: Pending Migration](https://img.shields.io/badge/status-pending%20migration-orange)


<a id="table-of-contents"></a>
# 📖 Classification

[🧭 Objective](#objective)  
- [📌 What is Classification?](#what-is-classification)  
- [📦 Use Cases](#classification-use-cases)  

[📂 Data Setup](#data-setup)  
- [📥 Load Dataset](#load-dataset)  
- [🧹 Preprocessing](#preprocessing)

[🧪 Baseline Model](#baseline-model)  
- [📈 Dummy Classifier](#dummy-classifier)  
- [📊 Metrics to Benchmark](#baseline-metrics)

[🔍 Models](#models)
- [📊 Logistic Regression](#logistic-regression)
- [🧮 Naive Bayes](#naive-bayes)
- [🌳 Decision Tree](#decision-tree)
- [🌲 Random Forest](#random-forest)
- [🚀 XGBoost](#xgboost)
- [🎯 KNN (K-Nearest Neighbors)](#knn)
- [📈 SVM (Support Vector Machines)](#svm)
- [🧠 Neural Network](#neural-net)

[📊 Evaluation & Comparison](#evaluation)  
- [📉 Confusion Matrix](#confusion-matrix)  
- [📈 ROC Curve / AUC](#roc-auc)  
- [📏 Precision / Recall / F1](#prf-metrics)  
- [📋 Model Comparison Table](#comparison-table)

[Hyperparameter Tuning with Cross-Validation](#hyperparameter-tuning)  

[❓ FAQ / Notes](#faq)  
- [📏 Class Imbalance](#class-imbalance)  
- [🧪 When to Use What Model](#model-selection-guide)
<hr style="border: none; height: 1px; background-color: #ddd;" />


<a id="objective"></a>
# 🧭 Objective



<a id="what-is-classification"></a>
#### 📌 What is Classification?

<details><summary><strong>📖 Click to Expand</strong></summary>
Classification is a type of supervised machine learning where the goal is to predict a categorical label for an observation. Given a set of features (input data), the model tries to assign the observation to one of several predefined classes. Common examples of classification problems include:
- **Spam detection**: Classifying emails as spam or not.
- **Customer churn prediction**: Classifying customers as likely to leave (churn) or stay based on their activity.
- **Image recognition**: Classifying images into categories, like identifying animals, vehicles, etc.

In classification, the output is discrete (e.g., 'spam' vs 'not spam', 'churn' vs 'no churn'). This contrasts with regression, where the output is continuous (e.g., predicting a house price).

##### Key Points
- Supervised learning approach.
- Used for predicting categories.
- Output is discrete (binary or multiclass).
- Examples: email classification, disease diagnosis, fraud detection.

</details>





<a id="classification-use-cases"></a>
#### 📦 Use Cases



[Back to the top](#table-of-contents)
___

<a id="data-setup"></a>
# 📂 Data Setup



<a id="load-dataset"></a>
#### 📥 Load Dataset



In [1]:
# Data handling and manipulation
import pandas as pd
import numpy as np

# Machine Learning and Model Evaluation
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, TimeSeriesSplit, KFold
from sklearn.decomposition import PCA
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Statistical and Other Utilities
from scipy.stats import zscore
from termcolor import colored

# Visualization
import matplotlib.pyplot as plt

<details><summary><strong>📖 Click to Expand</strong></summary>
In this section, we will begin by preparing the dataset. For simplicity, we'll use a simulated classification dataset generated using the `make_classification` function from `sklearn`. This allows us to create a synthetic dataset that is suitable for practicing classification tasks.

We will simulate a dataset with the following properties:
- 1000 samples (observations)
- 10 features (predictors)
- 2 informative features (ones that help in prediction)
- 2 classes (binary classification problem)

Let's generate and take a look at the data.

</details>

In [2]:
# Simulating a classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=2, n_classes=2, random_state=42)

# Converting to a DataFrame for easier handling
df = pd.DataFrame(X, columns=[f"Feature_{i}" for i in range(1, 11)])
df['Target'] = y

# Display the first few rows of the dataset
df.head()


Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Feature_10,Target
0,0.964799,-0.066449,0.986768,-0.358079,0.997266,1.18189,-1.615679,-1.210161,-0.628077,1.227274,0
1,-0.916511,-0.566395,-1.008614,0.831617,-1.176962,1.820544,1.752375,-0.984534,0.363896,0.20947,1
2,-0.109484,-0.432774,-0.457649,0.793818,-0.268646,-1.83636,1.239086,-0.246383,-1.058145,-0.297376,1
3,1.750412,2.023606,1.688159,0.0068,-1.607661,0.184741,-2.619427,-0.357445,-1.473127,-0.190039,0
4,-0.224726,-0.711303,-0.220778,0.117124,1.536061,0.597538,0.348645,-0.939156,0.175915,0.236224,1


<a id="data-characteristics-dictionary"></a>

#### 📊 Data Characteristics Dictionary

<details><summary><strong>📖 Click to Expand Explanation</strong></summary>

This section initializes the **data characteristics dictionary**, which will store various metadata about the dataset, including details about the target variable, features, data size, and linear separability.

The dictionary contains the following key sections:

1. **🎯 Target Variable**:
   - **Type**: Specifies whether the target variable is **binary** or **multiclass**.
   - **Imbalance**: Indicates whether the target variable has **class imbalance**.
   - **Class Imbalance Severity**: Specifies the severity of the imbalance (e.g., **high**, **low**).

2. **🔧 Features**:
   - **Type**: Describes the type of features in the dataset (e.g., **categorical**, **continuous**, or **mixed**).
   - **Correlation**: Indicates the correlation between features (e.g., **low**, **medium**, **high**).
   - **Outliers**: Flag to indicate whether **outliers** are detected in the features.
   - **Missing Data**: Tracks the percentage of **missing data** or flags missing values.

3. **📈 Data Size**:
   - **Size**: Contains the **number of samples** (rows) and **number of features** (columns).

4. **🔍 Linear Separability**:
   - **Linear Separability**: States whether the classes are **linearly separable** (True or False).

This dictionary will be updated dynamically as we analyze the dataset in subsequent steps. It serves as a **summary of key dataset properties** to help guide further analysis and modeling decisions.

</details>


In [4]:
# Initialize the data characteristics dictionary
data_characteristics = {
    "target_variable": {
        "type": None,  # "binary", "multiclass"
        "imbalance": None,  # True if imbalanced, False otherwise
        "class_imbalance_severity": None  # e.g., "high", "low"
    },
    "features": {
        "type": None,  # "categorical", "continuous", "mixed"
        "correlation": None,  # "low", "medium", "high"
        "outliers": None,  # True if outliers detected, False otherwise
        "missing_data": None  # Percentage of missing data or boolean
    },
    "data_size": None,  # Size of dataset (samples, features)
    "linear_separability": None  # True if classes are linearly separable
}


<a id="preprocessing"></a>
#### 🧹 Preprocessing



[Back to the top](#table-of-contents)
___



<a id="baseline-model"></a>
# 🧪 Baseline Model

<details><summary><strong>📖 Click to Expand </strong></summary>

In this section, we define the **baseline model** for the classification task. The baseline model is typically a **dummy model** that can be used to compare against more sophisticated models. Here, we use the **DummyClassifier**, which predicts the majority class, to set a baseline performance.

The baseline model will help us assess if more advanced models (e.g., Random Forest, SVM) are making meaningful improvements over a simple strategy.

</details>



<a id="dummy-classifier"></a>
#### 📈 Dummy Classifier



<a id="baseline-metrics"></a>
#### 📊 Metrics to Benchmark



[Back to the top](#table-of-contents)
___



<a id="models"></a>
# 🔍 Models



<a id="logistic-regression"></a>
#### 📊 Logistic Regression



<a id="naive-bayes"></a>
#### 🧮 Naive Bayes



<a id="decision-tree"></a>
#### 🌳 Decision Tree

<a id="random-forest"></a>
#### 🌲 Random Forest

<a id="xgboost"></a>
#### 🚀 XGBoost



<a id="knn"></a>
#### 🎯 KNN (K-Nearest Neighbors)

<a id="svm"></a>
#### 📈 SVM (Support Vector Machines)

<a id="neural-net"></a>
#### 🧠 Neural Network

[Back to the top](#table-of-contents)
___



<a id="evaluation"></a>
# 📊 Evaluation & Comparison



<a id="confusion-matrix"></a>
#### 📉 Confusion Matrix



<a id="roc-auc"></a>
#### 📈 ROC Curve / AUC



<a id="prf-metrics"></a>
#### 📏 Precision / Recall / F1



<a id="comparison-table"></a>
#### 📋 Model Comparison Table



[Back to the top](#table-of-contents)
___



<a id="faq"></a>
# ❓ FAQ / Notes

<a id="class-imbalance"></a>
#### 📏 Class Imbalance

<a id="model-selection-guide"></a>
#### 🧪 When to Use What Model

[Back to the top](#table-of-contents)
___

