# **Classification Models in Supervised Learning**
Classification is a supervised learning task where the goal is to predict discrete labels (categories). Here are some of the most commonly used classification models:

---

## **1. Logistic Regression**
- A simple linear model used for binary classification.
- Uses the sigmoid function to map predictions to probabilities.
- Works well when the data is linearly separable.

🔹 **Best For:** Binary classification problems (e.g., spam detection, medical diagnosis).

---

## **2. k-Nearest Neighbors (k-NN)**
- A non-parametric model that classifies data based on the majority vote of its k nearest neighbors.
- Does not require training; relies on distance metrics (e.g., Euclidean, Manhattan).

🔹 **Best For:** Small datasets where decision boundaries are irregular.

---

## **3. Decision Trees**
- A tree-based model that splits data based on feature conditions.
- Prone to overfitting but interpretable.

🔹 **Best For:** When interpretability is important (e.g., customer segmentation).

---

## **4. Random Forest**
- An ensemble of multiple decision trees.
- Reduces overfitting by averaging multiple tree predictions.

🔹 **Best For:** High-dimensional data, handling missing values.

---

## **5. Support Vector Machine (SVM)**
- Uses hyperplanes to separate data classes.
- Can handle non-linearly separable data with kernels (e.g., RBF kernel).

🔹 **Best For:** When the number of features is high relative to the number of samples.

---

## **6. Naïve Bayes**
- A probabilistic model based on Bayes' theorem.
- Assumes feature independence.

🔹 **Best For:** Text classification, spam filtering.

---

## **7. Neural Networks (Deep Learning)**
- Uses layers of artificial neurons to learn complex patterns.
- Can be simple (MLP) or deep (CNNs, RNNs).

🔹 **Best For:** Large datasets with high complexity (e.g., image and speech recognition).

---

## **8. Gradient Boosting Models (GBM, XGBoost, LightGBM, CatBoost)**
- Ensemble methods that build models sequentially to correct previous errors.
- Highly effective for structured/tabular data.

🔹 **Best For:** Kaggle competitions, predictive analytics.

---

## **Choosing the Right Model**
| Scenario | Suggested Model |
|----------|----------------|
| Binary Classification | Logistic Regression, SVM, Random Forest |
| Multiclass Classification | Decision Trees, Random Forest, XGBoost |
| Small Dataset | k-NN, Naïve Bayes |
| Large Dataset | Neural Networks, XGBoost |
| Text Classification | Naïve Bayes, Neural Networks |
| Image Recognition | CNNs (Deep Learning) |


# Best Model Selection

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = sns.load_dataset("titanic")
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']]
y = df['survived']

# Convert categorical variable 'sex' to numerical using one-hot encoding
X = pd.get_dummies(X, columns=['sex']) 

# Fill missing values in 'age' with mean age
X['age'] = X['age'].fillna(df['age'].mean())

# Import necessary libraries from sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Import models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define models
models = [LogisticRegression(),
          KNeighborsClassifier(n_neighbors=5),
          DecisionTreeClassifier(),
          RandomForestClassifier(n_estimators=100),
          SVC(kernel='rbf', probability=True)]

model_names = [
    "Logistic Regression",
    "k-Nearest Neighbors",
    "Decision Tree",
    "Random Forest",
    "Support Vector Machine"
]

# Train and evaluate models
model_scores = []

for name, model in zip(model_names, models):
    model.fit(X_train, y_train)  # Train the model
    y_pred = model.predict(X_test)  # Predict on test set
    accuracy = accuracy_score(y_test, y_pred)  # Calculate accuracy
    model_scores.append([name, accuracy])

# Sort models by accuracy
sorted_models = sorted(model_scores, key=lambda x: x[1], reverse=True)

# Print sorted results
print("\nModel Performance:")
for model in sorted_models:
    print(f"{model[0]}: {model[1]:.2f}")



Model Performance:
Support Vector Machine: 0.81
Logistic Regression: 0.80
Random Forest: 0.79
k-Nearest Neighbors: 0.78
Decision Tree: 0.77


# Classification Score

In [7]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Load dataset
df = sns.load_dataset("titanic")
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']]
y = df['survived']

# Convert categorical variable 'sex' to numerical using one-hot encoding
X = pd.get_dummies(X, columns=['sex']) 

# Fill missing values in 'age' with mean age
X['age'] = X['age'].fillna(df['age'].mean())

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define models
models = [
    # Tuples inside List [Models]
    ("Logistic Regression", LogisticRegression()),
    ("k-Nearest Neighbors", KNeighborsClassifier(n_neighbors=5)),
    ("Decision Tree", DecisionTreeClassifier()),
    ("Random Forest", RandomForestClassifier(n_estimators=100)),
    ("Support Vector Machine", SVC(kernel='rbf', probability=True))
]

# Train and evaluate models
for name, model in models:
    model.fit(X_train, y_train)  # Train the model
    y_pred = model.predict(X_test)  # Predict on test set
    print(f"\n{name} Classification Report:")
    print(classification_report(y_test, y_pred))



Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.86      0.83       105
           1       0.78      0.72      0.75        74

    accuracy                           0.80       179
   macro avg       0.80      0.79      0.79       179
weighted avg       0.80      0.80      0.80       179


k-Nearest Neighbors Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.82      0.82       105
           1       0.74      0.73      0.73        74

    accuracy                           0.78       179
   macro avg       0.78      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179


Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.78      0.78       105
           1       0.69      0.70      0.70        74

    accuracy                           0.75       179
  

# HyperTuning For All Models

In [8]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Load dataset
df = sns.load_dataset("titanic")
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']]
y = df['survived']

# Convert categorical variable 'sex' to numerical using one-hot encoding
X = pd.get_dummies(X, columns=['sex']) 

# Fill missing values in 'age' with mean age
X['age'] = X['age'].fillna(df['age'].mean())

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define hyperparameter grids
param_grids = {
    "Logistic Regression": {
        "C": [0.01, 0.1, 1, 10, 100],
        "solver": ["liblinear"]
    },
    "k-Nearest Neighbors": {
        "n_neighbors": [3, 5, 7, 9, 11],
        "weights": ["uniform", "distance"]
    },
    "Decision Tree": {
        "max_depth": [3, 5, 10, None],
        "criterion": ["gini", "entropy"]
    },
    "Random Forest": {
        "n_estimators": [50, 100, 200],
        "max_depth": [3, 5, 10, None]
    },
    "Support Vector Machine": {
        "C": [0.1, 1, 10],
        "kernel": ["linear", "rbf"]
    }
}

# Define models
models = {
    "Logistic Regression": LogisticRegression(),
    "k-Nearest Neighbors": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Support Vector Machine": SVC(probability=True)
}

# Perform hyperparameter tuning
best_params = {}

for name, model in models.items():
    grid_search = GridSearchCV(model, param_grids[name], cv=5, scoring="accuracy", n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    best_params[name] = grid_search.best_params_

# Print best hyperparameters for each model
print("\nBest Hyperparameters for Each Model:")
for model, params in best_params.items():
    print(f"{model}: {params}")



Best Hyperparameters for Each Model:
Logistic Regression: {'C': 0.1, 'solver': 'liblinear'}
k-Nearest Neighbors: {'n_neighbors': 3, 'weights': 'uniform'}
Decision Tree: {'criterion': 'entropy', 'max_depth': 3}
Random Forest: {'max_depth': 5, 'n_estimators': 50}
Support Vector Machine: {'C': 1, 'kernel': 'rbf'}
