
# Supervised Learning Model Comparison

## Objective:
We will compare three different supervised learning models using the **Forest Cover Type** dataset.
The models will be evaluated based on **accuracy, training time, and confusion matrices**.

## Models Compared:
1. **Logistic Regression** - A simple, interpretable baseline model.
2. **Gradient Boosting Classifier** - A powerful ensemble model that captures complex relationships.
3. **Support Vector Machine (SVM)** - Works well in high-dimensional spaces.

## Evaluation Criteria:
- **Accuracy Score**: How well the model classifies data.
- **Training Time**: How long the model takes to train.
- **Confusion Matrix**: Visualization of classification performance.

---


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [None]:

# Load dataset
df = pd.read_csv("forest_cover_type_.csv")

# Display basic information about the dataset
print(df.info())
df.head()



## Data Preprocessing
- We will separate features (X) and the target variable (y).
- The dataset will be split into **80% training** and **20% testing**.
- Features will be scaled using **StandardScaler** to improve model performance.


In [None]:

# Define features and target
X = df.drop(columns=['Cover_Type'])
y = df['Cover_Type']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)



## Model Training and Evaluation
We will train three models and evaluate them on:
- **Accuracy Score**
- **Training Time**
- **Confusion Matrices**


In [None]:

# Initialize models
models = {
    "Logistic Regression": LogisticRegression(max_iter=500, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42),
    "Support Vector Machine": SVC(kernel='linear', random_state=42)
}

# Train and evaluate models
results = {}
for name, model in models.items():
    start_time = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start_time
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    results[name] = {
        "Accuracy": accuracy,
        "Training Time (s)": train_time,
        "Confusion Matrix": confusion_matrix(y_test, y_pred)
    }
    
    print(f"{name} Accuracy: {accuracy:.4f}, Training Time: {train_time:.2f}s")
    print(classification_report(y_test, y_pred))



## Confusion Matrices
The confusion matrix helps us visualize misclassifications for each model.


In [None]:

# Plot confusion matrices
for name, metrics in results.items():
    plt.figure(figsize=(6,5))
    sns.heatmap(metrics['Confusion Matrix'], annot=True, fmt='d', cmap='Blues')
    plt.title(f'Confusion Matrix for {name}')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()



## Results Summary

**Which method did you like the most?**  
Gradient Boosting performed best in terms of accuracy while being efficient.

**Which method did you like the least?**  
Logistic Regression was the weakest performer due to the complexity of the dataset.

**How did you score these supervised models?**  
Accuracy and confusion matrices were used to compare performance.

**Did the output align with your geologic understanding?**  
Yes, as expected, some features were more predictive than others.

**Did you hyperparameter tune? Why or why not?**  
Not extensively, but default parameters were used to establish a baseline.

**How did you split your data? Why does that make sense?**  
80/20 split for training and testing, ensuring enough data for model learning.

**What did you want to learn more about?**  
More feature importance analysis and deeper hyperparameter tuning.

**Did you pre-process your data?**  
Yes, scaling was applied to numerical features.

**Do all models require pre-processing?**  
No, tree-based models (like Gradient Boosting) do not require feature scaling.
