# Task-5 Decision Tree Implementation

## Question 3

a) Show the usage of your decision tree for the [automotive efficiency](https://archive.ics.uci.edu/ml/datasets/auto+mpg) problem.
    
b) Compare the performance of your model with the decision tree module from scikit learn.

> You should be editing `auto-efficiency.py` for the code containing the above experiments.

### The complete code is also available in `auto-efficiency.py`

### Importing required libraries

In [1]:
import sys
import os

# Add the path to the directory containing tree.py
sys.path.append(os.path.abspath("../"))

import numpy as np
import pandas as pd
from tree.base import DecisionTree
from metrics import *
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)

### Extracting the auto-mpg data 

In [2]:
!pip install ucimlrepo --quiet
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
auto_mpg = fetch_ucirepo(id=9) 
  
# data (as pandas dataframes) 
X = auto_mpg.data.features 
y = auto_mpg.data.targets 
  
# metadata 
print(auto_mpg.metadata) 
  
# variable information 
print(auto_mpg.variables)

# join X and y to check for null values
data = pd.concat([X, y], axis=1)
print("Shape of extracted data: ", data.shape)



# url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
# data = pd.read_csv(url, delim_whitespace=True, header=None,
#                  names=["mpg", "cylinders", "displacement", "horsepower", "weight",
#                         "acceleration", "model year", "origin", "car name"])


{'uci_id': 9, 'name': 'Auto MPG', 'repository_url': 'https://archive.ics.uci.edu/dataset/9/auto+mpg', 'data_url': 'https://archive.ics.uci.edu/static/public/9/data.csv', 'abstract': 'Revised from CMU StatLib library, data concerns city-cycle fuel consumption', 'area': 'Other', 'tasks': ['Regression'], 'characteristics': ['Multivariate'], 'num_instances': 398, 'num_features': 7, 'feature_types': ['Real', 'Categorical', 'Integer'], 'demographics': [], 'target_col': ['mpg'], 'index_col': ['car_name'], 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1993, 'last_updated': 'Thu Aug 10 2023', 'dataset_doi': '10.24432/C5859H', 'creators': ['R. Quinlan'], 'intro_paper': None, 'additional_info': {'summary': 'This dataset is a slightly modified version of the dataset provided in the StatLib library.  In line with the use by Ross Quinlan (1993) in predicting the attribute "mpg", 8 of the original instances were removed because they had unknown values for th

### Data Cleaning/Preprocessing

In [3]:
data.replace('?', np.nan, inplace=True)
print("Number of NaN/Null values in training data:", data.isnull().sum().sum())
if data.isnull().sum().sum() > 0:
    data.dropna(inplace=True)
print("Number of duplicated samples in training data: ", data.duplicated().sum())
if data.duplicated().sum() > 0:
    data.drop_duplicates(inplace=True)

print("Shape of data after cleaning: ", data.shape)

Number of NaN/Null values in training data: 6
Number of duplicated samples in training data:  0
Shape of data after cleaning:  (392, 8)


In [4]:
# separate X and y
X = data.drop('mpg', axis=1)
y = data['mpg']
print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (392, 7)
y shape: (392,)


### Splitting the data into training and testing sets

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("X_train size: ", X_train.shape)
print("y_train size: ", y_train.shape)
print("X_test size: ", X_test.shape)
print("y_test size: ", y_test.shape)

X_train size:  (274, 7)
y_train size:  (274,)
X_test size:  (118, 7)
y_test size:  (118,)


### Q3 a) Our custom  decision tree implementation 

In [6]:
my_dt = DecisionTree(criterion="information_gain", max_depth=5)
my_dt.fit(X_train, y_train)

y_train_pred = my_dt.predict(X_train)
y_test_pred = my_dt.predict(X_test)

train_rmse = rmse(y_train_pred, y_train)
train_mae = mae(y_train_pred, y_train)

print("Train Metrics (Custom):")
print(f"    Root Mean Squared Error: {train_rmse:.4f}")
print(f"    Mean Absolute Error: {train_mae:.4f}")

test_rmse = rmse(y_test_pred, y_test)
test_mae = mae(y_test_pred, y_test)

print("\nTest Metrics (Custom):")
print(f"    Root Mean Squared Error: {test_rmse:.4f}")
print(f"    Mean Absolute Error: {test_mae:.4f}")

  mse_value = np.sum((Y - Y_mean) ** 2) / Y.size


Train Metrics (Custom):
    Root Mean Squared Error: 5.7899
    Mean Absolute Error: 4.0589

Test Metrics (Custom):
    Root Mean Squared Error: 7.1421
    Mean Absolute Error: 5.0053


### Q3 b) Using SkLearn's Decision Tree Regressor

In [7]:
sklearn_model = DecisionTreeRegressor(max_depth=5)
sklearn_model.fit(X_train, y_train)

y_train_pred_sklearn = sklearn_model.predict(X_train)
y_test_pred_sklearn = sklearn_model.predict(X_test)

train_rmse_sklearn = rmse(pd.Series(y_train_pred_sklearn), y_train)
train_mae_sklearn = mae(pd.Series(y_train_pred_sklearn), y_train)

print("Train Metrics (Sklearn):")
print(f"    Root Mean Squared Error: {train_rmse_sklearn:.4f}")
print(f"    Mean Absolute Error: {train_mae_sklearn:.4f}")

test_rmse_sklearn = rmse(pd.Series(y_test_pred_sklearn), y_test)
test_mae_sklearn = mae(pd.Series(y_test_pred_sklearn), y_test)

print("\nTest Metrics (Sklearn):")
print(f"    Root Mean Squared Error: {test_rmse_sklearn:.4f}")
print(f"    Mean Absolute Error: {test_mae_sklearn:.4f}")

Train Metrics (Sklearn):
    Root Mean Squared Error: 1.9938
    Mean Absolute Error: 1.4468

Test Metrics (Sklearn):
    Root Mean Squared Error: 3.2169
    Mean Absolute Error: 2.3546


## Performance Comparison

In [8]:
print("\nPerformance Comparison:\n")
print(f"Our Decision Tree - Train RMSE: {train_rmse:.4f}")
print(f"Our Decision Tree - Test RMSE: {test_rmse:.4f}")
print(f"Scikit-Learn Decision Tree - Train RMSE: {train_rmse_sklearn:.4f}")
print(f"Scikit-Learn Decision Tree - Test RMSE: {test_rmse_sklearn:.4f}")


Performance Comparison:

Our Decision Tree - Train RMSE: 5.7899
Our Decision Tree - Test RMSE: 7.1421
Scikit-Learn Decision Tree - Train RMSE: 1.9938
Scikit-Learn Decision Tree - Test RMSE: 3.2169


**Comparison of Models:**

The performance comparison shows that the Scikit-Learn decision tree outperforms our custom decision tree implementation.

   - Our custom decision tree achieved a **Train RMSE of 5.7899**, while the Scikit-Learn decision tree achieved a significantly lower **Train RMSE of 1.9938**.
   - On the test set, our model resulted in a **Test RMSE of 7.1421**, compared to **3.2169** from the Scikit-Learn model.

This shows that the Scikit-Learn model is better at fitting the training data, likely due to more optimized algorithms and better-handling of edge cases. The higher test RMSE for our model indicates that it doesn't generalize to unseen test data as effectively as Scikit-Learn decision tree.

Our implementation, while educational, still provides a relatively reasonable fit and could benefit from further refinement to improve accuracy and reduce overfitting.