## SUV Data Multiple Target Prediction

Given *data about SUVs*, let's try to predict **a variety of target variables** in the data.

We will use a decision tree model to make our predictions. 

Data source: https://www.kaggle.com/datasets/gabrielsantello/cars-purchase-decision-dataset

### Getting Started

In [18]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier

In [2]:
data = pd.read_csv("car_data.csv")
data

Unnamed: 0,User ID,Gender,Age,AnnualSalary,Purchased
0,385,Male,35,20000,0
1,681,Male,40,43500,0
2,353,Male,49,74000,0
3,895,Male,40,107500,1
4,661,Male,25,79000,0
...,...,...,...,...,...
995,863,Male,38,59000,0
996,800,Female,47,23500,0
997,407,Female,28,138500,1
998,299,Female,48,134000,1


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   User ID       1000 non-null   int64 
 1   Gender        1000 non-null   object
 2   Age           1000 non-null   int64 
 3   AnnualSalary  1000 non-null   int64 
 4   Purchased     1000 non-null   int64 
dtypes: int64(4), object(1)
memory usage: 39.2+ KB


### Model Pipeline

In [16]:
def predict_on_raw_data(df, target, task):
    df = df.copy()
    # Drop ID column
    df = df.drop('User ID', axis=1)
    # Split df into X and y
    y = df[target]
    X = df.drop(target, axis=1)
    # Train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)
    # Build pipeline
    binary_encoder = Pipeline(steps=[
        ('function', FunctionTransformer(lambda column: column.replace({'Female':0, 'Male': 1})))
    ])
    preprocessor = ColumnTransformer(transformers=[
        ('binary', binary_encoder, ['Gender'])
    ], remainder='passthrough')
    if target == 'Gender':
        model = DecisionTreeRegressor() if task == 'regression' else DecisionTreeClassifier()
    else:
        model = Pipeline(steps=[
            ('preprocessor', preprocessor),
            ('mod', DecisionTreeRegressor() if task == 'regression' else DecisionTreeClassifier())
        ])
    # Fit the model
    model.fit(X_train, y_train)
    # Get Results
    result = model.score(X_test, y_test)
    return result

### Results

In [20]:
# Classification
gender_acc = predict_on_raw_data(data, target='Gender', task='classification')
purchased_acc = predict_on_raw_data(data, target='Purchased', task='classification')

# Regression
age_r2 = predict_on_raw_data(data, target='Age', task='regression')
salary_r2 = predict_on_raw_data(data, target='AnnualSalary', task='regression')

  ('function', FunctionTransformer(lambda column: column.replace({'Female':0, 'Male': 1})))
  ('function', FunctionTransformer(lambda column: column.replace({'Female':0, 'Male': 1})))
  ('function', FunctionTransformer(lambda column: column.replace({'Female':0, 'Male': 1})))
  ('function', FunctionTransformer(lambda column: column.replace({'Female':0, 'Male': 1})))
  ('function', FunctionTransformer(lambda column: column.replace({'Female':0, 'Male': 1})))
  ('function', FunctionTransformer(lambda column: column.replace({'Female':0, 'Male': 1})))


In [22]:
print("Target: Gender (Accuracy): {:.2f}%".format(gender_acc * 100))
print("Target: Purchased (Accuracy): {:.2f}%".format(purchased_acc * 100))
print("Target: Age (R^2): {:.4f}".format(age_r2))
print("Target: AnnualSalary (R^2): {:.4f}".format(salary_r2))

Target: Gender (Accuracy): 54.00%
Target: Purchased (Accuracy): 88.67%
Target: Age (R^2): 0.1982
Target: AnnualSalary (R^2): -0.0052
