# KaggleX Challenge Notebook

This notebook holds the code for the KaggleX Challenge. The challenge is to predict the price of used cars based on the given features. The dataset is taken from the Kaggle competition [here](https://www.kaggle.com/competitions/kagglex-cohort4/data).

What's done in this notebook:
- Extracted data from engine feature
- Scaled numerical features and encoded categorical features
- Splited train data
- Trained a Random Forest and a Gradient Boosting model
- Evaluated the models using MAE and MSE
- Predicted the price of the test data using Gradient Boosting model

In [1]:
import pandas as pd

RANDOM_SEED = 6

# Load the training data
train_data_path = '../data/train.csv'
test_data_path = '../data/test.csv'

train_df = pd.read_csv(train_data_path)
test_df = pd.read_csv(test_data_path)

# Display the first few rows of the training data
train_df.head(1)


Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,0,Ford,F-150 Lariat,2018,74349,Gasoline,375.0HP 3.5L V6 Cylinder Engine Gasoline Fuel,10-Speed A/T,Blue,Gray,None reported,Yes,11000


In [2]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import re

# Refine the feature extraction function
def extract_engine_features(df):
    # Extract horsepower with error handling for non-matching patterns
    df['horsepower'] = df['engine'].apply(
        lambda x: int(re.search(r'(\d+)\.0HP', x).group(1)) if pd.notnull(x) and re.search(r'(\d+)\.0HP', x) else 0
    )
    # Extract number of cylinders with error handling for non-matching patterns
    df['cylinders'] = df['engine'].apply(
        lambda x: int(re.search(r'(\d+) Cylinder', x).group(1)) if pd.notnull(x) and re.search(r'(\d+) Cylinder', x) else 0
    )
    return df

# Apply the feature extraction
train_df = extract_engine_features(train_df)

# Drop the original engine column and 'id' column
train_df = train_df.drop(columns=['engine', 'id'])

# Preprocess the categorical columns using one-hot encoding and scale the numerical columns
categorical_features = ['brand', 'model', 'fuel_type', 'transmission', 'ext_col', 'int_col', 'accident', 'clean_title']
numerical_features = ['model_year', 'milage', 'horsepower', 'cylinders']

# Define the preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Apply the preprocessing steps
X_train = train_df.drop(columns=['price'])
y_train = train_df['price']

X_train_processed = preprocessor.fit_transform(X_train)

# Display the shape of the processed training data and the first few rows of the target variable
X_train_processed.shape, y_train.head()


((54273, 2324),
 0    11000
 1     8250
 2    15000
 3    63500
 4     7850
 Name: price, dtype: int64)

In [3]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Split the training data into training and validation sets
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train_processed, y_train, test_size=0.2, random_state=RANDOM_SEED)

In [None]:
# Train a Random Forest model
model_rf = RandomForestRegressor(n_estimators=100, random_state=RANDOM_SEED)
model_rf.fit(X_train_split, y_train_split)

# Train a Gradient Boosting model
model_gb = GradientBoostingRegressor(n_estimators=100, random_state=RANDOM_SEED)
model_gb.fit(X_train_split, y_train_split)

In [None]:
# Calculate metrics for choosing the best model
y_pred_rf = model_rf.predict(X_val_split)
y_pred_gb = model_gb.predict(X_val_split)
mse_rf = mean_squared_error(y_val_split, y_pred_rf)
mae_rf = mean_absolute_error(y_val_split, y_pred_rf)
mse_gb = mean_squared_error(y_val_split, y_pred_gb)
mae_gb = mean_absolute_error(y_val_split, y_pred_gb)
print(f'Random Forest - Validation MSE: {mse_rf:.2f}, MAE: {mae_rf:.2f}')
print(f'Gradient Boosting - Validation MSE: {mse_gb:.2f}, MAE: {mae_gb:.2f}')

In [None]:
# Using the Gradient Boosting model for prediction trained on the entire training data
model_gb = GradientBoostingRegressor(n_estimators=100, random_state=RANDOM_SEED)
model_gb.fit(X_train_processed, y_train)

In [5]:
# Save/Load the model to disk
import pickle

model_path = '../models/KaggleX3.pkl'

# Save
# with open(model_path, 'wb') as file:
#     pickle.dump(model_gb, file)

# Load
with open(model_path, 'rb') as file:
    model_gb = pickle.load(file)

In [6]:
# Apply the feature extraction to the test data
test_df = extract_engine_features(test_df)
test_df = test_df.drop(columns=['engine', 'id'])
X_test = preprocessor.transform(test_df)
y_test_pred = model_gb.predict(X_test)

In [8]:
# Generate the submission file
sub_template = pd.read_csv('../data/sample_submission.csv')
submission = sub_template.copy()
submission['price'] = y_test_pred
submission.to_csv('../submissions/submission3.csv', index=False)