### Author: Allan R. Jeeboo 
### Preferred Name: Vyncent S. A. van der Wolvenhuizen 
### Affiliation: Data Science student at TripleTen 
### Email: vanderwolvenhuizen.vyncent@proton.me
### Date Started: 2025-07-02 
### Last Updated: 2025-07-02 09:16 
--- 
--- 

# 1.0 Introduction 

In this project, we'll be placed in this scenario: 
Rusty Bargain used car sales service is developing an app to attract new customers. In that app, we can quickly find out the market value of our car. We have access to historical data: technical specifications, trim versions, and prices. We need to build the model to determine the value. 

Rusty Bargain is interested in:

- The quality of the prediction
- The speed of the prediction
- The time required for training 

Project instructions
Download and look at the data.
Train different models with various hyperparameters (You should make at least two different models, but more is better. Remember, various implementations of gradient boosting don't count as different models.) The main point of this step is to compare gradient boosting methods with random forest, decision tree, and linear regression.
Analyze the speed and quality of the models.
Notes:

Use the RMSE metric to evaluate the models.
Linear regression is not very good for hyperparameter tuning, but it is perfect for doing a sanity check of other methods. If gradient boosting performs worse than linear regression, something definitely went wrong.
On your own, work with the LightGBM library and use its tools to build gradient boosting models.
Ideally, your project should include linear regression for a sanity check, a tree-based algorithm with hyperparameter tuning (preferably, random forrest), LightGBM with hyperparameter tuning (try a couple of sets), and CatBoost and XGBoost with hyperparameter tuning (optional).
Take note of the encoding of categorical features for simple algorithms. LightGBM and CatBoost have their implementation, but XGBoost requires OHE. 

--- 
--- 

### 1.1 Data Import & Overview 
Let's begin by importing the necessary modules and loading the dataset. We'll then examine the first few rows and review the dataset's structure to gain an initial understanding of the data we'll be working with. 

In [1]:
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 

df = pd.read_csv('car_data.csv') 

In [2]:
display(df.head()) 
print(df.shape)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


(354369, 16)


In [3]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

---

### 1.2 Data Description 

#### Features:
- DateCrawled — date profile was downloaded from the database
- VehicleType — vehicle body type
- RegistrationYear — vehicle registration year
- Gearbox — gearbox type
- Power — power (hp)
- Model — vehicle model
- Mileage — mileage (measured in km due to dataset's regional specifics)
- RegistrationMonth — vehicle registration month
- FuelType — fuel type
- Brand — vehicle brand
- NotRepaired — vehicle repaired or not
- DateCreated — date of profile creation
- NumberOfPictures — number of vehicle pictures
- PostalCode — postal code of profile owner (user)
- LastSeen — date of the last activity of the user

### Target:
- Price — price (Euro)

---

### 1.3 Chapter Summary 

In this chapter, we've imported the data and necessary modules, glanced at the data and its info, and established what each column represents. I already notice a few problems. Firstly, the columns have capital letters in their names, which doesn't adhere to Python's [PEP 8](https://peps.python.org/pep-0008/) style guide. Secondly, I notice that there are NaNs within the data. Thirdly, There are three date columns — DateCrawled, DateCreated, and LastSeen — that appear to serve no purpose in the context of our analysis. In addition to that, neither do NumberOfPictures and PostalCode.

# 2.0 Data Preprocessing 
---

### 2.1 Column Adjustments

In [4]:
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [5]:
df.drop(columns=['DateCrawled', 
                 'DateCreated', 
                 'NumberOfPictures', 
                 'PostalCode', 
                 'LastSeen'], 
                 axis=1, 
                 inplace=True)
df.columns = df.columns.str.lower() 
df.rename(columns={'vehicletype': 'vehicle_type', 
                   'registrationyear': 'registration_year', 
                   'registrationmonth': 'registraition_month', 
                   'fueltype': 'fuel_type', 
                   'notrepaired': 'not_repaired'}, 
                   inplace=True) 

df.head() 

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,mileage,registraition_month,fuel_type,brand,not_repaired
0,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,
1,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no


---

### 2.2 NaNs

In [6]:
df.isna().sum() 

price                      0
vehicle_type           37490
registration_year          0
gearbox                19833
power                      0
model                  19705
mileage                    0
registraition_month        0
fuel_type              32895
brand                      0
not_repaired           71154
dtype: int64

There are quite a few NaNs, so how shall we address this? Looking at the columns that have NaNs — vehicle_type, gearbox, model, fuel_type, not_repaired — they're all categorical, which means filling these with the mean or median isn't possible, *but* we can use the mode, since it uses the most common occurrence.

In [7]:
nan_columns = df.columns[df.isna().any()].tolist()

for column in nan_columns: 
    mode_value = df[column].mode()[0]
    df[column].fillna(mode_value, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(mode_value, inplace=True)


In [8]:
df.isna().sum()

price                  0
vehicle_type           0
registration_year      0
gearbox                0
power                  0
model                  0
mileage                0
registraition_month    0
fuel_type              0
brand                  0
not_repaired           0
dtype: int64

---

### 2.3 Duplicates

In [9]:
df.duplicated().sum()

np.int64(33266)

In [12]:
df.drop_duplicates(inplace=True) 
display(df.head()) 
print(f'Number of duplicates: {df.duplicated().sum()}')
print(f'Shape: {df.shape}')

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,mileage,registraition_month,fuel_type,brand,not_repaired
0,480,sedan,1993,manual,0,golf,150000,0,petrol,volkswagen,no
1,18300,coupe,2011,manual,190,golf,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,no
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no


Number of duplicates: 0
Shape: (321103, 11)
