Assignment

Define and train a machine learning model for predicting the price of a laptop (buynow_price column in the dataset) based on its attributes. When testing and comparing your models, aim to minimize the RMSE measure.

# Load Data

In [1]:
import pandas as pd
import os
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

from sklearn.ensemble import VotingRegressor

In [2]:
train = pd.read_json("./train_dataset.json", orient="columns")
val = pd.read_json("./val_dataset.json", orient="columns")
test = pd.read_json("./test_dataset.json", orient="columns")

In [3]:
df = pd.concat([train, val, test], axis=0)
df.head()

Unnamed: 0,graphic card type,communications,resolution (px),CPU cores,RAM size,operating system,drive type,input devices,multimedia,RAM type,CPU clock speed (GHz),CPU model,state,drive memory size (GB),warranty,screen size,buynow_price
7233,dedicated graphics,"[bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,32 gb,[no system],ssd + hdd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr4,2.6,intel core i7,new,1250.0,producer warranty,"17"" - 17.9""",4999.0
5845,dedicated graphics,"[wi-fi, bluetooth, lan 10/100 mbps]",1366 x 768,4,8 gb,[windows 10 home],ssd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr3,2.4,intel core i7,new,256.0,seller warranty,"15"" - 15.9""",2649.0
10303,,"[bluetooth, nfc (near field communication)]",1920 x 1080,2,8 gb,[windows 10 home],hdd,,[SD card reader],ddr4,1.6,intel core i7,new,1000.0,producer warranty,"15"" - 15.9""",3399.0
10423,,,,2,,,,,,,,,new,,producer warranty,,1599.0
5897,integrated graphics,"[wi-fi, bluetooth]",2560 x 1440,4,8 gb,[windows 10 home],ssd,"[keyboard, touchpad, illuminated keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,1.2,other CPU,new,256.0,producer warranty,"12"" - 12.9""",4499.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7853 entries, 7233 to 1371
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   graphic card type       7357 non-null   object 
 1   communications          7071 non-null   object 
 2   resolution (px)         7245 non-null   object 
 3   CPU cores               7853 non-null   object 
 4   RAM size                7403 non-null   object 
 5   operating system        7203 non-null   object 
 6   drive type              7418 non-null   object 
 7   input devices           7175 non-null   object 
 8   multimedia              7145 non-null   object 
 9   RAM type                6989 non-null   object 
 10  CPU clock speed (GHz)   6917 non-null   float64
 11  CPU model               7320 non-null   object 
 12  state                   7853 non-null   object 
 13  drive memory size (GB)  7372 non-null   float64
 14  warranty                7853 non-null   ob

# Data Transformation

In [5]:
list_cols = ('communications', 'operating system', 'input devices', 'multimedia')

In [6]:
for c in list_cols:
    df[c].loc[df[c].isnull()] = df[c].loc[df[c].isnull()].apply(lambda x: [])
    df[c] = [','.join(map(str, l)) for l in df[c]]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[c].loc[df[c].isnull()] = df[c].loc[df[c].isnull()].apply(lambda x: [])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[c].loc[df[c].isnull()] = df[c].loc[df[c].isnull()].apply(lambda x: [])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[c].loc[df[c].isnull()] = df[c].loc[df[c].isnull()].apply(lambda x: [])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user

In [7]:
cols = ('graphic card type', 'communications', 'resolution (px)', 'CPU cores', 'RAM size', 'operating system', 'drive type', 'input devices', 'multimedia', 'RAM type', 'CPU model', 'state', 'warranty', 'screen size')

In [8]:
for c in cols:
    le = LabelEncoder()
    le.fit(list(df[c].values))
    df[c] = le.transform(list(df[c].values))

In [9]:
# Both 'CPU clock speed (GHz)' & 'drive memory size (GB)' have missing values. We will just substitute in the most common ones.
df['CPU clock speed (GHz)'] = df['CPU clock speed (GHz)'].fillna(df['CPU clock speed (GHz)'].mode()[0])
df['drive memory size (GB)'] = df['drive memory size (GB)'].fillna(df['drive memory size (GB)'].mode()[0])

In [10]:
df.isnull().sum()

graphic card type         0
communications            0
resolution (px)           0
CPU cores                 0
RAM size                  0
operating system          0
drive type                0
input devices             0
multimedia                0
RAM type                  0
CPU clock speed (GHz)     0
CPU model                 0
state                     0
drive memory size (GB)    0
warranty                  0
screen size               0
buynow_price              0
dtype: int64

# Split Data

In [11]:
ntrain = train.shape[0]
nval = val.shape[0]
ntest = test.shape[0]

In [12]:
train = df[:ntrain]
val = df[ntrain:(ntrain+nval)]
test = df[(ntrain+nval):]

In [13]:
X_train = train.drop('buynow_price', axis=1)
y_train = train['buynow_price']

In [14]:
X_val = val.drop('buynow_price', axis=1)
y_val = val['buynow_price']

In [15]:
X_test = test.drop('buynow_price', axis=1)
y_test = test['buynow_price']

# Build Model

In [16]:
def evaluation_metric(y, y_pred):
    rmse = mean_squared_error(y, y_pred, squared=False)
    return rmse

XGBoost

In [17]:
xgb_reg = XGBRegressor(eval_metric='rmse')

In [18]:
xgb_reg.fit(X_train, y_train)

In [19]:
evaluation_metric(y_val, xgb_reg.predict(X_val))

658.5794575976704

LGBM

In [20]:
lgbm_reg = LGBMRegressor(objective='mean_squared_error', force_col_wise=True)

In [21]:
lgbm_reg.fit(X_train, y_train)

[LightGBM] [Info] Total Bins 248
[LightGBM] [Info] Number of data points in the train set: 4711, number of used features: 15
[LightGBM] [Info] Start training from score 3495.831195


In [22]:
evaluation_metric(y_val, lgbm_reg.predict(X_val))

708.315854369088

# Prediction

In [23]:
ensemble = VotingRegressor(
    [
        ("xgb", xgb_reg),
        ("lgbm", lgbm_reg)
    ]
)


In [24]:
ensemble.fit(X_train, y_train)

[LightGBM] [Info] Total Bins 248
[LightGBM] [Info] Number of data points in the train set: 4711, number of used features: 15
[LightGBM] [Info] Start training from score 3495.831195


In [25]:
evaluation_metric(y_test, ensemble.predict(X_test))

711.3544756089694