 #  Laptop Price Prediction Model Creation:<br>
### This is second part where machine learning model will be created.

## Part 2: Machine Learning:<br>

### Model Building and Selection: <br>
-  Experiment with various regression algorithm.
-  Select the model with the best performance on evaluation metrics.
### Model Evaluation: <br>
- Employ appropriate metrics to assess model performance.
- R-squared (R²): Measures the proportion of variance explained by the model.
- Mean Absolute Error (MAE): Average absolute difference between actual and predicted prices.
### Model Selection: <br>
- Choose the model with the optimal balance of accuracy and interpretability.
- Consider factors like model complexity, prediction accuracy, and practical implications.
### Final Model: <br>
- Train the selected model on the entire dataset.
- Use the final model for future price predictions.

In [1]:
# Importing all the necessary libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

In [3]:
from sklearn.metrics import r2_score,mean_absolute_error

In [4]:
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor,AdaBoostRegressor,ExtraTreesRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline

In [5]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
df= pd.read_csv("D:\ML_projects\Laptop_price\dataset\laptop_clean_data.csv")# reading the dataset

In [7]:
df.head()

Unnamed: 0,name,os,processor,generation,ram,ssd,hdd,display,price
0,lenovo,windows 11,intel core i5,11,16,512,0,15.6,62990
1,lenovo,windows 11,intel core i3,11,8,256,1,15.6,37500
2,asus,windows 11,intel core i5,10,8,512,0,15.6,49990
3,asus,windows 11,intel core i3,10,8,512,0,15.6,33990
4,lenovo,other,amd other,0,4,256,0,14.0,18990


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 747 entries, 0 to 746
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   name        747 non-null    object 
 1   os          747 non-null    object 
 2   processor   747 non-null    object 
 3   generation  747 non-null    int64  
 4   ram         747 non-null    int64  
 5   ssd         747 non-null    int64  
 6   hdd         747 non-null    int64  
 7   display     747 non-null    float64
 8   price       747 non-null    int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 52.7+ KB


In [9]:
df.isnull().sum() # checking null values in the dataset

name          0
os            0
processor     0
generation    0
ram           0
ssd           0
hdd           0
display       0
price         0
dtype: int64

In [10]:
df.duplicated().sum() # checking duplicate values

0

#### Data is cleaned, no null and duplicate values 

## Model Building and Selection <br> 
### Data Preparation:<br>
- #### With the data cleaned, we're ready to proceed with model building.


### Train-Test Split: <br> 
- #### The dataset is divided into training and test sets for model training and evaluation, respectively.

In [11]:
# Split into train and test sets
x = df.drop(columns=['price'])
y = np.log(df['price'])

In [12]:
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.15,random_state=2)

In [13]:
X_train # x traning dataset 

Unnamed: 0,name,os,processor,generation,ram,ssd,hdd,display
666,dell,windows 11,intel core i3,12,8,512,0,15.6
240,msi,windows 10,intel core i7,11,16,1024,0,15.6
452,acer,windows 10,intel core i5,8,8,512,0,15.6
458,asus,windows 11,amd ryzen 9,10,16,512,0,16.6
575,asus,windows 11,intel core i3,11,8,256,1,15.6
...,...,...,...,...,...,...,...,...
534,asus,windows 10,intel core i3,10,4,0,1,14.1
584,hp,windows 11,other,0,16,1024,0,16.1
493,lenovo,windows 11,intel core i7,11,32,1024,0,16.0
527,acer,windows 11,other,0,8,512,0,15.6


In [14]:
y_train # y traning dataset 

666    11.055767
240    11.878263
452    11.517913
458    11.502774
575    10.736179
         ...    
534    10.330388
584    11.820410
493    12.170394
527    11.001933
168    11.245294
Name: price, Length: 634, dtype: float64

### Model Evaluation: <br>
- #### Various models are evaluated using R-squared score and Mean Absolute Error (MAE) to determine their performance.

### Linear regression

In [15]:
# step1 doing column transformation using one-hot-encoder in wihch dealing with categorical data  
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')

# Define regression model
step2 = LinearRegression() 

# Creating a pipeline
pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

#Fitting the pipeline to the training data 
pipe.fit(X_train,y_train)

# predictions on the test data
y_pred = pipe.predict(X_test)

# Evaluate model performance
print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.770803396783333
MAE 0.2248709401269258


### Ridge Regression

In [16]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')

step2 = Ridge(alpha=10)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.7516917882891939
MAE 0.22476906093886512


### Lasso Regression

In [17]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')

step2 = Lasso(alpha=0.001)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.7646587370364972
MAE 0.22050761897362012


### KNN

In [19]:
import warnings 
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')

step2 = KNeighborsRegressor(n_neighbors=3)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.7084232522665916
MAE 0.2264565738384703


### Decision Tree Regressor

In [20]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')

step2 = DecisionTreeRegressor(max_depth=8)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.7651047901619558
MAE 0.2135678933873926


### SVM

In [22]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')

step2 = SVR(kernel='rbf',C=10000,epsilon=0.1)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.29437082142296744
MAE 0.2537603621856415


### Random Forest Regressor

In [23]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')

step2 = RandomForestRegressor(n_estimators=100,
                              random_state=3,
                              max_samples=0.5,
                              max_features=0.75,
                              max_depth=15)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.8392024467055624
MAE 0.18655242913859904


### ExtraTrees Regressor

In [24]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')

step2 = ExtraTreesRegressor(n_estimators=100,
                              random_state=3,
                              max_samples=0.5,
                              max_features=0.75,
                              max_depth=15,
                              bootstrap=True)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.8364093507275443
MAE 0.18725812090368044


### AdaBoost Regressor

In [25]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')

step2 = AdaBoostRegressor(n_estimators=15,learning_rate=1.0)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.7848534789562376
MAE 0.2084170647585874


### Gradient Boost Regressor

In [26]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')

step2 = GradientBoostingRegressor(n_estimators=500)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.8366061248566274
MAE 0.18253265810439087


### XgBoost Regressor

In [27]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')

step2 = XGBRegressor(n_estimators=45,max_depth=5,learning_rate=0.5)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.8227712848745721
MAE 0.19409950203041332


### Using Ensemble techniques 

### Voting Regressor

In [28]:
from sklearn.ensemble import VotingRegressor,StackingRegressor

step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')


rf = RandomForestRegressor(n_estimators=350,random_state=3,max_samples=0.5,max_features=0.75,max_depth=15,bootstrap=True)
gbdt = GradientBoostingRegressor(n_estimators=100,max_features=0.5)
xgb = XGBRegressor(n_estimators=25,learning_rate=0.3,max_depth=5)
et = ExtraTreesRegressor(n_estimators=100,random_state=3,max_samples=0.5,max_features=0.75,max_depth=10,bootstrap=True)

step2 = VotingRegressor([('rf', rf), ('gbdt', gbdt), ('xgb',xgb), ('et',et)],weights=[5,1,1,1])

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.8478469010117716
MAE 0.17881489551351407


### Stacking Regressor

In [29]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')


estimators = [
    ('rf', RandomForestRegressor(n_estimators=350,random_state=3,max_samples=0.5,max_features=0.75,max_depth=15)),
    ('gbdt',GradientBoostingRegressor(n_estimators=100,max_features=0.5)),
    ('xgb', XGBRegressor(n_estimators=25,learning_rate=0.3,max_depth=5))
]

step2 = StackingRegressor(estimators=estimators, final_estimator=Ridge(alpha=100))

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',r2_score(y_test,y_pred))
print('MAE',mean_absolute_error(y_test,y_pred))

R2 score 0.8358685420229923
MAE 0.1805257436195798


### Model Selection: 
    - Voting Regressor, an ensemble meta-estimator, achieved performance with an R-squared score of 0.847 and MAE of 0.1787
    - Random Forest Regressor achieved the second-best performance with an R-squared score of 0.839 and MAE of 0.1833.

### Final Model:
    - We've selected Voting Regressor as the final model due to its ensemble approach and best R-squared score.

In [30]:
#voting regressor
from sklearn.ensemble import VotingRegressor,StackingRegressor

# step1 doing column transformation using one-hot-encoder in wihch dealing with categorical data  
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),[0,1,2])
],remainder='passthrough')


rf = RandomForestRegressor(n_estimators=350,random_state=3,max_samples=0.5,max_features=0.75,max_depth=15,bootstrap=True)
gbdt = GradientBoostingRegressor(n_estimators=100,max_features=0.5)
xgb = XGBRegressor(n_estimators=25,learning_rate=0.3,max_depth=5)
et = ExtraTreesRegressor(n_estimators=100,random_state=3,max_samples=0.5,max_features=0.75,max_depth=10,bootstrap=True)

# Define regression model
step2 = VotingRegressor([('rf', rf), ('gbdt', gbdt), ('xgb',xgb), ('et',et)],weights=[5,1,1,1])

# Creating a pipeline
pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

#Fitting the pipeline to the training data
pipe.fit(X_train,y_train)

In [32]:
#input values for prediction
#kindly use lower case for the inputs because all the values are in lower case 

new_data = pd.DataFrame({
    'name': ['hp'],
    'os': ['windows 11'],
    'processor': ['intel core i5'],
    'generation':[11],
    'ram': [8],
    'ssd': [512],
    'hdd': [0],
    'display': [15.6]
})

# Make predictions on the new data
new_data_predictions_log_scale = pipe.predict(new_data)

# Convert new data predictions back to the original scale
new_data_predictions_original_scale = np.exp(new_data_predictions_log_scale).round()


# Display the predictions for new data
print("Predictions for new data:") 
print(f"{new_data_predictions_original_scale[0]} INR")

Predictions for new data:
58038.0 INR


### Storing Model and Data: 
- #### To facilitate future predictions, both the trained Voting Regressor model and the dataset are saved in a Pickle (.pkl) file for later use.


In [33]:
import pickle

pickle.dump(df,open('df.pkl','wb')) #saving data into pkl file
pickle.dump(pipe,open('pipe.pkl','wb')) #saving final model in pkl file

Model is working properly and predictions are close to the original price of the product.