# Table of Contents

1. [Introduction](#Introduction)
2. [Data Loading](#Data-Loading)
3. [Data Cleaning](#Data-Cleaning)
4. [Feature Engineering](#Feature-Engineering)
5. [Model Training](#Model-Training)
6. [Model Evaluation](#Model-Evaluation)
7. [Conclusion](#Conclusion)

# Introduction

In this notebook, we aim to predict laptop prices using various machine learning algorithms. By analyzing different features such as CPU, GPU, memory, and screen resolution, we will build and evaluate models to accurately estimate the price of a laptop. This process involves data cleaning, feature engineering, and model training to achieve the best possible predictions.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns",100)

from sklearn.linear_model import LinearRegression,SGDRegressor,Ridge,Lasso,ElasticNet
from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor,AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor, ExtraTreeRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error
from sklearn.preprocessing import StandardScaler

def algo_test(x, y):
    # Modeller ve parametreler
    models = {
        'Ridge': Ridge(),
        'Lasso': Lasso(),
        'ElasticNet': ElasticNet(),
        'Gradient Boosting': GradientBoostingRegressor(),
        'KNeighborsRegressor': KNeighborsRegressor(),
        'Decision Tree': DecisionTreeRegressor(),
        'XGBRegressor': XGBRegressor(),
        'SVR': SVR(),
        'MLP Regressor': MLPRegressor(),
    }

    params = {
        'Ridge': {'alpha': [0.1, 1, 10]},
        'Lasso': {'alpha': [0.1, 1, 10]},
        'ElasticNet': {'alpha': [0.1, 1, 10], 'l1_ratio': [0.1, 0.5, 0.9]},
        'Gradient Boosting': {'n_estimators': [100, 200, 300], 'learning_rate': [0.01, 0.1, 0.2]},
        'KNeighborsRegressor': {'n_neighbors': [3, 5, 7]},
        'Decision Tree': {'max_depth': [None, 10, 20]},
        'XGBRegressor': {'n_estimators': [100, 200, 300], 'learning_rate': [0.01, 0.1, 0.2]},
        'SVR': {'C': [0.1, 1, 10], 'gamma': ['scale', 'auto']},
        'MLP Regressor': {'hidden_layer_sizes': [(50,), (100,), (50, 50)], 'alpha': [0.0001, 0.001, 0.01]}
    }

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)

    best_results = []

    for name, model in models.items():
        random_search = RandomizedSearchCV(model, params[name], scoring='r2', cv=5, n_iter=10, random_state=42)
        random_search.fit(x_train, y_train)
        best_model = random_search.best_estimator_
        predictions = best_model.predict(x_test)

        best_results.append({
            'Model': name,
            'Best Params': random_search.best_params_,
            'R_Squared': r2_score(y_test, predictions),
            'RMSE': mean_squared_error(y_test, predictions) ** 0.5,
            'MAE': mean_absolute_error(y_test, predictions)
        })

    result_df = pd.DataFrame(best_results).sort_values('R_Squared', ascending=False).reset_index(drop=True)

    return result_df

# Data Loading

In [2]:
df = pd.read_csv("laptop_data.csv")
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,1303.0,651.0,376.28801,0.0,325.5,651.0,976.5,1302.0
Inches,1303.0,15.017191,1.426304,10.1,14.0,15.6,15.6,18.4
Price,1303.0,59870.04291,37243.201786,9270.72,31914.72,52054.56,79274.2464,324954.72


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        1303 non-null   int64  
 1   Company           1303 non-null   object 
 2   TypeName          1303 non-null   object 
 3   Inches            1303 non-null   float64
 4   ScreenResolution  1303 non-null   object 
 5   Cpu               1303 non-null   object 
 6   Ram               1303 non-null   object 
 7   Memory            1303 non-null   object 
 8   Gpu               1303 non-null   object 
 9   OpSys             1303 non-null   object 
 10  Weight            1303 non-null   object 
 11  Price             1303 non-null   float64
dtypes: float64(2), int64(1), object(9)
memory usage: 122.3+ KB


# Data Cleaning

In [4]:
df.drop(["Unnamed: 0"],axis=1,inplace=True)

In [5]:
df.head()

Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.336
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.808


In [6]:
df.Cpu.unique()

array(['Intel Core i5 2.3GHz', 'Intel Core i5 1.8GHz',
       'Intel Core i5 7200U 2.5GHz', 'Intel Core i7 2.7GHz',
       'Intel Core i5 3.1GHz', 'AMD A9-Series 9420 3GHz',
       'Intel Core i7 2.2GHz', 'Intel Core i7 8550U 1.8GHz',
       'Intel Core i5 8250U 1.6GHz', 'Intel Core i3 6006U 2GHz',
       'Intel Core i7 2.8GHz', 'Intel Core M m3 1.2GHz',
       'Intel Core i7 7500U 2.7GHz', 'Intel Core i7 2.9GHz',
       'Intel Core i3 7100U 2.4GHz', 'Intel Atom x5-Z8350 1.44GHz',
       'Intel Core i5 7300HQ 2.5GHz', 'AMD E-Series E2-9000e 1.5GHz',
       'Intel Core i5 1.6GHz', 'Intel Core i7 8650U 1.9GHz',
       'Intel Atom x5-Z8300 1.44GHz', 'AMD E-Series E2-6110 1.5GHz',
       'AMD A6-Series 9220 2.5GHz',
       'Intel Celeron Dual Core N3350 1.1GHz',
       'Intel Core i3 7130U 2.7GHz', 'Intel Core i7 7700HQ 2.8GHz',
       'Intel Core i5 2.0GHz', 'AMD Ryzen 1700 3GHz',
       'Intel Pentium Quad Core N4200 1.1GHz',
       'Intel Atom x5-Z8550 1.44GHz',
       'Intel Celeron Du

# Feature Engineering

In [7]:
import re
def extract_ghz(processor_name):
    match = re.search(r'(\d+\.\d+)GHz', processor_name)
    if match:
        return float(match.group(1))
    return None  

df['GHz'] = df['Cpu'].apply(extract_ghz)


In [8]:
# CPU brand name extraction
df["Cpu Brand"] = df.Cpu.apply(lambda x: x.split()[0])


In [9]:
def get_gpu_brand(gpu_name):
    if 'Nvidia' in gpu_name:
        return 'Nvidia'
    elif 'AMD' in gpu_name:
        return 'AMD'
    elif 'Intel' in gpu_name:
        return 'Intel'
    else:
        return 'Other'

df['GPU_Brand'] = df['Gpu'].apply(get_gpu_brand)




In [10]:
df.Weight.value_counts() # convert to numeric
df.Weight = df.Weight.str.replace("kg","").astype(float)

In [11]:
df.OpSys.value_counts() 
# Windows, Linux, Mac, Chrome, No OS
def get_windows(os):
    if 'Windows 7' in os:
        return 1
    elif 'Windows 10 S' in os:
        return 2
    elif 'Windows 10' in os:
        return 3
    else:
        return 0
    
def get_linux(os):
    if 'Linux' in os:
        return 1
    else:
        return 0
    
def get_mac(os):
    if 'Mac OS' in os:
        return 2
    elif 'Mac OS X' in os:
        return 1
    else:
        return 0
    
def get_chrome(os):
    if 'Chrome OS' in os:
        return 1
    else:
        return 0
    
def get_no_os(os):
    if 'No OS' in os:
        return 1
    else:
        return 0
    
df['Windows'] = df['OpSys'].apply(get_windows)
df['Linux'] = df['OpSys'].apply(get_linux)
df['Mac'] = df['OpSys'].apply(get_mac)
df['Chrome'] = df['OpSys'].apply(get_chrome)
df['No_OS'] = df['OpSys'].apply(get_no_os)


In [12]:
# df.Memory.value_counts() # Flash Storage, Hyrid , SSD and HDD columns wile be created as int.If there is no value, it will be 0. otherwize it will be its GB value.(not TB , if its TB, it will be converted to GB). 

df['Flash Storage'] = 0
df['Hybrid'] = 0
df['SSD'] = 0
df['HDD'] = 0

# TB to GB conversion
def convert_to_gb(value):
    value = value.replace('Flash Storage', '').replace('Hybrid', '').replace('SSD', '').replace('HDD', '').strip()
    if 'TB' in value:
        return int(float(value.replace('TB', '').strip()) * 1024)
    elif 'GB' in value:
        return int(value.replace('GB', '').strip())
    return 0

for index, row in df.iterrows():
    memories = row['Memory'].split('+')
    for memory in memories:
        memory = memory.strip()
        if 'Flash Storage' in memory:
            df.at[index, 'Flash Storage'] = convert_to_gb(memory)
        elif 'Hybrid' in memory:
            df.at[index, 'Hybrid'] = convert_to_gb(memory)
        elif 'SSD' in memory:
            df.at[index, 'SSD'] = convert_to_gb(memory)
        elif 'HDD' in memory:
            df.at[index, 'HDD'] = convert_to_gb(memory)

In [13]:
df.Ram.value_counts() 
# delete GB ,and convert to int
df.Ram = df.Ram.str.replace("GB","").astype(int)

In [14]:
df.ScreenResolution.value_counts() 
# 4K Ultra Hd, IPS Panel,  Touchscreen, Full HD, Quad HD, Quad HD+ ,  Retina Display will be created as new columns. Boolen type.
df['4K Ultra HD'] = df['ScreenResolution'].apply(lambda x: 1 if '4K Ultra HD' in x else 0)
df['IPS Panel'] = df['ScreenResolution'].apply(lambda x: 1 if 'IPS Panel' in x else 0)
df['Touchscreen'] = df['ScreenResolution'].apply(lambda x: 1 if 'Touchscreen' in x else 0)
df['Full HD'] = df['ScreenResolution'].apply(lambda x: 1 if 'Full HD' in x else 0)
df['Quad HD'] = df['ScreenResolution'].apply(lambda x: 1 if 'Quad HD' in x else 0)
df['Quad HD+'] = df['ScreenResolution'].apply(lambda x: 1 if 'Quad HD+' in x else 0)
df['Retina Display'] = df['ScreenResolution'].apply(lambda x: 1 if 'Retina Display' in x else 0)


In [15]:
df.Inches.value_counts() #everything seems fine

Inches
15.6    665
14.0    197
13.3    164
17.3    164
12.5     39
11.6     33
12.0      6
13.5      6
13.9      6
12.3      5
10.1      4
15.4      4
15.0      4
13.0      2
18.4      1
17.0      1
14.1      1
11.3      1
Name: count, dtype: int64

In [16]:
df.TypeName.value_counts()

TypeName
Notebook              727
Gaming                205
Ultrabook             196
2 in 1 Convertible    121
Workstation            29
Netbook                25
Name: count, dtype: int64

In [17]:
# less than 10 will be named as "Other"
df['Company'] = df['Company'].apply(lambda x: x if x in df['Company'].value_counts()[df['Company'].value_counts() > 10].index else 'Other')
df.Company.value_counts() 

Company
Dell       297
Lenovo     297
HP         274
Asus       158
Acer       103
MSI         54
Other       51
Toshiba     48
Apple       21
Name: count, dtype: int64

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 30 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Company           1303 non-null   object 
 1   TypeName          1303 non-null   object 
 2   Inches            1303 non-null   float64
 3   ScreenResolution  1303 non-null   object 
 4   Cpu               1303 non-null   object 
 5   Ram               1303 non-null   int32  
 6   Memory            1303 non-null   object 
 7   Gpu               1303 non-null   object 
 8   OpSys             1303 non-null   object 
 9   Weight            1303 non-null   float64
 10  Price             1303 non-null   float64
 11  GHz               1217 non-null   float64
 12  Cpu Brand         1303 non-null   object 
 13  GPU_Brand         1303 non-null   object 
 14  Windows           1303 non-null   int64  
 15  Linux             1303 non-null   int64  
 16  Mac               1303 non-null   int64  


In [19]:
df.head()

Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price,GHz,Cpu Brand,GPU_Brand,Windows,Linux,Mac,Chrome,No_OS,Flash Storage,Hybrid,SSD,HDD,4K Ultra HD,IPS Panel,Touchscreen,Full HD,Quad HD,Quad HD+,Retina Display
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832,2.3,Intel,Intel,0,0,0,0,0,0,0,128,0,0,1,0,0,0,0,1
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,47895.5232,1.8,Intel,Intel,0,0,0,0,0,128,0,0,0,0,0,0,0,0,0,0
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,1.86,30636.0,2.5,Intel,Intel,0,0,0,0,1,0,0,256,0,0,0,0,1,0,0,0
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,1.83,135195.336,2.7,Intel,AMD,0,0,0,0,0,0,0,512,0,0,1,0,0,0,0,1
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,96095.808,3.1,Intel,Intel,0,0,0,0,0,0,0,256,0,0,1,0,0,0,0,1


In [20]:
df_unnecessary = ["ScreenResolution","Gpu","OpSys","Memory","Cpu"]
df.drop(df_unnecessary,axis=1,inplace=True)

In [21]:
# object type columns will be converted to numeric
df = pd.get_dummies(df, columns=['Company', 'TypeName', 'Cpu Brand', 'GPU_Brand'], drop_first=True)

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 39 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Inches                1303 non-null   float64
 1   Ram                   1303 non-null   int32  
 2   Weight                1303 non-null   float64
 3   Price                 1303 non-null   float64
 4   GHz                   1217 non-null   float64
 5   Windows               1303 non-null   int64  
 6   Linux                 1303 non-null   int64  
 7   Mac                   1303 non-null   int64  
 8   Chrome                1303 non-null   int64  
 9   No_OS                 1303 non-null   int64  
 10  Flash Storage         1303 non-null   int64  
 11  Hybrid                1303 non-null   int64  
 12  SSD                   1303 non-null   int64  
 13  HDD                   1303 non-null   int64  
 14  4K Ultra HD           1303 non-null   int64  
 15  IPS Panel            

In [23]:
#empty GHz values will be filled with the Kmeans clustering algorithm
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)  # 2 neighbors are selected for the imputation
df_filled = imputer.fit_transform(df)
df_filled = pd.DataFrame(df_filled, columns=df.columns)
df_filled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 39 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Inches                1303 non-null   float64
 1   Ram                   1303 non-null   float64
 2   Weight                1303 non-null   float64
 3   Price                 1303 non-null   float64
 4   GHz                   1303 non-null   float64
 5   Windows               1303 non-null   float64
 6   Linux                 1303 non-null   float64
 7   Mac                   1303 non-null   float64
 8   Chrome                1303 non-null   float64
 9   No_OS                 1303 non-null   float64
 10  Flash Storage         1303 non-null   float64
 11  Hybrid                1303 non-null   float64
 12  SSD                   1303 non-null   float64
 13  HDD                   1303 non-null   float64
 14  4K Ultra HD           1303 non-null   float64
 15  IPS Panel            

In [24]:
df_filled.isnull().sum().sum()

0

In [25]:
#train test split
x = df_filled.drop("Price",axis=1)
y = df_filled["Price"]

# Model Training and Evaluation

In [26]:
algo_test(x,y)

Unnamed: 0,Model,Best Params,R_Squared,RMSE,MAE
0,Gradient Boosting,"{'n_estimators': 300, 'learning_rate': 0.1}",0.84806,14801.342231,9366.973199
1,XGBRegressor,"{'n_estimators': 200, 'learning_rate': 0.1}",0.846019,14900.394695,9274.36887
2,Lasso,{'alpha': 10},0.759564,18619.363478,13222.595617
3,Ridge,{'alpha': 1},0.758726,18651.758062,13233.159955
4,ElasticNet,"{'l1_ratio': 0.9, 'alpha': 0.1}",0.748442,19045.116508,13305.937873
5,Decision Tree,{'max_depth': 10},0.734005,19583.998981,12683.661737
6,KNeighborsRegressor,{'n_neighbors': 3},0.670039,21812.021278,12709.437057
7,MLP Regressor,"{'hidden_layer_sizes': (50, 50), 'alpha': 0.01}",0.555886,25305.330405,16910.768452
8,SVR,"{'gamma': 'scale', 'C': 10}",-0.016544,38284.918859,26204.925475


### Conclusion

In this notebook, we performed a comprehensive analysis and modeling process to predict laptop prices based on various features. Here is a summary of the steps we took:

1. **Data Loading and Initial Exploration**:
   - Loaded the dataset and performed initial exploration to understand the structure and content of the data.

2. **Data Cleaning and Preprocessing**:
   - Removed unnecessary columns and handled missing values.
   - Extracted relevant features from complex columns such as [`Cpu`], [`Gpu`], [`Memory`], and [`ScreenResolution`].
   - Converted categorical variables into numerical representations using one-hot encoding.
   - Standardized numerical features to ensure they are on a similar scale.

3. **Feature Engineering**:
   - Created new features such as [`GHz`] from the [`Cpu`] column and various binary features from the [`ScreenResolution`] column.
   - Aggregated less frequent categories into an "Other" category to simplify the dataset.

4. **Handling Missing Values**:
   - Used KNNImputer to fill missing values in the dataset, ensuring no missing data remains.

5. **Model Training and Evaluation**:
   - Defined a function [`algo_test`] to train and evaluate multiple regression models using [`RandomizedSearchCV`] for hyperparameter tuning.
   - Split the data into training and testing sets.
   - Trained various models including Ridge, Lasso, ElasticNet, Gradient Boosting, KNeighborsRegressor, Decision Tree, XGBRegressor, SVR, and MLP Regressor.
   - Evaluated the models based on R-squared, RMSE, and MAE metrics.

6. **Results**:
   - Compiled the results of the best models and their respective hyperparameters.
   - Sorted the models based on their R-squared values to identify the best-performing model.

This process provided a robust framework for predicting laptop prices and highlighted the importance of thorough data preprocessing and feature engineering. The results can be further improved by exploring additional features, tuning hyperparameters more extensively, or trying other advanced machine learning algorithms.