# Price Prediction using Machine Learning

<span style="color:#3847f5; font-weight:bold;font-size:20px;">We will work on a project to predict the prices of laptops. The problem statement is that if a user wants to purchase a laptop, our application should be capable of providing an estimated price for the laptop based on the user's configurations. While this might seem like a simple project or just model development, the dataset we have is noisy and requires extensive feature engineering and preprocessing, making this project an interesting challenge to develop.</span>


## Feature Engineering and Preprocessing for the Laptop Price Prediction Model





<span style="color:#3847f5; font-weight:bold;font-size:20px;">Feature Engineering is the process of transforming raw data into meaningful information. There are various techniques under feature engineering, such as transformation, categorical encoding, and more. Currently, the columns in our dataset are noisy, so we need to apply some feature engineering steps.
</span>


# 1- Import data  :

In [132]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
import re

In [133]:
df=pd.read_csv("laptop_data.csv")
df.head(5)

Unnamed: 0.1,Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0
3,3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.336
4,4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.808


<span style="color:#3847f5; font-weight:bold;font-size:20px;">The dataset contains 12 columns. We will analyze each column, examine its characteristics, and make necessary modifications to build a robust model for price prediction.</span>

<span style="color:#7b0d8e; font-weight:bold;font-size:16px;">The First Column "Unnamed: 0".</span>

In [134]:
df["Unnamed: 0"].unique()

array([   0,    1,    2, ..., 1300, 1301, 1302], dtype=int64)

<span style="color:blue; font-weight:bold;font-size:16px;">The first column is just an index, so we will delete it.".</span>

In [135]:
df=df.drop(columns="Unnamed: 0",axis=1)

<span style="color:#7b0d8e; font-weight:bold;font-size:16px;">The Second Column "Company".</span>

In [136]:
df["Company"].unique()

array(['Apple', 'HP', 'Acer', 'Asus', 'Dell', 'Lenovo', 'Chuwi', 'MSI',
       'Microsoft', 'Toshiba', 'Huawei', 'Xiaomi', 'Vero', 'Razer',
       'Mediacom', 'Samsung', 'Google', 'Fujitsu', 'LG'], dtype=object)

<span style="color:blue; font-weight:bold;font-size:16px;">It consists of brand names and appears to require no modifications.".</span>

<span style="color:#7b0d8e; font-weight:bold;font-size:16px;">The thrid column "TypeName".</span>


In [137]:
df["TypeName"].unique()

array(['Ultrabook', 'Notebook', 'Netbook', 'Gaming', '2 in 1 Convertible',
       'Workstation'], dtype=object)

<span style="color:blue; font-weight:bold;font-size:16px;">df=df.drop(columns="Unnamed: 0",axis=1)

In [138]:
df.TypeName=df.TypeName.replace('Netbook','Notebook')

<span style="color:#7b0d8e; font-weight:bold;font-size:16px;">The fourth column "Inches"
It represents the device size in inches and appears to require no modifications.</span>

In [139]:
df["Inches"].unique()

array([13.3, 15.6, 15.4, 14. , 12. , 11.6, 17.3, 10.1, 13.5, 12.5, 13. ,
       18.4, 13.9, 12.3, 17. , 15. , 14.1, 11.3])

In [140]:
df.head(5)

Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.336
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.808


<span style="color:#7b0d8e; font-weight:bold;font-size:16px;">The fifth column "ScreenResolution"
From the contents of the column, we can extract three important pieces of information:.</span>  
 - The screen type, which consists of three categories "IPS Panel" ,"Touchscreen","IPS/Touchscreen".
 - The height, which can be extracted from  1920x1080.
 - The width, which can be extracted from 1920x1080.</span>

In [141]:
df["ScreenResolution"].unique()

array(['IPS Panel Retina Display 2560x1600', '1440x900',
       'Full HD 1920x1080', 'IPS Panel Retina Display 2880x1800',
       '1366x768', 'IPS Panel Full HD 1920x1080',
       'IPS Panel Retina Display 2304x1440',
       'IPS Panel Full HD / Touchscreen 1920x1080',
       'Full HD / Touchscreen 1920x1080',
       'Touchscreen / Quad HD+ 3200x1800',
       'IPS Panel Touchscreen 1920x1200', 'Touchscreen 2256x1504',
       'Quad HD+ / Touchscreen 3200x1800', 'IPS Panel 1366x768',
       'IPS Panel 4K Ultra HD / Touchscreen 3840x2160',
       'IPS Panel Full HD 2160x1440',
       '4K Ultra HD / Touchscreen 3840x2160', 'Touchscreen 2560x1440',
       '1600x900', 'IPS Panel 4K Ultra HD 3840x2160',
       '4K Ultra HD 3840x2160', 'Touchscreen 1366x768',
       'IPS Panel Full HD 1366x768', 'IPS Panel 2560x1440',
       'IPS Panel Full HD 2560x1440',
       'IPS Panel Retina Display 2736x1824', 'Touchscreen 2400x1600',
       '2560x1440', 'IPS Panel Quad HD+ 2560x1440',
       'IPS Panel 

In [142]:
#Equation for extraction Screen types : IPS Touchscreen and IPS/Touchscreen
def extraction_ScreenResolution(text):
    if "IPS"in text and "Touchscreen" in text:
        return "IPS/Touchscreen"
    elif "IPS" in text :
        return "IPS"
    else : 
        return  "Touchscreen"
    

In [143]:
#Create column "IPS/Touchscreen"
df["IPS/Touchscreen"]=df.ScreenResolution.apply(extraction_ScreenResolution)

In [144]:
#Equation for extraction resolution : width and height
def extraction_resolution(text):
    if not isinstance(text, str):
        return None, None
    patern=r"\b(\d+)x(\d+)"
    match=re.search(patern,text)
    if match :
        width =int(match.group(1))
        height =int(match.group(2))
       
        return width,height
    return None , None 

In [145]:
df[['width','height']]=df.ScreenResolution.apply(extraction_resolution).apply(pd.Series)

In [146]:
df[['width','height']]=df.ScreenResolution.apply(extraction_resolution).apply(pd.Series)

In [147]:
ScreenResolution_df = df.head(5).style.set_properties(subset=["IPS/Touchscreen",'width','height'],**{'background-color':"#bff3f5"})
ScreenResolution_df

Unnamed: 0,Company,TypeName,Inches,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price,IPS/Touchscreen,width,height
0,Apple,Ultrabook,13.3,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832,IPS,2560,1600
1,Apple,Ultrabook,13.3,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232,Touchscreen,1440,900
2,HP,Notebook,15.6,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0,Touchscreen,1920,1080
3,Apple,Ultrabook,15.4,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.336,IPS,2880,1800
4,Apple,Ultrabook,13.3,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.808,IPS,2560,1600


<span style="color:#7b0d8e; font-weight:bold;font-size:16px;">The sixth column "Cpu"
This column contains the processor type. The processors can be classified into 5 categories.
"Intel Core i5", "Intel Core i3", "Intel Core i7", "AMD ", "Intel"</span>

In [148]:
# Column cell assembly
def extraction_ScreenResolution(text):
     patterns = ["Intel Core i5", "Intel Core i3", "Intel Core i7", "AMD ", "Intel"]
    
     for pattern in patterns:
        if pattern in text:
            return pattern.strip()
    
     return "Other"
    

In [149]:
df["Cpu"]=df.Cpu.apply(extraction_ScreenResolution)

In [150]:
def extraction_ram(text):
    if not isinstance(text, str):
        return None, None
    patern=r"\b(\d+)"
    match=re.search(patern,text)
    if match :
        ram =int(match.group(1))
       
        return ram
    return None 

In [151]:
df['Ram']=df.Ram.apply(extraction_ram).apply(pd.Series)

<span style="color:#7b0d8e; font-weight:bold;font-size:16px;">The eighth column "Memory"
It contains a lot of information, and the column can be split into 4 columns. The first column will show the primary storage capacity, the second column will show its type, the third column will display the secondary storage capacity, and the fourth column will show its type.  
It is observed that some computers only have one storage capacity, so we will fill the secondary storage with a value of 0 and its type as any  
Additionally, there is an important note regarding storage capacity, as some devices have a capacity of GB and others have a capacity of GB Therefore, it is crucial to standardize the units, which will be addressed in the equation for removing the storage unit.
    1TB =1024GB

In [152]:
#split into two column  'Memory','Memory_plus' 
df[['Memory','Memory_plus']]=df.Memory.str.split('+',expand=True)
#equation for extraction capacite memory and memory plus
def extraction_memory_plus(text):
    if not isinstance(text, str):
        return None, None
    patern=r"\b(\d+(?:\.\d+)?)(TB|GB)\s(\w+)"
    match=re.search(patern,text)
    if match :
        Memory_plusCapacite=float(match.group(1)) 
        unite=match.group(2)
        Mermory_plusType=match.group(3)
        if unite =="TB" :
            Memory_plusCapacite*=1024
        return Memory_plusCapacite,Mermory_plusType
    return None , None 
#create the four columns
df[['Memory_plusCapacite','Mermory_plusType']]=df.Memory_plus.apply(extraction_memory_plus).apply(pd.Series)
df[['Memory','Mermory_Type']]=df.Memory.apply(extraction_memory_plus).apply(pd.Series)
#drop column "Memey " and fill cells None
df=df.drop(columns="Memory_plus",axis=1)
df["Memory_plusCapacite"]=df["Memory_plusCapacite"].fillna(0)
df["Mermory_plusType"]=df["Mermory_plusType"].fillna('Any')

In [153]:
df.Gpu.unique()

array(['Intel Iris Plus Graphics 640', 'Intel HD Graphics 6000',
       'Intel HD Graphics 620', 'AMD Radeon Pro 455',
       'Intel Iris Plus Graphics 650', 'AMD Radeon R5',
       'Intel Iris Pro Graphics', 'Nvidia GeForce MX150',
       'Intel UHD Graphics 620', 'Intel HD Graphics 520',
       'AMD Radeon Pro 555', 'AMD Radeon R5 M430',
       'Intel HD Graphics 615', 'AMD Radeon Pro 560',
       'Nvidia GeForce 940MX', 'Intel HD Graphics 400',
       'Nvidia GeForce GTX 1050', 'AMD Radeon R2', 'AMD Radeon 530',
       'Nvidia GeForce 930MX', 'Intel HD Graphics',
       'Intel HD Graphics 500', 'Nvidia GeForce 930MX ',
       'Nvidia GeForce GTX 1060', 'Nvidia GeForce 150MX',
       'Intel Iris Graphics 540', 'AMD Radeon RX 580',
       'Nvidia GeForce 920MX', 'AMD Radeon R4 Graphics', 'AMD Radeon 520',
       'Nvidia GeForce GTX 1070', 'Nvidia GeForce GTX 1050 Ti',
       'Nvidia GeForce MX130', 'AMD R4 Graphics',
       'Nvidia GeForce GTX 940MX', 'AMD Radeon RX 560',
       'Nvid

<span style="color:#7b0d8e; font-weight:bold;font-size:16px;">The Eighth Column "Gpu"
It is noted that the column, which pertains to the brand of graphics processors, contains a lot of details.  
We will need the brand name and its type, so we will limit it to the first two words of the text..</span>

In [154]:
def extraction_name_Gpu(text):
    if not isinstance(text, str):
        return None, None
    pattern = r"\w+\s\w+"
    match = re.search(pattern, text)
    if match:
        return match.group()
    return None

In [155]:
df.Gpu=df.Gpu.apply(extraction_name_Gpu)

In [156]:
df.Gpu.unique()

array(['Intel Iris', 'Intel HD', 'AMD Radeon', 'Nvidia GeForce',
       'Intel UHD', 'AMD R4', 'Nvidia GTX', 'AMD R17M', 'Nvidia Quadro',
       'AMD FirePro', 'Intel Graphics', 'ARM Mali'], dtype=object)

<span style="color:#7b0d8e; font-weight:bold;font-size:16px;">The ninth column includes the operating system for each device.
For example, some devices run on Windows, with some using version 10 and others version 7.  
There are also devices that run on different systems such as macOS and Linux.  
Only the name of the operating system will be retained without its version.</span>

In [157]:
df.OpSys.unique()

array(['macOS', 'No OS', 'Windows 10', 'Mac OS X', 'Linux', 'Android',
       'Windows 10 S', 'Chrome OS', 'Windows 7'], dtype=object)

In [158]:
replace={
    "macOS": "OS",
    "Mac OS X": "OS",
    "Chrome OS": "OS",
    "Windows 10": "Windows",
    "Windows 10 S": "Windows",
    "Windows 7": "Windows",
}

In [159]:
df.OpSys=df.OpSys.replace(replace)

In [160]:
df.OpSys.unique()

array(['OS', 'No OS', 'Windows', 'Linux', 'Android'], dtype=object)

<span style="color:#7b0d8e; font-weight:bold;font-size:16px;">The tenth column pertains to the device's weight in kilograms.
 We will separate the unit 'kg' of weight and retain only the value.</span>

In [161]:
# Extraction value from unite 
def extraction_Weight(text):
    if not isinstance(text, str):
        return None, None
    patern=r"\b(\d+(?:\.\d+)?)"
    match=re.search(patern,text)
    if match :
         Weight=float(match.group(1)) 
    
         return Weight
    return None 

In [162]:
df['Weight']=df.Weight.apply(extraction_Weight).apply(pd.Series)

Reorganizing and rearranging the columns to obtain the final format of the data.

In [163]:
df.columns

Index(['Company', 'TypeName', 'Inches', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys',
       'Weight', 'Price', 'IPS/Touchscreen', 'width', 'height',
       'Memory_plusCapacite', 'Mermory_plusType', 'Mermory_Type'],
      dtype='object')

In [164]:
column_order=['Company', 'TypeName', 'Inches', 'Cpu', 'Gpu','Ram', 'Memory','Mermory_Type', 'Memory_plusCapacite', 'Mermory_plusType','OpSys',
       'Weight', 'IPS/Touchscreen', 'width', 'height', 'Price'
       ]

In [165]:
df.to_csv("data_laptop_clean",index=False)

In [166]:
df

Unnamed: 0,Company,TypeName,Inches,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price,IPS/Touchscreen,width,height,Memory_plusCapacite,Mermory_plusType,Mermory_Type
0,Apple,Ultrabook,13.3,Intel Core i5,8,128.0,Intel Iris,OS,1.37,71378.6832,IPS,2560,1600,0.0,Any,SSD
1,Apple,Ultrabook,13.3,Intel Core i5,8,128.0,Intel HD,OS,1.34,47895.5232,Touchscreen,1440,900,0.0,Any,Flash
2,HP,Notebook,15.6,Intel Core i5,8,256.0,Intel HD,No OS,1.86,30636.0000,Touchscreen,1920,1080,0.0,Any,SSD
3,Apple,Ultrabook,15.4,Intel Core i7,16,512.0,AMD Radeon,OS,1.83,135195.3360,IPS,2880,1800,0.0,Any,SSD
4,Apple,Ultrabook,13.3,Intel Core i5,8,256.0,Intel Iris,OS,1.37,96095.8080,IPS,2560,1600,0.0,Any,SSD
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1298,Lenovo,2 in 1 Convertible,14.0,Intel Core i7,4,128.0,Intel HD,Windows,1.80,33992.6400,IPS/Touchscreen,1920,1080,0.0,Any,SSD
1299,Lenovo,2 in 1 Convertible,13.3,Intel Core i7,16,512.0,Intel HD,Windows,1.30,79866.7200,IPS/Touchscreen,3200,1800,0.0,Any,SSD
1300,Lenovo,Notebook,14.0,Intel,2,64.0,Intel HD,Windows,1.50,12201.1200,Touchscreen,1366,768,0.0,Any,Flash
1301,HP,Notebook,15.6,Intel Core i7,6,1024.0,AMD Radeon,Windows,2.19,40705.9200,Touchscreen,1366,768,0.0,Any,HDD
