Questions to Explore:
•	Which features have the most significant impact on laptop prices?
•	Can the model accurately predict the prices of laptops from lesser-known brands?
•	Does the brand of the laptop significantly influence its price?
•	How well does the model perform on laptops with high-end specifications compared to budget laptops?
•	What are the limitations and challenges in predicting laptop prices accurately?
•	How does the model perform when predicting the prices of newly released laptops not present in the training dataset?



--Project Title:
Laptop Price Prediction for SmartTech Co.


--Project Overview:
SmartTech Co. has partnered with our data science team to develop a robust machine learning model that predicts laptop prices accurately. As the market for laptops continues
to expand with a myriad of brands and specifications, having a precise pricing model becomes crucial for both consumers and manufacturers.


--Client's Objectives:
•	Accurate Pricing: Develop a model that can accurately predict laptop prices based on various features, helping our clients stay competitive in the market.
•	Market Positioning: Understand how different features contribute to pricing, enabling SmartTech Co. to strategically position its laptops in the market.
•	Brand Influence: Assess the impact of brand reputation on pricing, providing insights into brand perception and market demand.


--Key Challenges:
•	Diverse Specifications: The dataset encompasses laptops with diverse specifications. Our challenge is to build a model that generalizes well across a wide range of features.
•	Real-time Prediction: The model should have the capability to predict prices for newly released laptops, reflecting the fast-paced nature of the tech industry.
•	Interpretability: It is crucial to make the model interpretable, allowing SmartTech Co. to understand the rationale behind pricing predictions.


--Project Phases:
•	Data Exploration and Understanding:
	Dive into the dataset to understand the landscape of laptop specifications.
	Visualize trends in laptop prices and identify potential influential features.
•	Data Preprocessing:
	Handle missing values, outliers, and encode categorical variables.
	Ensure the dataset is ready for model training.
•	Feature Engineering:
	Extract meaningful features to enhance model performance.
	Consider creating new features that capture the essence of laptop pricing.
•	Model Development:
	Employ machine learning algorithms such as Linear Regression, Random Forest, and Gradient Boosting to predict laptop prices.
	Evaluate and choose the model that aligns best with the project's objectives.
•	Hyperparameter Tuning:
	Fine-tune the selected model to achieve optimal performance.
	Real-time Predictions:
	Implement a mechanism for the model to make predictions for new laptops entering the market.
•	Interpretability and Insights:
	Uncover insights into which features play a pivotal role in pricing decisions.
  Ensure that SmartTech Co. can interpret and trust the model's predictions.
•	Client Presentation:
  Present findings, model performance, and insights to SmartTech Co. stakeholders.
	Address any questions or concerns and gather feedback for potential model improvements.

In [6]:
!apt-get -qq install -y graphviz && pip install pydot
import pydot



This code snippet installs and imports the pydot library, which is used for working with Graphviz to create graphical representations of networks and graphs.


In [1]:
import pandas as pd
import numpy as np
import re

# Load the dataset and getting its info:
df = pd.read_csv("laptop.csv")
df.info()

# Drop irrelevant columns
df.drop(columns=["Unnamed: 0.1", "Unnamed: 0"], inplace=True, errors='ignore')

# Replace '?' with NaN and then drop rows with NaN
df = df.replace('?', np.nan)
df = df.dropna()

# Showing the first 15 columns of the data after the above operations
df.head(15)



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0.1      1303 non-null   int64  
 1   Unnamed: 0        1273 non-null   float64
 2   Company           1273 non-null   object 
 3   TypeName          1273 non-null   object 
 4   Inches            1273 non-null   object 
 5   ScreenResolution  1273 non-null   object 
 6   Cpu               1273 non-null   object 
 7   Ram               1273 non-null   object 
 8   Memory            1273 non-null   object 
 9   Gpu               1273 non-null   object 
 10  OpSys             1273 non-null   object 
 11  Weight            1273 non-null   object 
 12  Price             1273 non-null   float64
dtypes: float64(2), int64(1), object(10)
memory usage: 132.5+ KB


Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.336
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.808
5,Acer,Notebook,15.6,1366x768,AMD A9-Series 9420 3GHz,4GB,500GB HDD,AMD Radeon R5,Windows 10,2.1kg,21312.0
6,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.2GHz,16GB,256GB Flash Storage,Intel Iris Pro Graphics,Mac OS X,2.04kg,114017.6016
7,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,256GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,61735.536
8,Asus,Ultrabook,14.0,Full HD 1920x1080,Intel Core i7 8550U 1.8GHz,16GB,512GB SSD,Nvidia GeForce MX150,Windows 10,1.3kg,79653.6
9,Acer,Ultrabook,14.0,IPS Panel Full HD 1920x1080,Intel Core i5 8250U 1.6GHz,8GB,256GB SSD,Intel UHD Graphics 620,Windows 10,1.6kg,41025.6


In [2]:
#  MEMORY FEATURE ENGINEERING
# SEPERATING SSD, HDD, FLASH & HYBRID STORAGE:

def extract_memory_features(memory):
    ssd = hdd = flash = hybrid = 0
    for part in memory.split('+'):
        part = part.strip()
        value = int(re.findall(r'\d+', part)[0]) if re.findall(r'\d+', part) else 0
        if 'SSD' in part:
            ssd += value
        elif 'HDD' in part:
            hdd += value
        elif 'Flash' in part:
            flash += value
        elif 'Hybrid' in part:
            hybrid += value
    return pd.Series([ssd, hdd, flash, hybrid])

df[['SSD', 'HDD', 'Flash_Storage', 'Hybrid']] = df['Memory'].apply(extract_memory_features)
df

Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price,SSD,HDD,Flash_Storage,Hybrid
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832,128,0,0,0
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232,0,0,128,0
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0000,256,0,0,0
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.3360,512,0,0,0
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.8080,256,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1298,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8kg,33992.6400,128,0,0,0
1299,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3kg,79866.7200,512,0,0,0
1300,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5kg,12201.1200,0,0,64,0
1301,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19kg,40705.9200,0,1,0,0


In [3]:
# SEPARATING THE DATA INTO VARIOUS COLUMNS AS PER FEATURES:
# --------- CPU FEATURE ENGINEERING ---------
df['Cpu_Brand'] = df['Cpu'].apply(lambda x: x.split()[0])
df['Cpu_Series'] = df['Cpu'].apply(lambda x: x.split()[2] if 'Intel' in x or 'AMD' in x else x)

# --------- GPU FEATURE ENGINEERING ---------
df['Gpu_Brand'] = df['Gpu'].apply(lambda x: x.split()[0])

# --------- SCREEN FEATURE ENGINEERING ---------
df['Touchscreen'] = df['ScreenResolution'].apply(lambda x: 1 if 'Touchscreen' in x else 0)
df['IPS'] = df['ScreenResolution'].apply(lambda x: 1 if 'IPS' in x else 0)

def extract_resolution(res):
    try:
        res = res.split()[-1]
        width, height = res.split('x')
        return pd.Series([int(width), int(height)])
    except:
        return pd.Series([0, 0])

df[['Resolution_Width', 'Resolution_Height']] = df['ScreenResolution'].apply(extract_resolution)

# Convert 'Inches', 'Ram' and 'Weight' to numeric, coercing errors
df['Inches'] = pd.to_numeric(df['Inches'], errors='coerce')
df['Ram'] = df['Ram'].str.replace('GB', '', regex=False).astype(int) # Added regex=False
df['Weight'] = df['Weight'].str.replace('kg', '', regex=False).astype(float) # Added regex=False

# Drop rows where 'Inches' became NaN due to coercion
df.dropna(subset=['Inches'], inplace=True)

# Print dtypes after converting Inches, Ram, Weight
print("Data types after converting Inches, Ram, Weight:\n", df[['Inches', 'Ram', 'Weight']].dtypes)

# DROP ORIGINAL COLUMNS AFTER FEATURE ENGINEERING AND Before Encoding
df.drop(columns=['Memory', 'Cpu', 'Gpu', 'ScreenResolution'], inplace=True)

# Check for non-numeric columns before encoding
non_numeric_cols = df.select_dtypes(exclude=['number', 'bool']).columns
if len(non_numeric_cols) > 0:
    print("Non-numeric columns before encoding:", non_numeric_cols)
else:
    print("All columns are numeric or boolean before encoding.")

# ENCODE CATEGORICAL VARIABLES
categorical_cols = ['Company', 'TypeName', 'OpSys', 'Cpu_Brand', 'Cpu_Series', 'Gpu_Brand']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Print dtypes of the entire DataFrame after encoding
print("\nData types of the entire DataFrame after encoding:\n", df.dtypes)


# FINAL CHECK
print("\nProcessed dataset shape:", df.shape)
print(df.head())

Data types after converting Inches, Ram, Weight:
 Inches    float64
Ram         int64
Weight    float64
dtype: object
Non-numeric columns before encoding: Index(['Company', 'TypeName', 'OpSys', 'Cpu_Brand', 'Cpu_Series', 'Gpu_Brand'], dtype='object')

Data types of the entire DataFrame after encoding:
 Inches                 float64
Ram                      int64
Weight                 float64
Price                  float64
SSD                      int64
                        ...   
Cpu_Series_x5-Z8350       bool
Cpu_Series_x5-Z8550       bool
Gpu_Brand_ARM             bool
Gpu_Brand_Intel           bool
Gpu_Brand_Nvidia          bool
Length: 85, dtype: object

Processed dataset shape: (1270, 85)
   Inches  Ram  Weight        Price  SSD  HDD  Flash_Storage  Hybrid  \
0    13.3    8    1.37   71378.6832  128    0              0       0   
1    13.3    8    1.34   47895.5232    0    0            128       0   
2    15.6    8    1.86   30636.0000  256    0              0       0   
3   

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor as KNNRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# 1. Separate features and target
X = df.drop("Price", axis=1)
y = df["Price"]

# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Define models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Knn": KNNRegressor(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Support Vector": SVR()
}

# 4. Train and evaluate
results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    results[name] = {
        "MAE": round(mae, 2),
        "RMSE": round(rmse, 2),
        "R2 Score": round(r2, 4)
    }

# Show results
results_df = pd.DataFrame(results).T
print(results_df)


                        MAE      RMSE  R2 Score
Linear Regression  11576.87  17017.13    0.7396
Random Forest       8033.81  13271.23    0.8416
Knn                11982.27  18958.57    0.6767
Decision Tree       9738.29  17494.83    0.7247
Support Vector     25629.71  33340.97    0.0003


In [9]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
}

# Create base model
rf = RandomForestRegressor(random_state=42)

# Grid search
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                           cv=3, n_jobs=-1, verbose=2, scoring='r2')

# Fit on training data
grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best R² Score:", grid_search.best_score_)

# Evaluate on test set
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

print("Test R² Score:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("Test MAE:", mean_absolute_error(y_test, y_pred))


Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best R² Score: 0.7530965663650427
Test R² Score: 0.8417602499210721
Test RMSE: 13264.537509555747
Test MAE: 8020.655953807509
