Laptop Price Prediction

Problem Statement

Suppose someone is planning to buy a laptop for his personal use and he has already decided the hardware configuration. Now the question is what is the tentative budget he should consider.
Suppose he wants to buy a laptop with i3 processor with 32 GB RAM. But such combination may not be available in the market.

In this project we will build a system which will predict expected price of a laptop based on hardware configuration and also recommend laptops having similar configuration available in the market.

Approach to solve the problem

1. Collecting data

There are some datasets available on the internet to solve this problem, but the those datasets do not reflect the current price trend. As we know that price of any commodity varies with ups and down in the market it is always recommended to collect latest data to solve any commodity price prediction problem. That is why we scraped data of available laptops from an e-commerce website.

2. Cleaning, analysing and preprocessing the data

The raw data collected looks like this -

The data is cleaned first by extracting useful information (e.g., the maximum clock speed, screen size, resolution etc.). In this dataset our target variable is price of laptop, so relationships of different features with the target variable are also explored. After data cleaning the categorical variables are encoded to numerical forms by target oriented feature encoding.

Please refer the EDA notebook for more details.

3. Building and tuning model to solve the price prediction problem

After preprocessing the data looks like this -

We used the the dataset to build a regression model to predict price using the available features. We have done hyperparameter tuning on different models and saved the model as the best one which gives gives performance score. We used R2 score as performance metric.

4. Build a simple recommendation system to recommend similar laptops

We have built a simple recommendation system which will recommend top 5 laptops having similarity with the configuration selected by the user. It is buit on Nearest Neighbour using cosine similarity

Project Architecture

The project is divided into several modules where each module performs a predefined job. Below is the brief description of each module -

1. webcrawler

The module scrapes data from e-commerce website in two steps -

In step 1 the all the product urls and ids will be collected and saved in one file (productlinks.csv)
In step 2 product specific details (e.g., processor name, SSD_Capacity) will be collected for each product url found in step 1. The data will be saved as raw.csv.

2. validation

The module has two purposes.

validation_raw_data.py validates the data from raw.csv file. It checks whether nan values are present in certain columns. If nan value is found the validation fails.
validation_prediction.py checks the input data for prediction. It checks the column names, ordering of the columns, data type for each column, presence of new category for categorical features. If either of the checks goes wrong the validated fails.

3. preprocessing

In this module separate classes are defined for separate purposed. For example, DropDuplicates class inside data_cleaning.py drops all duplicate columns in the dataset. Each class is extention of BaseEstimator, TransformerMixin. So, each class has its own fit and transform method. This is done to create customised pipelines.

There are three pipelines defined inside preprocessor_main.py -

DataCleaningPipeline : to remove duplicate columns, to drop columns that are not required for model builing, extract numerical values from string data (e.g., extract screen size, clock speed, screen resolution), fill nan values in Graphics_Memory and SSD_Capacity columns
EncodingPipeline: to encode categorical values (Processor Name, SSD, GPU and Touchscreen) to numerical. The encoder model will be saved as models/encoder.pkl because it will be required during prediction.
ImputerPipeline: there will be some nan values in Clock_Speed and Screen_Resolution columns. Those values are imputed using KNN Imputer technique.

After running all the pipelines the output data will be stored as preprocessed.csv

4. training

In this module separate classes are created to build separate models. In each class the workflow is as follows -

The parameters are defined for that model.
The preprocessed data is read, the data is then split in train and test
Hyperparameter tuning is done on the train data using grid search cv
The performance metric used is R2 score
The R2 score on test data is checked and the score is saved in application logs
The best model is also saved inside /models

Inside class TrainBestModel four mododels (Linear Model, Decision Tree, Random Forest, XGBoost) are used for hyperparameter tuning and the model having highest test score is chosed as the best one. The best model insformation is saved to application logs and the best model itself is saved inside /models.

As per latest training result XGBoost was selected as best model with test R2 score around 90%.

5. recommendation

The steps of building recommendation system are as follows:

dataset saved after DataCleaningPipeline is used for buiding recommendation system.
Clock speed and screen size features are not used in recommendation system, so those columns are dropped.
Every feature is converted to Categorical and then One Hot Encoding is done.
The encoded data is then fit to NearestNeighbour model. Cosine similarity is used to find as the metric.
The Recommendation object is then saved inside /models.

6. prediction

This module is used during runtime. Two separate classes are created inside this module -

PredictionPipeline : de-serialises the encoder and best model objects, input data is first passed to encoder and then to best model for prediction.
RecommendationPipeline : de-serialises the Recommendation object, input data is fed to recommender and top five results are returned.

7. logs

The purpose of this module is to save logs to database. Each log has a module name (e.g., training or preprocessing), timestamp, type (success or failure) and log message. During development the logs were stored as plain text file but it was migrated to MongoDB Atlas afterwards. The connection url is saved inside .env file which is not shared for security purpose.

Entrypoint of the project

There are two scripts in the project root folder. Those are two entrypoints.

retrain.py is used to scrape fresh data, preprocess and retrain the model. app.py is used to run the streamlit web application. In this script the PredictionPipeline and RecommendationPipeline are de-serialised and initialised. Then the streamlit app is run.

Here is the architecture of retraining pipeline

How to run this project

To run this project perform following steps -

Clone the project to your system
Create a virtual environment
Run pip install -r requirements.txt to install packages
For application logging follow one of the following steps:

Create a database in MongoDB Atlas manually. The name of the databse should be defined inside .env file as variable name DB
Create a collection inside this database. The name of the collection should be defined inside .env file as variable name APP_LOGS
The database connection url should be defined inside .env file as variable name CONN_URL

Run streamlit run app.py from command prompt inside root folder and the web application will start.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Laptop Price Prediction

Table of contents

Problem Statement

Approach to solve the problem

1. Collecting data

2. Cleaning, analysing and preprocessing the data

3. Building and tuning model to solve the price prediction problem

4. Build a simple recommendation system to recommend similar laptops

Project Architecture

1. webcrawler

2. validation

3. preprocessing

4. training

5. recommendation

6. prediction

7. logs

Entrypoint of the project

Here is the architecture of retraining pipeline

How to run this project

Technologies Used

1. Web Scraping

2. Data preprocessing

3. Data visualisation

4. ML model building

5. Web application

Sample test result

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.streamlit		.streamlit
__pycache__		__pycache__
data		data
database		database
logs		logs
models		models
notebooks		notebooks
prediction		prediction
preprocessing		preprocessing
recommendation		recommendation
training		training
urls_and_paths		urls_and_paths
validation		validation
webcrawler		webcrawler
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
retrain.py		retrain.py

License

arnabroy734/laptop_price_prediction

Folders and files

Latest commit

History

Repository files navigation

Laptop Price Prediction

Table of contents

Problem Statement

Approach to solve the problem

1. Collecting data

2. Cleaning, analysing and preprocessing the data

3. Building and tuning model to solve the price prediction problem

4. Build a simple recommendation system to recommend similar laptops

Project Architecture

1. webcrawler

2. validation

3. preprocessing

4. training

5. recommendation

6. prediction

7. logs

Entrypoint of the project

Here is the architecture of retraining pipeline

How to run this project

Technologies Used

1. Web Scraping

2. Data preprocessing

3. Data visualisation

4. ML model building

5. Web application

Sample test result

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages