- Problem statement
- Approach to solve the problem
- Project description and architecture
- How to run this project
- Technologies used
- Sample test results
- Deployed version of this repo
- Docker Image
- Suppose someone is planning to buy a laptop for his personal use and he has already decided the hardware configuration. Now the question is what is the tentative budget he should consider.
- Suppose he wants to buy a laptop with i3 processor with 32 GB RAM. But such combination may not be available in the market.
In this project we will build a system which will predict expected price of a laptop based on hardware configuration and also recommend laptops having similar configuration available in the market.
There are some datasets available on the internet to solve this problem, but the those datasets do not reflect the current price trend. As we know that price of any commodity varies with ups and down in the market it is always recommended to collect latest data to solve any commodity price prediction problem. That is why we scraped data of available laptops from an e-commerce website.
The raw data collected looks like this -
The data is cleaned first by extracting useful information (e.g., the maximum clock speed, screen size, resolution etc.). In this dataset our target variable is price of laptop, so relationships of different features with the target variable are also explored. After data cleaning the categorical variables are encoded to numerical forms by target oriented feature encoding.
Please refer the EDA notebook for more details.
After preprocessing the data looks like this -
We used the the dataset to build a regression model to predict price using the available features. We have done hyperparameter tuning on different models and saved the model as the best one which gives gives performance score. We used R2 score as performance metric.
We have built a simple recommendation system which will recommend top 5 laptops having similarity with the configuration selected by the user. It is buit on Nearest Neighbour using cosine similarity
The project is divided into several modules where each module performs a predefined job. Below is the brief description of each module -
The module scrapes data from e-commerce website in two steps -
- In step 1 the all the product urls and ids will be collected and saved in one file (productlinks.csv)
- In step 2 product specific details (e.g., processor name, SSD_Capacity) will be collected for each product url found in step 1. The data will be saved as raw.csv.
The module has two purposes.
- validation_raw_data.py validates the data from raw.csv file. It checks whether nan values are present in certain columns. If nan value is found the validation fails.
- validation_prediction.py checks the input data for prediction. It checks the column names, ordering of the columns, data type for each column, presence of new category for categorical features. If either of the checks goes wrong the validated fails.
In this module separate classes are defined for separate purposed. For example, DropDuplicates class inside data_cleaning.py drops all duplicate columns in the dataset. Each class is extention of BaseEstimator, TransformerMixin. So, each class has its own fit and transform method. This is done to create customised pipelines.
There are three pipelines defined inside preprocessor_main.py -
- DataCleaningPipeline : to remove duplicate columns, to drop columns that are not required for model builing, extract numerical values from string data (e.g., extract screen size, clock speed, screen resolution), fill nan values in Graphics_Memory and SSD_Capacity columns
- EncodingPipeline: to encode categorical values (Processor Name, SSD, GPU and Touchscreen) to numerical. The encoder model will be saved as models/encoder.pkl because it will be required during prediction.
- ImputerPipeline: there will be some nan values in Clock_Speed and Screen_Resolution columns. Those values are imputed using KNN Imputer technique.
After running all the pipelines the output data will be stored as preprocessed.csv
In this module separate classes are created to build separate models. In each class the workflow is as follows -
- The parameters are defined for that model.
- The preprocessed data is read, the data is then split in train and test
- Hyperparameter tuning is done on the train data using grid search cv
- The performance metric used is R2 score
- The R2 score on test data is checked and the score is saved in application logs
- The best model is also saved inside /models
Inside class TrainBestModel four mododels (Linear Model, Decision Tree, Random Forest, XGBoost) are used for hyperparameter tuning and the model having highest test score is chosed as the best one. The best model insformation is saved to application logs and the best model itself is saved inside /models.
As per latest training result XGBoost was selected as best model with test R2 score around 90%.
The steps of building recommendation system are as follows:
- dataset saved after DataCleaningPipeline is used for buiding recommendation system.
- Clock speed and screen size features are not used in recommendation system, so those columns are dropped.
- Every feature is converted to Categorical and then One Hot Encoding is done.
- The encoded data is then fit to NearestNeighbour model. Cosine similarity is used to find as the metric.
- The Recommendation object is then saved inside /models.
This module is used during runtime. Two separate classes are created inside this module -
- PredictionPipeline : de-serialises the encoder and best model objects, input data is first passed to encoder and then to best model for prediction.
- RecommendationPipeline : de-serialises the Recommendation object, input data is fed to recommender and top five results are returned.
The purpose of this module is to save logs to database. Each log has a module name (e.g., training or preprocessing), timestamp, type (success or failure) and log message. During development the logs were stored as plain text file but it was migrated to MongoDB Atlas afterwards. The connection url is saved inside .env file which is not shared for security purpose.
There are two scripts in the project root folder. Those are two entrypoints.
retrain.py is used to scrape fresh data, preprocess and retrain the model. app.py is used to run the streamlit web application. In this script the PredictionPipeline and RecommendationPipeline are de-serialised and initialised. Then the streamlit app is run.
To run this project perform following steps -
- Clone the project to your system
- Create a virtual environment
- Run
pip install -r requirements.txt
to install packages - For application logging follow one of the following steps:
- Create a database in MongoDB Atlas manually. The name of the databse should be defined inside .env file as variable name
DB
- Create a collection inside this database. The name of the collection should be defined inside .env file as variable name
APP_LOGS
- The database connection url should be defined inside .env file as variable name
CONN_URL
- Run
streamlit run app.py
from command prompt inside root folder and the web application will start.