Salary Prediction Project

Project Overview

Created a tool that estimates salary for different Data Science roles to help with salary negotiation with future employer.
Scrapped about 1000 job postings from glassdoor using python and selenium.
Feature engineering done to quantify the postings which contained python, R, AWS, Spark, SQL, Tableau, Excel.
Optimised Multiple Linear, Support Vector and Random Forest Regressors using RandomizedSearchCV.
Built a client facing API using Flask.

Resources

Python Version : 3.7.6

Packages :pandas, numpy, sklearn, matplotlib, seaborn, selenium, flask, json, pickle

Web Framework Requirements : pip install -r requirements.txt

Project inspiration from Ken Jee's project walkthrough : https://www.youtube.com/playlist?list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t

Scraper to scrap glassdoor job postings: https://github.com/arapfaik/scraping-glassdoor-selenium

Client API using Flask : https://towardsdatascience.com/productionize-a-machine-learning-model-with-flask-and-heroku-8201260503d2

Web Scraping for job postings from Glassdoor.com

Scraped 1000 job postings from glassdoor.com. Following fields were scraped from each posting :

Job title
Salary Estimate
Job Description
Rating
Company
Location
Company Headquarters
Company Size
Company Founded Date
Type of Ownership
Industry
Sector
Revenue
Competitors

Data Preprocessing

Parsed numeric data out of salary
Parsed rating out of company text
Made a new column for company state
Made columns for employer provided salary and hourly wages
Removed job postings without salary
Added a column for if the job was at the company’s headquarters
Transformed founded date into age of company
Made columns for if different skills were listed in the job description:
- Python
- SQL
- Tableau
- R
- Excel
- AWS
- Spark
Column for simplified job title and Seniority
Column for description length

Exploratory Data Analysis

Below are some of the important findings from the exploratory data analysis:

Model Building and Evaluation

Categorical variables tranformed to dummy variables.

Three different models built and tuned. Evaluation done through Mean Absolute Error(MAE).

Performance of the models is given below:

MAE for Multiple Linear Regression : 19.67078838952545
MAE for Support Vector Regression : 30.088913715302816
MAE for Random Forest : 14.452240720416757

Random Forest gives least error and it also shows that it performs better on sparse matrix.

Productionization

In the final step, I built a Flask API endpoint which is hosted locally with the help of article referenced above. The API endpoint can give the predicted salary based on the model we built from the list of features we give as input.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.ipynb_checkpoints		.ipynb_checkpoints
FlaskAPI		FlaskAPI
__pycache__		__pycache__
images		images
.DS_Store		.DS_Store
Cleaned_data.csv		Cleaned_data.csv
Exploratory_Analysis.ipynb		Exploratory_Analysis.ipynb
README.md		README.md
data_cleaning.py		data_cleaning.py
data_collection.py		data_collection.py
glassdoorScraper.py		glassdoorScraper.py
model_building.py		model_building.py
processed_data.csv		processed_data.csv
scrapped_jobs.csv		scrapped_jobs.csv

devsheth09/ds_jobs_proj

Folders and files

Latest commit

History

Repository files navigation

Salary Prediction Project

Project Overview

Resources

Web Scraping for job postings from Glassdoor.com

Data Preprocessing

Exploratory Data Analysis

Model Building and Evaluation

Productionization

About

Resources

Stars

Watchers

Forks

Languages