Data Professions salaries - Analysing and Data Science

[![LinkedIn][linkedin-shield]][linkedin-url] ######################################## ########################################????

Data Professions salaries - Analysing and Data Science

A short epilogue:

My first approach about this project was to copy Ken Jee senior skills and take what is nutritious for me.

Even so, as I was working, I was understanding that my skills was not soooo inferior as I thought and as I was adding stuff and enhancing some things I noticed that the work was differentiating more clearly. Therefore I decided to upload it as a proper work.

Actually at first, I was not going to share this and had set it as "private", but seeing that I was correcting things that other people were asking here or in StackOverflow and so, I decided to make it public.

This is where everything started:

Ken Jee original DS Analysis code: ds_salary_proj
Ken Jee YouTube Video explaining his project
Ken Jee GitHub Profile

Index

Table of Contents:

About The Project
- Project Overview
- Built With
Getting Started
- Prerequisites
- Installation
Stages overview

######################################## ########################################

About The Project

Project résumé:

Salary comparisson between 2020 and 2021 datasets.
Created a tool that estimates data related professions salaries (MAE ~ $ 12K) to help data scientists negotiate their income when they get a job.
Scraped over two datasets of 1000 job descriptions from glassdoor using python and selenium (2020 and 2021).
Engineered features from the text of each job description to quantify the value companies put on keywords related to the data related professions for each year Some keywords: python, excel, aws, spark among others.
Analyzed many potential correlation for others potential analysis objectives.
ML Algorithms optimized to reach the best model: Linear, Lasso, and Random Forest Regressors using GridsearchCV.
Built a ~~not so practical~~ client facing API using flask (a great experience that enjoyed).

Built With

Python Version: 3.8.8
Framework: Anaconda (Jupyter Notebook and Spyder) and Kaggle Kernel.
Packages: pandas, numpy, sklearn, matplotlib, seaborn, selenium, flask, json, pickle, statsmodels.
2020 dataset and the original idea:
- Ken Jee code and DataSet: ds_salary_proj
- Explanation videos
Ken Jee YouTube Video explaining his project
Scraping: You can find the original code (from Ömer Sakarya) here:
- Github code.
- Explanation article.
Flask Productionization:
- Git Code

Many, MANY thanks Ken, Ömer and GeekDataGuy

Installation

Clone the repo

git clone https://github.com/echestare/001KenJeeFromScratch_DSSalary.git

######################################## ########################################

Stages overview

Web Scraping

The web scraped tweaked by Ken Jee was re-tweaked o updated by me (2021/09/30). It was setted to scrape 1000+ job postings from glassdoor.com. The information extracted and stored as .csv is:


Job title	Salary Estimate	Job Description
Rating	Company	Location
Company Headquarters	Company Size	Company Founded Date
Type of Ownership	Industry	Sector
Revenue	Competitors	-

Data Cleaning

The data scraped needed to be cleaned up, so that it was usable for our model.

The following changes was made:

Basic cleanings:
- Drop duplicates and fill NaN values.
- Remove first column (false index) and rows without salary.
- Reset Index
Parsed numeric data out of salary column as min, max and avg _salary:
- Taked into account the hourly given salaries.
- Added columns for employer provided salary and hourly wages.
Parsed rating out of company text and removed undesired characters.
Made a new column for company state and cleaned it.
Transformed founded date into age of company
Made columns for if different skills were listed in the job description:


Python	R	Spark	AWS	Excel	SQL
SAS	d3.js	Julia	Jupyter	Keras	MatLab
MatPlotLib	PyTorch	Scikit-Learn	Tensor Flow	Weka	Selenium
Hadoop	Tableau	Power BI	BigML	RapidMiner	Apache Flink
DataRobot	SAP Hana	Mongo DB	Trifacta	MiniTab	Kafka
MicroStrategy	Google Analytics	SPSS	-	-	-

Column for simplified Job Title.
Column for Seniority:
- by Jobs Title info and by Jobs Descriptoin info.
Column for Job Description length
- by quantity of letters and quantity of words
Cleaned other columns:
- Size
- Type of ownership
- Revenue

NOTE: In the EDA I added a new Column for quantity of keywords in Jobs Description.

Exploratory Data Analysis

The approach was to inspect each categorical variable and look for direct correlations with the salary distribution as well as between themself. The analysis was extensive and interesting, but it is well explained in the respective notebook. ######################################## ########################################

Finaly, I reapeated each analysis with the new dataset. And compared main characteristics.

Model Building

Working with the first dataset (2020) I splited the data into train (80%) and test (%20) and transformed cateorical variables into Dummy variables.

First, three models was evaluated using Mean Absolute Error, because it is not so sensible to attipical error and outliers in this model are not particularly bad. Models used (I think Ken Jee approach is the correct):

Multiple Linear Regression – Baseline for the model
Lasso Regression – Because of the sparse data from the many categorical variables, I thought a normalized regression like lasso would be effective.
Random Forest – Again, with the sparsity associated with the data, I thought that this would be a good fit.

Model performance

######################################## ########################################

The Random Forest model far outperformed the other approaches on the test and validation sets. 
*	**Random Forest** : MAE = 11.22
*	**Linear Regression**: MAE = 18.86
*	**Ridge Regression**: MAE = 19.67

######################################## ########################################

Productionization

I would say I get much fun doing this stage. Was quite interesting since I have not used Flask before. But, I must also say that the final results are not enought treated to be considered a good finish. Anyway, I leave it here for how much I enjoyed it

The description of this stage is:

In this step, a flask API endpoint was built and was hosted on a local webserver by following along with the TDS. The API endpoint takes in a request with a list of values from a job listing and returns an estimated salary.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Jupyter Notebook (anaconda3).lnk		Jupyter Notebook (anaconda3).lnk
Main_maybe.py		Main_maybe.py
README.md		README.md
chromedriver.exe		chromedriver.exe
data_cleaned_2020.csv		data_cleaned_2020.csv
data_cleaning.ipynb		data_cleaning.ipynb
data_cleaning.py		data_cleaning.py
data_scraped_2020.csv		data_scraped_2020.csv
glassdoor_scraper.py		glassdoor_scraper.py
planning.txt		planning.txt

echestare/001KenJeeFromScratch_DSSalary

Folders and files

Latest commit

History

Repository files navigation

Data Professions salaries - Analysing and Data Science

A short epilogue:

My first approach about this project was to copy Ken Jee senior skills and take what is nutritious for me.

Even so, as I was working, I was understanding that my skills was not soooo inferior as I thought and as I was adding stuff and enhancing some things I noticed that the work was differentiating more clearly. Therefore I decided to upload it as a proper work.

Actually at first, I was not going to share this and had set it as "private", but seeing that I was correcting things that other people were asking here or in StackOverflow and so, I decided to make it public.

This is where everything started:

Index

About The Project

Project résumé:

Built With

Many, MANY thanks Ken, Ömer and GeekDataGuy

Installation

Stages overview

Web Scraping

Data Cleaning

Exploratory Data Analysis

Model Building

Model performance

Productionization

About

Topics

Resources

Stars

Watchers

Forks

Languages