# Analysing Data Scientist Salaries

In this notebook, we will analyze the [Kaggle Jobs Dataset from Glassdoor](https://www.kaggle.com/datasets/thedevastator/jobs-dataset-from-glassdoor) that contains job postings from Glassdoor.com from 2017. 

We aim to analyze the dataset considering the following research questions:

- What are the key factors that affect data science salaries? 
- Can we predict the salary of data science positions based on the job postings?  

## Setup the dataset to HDFS for big data analysis

The HDFS will allow us to store and retrieve the data efficiently for our analysis. It makes sure the data is readily available in parallel.

Before loading the data to HDFS we added the missing header **ID** into files `eda_data.csv` and `glassdoor_jobs.csv`.

In [None]:
# TODO: SET UP COMMANDS

## EDA

In [1]:
import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
import pyspark.sql.functions as F
from pyspark import SparkContext, SparkConf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()


conf = SparkConf().set('spark.ui.port', '4040')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.appName('Data Scientist Salaries').master('local[*]').getOrCreate()

24/10/17 10:20:47 WARN Utils: Your hostname, MacBook-Pro-Eric.local resolves to a loopback address: 127.0.0.1; using 213.112.119.200 instead (on interface en0)
24/10/17 10:20:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/17 10:20:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
spark

In [167]:
# read the dataset, ecape quotes inside the job descriptions
df = spark.read.csv('./dataset/eda_data.csv', header=True, inferSchema=True, multiLine=True, quote='"', escape='"', mode='PERMISSIVE')
df.show(5)

+---+--------------------+--------------------+--------------------+------+--------------------+---------------+--------------+--------------------+-------+------------------+--------------------+--------------------+--------------------+--------------------+------+-----------------+----------+----------+----------+--------------------+---------+----------+---+---------+----+-----+---+-----+--------------+---------+--------+--------+
| ID|           Job Title|     Salary Estimate|     Job Description|Rating|        Company Name|       Location|  Headquarters|                Size|Founded| Type of ownership|            Industry|              Sector|             Revenue|         Competitors|hourly|employer_provided|min_salary|max_salary|avg_salary|         company_txt|job_state|same_state|age|python_yn|R_yn|spark|aws|excel|      job_simp|seniority|desc_len|num_comp|
+---+--------------------+--------------------+--------------------+------+--------------------+---------------+------------

In [168]:
print('Number of datapoints:', df.count())
print('Number of columns:', len(df.columns))

Number of datapoints: 742
Number of columns: 33


**Show schema:**

In [169]:
df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Job Title: string (nullable = true)
 |-- Salary Estimate: string (nullable = true)
 |-- Job Description: string (nullable = true)
 |-- Rating: double (nullable = true)
 |-- Company Name: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Headquarters: string (nullable = true)
 |-- Size: string (nullable = true)
 |-- Founded: integer (nullable = true)
 |-- Type of ownership: string (nullable = true)
 |-- Industry: string (nullable = true)
 |-- Sector: string (nullable = true)
 |-- Revenue: string (nullable = true)
 |-- Competitors: string (nullable = true)
 |-- hourly: integer (nullable = true)
 |-- employer_provided: integer (nullable = true)
 |-- min_salary: integer (nullable = true)
 |-- max_salary: integer (nullable = true)
 |-- avg_salary: double (nullable = true)
 |-- company_txt: string (nullable = true)
 |-- job_state: string (nullable = true)
 |-- same_state: integer (nullable = true)
 |-- age: integer (nullable = 

### Data Cleaning

First, we will fix our dataset with regard to missing values and either fill the instances or drop the records containing these values.

**Handling of irrelevant features:**

The dataset contains a lot of columns that are interesting and could be used for the predicitons.  We have decided to drop features that will not be necessary for our analysis, from the data set, to decrease dimensionality and also the number of missing values that need to be filled. Since the `Job Description` column contains a long string of text we choose to drop and use `desc_len` as an inidicator of how detailed the descriptions for each job are. We also do not care about where a single person wants to work or their feature so we drop columns `same_state` and `age`. `Company Name`is just an unclean version of `company_txt`.

In [170]:
def drop_irrelevant_cols(df):
    df = df.drop(*['Job Description', 'age', 'same_state', 'Competitors', 'Company Name'])
    return df

df = drop_irrelevant_cols(df)

In [171]:
df.describe().show()

[Stage 297:>                                                        (0 + 1) / 1]

+-------+------------------+-----------------+--------------------+------------------+----------------+-------------------+-------+-----------------+-----------------+---------+----------------+--------------------+-------------------+--------------------+-----------------+------------------+------------------+--------------------+---------+-------------------+--------------------+-------------------+-------------------+-------------------+--------+---------+------------------+------------------+
|summary|                ID|        Job Title|     Salary Estimate|            Rating|        Location|       Headquarters|   Size|          Founded|Type of ownership| Industry|          Sector|             Revenue|             hourly|   employer_provided|       min_salary|        max_salary|        avg_salary|         company_txt|job_state|          python_yn|                R_yn|              spark|                aws|              excel|job_simp|seniority|          desc_len|          num_co

                                                                                

**Handling missing values:**



In [172]:
# after playing around with the data, it seems that one of the records seem to have many missing values, so we drop it
bad_record = df.filter(df['Headquarters'] == -1)
bad_record.show()

+---+--------------------+--------------------+------+-------------+------------+----+-------+-----------------+--------+------+-------+------+-----------------+----------+----------+----------+--------------------+---------+---------+----+-----+---+-----+--------+---------+--------+--------+
| ID|           Job Title|     Salary Estimate|Rating|     Location|Headquarters|Size|Founded|Type of ownership|Industry|Sector|Revenue|hourly|employer_provided|min_salary|max_salary|avg_salary|         company_txt|job_state|python_yn|R_yn|spark|aws|excel|job_simp|seniority|desc_len|num_comp|
+---+--------------------+--------------------+------+-------------+------------+----+-------+-----------------+--------+------+-------+------+-----------------+----------+----------+----------+--------------------+---------+---------+----+-----+---+-----+--------+---------+--------+--------+
|581|Scientist – Cance...|Employer Provided...|  -1.0|Cambridge, MA|          -1|  -1|     -1|               -1|      

In [173]:
df = df.filter(df['ID'] != 581)
df.describe().show()

+-------+------------------+-----------------+--------------------+------------------+----------------+-------------------+-----------------+------------------+--------------------+---------+----------------+--------------------+--------------------+--------------------+------------------+------------------+------------------+--------------------+---------+------------------+--------------------+-------------------+-------------------+------------------+--------+---------+------------------+------------------+
|summary|                ID|        Job Title|     Salary Estimate|            Rating|        Location|       Headquarters|             Size|           Founded|   Type of ownership| Industry|          Sector|             Revenue|              hourly|   employer_provided|        min_salary|        max_salary|        avg_salary|         company_txt|job_state|         python_yn|                R_yn|              spark|                aws|             excel|job_simp|seniority|       

In [178]:
min_max_median = df.agg(
    F.min('Founded').alias('min_value'),
    F.max('Founded').alias('max_value'),
    F.expr('percentile_approx(Founded, 0.5)').alias('median_value')  # 0.5 for median
)

# Show the results
min_max_median.show()

+---------+---------+------------+
|min_value|max_value|median_value|
+---------+---------+------------+
|       -1|     2019|        1988|
+---------+---------+------------+



df.select(['Company Name'])

In [176]:
def fill_missing_values(df):
    df = df.withColumn('Rating', F.when(df['Rating'] == -1, 3.0).otherwise(df['Rating']))
    df = df.withColumn('Size', F.when(df['Size'] == -1, 'Unknown').otherwise(df['Size']))
    df = df.withColumn('Industry', F.when(df['Industry'] == -1, 'Unknown').otherwise(df['Industry']))
    df = df.withColumn('Founded', F.when(df['Founded'] == 1988, 'Unknown').otherwise(df['Founded']))  # fill in median year founded
    df = df.withColumn('seniority', F.when(df['seniority'] == 'na', 'med').otherwise(df['seniority']))
    # not a clearly deined job position, research or scientist positons, devops/spark engineers or many tasks etc
    df = df.withColumn('job_simp', F.when(df['job_simp'] == 'na', 'vague').otherwise(df['job_simp'])) 
    return df

df = fill_missing_values(df)

In [177]:
df.groupBy('job_simp').count().show()

+--------------+-----+
|      job_simp|count|
+--------------+-----+
|data scientist|  279|
|         vague|  183|
|      director|   14|
|       manager|   22|
| data engineer|  119|
|       analyst|  102|
|           mle|   22|
+--------------+-----+



In [181]:
# Filter rows where 'column_name' equals -1
rows_with_minus_one = df.filter(df['seniority'] == 'med')
rows_with_minus_one.count()

519

In [165]:
df_new = df.filter(df['job_simp'] == 'na')
df_new.select(['Job Title']).collect()

[Row(Job Title='Research Scientist'),
 Row(Job Title='Scientist I/II, Biology'),
 Row(Job Title='Scientist'),
 Row(Job Title='Spectral Scientist/Engineer'),
 Row(Job Title='R&D Data Analysis Scientist'),
 Row(Job Title='Analytics Consultant'),
 Row(Job Title='Scientist'),
 Row(Job Title='Research Scientist'),
 Row(Job Title='Data Management Specialist'),
 Row(Job Title='Sr. Scientist II'),
 Row(Job Title='Data Modeler'),
 Row(Job Title='Scientist'),
 Row(Job Title='Research Scientist'),
 Row(Job Title='Project Scientist'),
 Row(Job Title='Associate Scientist'),
 Row(Job Title='Scientist 2, QC Viral Vector'),
 Row(Job Title='Senior Research Scientist - Embedded System Development for DevOps'),
 Row(Job Title='Senior Spark Engineer (Data Science)'),
 Row(Job Title='Pricipal Scientist Molecular and cellular biologist'),
 Row(Job Title='Senior Scientist - Neuroscience'),
 Row(Job Title='Medical Lab Scientist'),
 Row(Job Title='Scientist, Analytical Development'),
 Row(Job Title='Sr. Scient

In [98]:
df.show(10)

+---+--------------------+--------------------+------+--------------------+---------------+--------------+--------------------+-------+------------------+--------------------+--------------------+--------------------+--------------------+------+-----------------+----------+----------+----------+--------------------+---------+---------+----+-----+---+-----+--------------+---------+--------+--------+
| ID|           Job Title|     Salary Estimate|Rating|        Company Name|       Location|  Headquarters|                Size|Founded| Type of ownership|            Industry|              Sector|             Revenue|         Competitors|hourly|employer_provided|min_salary|max_salary|avg_salary|         company_txt|job_state|python_yn|R_yn|spark|aws|excel|      job_simp|seniority|desc_len|num_comp|
+---+--------------------+--------------------+------+--------------------+---------------+--------------+--------------------+-------+------------------+--------------------+--------------------+

In [62]:
df.select(['Job Title']).tail(20)

24/10/17 11:42:15 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Job Title, Salary Estimate, Job Description, Rating, Company Name, Location, Headquarters, Size, Founded, Type of ownership, Industry, Sector, Revenue, Competitors, hourly, employer_provided, min_salary, max_salary, avg_salary, company_txt, job_state, same_state, age, python_yn, R_yn, spark, aws, excel, job_simp, seniority, desc_len, num_comp
 Schema: _c0, Job Title, Salary Estimate, Job Description, Rating, Company Name, Location, Headquarters, Size, Founded, Type of ownership, Industry, Sector, Revenue, Competitors, hourly, employer_provided, min_salary, max_salary, avg_salary, company_txt, job_state, same_state, age, python_yn, R_yn, spark, aws, excel, job_simp, seniority, desc_len, num_comp
Expected: _c0 but found: 
CSV file: file:///Users/ericbanzuzi/uni/KTH/Data%20Intensive%20Computing/ID2221-project/dataset/eda_data.csv


[Row(Job Title=None),
 Row(Job Title=None),
 Row(Job Title=None),
 Row(Job Title=None),
 Row(Job Title=' drinks (three ways to brew your favorite cup of coffee)'),
 Row(Job Title=None),
 Row(Job Title=' dental and vision insurance plans'),
 Row(Job Title=None),
 Row(Job Title=None),
 Row(Job Title=None),
 Row(Job Title=None),
 Row(Job Title=' CA. It is not a remote role."'),
 Row(Job Title='Data Science Project Manager'),
 Row(Job Title='Data Engineer'),
 Row(Job Title='Principal, Data Science - Advanced Analytics'),
 Row(Job Title='Sr Scientist, Immuno-Oncology - Oncology'),
 Row(Job Title='Senior Data Engineer'),
 Row(Job Title='Project Scientist - Auton Lab, Robotics Institute'),
 Row(Job Title='Data Science Manager'),
 Row(Job Title='Research Scientist – Security and Privacy')]

In [35]:
df = pd.read_csv('./dataset/eda_data.csv')
df.head(5)

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,age,python_yn,R_yn,spark,aws,excel,job_simp,seniority,desc_len,num_comp
0,0,Data Scientist,$53K-$91K (Glassdoor est.),"Data Scientist\nLocation: Albuquerque, NM\nEdu...",3.8,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,...,47,1,0,0,0,1,data scientist,na,2536,0
1,1,Healthcare Data Scientist,$63K-$112K (Glassdoor est.),What You Will Do:\n\nI. General Summary\n\nThe...,3.4,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,...,36,1,0,0,0,0,data scientist,na,4783,0
2,2,Data Scientist,$80K-$90K (Glassdoor est.),"KnowBe4, Inc. is a high growth information sec...",4.8,KnowBe4\n4.8,"Clearwater, FL","Clearwater, FL",501 to 1000 employees,2010,...,10,1,0,1,0,1,data scientist,na,3461,0
3,3,Data Scientist,$56K-$97K (Glassdoor est.),*Organization and Job ID**\nJob ID: 310709\n\n...,3.8,PNNL\n3.8,"Richland, WA","Richland, WA",1001 to 5000 employees,1965,...,55,1,0,0,0,0,data scientist,na,3883,3
4,4,Data Scientist,$86K-$143K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,...,22,1,0,0,0,1,data scientist,na,2728,3


In [14]:
df.shape[0]

742

In [4]:
df.count()

                                                                                

25508

In [45]:
### Check the schema
df.printSchema()

root
 |-- Unnamed: 0: string (nullable = true)
 |-- Job Title: string (nullable = true)
 |-- Salary Estimate: string (nullable = true)
 |-- Job Description: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Company Name: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Headquarters: string (nullable = true)
 |-- Size: string (nullable = true)
 |-- Founded: string (nullable = true)
 |-- Type of ownership: string (nullable = true)
 |-- Industry: string (nullable = true)
 |-- Sector: string (nullable = true)
 |-- Revenue: string (nullable = true)
 |-- Competitors: string (nullable = true)
 |-- hourly: string (nullable = true)
 |-- employer_provided: string (nullable = true)
 |-- min_salary: string (nullable = true)
 |-- max_salary: string (nullable = true)
 |-- avg_salary: string (nullable = true)
 |-- company_txt: string (nullable = true)
 |-- job_state: string (nullable = true)
 |-- same_state: string (nullable = true)
 |-- age: string (nullable = 

In [6]:
type(df)

pyspark.sql.dataframe.DataFrame

In [7]:
df.head(3)

[Row(Job Title='Data Scientist', Salary Estimate='$53K-$91K (Glassdoor est.)', Job Description='Data Scientist', Rating=None, Company Name=None, Location=None, Headquarters=None, Size=None, Founded=None, Type of ownership=None, Industry=None, Sector=None, Revenue=None, Competitors=None, hourly=None, employer_provided=None, min_salary=None, max_salary=None, avg_salary=None, company_txt=None, job_state=None, same_state=None, age=None, python_yn=None, R_yn=None, spark=None, aws=None, excel=None),
 Row(Job Title='Location: Albuquerque', Salary Estimate=' NM', Job Description=None, Rating=None, Company Name=None, Location=None, Headquarters=None, Size=None, Founded=None, Type of ownership=None, Industry=None, Sector=None, Revenue=None, Competitors=None, hourly=None, employer_provided=None, min_salary=None, max_salary=None, avg_salary=None, company_txt=None, job_state=None, same_state=None, age=None, python_yn=None, R_yn=None, spark=None, aws=None, excel=None),
 Row(Job Title='Education Requ

In [8]:
df.select(['Job Title','Rating']).show()

+--------------------+--------------------+
|           Job Title|              Rating|
+--------------------+--------------------+
|      Data Scientist|                NULL|
|Location: Albuque...|                NULL|
|Education Require...|            business|
|    Skills Required:|                NULL|
|Bachelor’s Degree...|       data analysis|
|Applicant should ...|              MATLAB|
|Excellent verbal ...|                NULL|
|Applicant must be...|                NULL|
|U.S. citizenship ...|                NULL|
|Responsibilities:...|              models|
|           Benefits:|                NULL|
|We offer competit...|                NULL|
|Comprehensive health| long and short t...|
|100% Company fund...|                NULL|
|   Generous vacation|                NULL|
|  Tuition assistance|                NULL|
|Benefits are prov...|                NULL|
|Tecolote Research...|                 3.8|
|                3.8"|501 to 1000 emplo...|
| , NM,0,47,1,0,0,0,1|          

In [9]:
df.dtypes

[('Job Title', 'string'),
 ('Salary Estimate', 'string'),
 ('Job Description', 'string'),
 ('Rating', 'string'),
 ('Company Name', 'string'),
 ('Location', 'string'),
 ('Headquarters', 'string'),
 ('Size', 'string'),
 ('Founded', 'string'),
 ('Type of ownership', 'string'),
 ('Industry', 'string'),
 ('Sector', 'string'),
 ('Revenue', 'string'),
 ('Competitors', 'string'),
 ('hourly', 'string'),
 ('employer_provided', 'string'),
 ('min_salary', 'string'),
 ('max_salary', 'string'),
 ('avg_salary', 'string'),
 ('company_txt', 'string'),
 ('job_state', 'string'),
 ('same_state', 'string'),
 ('age', 'string'),
 ('python_yn', 'string'),
 ('R_yn', 'string'),
 ('spark', 'string'),
 ('aws', 'string'),
 ('excel', 'string')]

In [10]:
df.describe().show()

[Stage 7:>                                                          (0 + 1) / 1]

+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+---------+------------------+--------------------+-------------------+-------------------+--------------------+--------------------+-------------------+
|summary|           Job Title|     Salary Estimate|     Job Description|              Rating|        Company Name|            Location|     Headquarters|              Size|             Founded|   Type of ownership|            Industry|              Sector|             Revenue|         Competitors|              hourly|   employer_provided|          min_salary|          max_salary|       avg_salary|         company_txt|job_s

                                                                                

In [36]:
### Adding Columns in data frame
df_pyspark=df_pyspark.withColumn('Experience After 2 year',df_pyspark['Experience']+2)

In [38]:
df_pyspark.show()

+---------+---+----------+-----------------------+
|     Name|age|Experience|Experience After 2 year|
+---------+---+----------+-----------------------+
|    Krish| 31|        10|                     12|
|Sudhanshu| 30|         8|                     10|
|    Sunny| 29|         4|                      6|
+---------+---+----------+-----------------------+



In [41]:
### Drop the columns
df_pyspark=df_pyspark.drop('Experience After 2 year')

In [42]:
df_pyspark.show()

+---------+---+----------+
|     Name|age|Experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshu| 30|         8|
|    Sunny| 29|         4|
+---------+---+----------+



In [44]:
### Rename the columns
df_pyspark.withColumnRenamed('Name','New Name').show()

+---------+---+----------+
| New Name|age|Experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshu| 30|         8|
|    Sunny| 29|         4|
+---------+---+----------+

