# Salary Predictions Based on Job Descriptions

# Part 1 - DEFINE

### ---- 1 Define the problem ----

Write the problem in your own words here

In [1]:
# Data Analysis libraries
import numpy as np
import pandas as pd

# Visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Allow multiple outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Info
__author__ = "Thiago do Couto"
__email__ = "thiago.coutoreis@gmail.com"

## Part 2 - DISCOVER

### ---- 2 Load the data ----

In [2]:
df1 = pd.read_csv('data/train_features.csv')
df2 = pd.read_csv('data/train_salaries.csv')

df = df1.join(df2.set_index('jobId'), on = 'jobId')
df.head(3)

Unnamed: 0,jobId,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis,salary
0,JOB1362684407687,COMP37,CFO,MASTERS,MATH,HEALTH,10,83,130
1,JOB1362684407688,COMP19,CEO,HIGH_SCHOOL,NONE,WEB,3,73,101
2,JOB1362684407689,COMP52,VICE_PRESIDENT,DOCTORAL,PHYSICS,HEALTH,10,38,137


### ---- 3 Clean the data ----

In [82]:
# Define array for numeric columns loop
intcolumns = ['yearsExperience','milesFromMetropolis','salary']

# Check for column types
print('1 - Check for column types:')
print(df.dtypes)

# Look for duplicates
print('2 - Is there any duplicated row? ', df.duplicated().any())

# Look for negative values in numeric data
print('3 - How many negative values exists in each numeric column?')
for check in intcolumns:
    print('Column', check, ':', (df[check] < 0).sum())

# Look for zeroes in numeric data
print('4 - How many negative values exists in each numeric column?')
for check in intcolumns:
    print('Column', check, ':', (df[check] == 0).sum())
    
# Look for NaNs
print('5 - Check for NaNs in each atrribute:')
print(df.isna().any())


1 - Is there any duplicated value?  False
2 - Check for column types:
jobId                  object
companyId              object
jobType                object
degree                 object
major                  object
industry               object
yearsExperience         int64
milesFromMetropolis     int64
salary                  int64
dtype: object
3 - How many negative values exists in each numeric column?
Column yearsExperience : 0
Column milesFromMetropolis : 0
Column salary : 0
4 - How many negative values exists in each numeric column?
Column yearsExperience : 39839
Column milesFromMetropolis : 10123
Column salary : 5
5 - Check for NaNs in each atrribute:
jobId                  False
companyId              False
jobType                False
degree                 False
major                  False
industry               False
yearsExperience        False
milesFromMetropolis    False
salary                 False
dtype: bool


In [59]:
for check in intcolumns:
    print('How many zeros exists in', check, 'column? ', (df[check] == 0).sum())
#print('How many zeros exists in Salary column? ', (df['salary'] == 0).sum())

How many zeros exists in yearsExperience column?  39839
How many zeros exists in milesFromMetropolis column?  10123
How many zeros exists in salary column?  5


In [7]:
df.dropna(subset = ['salary'], axis = 1, inplace = True)

(1000000, 9)

### ---- 4 Explore the data (EDA) ----

In [None]:
#summarize each feature variable
#summarize the target variable
#look for correlation between each feature and the target
#look for correlation between features

#df.describe()
#df['jobType'].value_counts()

In [None]:
sns.boxplot(x = df['yearsExperience'], y = df['salary'])

In [None]:
plt.scatter(x = df['yearsExperience'], y = df['salary'])

### ---- 5 Establish a baseline ----

In [None]:
#select a reasonable metric (MSE in this case)
#create an extremely simple model and measure its efficacy
#e.g. use "average salary" for each industry as your model and then measure MSE
#during 5-fold cross-validation

### ---- 6 Hypothesize solution ----

In [None]:
#brainstorm 3 models that you think may improve results over the baseline model based
#on your 

Brainstorm 3 models that you think may improve results over the baseline model based on your EDA and explain why they're reasonable solutions here.

Also write down any new features that you think you should try adding to the model based on your EDA, e.g. interaction variables, summary statistics for each group, etc

## Part 3 - DEVELOP

You will cycle through creating features, tuning models, and training/validing models (steps 7-9) until you've reached your efficacy goal

#### Your metric will be MSE and your goal is:
 - <360 for entry-level data science roles
 - <320 for senior data science roles

### ---- 7 Engineer features  ----

In [None]:
#make sure that data is ready for modeling
#create any new features needed to potentially enhance model

### ---- 8 Create models ----

In [None]:
#create and tune the models that you brainstormed during part 2

### ---- 9 Test models ----

In [None]:
#do 5-fold cross validation on models and measure MSE

### ---- 10 Select best model  ----

In [None]:
#select the model with the lowest error as your "prodcuction" model

## Part 4 - DEPLOY

### ---- 11 Automate pipeline ----

In [None]:
#write script that trains model on entire training set, saves model to disk,
#and scores the "test" dataset

### ---- 12 Deploy solution ----

In [None]:
#save your prediction to a csv file or optionally save them as a table in a SQL database
#additionally, you want to save a visualization and summary of your prediction and feature importances
#these visualizations and summaries will be extremely useful to business stakeholders

### ---- 13 Measure efficacy ----

We'll skip this step since we don't have the outcomes for the test data