In [92]:
import pandas as pd
import numpy as np
import seaborn as sns
import re

# Factors that impact Salary

To predict salary, building either a classification or regression model.

- Frame this as a regression problem, you will be estimating the listed salary amounts. 
- Frame this as a classification problem, you will create labels from these salaries (high vs. low salary) according to thresholds (such as median salary).

Models that may be useful for this problem:

- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. 

## Data Cleaning

### Drop duplicate and na

In [70]:
# Load data scraping from au.indeed.com (details in Scraping notebook)
jobs = pd.read_csv('jobs.csv')
jobs.drop('Unnamed: 0', axis=1, inplace=True)
print jobs.shape
jobs.drop_duplicates(inplace=True)
print jobs.shape
jobs.dropna(inplace=True)
print jobs.shape
jobs.columns = ['city', 'title', 'company', 'location', 'description', 'salarytype', 'salary']

(1241, 7)
(1034, 7)
(1034, 7)


## Get dummies

In [72]:
jobs['state'] = jobs.location.map(lambda x: x.split()[-1])
jobs.drop('location', axis=1, inplace=True)

state_dummies = pd.get_dummies(jobs.state)
jobs = pd.concat([jobs, state_dummies], axis=1)
jobs.drop(['state', 'Australia'], axis=1, inplace=True)

city_dummies = pd.get_dummies(jobs.city)
jobs = pd.concat([jobs, city_dummies], axis=1)
jobs.drop(['city', 'Newcastle'], axis=1, inplace=True)

type_dummies = pd.get_dummies(jobs.salarytype)
jobs = pd.concat([jobs, type_dummies], axis=1)
jobs.drop(['salarytype', 'month'], axis=1, inplace=True)

In [107]:
def f(x):
    x = re.sub(r'\([^)]*\)', '', x)
    return x

jobs.title.map(f)

0                   Research Scientist - Machine Learning
1                                          Data Scientist
2                                          Data Scientist
3         Data Analyst/Jr. Data Scientist - Perm - Sydney
4                        Data Scientist Sydney, Australia
5       Data Scientist - Machine Learning - Validated ...
6                        Data Scientist - Top ASX company
7          Customer Data Scientist - Insights & Analytics
8              Research Scientist - Theoretical Physicist
9       Principal Consultant, Artificial Intelligence ...
10                  Data Scientist / Statistical Modeller
11                                  Senior Data Scientist
12                         Data Scientist - Perm - Sydney
13      Data Scientist/Biostatistician - Medicine Insi...
14                               Administrative Assistant
15           Business Intelligence Lead Sydney, Australia
16      BI, DW and Data IT Recruitment Consultant - Sy...
17            

# Factors that distinguish job category

Identify features in the data related to job postings that can distinguish job titles from each other. 

There are a variety of interesting ways you can frame the target variable, for example:

- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. 

Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. 

The type of classification model you choose is up to you. 

Be sure to interpret your results and evaluate your models' performance.

# ROC

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs. Plot the ROC curve.

# Requirement

- Scrape and prepare your own data.


- Create and compare at least two models for each section. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).

    - Section 1: Job Salary Trends
    - Section 2: Job Category Factors


- Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists.

    - Make sure to clearly describe and label each section.
    - Comment on your code so that others could, in theory, replicate your work.
    
    
- A brief writeup in an executive summary, written for a non-technical audience.

    - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.
    

- Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions.


- Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.
