# Predict Ads Agency Salary

Based on the [Real Agency Salaries](https://www.campaignlive.com/article/creative-ad-agency-salaries-laid-bare-public-spreadsheet/) dataset, your task is to create a model that accurately predicts a person working in an ads agency based on features given in the dataset.

In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
from plotnine import *
import re

In [3]:
df = pd.read_csv('data/agency_table.csv')
df.shape

(4812, 17)

## Data Cleaning

### Target - `salary`

Clean the `salary` column so that it is a numeric column with correct numbers. Beware of outliers and unorthodox ways to entering numbers such as 50k and 100-200k.

In [4]:
#some examples of how you can clean some
df['salary'] = df.salary.map(lambda x: x.lower())
df['salary'] = df.salary.map(lambda x: re.sub(',','', str(x)))
df['salary'] = df.salary.map(lambda x: x.split('.')[0])
df['salary'] = df.salary.map(lambda x: re.sub('(?<=[0-9])k','000', str(x)))
df['salary'] = df.salary.map(lambda x: re.sub('[^0-9]','', str(x)))
df['salary'] = pd.to_numeric(df.salary)
df.salary.describe()

count    4.799000e+03
mean     4.499279e+13
std      2.894620e+15
min      0.000000e+00
25%      5.500000e+04
50%      7.500000e+04
75%      1.150000e+05
max      2.000002e+17
Name: salary, dtype: float64

In [5]:
#some miscleaned outliers
df[df.salary>1e6].shape

(47, 17)

In [6]:
#some miscleaned outliers
df[df.salary<1000].shape

(129, 17)

### Features

Clean also the features which as mostly categorical:

* company
* department
* title
* currency
* gender
* race
* sexual_orientation
* country
* city
* inhouse
* happy_salary

But also some numerical variables that you can transform into categorical variables if they are too sparse (such as `nb_awards_*`)
* exp_years
* nb_jobs
* nb_awards_top
* nb_awards_inter
* nb_awards_local

In [7]:
df.columns

Index(['company', 'department', 'title', 'salary', 'currency', 'gender',
       'race', 'sexual_orientation', 'exp_years', 'country', 'city', 'inhouse',
       'nb_jobs', 'happy_salary', 'nb_awards_top', 'nb_awards_inter',
       'nb_awards_local'],
      dtype='object')

## Exploration

Explore the cleaned dataset to see the distribution of each variable as well as one-to-one relationship between all interesting variable pairs.

## Modeling

Create a model that predicts `salary` with 20% randomly held out validation set.