<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Baltimore Salaries

_Authors: Greg Baker (SYD)_

---

The City of Baltimore publish data about all their employees, including their salaries.
Their annual salary can differ from their gross pay: perhaps they have overtime and earn
more than their official salary, perhaps they are only employed for a part of the year
and earn less.

In this lab we'll look estimate what a typical City of Baltimore employee's gross pay will be 
based on their annual salary.

In [None]:
%matplotlib inline
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt

## Read the dataset

The Baltimore salaries data set is in "datasets/Baltimore_City_Employee_Salaries_2011.csv". 
You can use column 0 as an index. Column 4 is a date.

In [None]:
# A:

## Pre-processing to have numbers instead of strings 

The AnnualSalary and GrossPay columns are strings and start with a $. Strip this off and convert
these columns to floats.

In [None]:
# A:

## Exploratory Data Analysis

Create a scatter plot of Annual Salary vs Gross Pay

In [None]:
# A:

# Look for a linear relationship

It seems like there is a linear relationship in there, but it is obscured by a lot of noise.

Split the data into a test and training data set.

In [None]:
# A:

## Ordinary Least Squares

The errors in the graph above don't look evenly balanced, which doesn't bode well for ordinary least squares.

Let's see what it gives us: import sklearn.linear_models, create an ordinary least squares regressor
and train it

In [None]:
# A:

### Visualise

Plot the test data, and plot the predictions from the linear model over it. OLS
will generally predict a gross salary that is a little too high.

In [None]:
# A:

### Measure

Initially, let's look at three metrics to understand how well this line represents the data.

- Calculate the $R^2$ score for the predictions it made
- Calculate the median absolute error
- Calculate the mean absolute error

Remember that sklearn.metrics has functions for doing all of these.

In [None]:
# A:

## Robust Regression

Perform the same analysis using Theil-Sen, RANSAC and Huber.

### Theil-Sen

Train the Theil-Sen regressor, plot its predictions for the test data and calculate the 
three metrics above. You can copy and paste most of the code you wrote.

Expect to see the $R^2$ worse -- and perhaps other metrics worse -- but a better-looking fit.

In [None]:
# A:

## RANSAC

As above, using RANSAC

In [None]:
# A:

## Huber

If you are running a version old version of scikit-learn (0.18 or earlier) you might not have the
option to create a Huber regressor.

In [None]:
# A:

## Review

- Which model had the highest $R^2$ score? Why is this obvious?
- Which model had the lowest median absolute error?
- Which model had the lowest mean absolute error?

In [None]:
# A:

# OLS will always have the highest R^2 score, because that's what it maximises
# Huber usually wins on median absolute error and mean absolute error

# Commercial Analysis

You are the hiring manager at the City of Baltimore. New employees regularly ask
how much they are actually likely to earn given the salary that they are about to
agree to.

You don't want to give an answer that is too high because then you might be setting
the city at risk of a lawsuit for misrepresenting the job. You don't want to give an
answer that is too low because then the candidate might pass up on the job and work
elsewhere.

You decide that it will cost \\$0.05 in law-suit danger for each dollar that you 
over-represent, but only \\$0.01 for each dollar that you under-represent.

e.g. if a candidate is actually likely to earn \\$100,000 and you say \\$120,000, this
is worth \$10,000 in potential law-suits for mis-representation. If you say \\$80,000
then that will cost you \\$200 in potential recruiters' fees to find someone else.

## Evaluate existing models

You will need to choose between the four models that you have built. You would
choose based on the one which costs the City the least amount of money if you had
used it on all the employees in your test data set.

Write a scoring function that returns the dollar value given an estimator, an
Xtest set, and a Ytest set.

In [None]:
# A:


### Score the four models using this function

- OLS
- RANSAC
- Theil-Sen
- Huber

In [None]:
# A:
