# Salary Increase Rates from 1967 - 2019

As shown on http://otcads.umd.edu/bfa/budgetinfo3.htm, there is a historical tuition and fees page (shown here: http://otcads.umd.edu/bfa/2019%20COLA%20history%20Revised.htm).  
This dataset has salary increase rates every year from 1967 to 2019. 

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import pickle
import numpy as np

## Scraping

In [2]:
url = 'http://otcads.umd.edu/bfa/2013%20COLA%20history_files/sheet001.htm'
r = requests.get(url)

In [3]:
soup = BeautifulSoup(r.text,'html.parser')

## Parsing

In [4]:
# Making sure that the only table being used is the only table present.
len(soup.find_all('table'))

1

In [5]:
table = soup.find('table')

In [6]:
str_salary = []

for e in table.find_all('td'):
    if(e.find('p') is not None and not e.find('p').text.isspace() and e.find('p').find('b') is None):
        str_salary.append(e.find('p').text.strip())

All data from the page is in the str_salary array. To correctly index the right parts of the data, the entire table was printed to find the right index. The code directly below can be uncommented to index each element of the array.

In [7]:
i = 0

salary_increases = pd.DataFrame(columns=['Year', 'State C.O.L.A. %', 'Merit %', 'Total Increase %'])

while(i <= 208):
#     str_salary[i + 0]
    salary_increases = salary_increases.append({
                'Year' : str_salary[i + 0],
                'State C.O.L.A. %' : str_salary[i + 1],
                'Merit %' : str_salary[i + 2],
                'Total Increase %' : str_salary[i + 3]
        }, ignore_index=True)
    i = i + 4
#     break;

## Finalizing & Fine-tuning

The data is now being correctly type casted.

In [8]:
for col in ['State C.O.L.A. %', 'Merit %', 'Total Increase %']:
    salary_increases[col] = salary_increases[col].apply(lambda x : x.replace('%', '').replace(',', '').replace('$', '').replace('bonus', ''))
    salary_increases[col] = salary_increases[col].replace('variable', np.nan)                                                    
    salary_increases[col] = salary_increases[col].astype(float)

In [9]:
salary_increases = salary_increases.sort_values(by='Year', ascending=False)

In [10]:
salary_increases

Unnamed: 0,Year,State C.O.L.A. %,Merit %,Total Increase %
52,2019,2.5,0.0,2.5
51,2018,0.0,0.0,0.0
50,2017,0.0,2.5,2.5
49,2016,0.0,0.0,0.0
48,2015,2.0,2.5,5.5
47,2014,3.0,2.5,5.5
46,2013,2.0,0.0,2.0
45,2012,750.0,0.0,750.0
44,2011,0.0,0.0,0.0
43,2010,0.0,0.0,0.0


In [11]:
salary_increases.to_pickle('df/salary_increases')