# Wisconsin faculty salaries

In this demo, we'll analyze data on average faculty salaries in Wisconsin, as compiled by the [Chronicle of Higher Education](http://data.chronicle.com/category/state/Wisconsin/faculty-salaries/).  The file ```wisconsinfaculty.txt``` contains data on faculty salaries at the 25 highest-paying institutions in Wisconsin in the 2013-2014 academic year, ordered by pay for faculty with the rank of Professor.

In [None]:
import pandas as pd
import numpy as np

In [None]:
# To start, we look at the first 10 lines of the file, to understand its structure.

with open("wisconsinfaculty.txt") as wisc_file:
    line_num = 0
    for line in wisc_file:
        print("Line number:", line_num, line)
        line_num +=1
        if line_num==10:
            break
        

We see that after a header line, institution names are on odd lines and institution data is on even lines.

We want to build a data frame with the following headers: Institution, Is Public, Professors, Associate Professors	Assistant Professors, Instructors, Lecturers, and Unranked.

Let's start by making sure we can extract the appropriate information.  We'll test our code on the first few lines of the file, to make sure it works.

In [None]:
%%time
with open("wisconsinfaculty.txt") as wisc_file:
    
    #skip the header line
    next(wisc_file)
    
    line_num=1
    while line_num < 10:
        # We use readline() to read two lines at once
        odd_line = wisc_file.readline()
        even_line = wisc_file.readline()
        
        #break out of the loop if we hit the end of the file
        if not even_line:
            break
        
        inst = odd_line.strip()
        
        #split the even line on the pipe symbol to extract public vs. private
        pipe_list = even_line.split("|")
        is_public =  "public" in pipe_list[1]
        
        #split on \t to extract the salary amount
        salary_list = pipe_list[2].split("\t")[1:]
        
        #strip special characters from salaries
        salary_list = [s.translate({ord(i):None for i in "$,\n-"}) for s in salary_list]
        
        #convert strings to numbers if non-empty
        salary_list = [np.nan if s=="" else eval(s) for s in salary_list]
        
        print(inst, is_public, salary_list)
            
        line_num += 2

Now we have to insert our information in a data frame.  A naive way to do this is to create an empty data frame and append the information for each row as we compute it.

In [None]:
%%time 
with open("wisconsinfaculty.txt") as wisc_file:
    
    #skip the header line
    next(wisc_file)
    
    # create an empty data frame with the appropriate columns
    column_names = ('Institution', 'Is Public', 'Professors', 'Associate Professors', 'Assistant Professors', 'Instructors', 'Lecturers', 'Unranked')
    fac_df = pd.DataFrame(columns=column_names)
    
    line_num=1
    while line_num < 10:
        # We use readline() to read two lines at once
        odd_line = wisc_file.readline()
        even_line = wisc_file.readline()
        
        #break out of the loop if we hit the end of the file
        if not even_line:
            break
        
        inst = odd_line.strip()
        
        #split the even line on the pipe symbol to extract public vs. private
        pipe_list = even_line.split("|")
        is_public =  "public" in pipe_list[1]
        
        #split on \t to extract the salary amount
        salary_list = pipe_list[2].split("\t")[1:]
        
        #strip special characters from salaries
        salary_list = [s.translate({ord(i):None for i in "$,\n-"}) for s in salary_list]
        
        #convert strings to numbers if non-empty
        salary_list = [np.nan if s=="" else eval(s) for s in salary_list]
        
        fac_df = fac_df.append(pd.DataFrame([[inst, is_public]+salary_list], columns = column_names))
            
        line_num += 2

print(fac_df)

We see that appending the information for each row is significantly slower than merely computing the information itself.  The problem is that appending data frames is a memory-intensive operation.

We can make our code faster by storing the information we compute in a list of dictionaries, and converting to a data frame at the end.

In [None]:
%%time
with open("wisconsinfaculty.txt") as wisc_file:
    
    #skip the header line
    next(wisc_file)
    
    dict_list = []
    column_names = ('Institution', 'Is Public', 'Professors', 'Associate Professors', 'Assistant Professors', 'Instructors', 'Lecturers', 'Unranked')
    
    line_num=1
    while line_num < 10:
        # We use readline() to read two lines at once
        odd_line = wisc_file.readline()
        even_line = wisc_file.readline()
        
        #break out of the loop if we hit the end of the file
        if not even_line:
            break
        
        inst = odd_line.strip()
        
        #split the even line on the pipe symbol to extract public vs. private
        pipe_list = even_line.split("|")
        is_public =  "public" in pipe_list[1]
        
        #split on \t to extract the salary amount
        salary_list = pipe_list[2].split("\t")[1:]
        
        #strip special characters from salaries
        salary_list = [s.translate({ord(i):None for i in "$,\n-"}) for s in salary_list]
        
        #convert strings to numbers if non-empty
        salary_list = [np.nan if s=="" else eval(s) for s in salary_list]
        
        dict_list.append(dict(zip(column_names, [inst,is_public]+salary_list)))
            
        line_num += 2

fac_df = pd.DataFrame(dict_list)
print(fac_df)

We're ready to run our code on the full file!

In [None]:
with open("wisconsinfaculty.txt") as wisc_file:
    
    #skip the header line
    next(wisc_file)
    
    dict_list = []
    column_names = ('Institution', 'Is Public', 'Professors', 'Associate Professors', 'Assistant Professors', 'Instructors', 'Lecturers', 'Unranked')
    
    #We rely on our break command at the end of the file to exit the loop gracefully
    
    while True:
        # We use readline() to read two lines at once
        odd_line = wisc_file.readline()
        even_line = wisc_file.readline()
        
        #break out of the loop if we hit the end of the file
        if not even_line:
            break
        
        inst = odd_line.strip()
        
        #split the even line on the pipe symbol to extract public vs. private
        pipe_list = even_line.split("|")
        is_public =  "public" in pipe_list[1]
        
        #split on \t to extract the salary amount
        salary_list = pipe_list[2].split("\t")[1:]
        
        #strip special characters from salaries
        salary_list = [s.translate({ord(i):None for i in "$,\n-"}) for s in salary_list]
        
        #convert strings to numbers if non-empty
        salary_list = [np.nan if s=="" else eval(s) for s in salary_list]
        
        dict_list.append(dict(zip(column_names, [inst,is_public]+salary_list)))

fac_df = pd.DataFrame(dict_list)
fac_df

Now that we have created the data frame, it's easy to extract summary statistics!

In [None]:
fac_df.describe()

In [None]:
fac_df.groupby('Is Public').describe()