# Homework 4

In this HW, we will study the pay gap between men and women who have jobs in San Francisco.  We will use the following two csv files to accomplish this task.

- Salaries.csv : contains salaries for over 100K employees in SF from 2011 to 2014.
- Names.csv :  contains baby names from 1980 to 2014 along with counts of how many times the given baby name was used.

We would like to find the average salary of men and women over all jobs from 2011 to 2014.  The problem, however, is that the Salaries.csv does not contain gender.  Futher, there are many names that are unisex.  Since we have counts in the file Names.csv, we use a majority vote to label the gender of each name in Names.csv. 

You will be asked to write a series of functions to implement this task. 

**Note: Unlike previous homeworks, the problems in this homework are inter-dependent in the sense that you can only pass the test for problem n if you have passed the test cases in problem n-1, since normally problem n requires to call the function in problem n-1.**


## Problem 1 Read the data

The following functions **ReadData** will read in the salary and the names data as pandas dataframes and return a list which contains these two dataframes. 

In [1]:
#Place your import here
import pandas as pd
import numpy as np

def ReadData():
    df_salaries = pd.read_csv("./Data/Salaries.csv")
    df_names = pd.read_csv("./Data/Names.csv")
    
    return [df_salaries, df_names]

In [2]:
[df_salaries, df_names] = ReadData()
assert df_names.shape == (24713, 5)

In [3]:
[df_salaries, df_names] = ReadData()
assert df_salaries.shape == (27386, 6)

## Problem 2 Get name counts

The following functions **ParseNames** will take the name dataframe as an input. It will then output two dictionaries called male_name and female_name. The key in each of these dictionaries will be the names (in all lowercase) and the value will be the sum of counts for the given name when it applied to the given gender. Note that the same name may appear in both the male and female gender.

For this function, USE ONLY ITERROWS(), NO GROUPING OR FILTERING YET! 

In [4]:
def ParseNames(df_names):
    """
    
    INPUT: the pandas dataframe contains names.csv
    
    OUTPUT: two dictionaries: male_names, female_name.
    The key in each of these dictionaries will be names 
    (in all lowercase)and the value will be the sum of the 
    counts for the given name when it applies to the given gender.
    
    USE ONLY ITERROWS(), NO GROUPING OR FILTERING YET! 
    This above function will take a minute or two to run. 
    """
    
    #Initialize empty dictionaries for names
    male_names = {}
    female_names = {}
    
    for index, row in df_names.iterrows():
        name = row["Name"].lower()
        count = row["Count"]
        gender = row["Gender"]
        
        if gender == "F":
            female_names[name] = count
        else:
            male_names[name] = count
    
    return male_names, female_names


In [5]:
[male_names, female_names] = ParseNames(df_names)
assert len(male_names) == 9482
assert len(female_names) == 15231

## Problem 3 Get First Name

This following functions **GetFirstName** will take a name of a person (name contains first and last names separated by spaces) and return the lower case of the first name of the person. 


In [6]:
def GetFirstName(name):
    
    """
    Gets the first name from a name in the column
    EmployeeName in Salaries.csv.
    INPUT: name as string
    OUTPUT: first name in all lowercase
    """
    first_name = name.split(" ")[0].lower()
    
    return first_name

In [7]:
assert GetFirstName("Dennis Zhang") == "dennis"

## Problem 4 GetGender

This function takes in the dictionary for the male and female names from **ParseNames**, and a first name. It then returns "M" if the first name appears more times in male_names than female_names, "F" if the first name appears more times in female_names than male_names, and "NA" if the name does not appear in either male_names nor female_names.

In [8]:
def AddGender(first_name, male_names, female_names):
    
    """
    Find the most likely gender associated with a first name.
    
    INPUT: first_name, males_names and females_names which are the dictionaries 
    returned from ParseNames().
    
    OUTPUT:
    "M" if male_names[name] > female_names[name]
    
    "F" if male_names[name] <= female_names[name]
    
    "NaN" if the name doesn't apper in either dictionary
    """

    if first_name not in female_names and first_name in male_names:
        return_gender = "M"
    elif first_name not in male_names and first_name in female_names:
        return_gender = "F"
    elif first_name not in female_names and first_name not in male_names:
        return_gender = "NaN"
    else: 
        if male_names[first_name] > female_names[first_name]:
            return_gender = "M"
        else: 
            return_gender = "F"
    
    return return_gender

In [9]:
[df_salaries, df_names] = ReadData()
assert AddGender("charles", male_names, female_names) == "M"
assert AddGender("jasmine", male_names, female_names) == "F"
assert AddGender("dennis", male_names, female_names) == "M"

## Problem 5 AddGenderToDF

This function takes the df_salary dataset and adds a new column called "gender". The function will assign gender to each row in df_salary based on the first name of the person as well as the male_names and female_names dictionaries. You should use AddGender() and GetFirstName() to implement this function. The function then returns the dataframe with the new column "gender".

In [10]:
def AddGenderToDF(df_salaries, male_names, female_names):
    """
    This function will return a new dataframe with two new columns
    on top of the existing columns in df_salaries. 
    
    The first column is called "first_name" which contains the first
    name of the person.
    
    The second column is called "gender" which contains the gender
    inforamtion of the person from the AddGender() function.
    """
    
    df_salaries["first_name"] = df_salaries.EmployeeName.apply(GetFirstName)
    df_salaries["gender"] = df_salaries.first_name.apply(AddGender, args = (male_names, female_names))
    
    return df_salaries

In [11]:
[df_salaries, df_names] = ReadData()
df_salaries = AddGenderToDF(df_salaries, male_names, female_names)

assert df_salaries[df_salaries["EmployeeName"] == "GARY JIMENEZ"]["first_name"].tolist()[0] == "gary"
assert df_salaries[df_salaries["EmployeeName"] == "GARY JIMENEZ"]["gender"].tolist()[0] == "M"

## Problem 6 Compute the Average Salary

In this problem, you will implement a function called **ComputeAvgSalary**. This function takes in the df_salaries dataframe (this dataframe already has a new column called "gender", which indicates teh gender". The function returns two dictionary: salary_dict_male and salary_dict_female. In salary_dict_male (salary_dict_female), the key is the year (i.e., 2011, 2012, 2013, 2014) and the value is the average salary for male (female) workers.

In [12]:
def ComputeAvgSalary(df_salaries):
    """
    This function takes the new salary dataframe with gender and
    first_name columns. It returns the the average salary of male
    and female workers. 
    """

    female_avg_salary = df_salaries.groupby("gender").Total_Pay.mean()[0]
    male_avg_salary = df_salaries.groupby("gender").Total_Pay.mean()[1]
    
    return [male_avg_salary, female_avg_salary]

In [13]:
df_salaries = AddGenderToDF(df_salaries, male_names, female_names)
[male_avg_salary, female_avg_salary] = ComputeAvgSalary(df_salaries)
assert round(male_avg_salary, 2) == 97774.57
assert round(female_avg_salary, 2) == 83172.18