# Week 8: Lecture Supplement

This notebook contains all the code used to generate `nyt_full_gender_signal.tsv` speadsheet contaiing the added `first_name` and `gender_signal` columns in the NYT Best Seller List dataset that we are working with in the main Week 8 lecture. All steps and decisions in this supplemental explained in the main Week 8 Lecture Slides. This code is provided in case you think it might be useful to your Projects — but none of the advanced coding concepts in this notebook is will be covered in the exam, unless they are present in future weeks' main Lecture Notebooks.

In this notebook, we:
* Load a dataset containing gendered counts for baby names
* Create a dictionary (a Python data type not covered in the course to this point) in which the counts of particular names as male or female are store
* Create another dictionary in which we assign one of four values for each name: if a name is assigned more than 90% of the time as either male or female, record is as `F` or `M`; if the ratio of assignment doesn't pass that threshold, assign it as `A` (ambiguous).
* Extract the first names of all authors in the NYT Best Seller list, store them in a new column `first_name`
* For each first name, assign a "gender signal" of `F`, `M`, or `A` based on the steps above — or assign `U`/"unknown" if the name doesn't appear in our list of names, or `I` if the name is an intial like J. K. Rowling. Store the predicted gender signal in a new column in the dataframe, `gender_signal`.
* Write the DataFrame with gender signal information to a new TSV file

# Step 1: Load the Datasets

Here we load the `nyt_full.tsv` dataset used last time, and also load the [UCI Gender By Name Data Set](https://archive.ics.uci.edu/ml/datasets/Gender+by+Name) (`name_gender_dataset.csv`). Both are loaded as Pandas DataFrames.

In [1]:
import pandas as pd

In [2]:
nyt_df = pd.read_csv('nyt_full.tsv', sep="\t")

In [3]:
n2g_df = pd.read_csv('name_gender_dataset.csv')

In [4]:
n2g_df

Unnamed: 0,Name,Gender,Count,Probability
0,James,M,5304407,1.451679e-02
1,John,M,5260831,1.439753e-02
2,Robert,M,4970386,1.360266e-02
3,Michael,M,4579950,1.253414e-02
4,William,M,4226608,1.156713e-02
...,...,...,...,...
147264,Zylenn,M,1,2.736740e-09
147265,Zymeon,M,1,2.736740e-09
147266,Zyndel,M,1,2.736740e-09
147267,Zyshan,M,1,2.736740e-09


As you can see, the dataset contains 147,269 names, and for each gives a binary gender (M/F) and a count for the number of times that name was given to a baby in the US, UK, Canadian, and Australian data (see main lecture slides or the link to the dataset page above for more details on the dataset and its sources). We will use the `Name`, `Gender`, and `Count` columns here. 

Below, you can see how many "male" and "female" names are in the dataset.

In [5]:
n2g_df['Gender'].value_counts()

F    89749
M    57520
Name: Gender, dtype: int64

Below, we see that many names appear *twice* in the dataset. This indicates that M and F counts are given in separate rows. There are not in fact 147,269 unique names; there are 133,910, with many appearing with both M and F counts.

In [6]:
n2g_df['Name'].value_counts()

James        2
Doni         2
Audley       2
Rhodes       2
Moran        2
            ..
Manard       1
Macksen      1
Lonas        1
Lethaniel    1
Zyton        1
Name: Name, Length: 133910, dtype: int64

# Step 2: Organize Data into a Dictionary to Easily Extract M and F Counts for Each Name

Below, we go row-by-row through the `n2g_df` dataset to extract the counts for each name and store them in a new Python data type, a dictionary. 

In [7]:
n2g_df.head()

Unnamed: 0,Name,Gender,Count,Probability
0,James,M,5304407,0.014517
1,John,M,5260831,0.014398
2,Robert,M,4970386,0.013603
3,Michael,M,4579950,0.012534
4,William,M,4226608,0.011567


We will use the Pandas `.iterrows()` method [(documented here)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html) to *iterate through* the dataset row by row.

In [8]:
n2g_df.head().iterrows()

<generator object DataFrame.iterrows at 0x7fd9989986d0>

`.iterrows()` returns each row of the DataFrame as a Pandas Series. In the `for` loop below, we call the "index" (the row index 0-4 in the DataFrame head above) `i` and the row of data itself `row`. 

In [9]:
for i, row in n2g_df.head().iterrows():
    print(i)
    print(row)

0
Name              James
Gender                M
Count           5304407
Probability    0.014517
Name: 0, dtype: object
1
Name               John
Gender                M
Count           5260831
Probability    0.014398
Name: 1, dtype: object
2
Name             Robert
Gender                M
Count           4970386
Probability    0.013603
Name: 2, dtype: object
3
Name            Michael
Gender                M
Count           4579950
Probability    0.012534
Name: 3, dtype: object
4
Name            William
Gender                M
Count           4226608
Probability    0.011567
Name: 4, dtype: object


`row` can be further subsetted as follows:
- `row['Name']` contains the Name value
- `row['Gender']` contains the Gender label
- `row['Count']` contains the Count value
- `row['Probability']` contains the Probability value, which we won't be using.

Below, we create an empty Python Dictionary named `name2counts`. Dictionaries are a new data type, of which [Melanie Walsh offers a terrific overview here](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/11-Dictionaries.html).

We will be creating a nested dictionary. At the first level will be an individual name. At the second level, each name will have "M" and "F" keys, and the values will be the raw counts from `n2g_df`. 

In [10]:
name2counts = {}    # Creates a new empty dictionary
for i, row in n2g_df.iterrows():    # Iterates through the rows of the dataframe with gender data.
    nme = row['Name']    # The variable nme is assigned to the name of the current row of the gender data DF
    if nme not in name2counts:    # If we haven't yet encountered a particular name...
        name2counts[nme] = {'F':0, 'M': 0}    # ... we create an empty spot for that name in the name2counts dictionary
    name2counts[nme][row['Gender']] = row['Count']    # By this point we're sure there is an entry for the given name, so we can safely assigns a value to whatever gender the current row of the gender data DF has info for. If this is the M James, it sticks the count in; if it's the F James, it puts that in. 

The above leaves us with a dictionary called `name2counts` that contains every name in theGender by Name dataset, and has M and F counts for each.

In [11]:
type(name2counts)

dict

In [12]:
len(name2counts)

133910

We access the data in this dictionary not with index numbers or ranges (as in a list) but rather by the name itself.

In [19]:
name2counts['Ngaio']

{'F': 2, 'M': 0}

In [20]:
name2counts['Dr.']

{'F': 0, 'M': 2}

In [None]:
name2counts['George']

In [None]:
name2counts['Evelyn']

In [18]:
name2counts['Halsey']

{'F': 274, 'M': 408}

In [None]:
name2counts['Alex']

If we want to access the actual counts, a secondlevel of subsetting needs to be done.

In [None]:
name2counts['Alex']['M']

# Step 3: Set a Threshold for "Ambiguous" Names

Now that we have raw counts for each of our names, we can set a threshold within which the names in our NYT Best Seller List might send an ambigious gender signal — names likely to prompt readers to be uncertain of the author's binary gender. We will begin by setting this threshold at 90%: if 90% or more of the counts for a given name are `M` or `F`, we will consider that a strong gender signal and apply that label; otherwise, we will label it `A` or ambiguous.

Do you think this threshold is right? Should Alex be considered ambiguous? We will record it as `F` given our 90% threshold — but perhaps you believe that threshold is too generous, and it should be 97%?

In [None]:
name2counts['Alex']['F'] / (name2counts['Alex']['M'] + name2counts['Alex']['F'])

Below, we create another dictionary that evaluates the counts of a particular name in the `name2counts` dictionary created above, and assigns a value to each name of `F`, `M`, or `A`.

The `for` loop below uses the `.items()` method to iterate through all the items in the `name2counts` dictionary.

You can alter the thresholds by changing the code below.

In [None]:
name2genders = {}    # This creates a new name2genders dictionary in which we're able to apply a threshold...
for name, counts in name2counts.items():
    
    f_count = counts['F']    # Pulls out the M counts for each name
    m_count = counts['M']    # Pulls out the F counts for each name
    
    if m_count == 0 or f_count/(m_count+f_count) >= 0.9:    # If there are no M counts, or if the F count is 90% or more of the total count, label it as F
        name2genders[name] = 'F'
    elif f_count == 0 or m_count/(f_count+m_count) >= 0.9:    # As above, but reversed for M/F
        name2genders[name] = 'M'
    else:
        name2genders[name] = 'A'    # If the name doesn't meet either threshold, label the name as A

# Step 4: Extract First Names from the Author Column of the NYT Best Seller List Dataset

We now have a dictionary that will allow us to approximate the gender signal for nearly 134,000 first names. In order to apply that to our dataset, we need to isolate the first names of all the authors in our dataset. To do this, we will use our old friend, the `.split()` method, split on spaces (`.split(" ")`), and look at the first item in the returned list.

In [None]:
nyt_df

In [None]:
sample_name = "John Doe"
sample_name.split(" ") # This is a method we know well! Splits a string into a list. 

In [None]:
sample_name.split(" ")[0] # The first item in the list is the first name

In [None]:
sample_name = "A. John Doe"
sample_name.split(" ")
print(sample_name.split(" "))
print(sample_name.split(" ")[0]) # ... or the first "whatever" in the Author field, rather. "A." is not a name but an initial.

In [None]:
sample_name = "Clive Cussler and Boyd Morrison"
sample_name.split(" ")
print(sample_name.split(" "))
print(sample_name.split(" ")[0]) # Our method also can't account for second authors, only those names first

Our task seems like a simple one: use `.split(" ")` to extract all first names, then stick them in a new column of the dataframe. But it's not terribly straighforward to apply a method like `.split(" ")` to all the author name values in our Pandas DataFrame. At least, it isn't straighforward *yet*!

- Pandads will allow us to apply any **function** to any column of the dataset using its `.apply()` method.
- But `s.split(" ")` isn't a **function**; it's a string method.
- So we need to *create a new function* that applies the `s.split(" ")` method, and extracts the first item from the resulting list

Below, we create a function called `get_first_name` that does just want we want to do. Melanie Walsh has [a great overview of functions and how to create or *define* them](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/12-Functions.html).

In [None]:
def get_first_name(name):
    first_name = name.split(" ")[0]
    return first_name

While we're at it, let's also get rid of those ugly upper-case titles, and create another function (`make_text_title_case()`) that applies the Python `s.title()` string method.

In [None]:
def make_text_title_case(text):
    title_case_text = text.title()
    return title_case_text

In [None]:
make_text_title_case("THE GOOD EARTH")

Below, we use the Pandas `.apply(function)` method to apply our newly-defined `make_text_title_case(text)` function to every value in the `'title'` column.

More more on `.apply()`, see [Melanie Walsh's discussion](https://melaniewalsh.github.io/Intro-Cultural-Analytics/03-Data-Analysis/03-Pandas-Basics-Part3.html#applying-functions).

In [None]:
nyt_df['title'].apply(make_text_title_case)

Now let's actually *use* the output above. The below line replaces the previous contents of the `'title'` column with the newly lowecases ones.

In [None]:
nyt_df['title'] = nyt_df['title'].apply(make_text_title_case)

In [None]:
nyt_df.head()

Now let's create a new column, `'first_name'`, that contains all the first names extracted by our `get_first_name()` function.

In [None]:
nyt_df['first_name'] = nyt_df['author'].apply(get_first_name)

In [None]:
nyt_df

# Step 5: Store Gender Signal Approximations in a New Column in the DataFrame

We will now create another function, `get_gender_signal()`, that applies the gender label stored in the `name2genders` dictionary, or:
- if a particular name in the NYT Best Seller List is not in theGender by Name data, apply `U` for "unknown"
- if a particular name is one character long, or one character followed by a period, apply `I` for "initials", we we will later interpret as a name with a masked gender signal

Once we've made this function, we'll apply it to the `'first_name'` column of `nyt_df`, and store the results in a new column, `gender`.

In [None]:
def get_gender_signal(name):
    gender = 'U'
    if name in name2genders:
        gender = name2genders[name]
    if len(name) == 1 or name[1] == '.': # This is a separate if statement bc even if an initial name happens to be in theGender by Name dataset, we want to treat it differently ourselves...
        gender = 'I'
    return gender

In [None]:
nyt_df['gender_signal'] = nyt_df['first_name'].apply(get_gender_signal)

In [None]:
nyt_df.head(10)

# Step 6: Write the DataFrame with Gender Signal Approximations to a TSV

Finally, let's write all this to a TSV file that we can open in our main lecture notebook, and begin the next steps of our investigation...

For this, we'll use Pandas's `.to_csv()` method and the `sep="\t"` delimiter.

In [None]:
nyt_df.to_csv("nyt_full_gendersignal.tsv", sep="\t", encoding='utf-8', index=False)