# Week 8: Analyzing Gender Signals in the NYT Best Seller List

Today we will focus on aggregating data using `.groupby()` and `.sum()`

Along the way, we will need a few other Pandas methods to help organize the data.

## Gender Signal
We will use the Gender By Name dataset to automate assigning a gender to names in our NYT Best Seller Data
We use the first word in the `author` column to approximate a `gender_signal` which can take one of five values:
- `F` **Female**: first names given to children of female sex in the Gender By Name dataset 90% or more of the time
- `M` **Male**: first names given to children of male sex in the Gender By Name dataset 90% or more of the time
- `A` **Ambiguous**: first names that do not meet either of the 90% thresholds listed above
- `U` **Unknown**: first names that do not appear in the Gender By Name dataset
- `I` **Initials**: authors whose gender is masked by names given only as initials

In [None]:
import pandas as pd

nytg_df = pd.read_csv('nyt_full_gender_signal.tsv', sep="\t")
nytg_df

Now, using a technique introduced last class, let's get the value counts for each of these categories in the `gender_signal` column.

In [None]:
nytg_df['gender_signal'].value_counts()

Again, using a technique introduced last class, let's produce a pie chart that demonstrates the relative proprtion of each gender signal category in the dataset.

In [None]:
plot = nytg_df['gender_signal'].value_counts().plot(kind="pie", figsize=(6, 6))
print(plot)

Let's look at some of this data.  What names fall into the Ambiguous category.

We use a filter (boolean Series) to extract the rows where `gender_signal` is `"A"`

In [None]:
gender_filter = nytg_df['gender_signal']=='A'
type(gender_filter)

In [None]:
nytg_df[gender_filter]

Note that the above line of code is equivalent to the below (as discussed in last week's lecture notebook). We just think the above is a bit more legible.

In [None]:
nytg_df[nytg_df['gender_signal']=='A']

Now let's select only the `"first_name"` column, and call the `.unique()` method to see all the unique names that our method has identified as `"A"`

In [None]:
# get the data frame with only "A" gender_signals

# selet the first_names column

# call .unique() on the first names column to get a list of the unique names



And now let's repeat that for the `"U"`, `"I"` — and much larger `"F"` and `"M"` — categories. How well is our gender system doing? What are its blind spots? How could it be improved?

In [None]:
gender_filter = nytg_df['gender_signal']=='I'
sorted(nytg_df[gender_filter]['first_name'].unique())

# Getting the Gender Signal Data into a Useful Form

Let's suppose we've decided our gender signal information is sufficient to proceed with our analysis. The immediate challenge, then, is to get the data into a useful form.

We're interested in knowing how many authors of each gender signal category appears for every year of the dataset.

What data type do want to get our data into? What colums and rows do we want to see, and how do we want them organized? What will the actual values look like?

## `.groupby()`

Let's start our journey with the `.groupby()` method, which allows us to group the data by particular columns and perform calculations on it. For instance, let's try grouping out `nytg_df` DataFrame by the `gender_signal` column.

In [None]:
nytg_df.groupby("gender_signal")

The above command produces a "GroupBy" object.

In [None]:
type(nytg_df.groupby("gender_signal"))

We can perform a few methods on these GroupBy objects. Let's start with `.count()`

In [None]:
nytg_df.groupby("gender_signal").count()

In [None]:
# Use size instead of count to get the number of rows
nytg_df.groupby("gender_signal").size()

If we use `.count()` on the `"year"` column, what do we get?

In [None]:
nytg_df.groupby("year").count()

In [None]:
nytg_df.groupby(["year", "gender_signal"]).size()

The only issue here is that the gender signal categories are "embedded within" each year — when what we need is just a DataFrame where the years are rows and the columns are the gender categories. 

In Pandas-speak, the gender signal values are "stacked" within the year — and we need to `.unstack()` them!

In [None]:
nytg_df.groupby(['year', 'gender_signal']).size().unstack()

This is *almost* excactly what we need. There is only one problem now: not all years have values for all categories, which will confuse our efforts to work with the data in the next stages. For instance, 2016 has `NaN` for `U` — "Not a Number," indicating missing data. 

Thankfully our friend `.unstack()` will take an argument that tells it what to do with any `NaN` situations. In this case, we want it to replace all those with `0`.

In [None]:
nytg_df.groupby(['year', 'gender_signal']).size().unstack(fill_value=0)

Hooray! That's what we need! Let's stick it in a variable.

This is one of those patterns that you want to store away for later use.  

In [None]:
year_counts = nytg_df.groupby(['year', 'gender_signal']).size().unstack(fill_value=0)
year_counts

# Calculate Proportion of Each Category for Each Year


To calculate **proportions**, we will divide the count for each category by the total number of values for that year. First we need the sum of the values across a row.

## `.sum()`


And indeed, the Pandas method we need to calculate the sum of the values in a row is... `.sum()`!

By default, it gives you the sum of *columns* (a "vertical" sum).

In [None]:
year_counts.sum()

We can specify that we want the sum of the values in a *row* (a "horizontal" sum) by setting the `axis` to `1`.

(Yes, this is a weird detail, but it comes from the fact that these operations have a default direction, and the default for sum is to sum up a column, so we need away to tell it to use a different direction.  The designers chose the word axis to label the change in direction.)

In [None]:
year_counts.sum(axis=1)

## Adding a Column to a DataFrame and Filling It with Values

Let's add a new column to our `year_counts` DataFrame that contains these sums. 

The syntax below does the job:
- `year_counts['total']` creates a new column and calls it `"total"`
- `year_counts.sum(axis=1)` stuffs that new column with the values created above

In [None]:
year_counts['total'] = year_counts.sum(axis=1)

In [None]:
year_counts

## Calculating Proportions and Adding Them to a New Row

Now let's calculate the proportions we discussed above. 

Once again, we'll create a new column and stuff it full of new values. Here we will calculate the percentage of `F` labels in every year: number of `F` counts divided by the total number of counts that year, multiplied by 100. Pandas knows we want to perform this calculation for every single row.

In [None]:
year_counts['prop-F'] = (year_counts['F']/year_counts['total']) * 100

In [None]:
year_counts

In [None]:
plot = year_counts['prop-F'].plot(
    kind='bar', 
    title='Percentage of authors with female first-name gender signals in each year', 
    figsize=(15,6))
print(plot)

Let's now create proportional columns for all the other gender signal categories.

In [None]:
year_counts['prop-A'] = (year_counts['A']/year_counts['total']) * 100
year_counts['prop-M'] = (year_counts['M']/year_counts['total']) * 100
year_counts['prop-U'] = (year_counts['U']/year_counts['total']) * 100
year_counts['prop-I'] = (year_counts['I']/year_counts['total']) * 100
year_counts

In [None]:
plot = year_counts['prop-M'].plot(kind='bar', title='Percentage of authors with male first-name gender signals in each year', figsize=(15,6))
print(plot)

We can plot the values of multiple categories side-by-side. Let's compare the percentage of `M` and `F` categories for each year.

In [None]:
plot = year_counts[['prop-F', 'prop-M']].plot(kind='bar', title='Percentage of authors with female vs. male first-name gender signals in each year', figsize=(20,8))
#print(year_counts.columns)

Here's some code that visualizes this same data as pretty line plots.

In [None]:
plot = year_counts[['prop-F', 'prop-M']].plot(figsize=(22,8), style='--', marker='x', title='Percentage of authors in NYT Hardcover Fiction Best Seller List \n with female vs. male first-name gender signals')
plot.set_xticks(year_counts.index);
plot.set_xticklabels(year_counts.index, rotation=90)
print(plot)

# Digging into Our Data

Let's use some techniques we've already learned to investigate some potentially significant areas of our dataset that our analysis is revealing...

In [None]:
year_counts['prop-F'].describe()

In [None]:
nytg_df[(nytg_df['year']==1942)]['author'].value_counts()

In [None]:
year_counts['prop-M'].describe()

In [None]:
nytg_df[nytg_df['year']==1975]['author'].value_counts()