# Objectives
* Use string manipulation with Pandas objects
* Create new columns using elementwise arithmetic
* Distinguish the main difference between the Seaborn and Pandas wrappers for Matplotlib functionality

We will continue to use the Baby Names file to practice common data manipulation practices using Pandas. In this exercise, we will explore a single question:

<p style="text-align:center"><b>Can we use the last letter of a name to predict the gender of the baby?</b></p>

### Load the Libraries
Run the following code cell to import each of the libraries that we will use in the lab.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

**Q0.** Again, read the contents of the file to a DataFrame named `names`, use the Id column as the index, and inspect the first five rows.

In order to answer this question, we can break the problem into the following steps:
1. Compute the last letter of each name
2. Group by the last letter and gender while aggregating (sum) the count
3. Plot the total number of baby names ending in each character for each sex

# 1. Working with Strings
Computing the last letter of each name requires string manipulation. With a Python string, this would be done easily by accessing the last element of the string. As noted, Pandas stores non-numeric data as a generic object type. However, the Series object has a `str` attribute so that we can use string methods. Explore the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#string-methods) to better understand it and find a good example of its usage.

**Q1.1.** Use the `str` attribute of the Name column to create a new column named `Last`. The values in this column should be the last character of the name.

# 2. Group by Last and Gender While Aggregating Count
**Q2.1.** Use the `groupby` method and an appropriate aggregation function to determine the total number of babies born with a given name ending in each letter. Assign the result to the variable `letters`. The first four values should look like:
```
Last  Gender
E     F              181
      M              256
a     F         59369213
      M          1992064
```

After inspecting the results, what may surprise you?

There are a lot of babies born with a given name that ends in the capital letter E. Does this make sense? Let's investigate it.

**Q2.2.** Output all observations where the name ends in a capital letter E.

In this case, Pandas is interpreting the name True as a Boolean value rather than a string (thanks a lot Kardashians). We can fix this behavior by specifying the `dtype` parameter when reading the CSV. For this problem, we will just convert all letters to lowercase.

**Q2.3.** Again, use the `str` attribute of the Name column to create a new column named `Last`. The values in this column should be the last character of the name. However, chain the `lower` method to convert all letters to lowercase. *Hint, you need to use the `str` attribute twice.*

**Q2.4.** Again, applying the same grouping and aggregation as in **Q2.1.** and assign the result to `letters`. Verify there are no entries with a capital letter E.

# 3. Plot the Results
In the following two code cells, I plot this using Seaborn first and then Pandas second. Remember, `letters` is a Series object with a multi-index. Note the differences and explore the methods that I used to learn more. Try to create the bar plot yourself and reference my solution for help.

In [4]:
# # Pandas wrapper to create a horizontal barplot and change the figure size

# letters.plot(kind='barh', figsize=(10, 10))
# plt.show()

By default, Pandas will make the column(s) that were grouped as the index. This is not helpful because it is not obvious which bars are for which gender. Instead, we'd like to plot each gender a different color (map gender to the `hue` parameter similar to how we mapped the index to the `y` parameter), so we need the last letter and gender to be their own columns.

**Q3.1.** Use the `reset_index` method to do this. This is a method for both Series and DataFrame objects. By default, it returns a new Series or DataFrame so you must assign the result to a variable. In this case, we no longer need this version of `letters` so we will assign the result to `letters`. *Note, it is good practice to output the result to make sure it is doing what you expect before assigning it to a variable.*

**Q3.2.** Now, we are able to map the color of each bar to the Gender column. In Seaborn, this is done with the `hue` parameter. Create the same barplot as above with the except the y-axis will only have letters and each letter will have two different bars for gender. *Note, I set my figure size to (15, 15).* Try to create the bar plot yourself and reference my solution for help.

In [5]:

# plt.figure(figsize=(15, 15))
# sns.barplot(x="Count", y="Last", hue="Gender", data=letters)
# plt.show()

This is helpful because we can now see the disparity in the size of the bars. The larger the difference, the easier it is to predict the gender on that letter. However, it is still hard to decipher for some letters because we are displaying the magnitude and some letters are much more frequent than others. As a result, it is hard to distinguish for the very small bars. 

# 4. Normalizing the Data
A better way to communicate this data is to show the proportion of males to females for each letter. We can use the `letters_pivot` DataFrame to calculate the proportion of males with a name ending in a letter by dividing it by the total of males and females with names ending in that letter.
1. Create a Series named `total` by calculating the total babies for each letter
2. Create a new column named `Percent Males` by dividing `M` by `total`
3. Create a new column named `Percent Females` by dividing `F` by `total`

**Q4.1.** Create a Series named `total`. This can be done by applying the `sum` method to the DataFrame. This method requires you to pass an argument for the `axis` parameter which specifies whether to sum over the rows ("columns" and return a Series with the same number of rows) or to sum over the columns ("index" and return a Series with the same number of columns).

**Q4.2.** Next, create the `Percent Males` and `Percent Females` columns in `letters`.

In [6]:
# reset index to match total


# calculate percent males and females


# merge both series (pct_male and pct_female) together and rename columns


# inspect



**Q4.3.** Finally, we can again create the same bar plot. This time, sort the DataFrame so that the disparity between genders is even more clear. You can use either the Pandas or Seaborn wrapper to Matplotlib, but Pandas is easiest in the current shape of our data. What would you need to do in order to use Seaborn like previously?

In [7]:
# uncomment to see plot
# letters_pct.loc[:,['Percent Males', 'Percent Females']].sort_values('Percent Males').plot(kind='barh',stacked=True, figsize=(10,10))
# plt.show()

In the Pandas wrapper for a bar plot, you can specify a Boolean argument for the `stacked` parameter. Instead, of plotting the bars clustered next to one another, it will stack them on top of each other. In this case, that is helpful because you can quickly see which letters transition near 0.5.