## 1 Preliminaries                          

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gawron/python-for-social-science/blob/master/pandas/pandas_assignment.ipynb)

In [1]:
# The usual preamble
import pandas as pd
from matplotlib import pyplot as plt
# Make the graphs a bit prettier, and bigger
#pd.set_option('display.mpl_style', 'default') 
#pd.set_option('display.line_width', 5000) 
pd.set_option('display.max_columns', 60) 

#figsize(15, 5)

We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a subset of the of 311 service requests from [NYC Open Data](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9). 

In [2]:
import pandas as pd
url = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/names/yob2000.txt'
names2000 = pd.read_csv(url,names=['name','sex','births'])

In [3]:
names2000

Unnamed: 0,name,sex,births
0,Emily,F,25949
1,Hannah,F,23066
2,Madison,F,19965
3,Ashley,F,17991
4,Sarah,F,17677
...,...,...,...
29753,Zeph,M,5
29754,Zeven,M,5
29755,Ziggy,M,5
29756,Zo,M,5


## Basic Pandas skills (Baby names data)

###  2.1 Selecting columns and rows

In next cell, write an expression that returns a `pandas` `Series` with just the names in the `name` column.

In the next cell, write an expression that returns the first 25 rows of the `names2000` dataframe,

In the next cell write an expression that returns the first 25 row of the `name` column.

###  2.2 Selecting multiple columns

What if we just want to know the gender and the birth counts, but not the name? Pandas makes it really easy to select a subset of the columns: just index with list of columns you want.  Write an expression that that returns a data frame with just the `births` and `sex` columns of the `names2000` dataframe.

Now write an expression that returns just the first ten rows of the dataframe you returned the cell above. 

### 2.3 Plotting

Write some lines of code that do a barplot of the first fifteen rows of the `names2000` dataframe.  Make sure the `x`-axis shows the name associated with each bar (and not just an arbitrary integer).

If you have trouble with this, don't spend a lot of time on it.  Move on the the later questions,
which are more imnportant.

## 3 Aggregation (Service requests data)

The following code loads the service requests data used in one of your pandas notebooks, and creates 
a subtable consisting of the data for just three agencies.  It then adds a **new** column called `Count`,
which we're going to use for counting complaints.  Since each row represents exactly one complain,
the value in the `Count` column is always 1.

You can learn more about this data set in the [pandas pivot and merge notebook.](https://github.com/gawron/python-for-social-science/blob/master/pandas/pandas_pivot_and_merge.ipynb)

In [14]:
round(100 * 40.708275)/100

40.71

In [24]:
complaints['Complaint Type'].value_counts(ascending=False)

HEATING                           14200
GENERAL CONSTRUCTION               7471
Street Light Condition             7117
DOF Literature Request             5797
PLUMBING                           5373
                                  ...  
Municipal Parking Facility            1
Tunnel Condition                      1
DHS Income Savings Requirement        1
Stalled Sites                         1
X-Ray Machine/Equipment               1
Name: Complaint Type, Length: 165, dtype: int64

In [15]:
import pandas as pd
fn = '311-service-requests.csv'
base_url = 'https://github.com/gawron/pandas-cookbook/master/data'
path = f'{base_url}/{fn}'
raw_path = path.replace('github.com','raw.githubusercontent.com')
complaints = pd.read_csv(raw_path)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
three_agencies = ['DOT', "DOP", 'NYPD']

#### Problems

1.  Create DataFrame whose rows are the three agencies above and whose columns are the complaint types.  Each cell in the DataFrame  should contains the total number of complaints of that complaint type for that agency.  For example, in the NYPD row, the Animal Abuse column should have the number 164, meaning that 164 animal abuse complaints were made to NYPD.

2. Create a DataFrame whose rows are the rows are the Manhattan zipcodes (look in the 'Borough' column) and whose column is the single complaint type 'GENERAL CONSTRUCTION'.  Each cell in the DataFrame  should contains the total number of GENERAL CONSTRUCTION complaints for that zipcode.  For example, zipcode 100040 has 83 GENERAL CONSTRUCTION complaints.  Note there are some inconsistencies in how the data is entered in the `'Incident Zip'`
column, so when you refer to that column, you might want to do:

```
complaints['Incident Zip'].astype(int)
```

## 4.  Baby names

### 4.1: 

Aggregate the data for all years from the website as in the next cell.
(this data was first loaded in the 
Pandas notebook [bda_pandas_intro.ipynb](https://github.com/gawron/python-for-social-science/blob/master/pandas/bda_pandas_intro.ipynb).)

Note: the next cell takes a while to execute.

In [None]:
import pandas as pd
years = list(range(1880,2011))
pieces = []
columns = ['name','sex','births']

url = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/names/'
for year in years:
    path = f'{url}yob{year:d}.txt'
    frame = pd.read_csv(path,names=columns)
    frame['year'] = year
    pieces.append(frame)    
    
names = pd.concat(pieces, ignore_index=True)

Use matplotlib to plot male and female
births for the years 1946--1964 (the official dates of the **baby boom**).
Also plot male and female name diversity in those years (the number of distinct male
and female names).

### 4.2

Another plot.  The x-axis is names; the y-axis is frequencies.  Aggregate the data for female names into
decades (10 year increments) as follows: First create a data frame whose index is female names and 
whose columns are the decades in the data.  The cells should contain the mean popularity
of the name in the decade.  The way to do this is by creating a 'decade' column that correctly
assihgns decades to each row, then create a pivot table that uses that column.

Note: there are two "80s" decades in the data, 
so it might be convenient fill your column by rounding down to the nearest 10; for
example, 1888->1880,1988->1980,...). 

Create yet another DataFrame containing
a subset of the decades: the 1880s, the 1940s, and the 1990s.
Select a subset of the names as well (criteria to be discussed shortly).

Produce a plot that contains an 1880s line, a 1940s line, and a 1990s line,
showing the frequency of your selected names for each of the three decades.
In other words, if "Mary" is one of your chosen names, the mean frequencies
of "Mary" in the 1880s, the 1940s, and the 1990s should be shown.

How should you select your names? The goal of your plot is to show the changes in 
name popularity over time, so
find the 5 most popular names in each of the 3 decades.  That might give you
15 names or there might be some overlap, and you will have less than 15 names.

### 4.3

This is the most difficult of the plotting problems.  But it is useful to think this one through, if
you have time.

Find the **set** of all male names and the **set** of all female names for all the years in the data. For each letter find its frequency as a last letter in the set of male names and in the set of female names, using Python Counters (`from collections import Counter`).
Use matplotlib to draw a single plot that  shows
contrast between the last-letter frequencies 
for male and female names; x-axis is letters;
y-axis is frequencies.

    

### 4.4 Extra Credit (You can wait until you learn about machine learning to do this)

Train a classifier that distinguishes male names
    from female names.  The features should be the last three
    letters in the names and the first three letters. If a name
    has fewer than six letters, it is okay for a letter to be represented
    both as a first letter and as a last letter.
    (This strategy guarantees that all names have the same length
    representation).  You will have to make a decision about what to do about
    ambiguous names (Lee, Sam, Pat), but don't simply exclude them.
    Note:  There are different interesting ways to deal with this issue, not
    just one answer. You should ask if you are unsure about your
    solution.
    
If a name has fewer than three letters ('Al'), pad it with spaces
    and use the spaces as part of your representation ('Al' => 'Al ') and the first
    three letters are ['A', 'l', ' '] and the last three letters are
    ['A', 'l', ' '].  Separate your names into training and test
    names.  Extra credit.  Does it help to make the decade a feature?
    
Turn  in your  notebook file, showing the code you used to
    complete parts 