## 1 Preliminaries                          

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gawron/python-for-social-science/blob/master/pandas/pandas_assignment.ipynb)

In [None]:
# The usual preamble
import pandas as pd
from matplotlib import pyplot as plt
# Make the graphs a bit prettier, and bigger
#pd.set_option('display.mpl_style', 'default') 
#pd.set_option('display.line_width', 5000) 
pd.set_option('display.max_columns', 60) 

#figsize(15, 5)

We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a subset of the of 311 service requests from [NYC Open Data](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9). 

In [None]:
import pandas as pd
url = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/names/yob2000.txt'
names2000 = pd.read_csv(url,names=['name','sex','births'])

In [None]:
names2000

Unnamed: 0,name,sex,births
0,Emily,F,25949
1,Hannah,F,23066
2,Madison,F,19965
3,Ashley,F,17991
4,Sarah,F,17677
...,...,...,...
29753,Zeph,M,5
29754,Zeven,M,5
29755,Ziggy,M,5
29756,Zo,M,5


## Basic Pandas skills (Baby names data)

###  2.1 Selecting columns and rows

In next cell, write an expression that returns a `pandas` `Series` with just the names in the `name` column.

In the next cell, write an expression that returns the first 25 rows of the `names2000` dataframe,

In the next cell write an expression that returns the first 25 row of the `name` column.

###  2.2 Selecting multiple columns

What if we just want to know the gender and the birth counts, but not the name? Pandas makes it really easy to select a subset of the columns: just index with list of columns you want.  Write an expression that that returns a data frame with just the `births` and `sex` columns of the `names2000` dataframe.

Now write an expression that returns just the first ten rows of the dataframe you returned the cell above. 

### 2.3 Plotting

Write some lines of code that do a barplot of the first fifteen rows of the `names2000` dataframe.  Make sure the `x`-axis shows the name associated with each bar (and not just an arbitrary integer).

## 3 Aggregation (Service requests data)

The following code loads the service requests data used in one of your pandas notebooks, and creates 
a subtable consisting of the data for just three agencies.  It then adds a **new** column called `Count`,
which we're going to use for counting complaints.  Since each row represents exactly one complain,
the value in the `Count` column is always 1.

You can learn more about this data set in the [pandas pivot and merge notebook.](https://github.com/gawron/python-for-social-science/blob/master/pandas/pandas_pivot_and_merge.ipynb)

In [None]:
import pandas as pd
fn = '311-service-requests.csv'
base_url = 'https://github.com/gawron/pandas-cookbook/master/data'
path = f'{base_url}/{fn}'
raw_path = path.replace('github.com','raw.githubusercontent.com')
complaints = pd.read_csv(raw_path,error_bad_lines=False)
complaints['Count'] = 1

  interactivity=interactivity, compiler=compiler, result=result)


Looking at the `Count` column:

In [None]:
complaints['Count'][:10]

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: Count, dtype: int64

The following code creates a subtable with just the complaints for the following three
agencies: `DOT`, `DOP`, and `NYPD`.

In [None]:
pt00 = complaints[complaints.Agency.isin(['DOT', "DOP", 'NYPD'])]

Your mission should you decide to accept it is to create pivot table whose rows are the three agencies,
and whose columns are the complaint types.  Each cell in the table thus contains the total number of complaints
of that complaint type for that agency.  For example, in the NYPD row, the Animal Abuse column should
have the number 164, meaning that 164 animal abuse complaints were made to NYPD.


## 4.  Baby names

### 4.1: 

Use the babynames data created in the [general pandas intro notebook.](https://github.com/gawron/python-for-social-science/blob/master/pandas/general_pandas_intro.ipynb)
     and matplotlib to plot male and female
    births for the years covered in the data. Put this in
    a `jupyter notebook` file.  In the same notebook, use the babynames data to get a list of male and female names.
    

### 4.2

Aggregate the data for all years from the website
    (see the examples in the 
    Pandas notebook [general_pandas_intro.](https://github.com/gawron/python-for-social-science/blob/master/pandas/general_pandas_intro.ipynb>)
    Find the **set** of all male names and the **set**
    of all female names. 
    For each letter find its frequency as a last letter in
    male names and in female names, using 
    Python Counters (`from collections import Counter`).
    Use matplotlib to draw a single plot that  shows
    contrast between the last-letter frequencies 
    for male and female names; x-axis is letters;
    y-axis is frequencies.

    

### 4.3

Another plot.  The x-axis is names;
    the y-axis is frequencies.  Aggregate the data for female names into
    decades (10 year increments).  Produce a plot
    that contains an 1880s line, a 1940s line, and 
    a 1990s line.  There is a display issue here.  There
    are too many names to just write them out on the
    x-axis.  Try to devise an informative way of displaying
    the data that tells a story about fashions in names change.   Hint: Pick a value for n and consider the n most popular names in a decade.  But which decade?

### 4.4 Extra Credit (You can wait until you learn about machine learning to do this)

Train a classifier that distinguishes male names
    from female names.  The features should be the last three
    letters in the names and the first three letters. If a name
    has fewer than six letters, it is okay for a letter to be represented
    both as a first letter and as a last letter.
    (This strategy guarantees that all names have the same length
    representation).  You will have to make a decision about what to do about
    ambiguous names (Lee, Sam, Pat), but don't simply exclude them.
    Note:  There are different interesting ways to deal with this issue, not
    just one answer. You should ask if you are unsure about your
    solution.
    
If a name has fewer than three letters ('Al'), pad it with spaces
    and use the spaces as part of your representation ('Al' => 'Al ') and the first
    three letters are ['A', 'l', ' '] and the last three letters are
    ['A', 'l', ' '].  Separate your names into training and test
    names.  Extra credit.  Does it help to make the decade a feature?
    
Turn  in your  notebook file, showing the code you used to
    complete parts 