# Lab 2: Pandas Overview

**This assignment should be completed before Tuesday 1/30 at 1:00AM.**

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with:

* Creating dataframes
* Slicing data frames (ie. selecting rows and columns)
* Filtering data (using boolean arrays)
* Data Aggregation/Grouping dataframes

In this lab, you are going to use several pandas methods like `drop()`, `loc()`, `groupby()`. You may press `shift+tab` on the method parameters to see the documentation for that method.

## Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

## Creating DataFrames & Basic Manipulations

A [dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) is a two-dimensional labeled data structure with columns of potentially different types.

**Method 1: ** You can create a data frame by specifying the columns and values as shown below.

In [None]:
fruit_info = pd.DataFrame(
    data={'fruit': ['apple', 'orange', 'banana', 'raspberry'],
          'color': ['red', 'orange', 'yellow', 'pink']
          })
fruit_info

**Method 2: ** You can also define a dataframe by specifying the rows like below.

In [None]:
fruit_info2 = pd.DataFrame(
    [("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
     ("pink", "raspberry")], 
    columns = ["color", "fruit"])
fruit_info2

You can obtain the dimensions of a matrix by using the shape attribute dataframe.shape

In [None]:
(num_rows, num_columns) = fruit_info.shape
num_rows, num_columns

### Question 1

You can add a column by `dataframe['new column name'] = [data]`. Please add a column called `rank` to the `fruit_info` table which contains a 1,2,3, or 4 based on your personal preference ordering for each fruit. 


In [None]:
### BEGIN SOLUTION
fruit_info["rank"] = [2, 1, 4, 3]
### END SOLUTION

In [None]:
fruit_info

In [None]:
assert fruit_info["rank"].dtype == np.dtype('int64')
### BEGIN HIDDEN TESTS
assert len(fruit_info["rank"].dropna()) == 4
### END HIDDEN TESTS

### Question 2

Use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) the `rank` column you created. (Make sure to use the `axis` parameter correctly)

In [None]:
fruit_info_original = ...
### BEGIN SOLUTION
fruit_info_original = fruit_info.drop("rank", axis = 1)
### END SOLUTION

In [None]:
fruit_info_original

In [None]:
assert fruit_info_original.shape[1] == 2

### Question 3

Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) the columns of `fruit_info_original` so they begin with a capital letter. Set the `inplace` parameter correctly to change the `fruit_info_original` dataframe. (hint: in Question 2, `drop` creates and returns a new dataframe instead of changing `fruit_info` because `inplace` by default is `False`)

In [None]:
### BEGIN SOLUTION
fruit_info_original.rename(columns = {"color":"Color", "fruit":"Fruit"}, inplace = True)
### END SOLUTION

In [None]:
fruit_info_original

In [None]:
assert fruit_info_original.columns[0] == 'Color'
### BEGIN HIDDEN TESTS
assert fruit_info_original.columns[1] == 'Fruit'
### END HIDDEN TESTS

### Babyname datasets
Now that we have learned the basics. We will then work on the babynames dataset. Let's clean and wrangle the following data frames for the remainder of the lab.

First let's run the following shell to build the dataframe.
It download the data from the web and extract the data in California region. There should be totally 367931 records

### `fetch_and_cache` Helper

The following function downloads and caches data in the `data/` directory and returns the `Path` to the downloaded file

In [None]:
def fetch_and_cache(data_url, file, data_dir="data", force=False):
    """
    Download and cache a url and return the file object.
    
    data_url: the web address to download
    file: the file in which to save the results.
    data_dir: (default="data") the location to save the data
    force: if true the file is always re-downloaded 
    
    return: The pathlib.Path object representing the file.
    """
    import requests
    from pathlib import Path
    data_dir = Path(data_dir)
    data_dir.mkdir(exist_ok=True)
    file_path = data_dir/Path(file)
    if force and file_path.exists():
        file_path.unlink()
    if force or not file_path.exists():
        print('Downloading...', end=' ')
        resp = requests.get(data_url)
        with file_path.open('wb') as f:
            f.write(resp.content)
        print('Done!')
    else:
        import time 
        birth_time = time.ctime(file_path.stat().st_ctime)
        print("Using cached version downloaded:", birth_time)
    return file_path

Below we use fetch and cache to download the `namesbystate.zip` zip file. 

**This might take a little while! Consider stretching.**

In [None]:
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'
namesbystate_path = fetch_and_cache(data_url, 'namesbystate.zip')

The following cell builds the final full `baby_names` DataFrame. 

In [None]:
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')

field_names = ['State', 'Sex', 'Year', 'Name', 'Count']

def load_dataframe_from_zip(zf, f):
    with zf.open(f) as fh: 
        return pd.read_csv(fh, header=None, names=field_names)

# List comprehension
states = [
    load_dataframe_from_zip(zf, f)
    for f in sorted(zf.filelist, key=lambda x:x.filename) 
    if f.filename.endswith('.TXT')
]

baby_names = pd.concat(states).reset_index(drop=True)

In [None]:
baby_names.head()

In [None]:
len(baby_names)

## Slicing Data Frames - selecting rows and columns


### Selection Using Label

**Column Selection** 
To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html). General usage looks like `frame.loc[rowname,colname]`. (Reminder that the colon `:` means "everything").  For example, if we want the `color` column of the `ex` data frame, we would use :

- You can also slice across columns. For example, `baby_names.loc[:, 'Name':]` would give select the columns `Name` and the columns after.

- *Alternative:* While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `frame['colname']`.

**Row Selection**
Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (ie. primary key) of the dataframe.

In [None]:
#Example:
baby_names.loc[2:5, 'Name']

In [None]:
#Example:  Notice the difference between these two methods
baby_names.loc[2:5, ['Name']]

The `.loc` actually uses the index rather than row id to perform the selection. The pervious example is just a coincidence that it matches the array slicing syntax. 

But we can always uses [`.iloc`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) to slicing the dataframe using row id and column id.

See the following example:

In [None]:
#Example: We change the index from 0,1,2... to the Name column
df = baby_names[:5].set_index("Name") 
df

We can now lookup rows by name directly:

In [None]:
df.loc[['Mary', 'Anna'], :]

However, if we want to access rows by location we will need to use the integer loc (`iloc`) accessor:

In [None]:
#Example: 
#df.loc[2:5,"Year"] You can't do this
df.iloc[1:4,2:3]

### Question 4

Selecting multiple columns is easy.  You just need to supply a list of column names.  Select the `Name` and `Year` **in that order** from the `baby_names` table.

In [None]:
name_and_year = ...
### BEGIN SOLUTION
name_and_year = baby_names.loc[:, ['Name', 'Year']]
### END SOLUTION

In [None]:
name_and_year[:5]

In [None]:
assert name_and_year.shape == (5838786, 2)
### BEGIN HIDDEN TESTS
assert name_and_year.loc[0,"Name"] == "Mary"
assert name_and_year.loc[0,"Year"] == 1910
### END HIDDEN TESTS

As you may have noticed above, the .loc() method is a way to re-order the columns within a dataframe.

## Filtering Data

### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, culling out fishy outliers, or analyzing subgroups of your data set.  Note that compound expressions have to be grouped with parentheses. Example usage looks like `df[df[column name] < 5]]`.

For your reference, some commonly used comparison operators are given below.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
==   | a == b   | Does a equal b?
<=   | a <= b   | Is a less than or equal to b?
>=   | a >= b   | Is a greater than or equal to b?
<    | a < b    | Is a less than b?
&#62;    | a &#62; b    | Is a greater than b?
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

In the following we construct the DataFrame containing only names registered in California

In [None]:
ca = baby_names[baby_names['State'] == "CA"]

### Question 5a
Select the names in Year 2000 (for all baby_names) that have larger than 3000 counts. What do you notice?

(If you use `p & q` to filter the dataframe, make sure to use `df[df[(p) & (q)]]`)

In [None]:
result = ...
### BEGIN SOLUTION
result = baby_names[(baby_names["Year"] == 2000) & (baby_names["Count"] > 3000)]
### END SOLUTION

In [None]:
result

In [None]:
assert len(result) == 11
assert result["Count"].sum() == 38988
### BEGIN HIDDEN TESTS
assert result["Count"].iloc[0] == 4339
### END HIDDEN TESTS

## Data Aggregration (Grouping Data Frames)

### Question 6a
To count the number of instances of a value in a `Series`, we can use the `value_counts()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) as `df["col_name"].value_counts()`. Count the number of different names for each Year in `CA` (California).  (You may use the `ca` DataFrame created above.)

**Note:** *We are not computing the number of babies but instead the number of names (rows in the table) for each year.*

In [None]:
num_of_names_per_year = ...
### BEGIN SOLUTION
num_of_names_per_year = ca["Year"].value_counts()
### END SOLUTION

In [None]:
num_of_names_per_year[:5]

In [None]:
assert num_of_names_per_year[2007] == 7247
assert num_of_names_per_year[:5].sum() == 35603
### BEGIN HIDDEN TESTS
assert num_of_names_per_year[1910] == 363
assert num_of_names_per_year[:15].sum() == 103411
### END HIDDEN TESTS

### Question 6b
Count the number of different names for each gender in `CA`. Does the result help explaining the findings in Question 5?

In [None]:
num_of_names_per_gender = ...
### BEGIN SOLUTION
num_of_names_per_gender = ca["Sex"].value_counts()
### END SOLUTION

In [None]:
num_of_names_per_gender

In [None]:
assert num_of_names_per_gender["F"] > 200000
### BEGIN HIDDEN TESTS
assert num_of_names_per_gender["F"] == 217309
assert num_of_names_per_gender["M"] == 150622
### END HIDDEN TESTS

### Question 7a

A more versatile way to aggregate data is to use the `.groupby()` [function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html). Find the sum of `Count` for each `Name` in the `ca` table. You should use `df.groupby("col_name").sum()`. Your result should be a Pandas Series.

**Note:** *In this question we are now computing the number of registered babies with a given name.*

In [None]:
count_for_names = ...
### BEGIN SOLUTION
count_for_names = ca.groupby("Name")["Count"].sum()
### END SOLUTION

In [None]:
count_for_names.sort_values(ascending=False)[:5]

In [None]:
assert count_for_names["Michael"] == 428290
assert count_for_names[:100].sum() == 96149
### BEGIN HIDDEN TESTS
assert count_for_names["David"] == 370070
assert count_for_names[:1000].sum()
### END HIDDEN TESTS

### Question 7b

Find the sum of `Count` for each female name after year 1999 (`>1999`) in California.


In [None]:
female_name_count = ...
### BEGIN SOLUTION
female_name_count = ca[(ca["Year"]>1999) & (ca["Sex"] == "F")].groupby("Name")["Count"].sum()
### END SOLUTION

In [None]:
female_name_count.sort_values(ascending=False)[:5]

In [None]:
assert female_name_count["Emily"] == 46277
assert female_name_count[:100].sum() == 45883
### BEGIN HIDDEN TESTS
assert female_name_count["Isabella"] == 42875
assert female_name_count[:10000].sum() == 3718549
### END HIDDEN TESTS

#### You are done! Remember to validate and submit via JupyterHub