# Week 04: pandas

## Objectives

By the end of this tutorial, you should be able to explore a data set and propose possible relationships among variables. Specifically you will:

- understand the structure and properties of a `DataFrame` object
- specify variables for analysis
- filter data according to criteria
- sort data by specific variables
- summarize the domain of a data set
- define new variables computed from other ones
- summarize the values of a variable


## pandas

Using lists, dictionaries, and basic functions of core Python, you can construct data structures and functions for storing and processing data. Because data sets come in varying shapes and sizes, you would need to write custom code for each one you had to analyze. Wanting to automate this process and speed up analysis times, developers at investment firm [AQR Capital Management](https://en.wikipedia.org/wiki/AQR_Capital) refactored their code so it could adapt to most data sets, and thus the [pandas module](http://pandas.pydata.org/index.html) was born. 

The central feature of `pandas` is its `DataFrame` object, which represents a spreadsheet type of data structure. It offers methods for importing data into itself, and methods for viewing, processing, and analyzing the data. 

*R users: The functionality of `pandas` is most similar to the `dplyr` package.*

## Importing a data set

We'll start by making `pandas` available, and then importing a data set into a new `DataFrame`. The `url` points to a Github repository containing data for testing pandas. The third line calls the `read_csv()` function, and does three things:

- looks up the webpage specified by `url`
- reads this page, which is a csv file, into memory
- creates a new `DataFrame` object containing the imported data

In [141]:
from pandas import *

url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv'

tips = pandas.read_csv(url)

### *Questions*

In [142]:
### Questions

# Q: What does the "*" in the first line mean?
# A: 
#
# Q: What is the name of the new DataFrame object containing the data?
# A: 

## Background on the data set

To understand a data set, we need to know its context. How were the data collected? What are the variables, and what do they represent? 

The [tips data set](https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv) represents [244 observations made by a waiter working at a restaurant for several months](https://books.google.com/books/about/Practical_Data_Analysis.html?id=5hILAAAACAAJ). He recorded information about each table he worked, including the following variables:

- tip in dollars,
- bill in dollars,
- sex of the bill payer,
- whether there were smokers in the party,
- day of the week,
- time of day,
- size of the party.

## Previewing a data set

Once you've imported a DataFrame, take a look at some values, to make sure the import was successful. The `head()` and `tail()` functions allow you to "peek" at the start and end of the data set. `pandas` automatically formats the output to reflect the tabular structure.

In most of the examples below, we use a function that creates a new `DataFrame`. In most cases, we simply print it and then forget it, but any of these could be assigned to a new variable if you wanted.

In [143]:
print tips.head(4)  # shows the first 4 rows

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2


In [144]:
print tips.tail(3)  # shows the last 3 rows

     total_bill   tip     sex smoker   day    time  size
241       22.67  2.00    Male    Yes   Sat  Dinner     2
242       17.82  1.75    Male     No   Sat  Dinner     2
243       18.78  3.00  Female     No  Thur  Dinner     2


## Reading and updating data from a data set

There are two equivalent ways to think of a `DataFrame`: as a *list of dictionaries*, or as a *dictionary of lists*. Either way you must specify both an index (for the row) and a key (for the column). To access any value, tack on `.loc` and enclose the index and key in square brackets [ ].

In [145]:
# print out the bill in row 1, should be 10.34
print tips.loc[1, 'total_bill']

10.34


In [146]:
# Use slice notation to get rows 0 thru 4
print tips.loc[0:4]

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4


In [147]:
# Use a list of keys to specify multiple columns
print tips.loc[0:4, ['tip', 'day']]

    tip  day
0  1.01  Sun
1  1.66  Sun
2  3.50  Sun
3  3.31  Sun
4  3.61  Sun


### *Exercises*

In [148]:
# E: find the tip rate (tip as a percentage of bill) for row 3



In [149]:
# E: set the party size of row 0 to 12, then use .head() to check that it worked.


## Selecting columns

When you just want to work with entire columns, you can use regular dictionary notation, using the column names as the keys. If you need multiple columns, specify a *list* of keys.

In [150]:
# select just the tip column, using .head() to limit output

print tips['tip'].head()

0    1.01
1    1.66
2    3.50
3    3.31
4    3.61
Name: tip, dtype: float64


In [151]:
# select two columns

print tips[['total_bill','tip']].head()

   total_bill   tip
0       16.99  1.01
1       10.34  1.66
2       21.01  3.50
3       23.68  3.31
4       24.59  3.61


## Filtering rows

To select a subset of observations, you need to create a filter. A filter is a list of Boolean values that indicate which rows to include. You can generate a filter using a logical expression with a column.

In [152]:
# Make a filter that is True if the bill is over $20

f_over20 = tips['total_bill'] > 20
print f_over20.head(10)

0    False
1    False
2     True
3     True
4     True
5     True
6    False
7     True
8    False
9    False
Name: total_bill, dtype: bool


If you use this filter in the brackets, the resulting `DataFrame` will only show rows where the filter was True. In the example below, the only indices that appear were those that had a value of `True` in the filter.

In [153]:
# Use filter to select subset of rows

print tips[f_over20].head(10)

    total_bill   tip     sex smoker  day    time  size
2        21.01  3.50    Male     No  Sun  Dinner     3
3        23.68  3.31    Male     No  Sun  Dinner     2
4        24.59  3.61  Female     No  Sun  Dinner     4
5        25.29  4.71    Male     No  Sun  Dinner     4
7        26.88  3.12    Male     No  Sun  Dinner     4
11       35.26  5.00  Female     No  Sun  Dinner     4
15       21.58  3.92    Male     No  Sun  Dinner     2
19       20.65  3.35    Male     No  Sat  Dinner     3
21       20.29  2.75  Female     No  Sat  Dinner     2
23       39.42  7.58    Male     No  Sat  Dinner     4


One slick feature: you can "chain" sets of brackets together to apply multiple criteria

In [154]:
# Use filter and column selection to show just the bill and tip columns, 
# for bills over 20

print tips[f_over20][['total_bill','tip']].head(10)

    total_bill   tip
2        21.01  3.50
3        23.68  3.31
4        24.59  3.61
5        25.29  4.71
7        26.88  3.12
11       35.26  5.00
15       21.58  3.92
19       20.65  3.35
21       20.29  2.75
23       39.42  7.58


In [155]:
# Even apply multiple filters
# don't even need .head() because only a few bills met all three criteria

f_female = tips['sex'] == "Female"   # female bill payers
f_smoker = tips['smoker'] == "Yes"   # party had a smoker in it

print tips[f_over20][f_female][f_smoker]   

     total_bill   tip     sex smoker   day    time  size
72        26.86  3.14  Female    Yes   Sat  Dinner     2
73        25.28  5.00  Female    Yes   Sat  Dinner     2
102       44.30  2.50  Female    Yes   Sat  Dinner     3
103       22.42  3.48  Female    Yes   Sat  Dinner     2
186       20.90  3.50  Female    Yes   Sun  Dinner     3
197       43.11  5.00  Female    Yes  Thur   Lunch     4
214       28.17  6.50  Female    Yes   Sat  Dinner     3
219       30.14  3.09  Female    Yes   Sat  Dinner     4
229       22.12  2.88  Female    Yes   Sat  Dinner     2
240       27.18  2.00  Female    Yes   Sat  Dinner     2


### *Exercises*

In [156]:
# E: Find lunchtime bills paid by male patrons who tipped less than $2


In [157]:
# E: Create two new dataframes, named males and females, that contain 
#    bills only from male and female patrons, respectively.

## Sorting rows

To rearrange the rows in a particular order, use the `sort()` method, specifying a list of column names to sort on. The default direction is ascending, which you can reverse with the `ascending` keyword.

In [158]:
# sort by party size, in descending order

print tips.sort('size', ascending=False).head()

     total_bill  tip     sex smoker   day    time  size
143       27.05  5.0  Female     No  Thur   Lunch     6
156       48.17  5.0    Male     No   Sun  Dinner     6
125       29.80  4.2  Female     No  Thur   Lunch     6
141       34.30  6.7    Male     No  Thur   Lunch     6
185       20.69  5.0    Male     No   Sun  Dinner     5


In [159]:
# sort by party size, then total_bill, in descending order

print tips.sort(['size', 'total_bill'], ascending=[False, False]).head()

     total_bill  tip     sex smoker   day    time  size
156       48.17  5.0    Male     No   Sun  Dinner     6
141       34.30  6.7    Male     No  Thur   Lunch     6
125       29.80  4.2  Female     No  Thur   Lunch     6
143       27.05  5.0  Female     No  Thur   Lunch     6
142       41.19  5.0    Male     No  Thur   Lunch     5


### *Exercises*

In [160]:
# E: sort by total_bill, then by tip, in descending order, print first 10 records
print tips.sort(['total_bill','tip'],ascending=[False,False])[tips['time']=="Lunch"].head(10)

     total_bill   tip     sex smoker   day   time  size
197       43.11  5.00  Female    Yes  Thur  Lunch     4
142       41.19  5.00    Male     No  Thur  Lunch     5
85        34.83  5.17  Female     No  Thur  Lunch     4
141       34.30  6.70    Male     No  Thur  Lunch     6
83        32.68  5.00    Male    Yes  Thur  Lunch     2
125       29.80  4.20  Female     No  Thur  Lunch     6
192       28.44  2.56    Male    Yes  Thur  Lunch     2
77        27.20  4.00    Male     No  Thur  Lunch     4
143       27.05  5.00  Female     No  Thur  Lunch     6
88        24.71  5.85    Male     No  Thur  Lunch     2


In [161]:
# E: same as above, but only show lunch-time bills

## Finding the domain of a variable

The **domain** of a variable is the set of unique values it can have in a given data set. If you want to categorize observations, you need to know what the categories are, but it may be difficult to find all the categories in a large data set just by inspection. Use the `.unique()` method with a key to find the unique values possible for that key. The result is an `array`, which is similar to a list.



In [162]:
# On what days is the restaurant open?

open_days = tips['day'].unique()
print open_days
print "Open %i days of the week" %len(open_days)

['Sun' 'Sat' 'Thur' 'Fri']
Open 4 days of the week


### *Exercises*

In [163]:
# E: What party sizes could be observed in this data set?


In [164]:
# E: Among Saturday bills paid by a female patron, what party sizes were observed?


In [165]:
# E: Make a loop that prints out the days of the week that restaurant serves Dinner

print "Dinner bills paid by men only occured on the following days:"
print "  - some day"
print "  - another day"

Dinner bills paid by men only occured on the following days:
  - some day
  - another day


## Summarizing column values 

A `DataFrame` offers several **summary** methods — such as `count()`, `sum()`, `mean()`, `min()`, `max()`, `std()`, `sem()` — which you can apply to a column. A handy method is `describe()`, which returns a `DataFrame` of various statistics (including percentiles) for each numeric column.

In [166]:
# Find the average bill

avg = tips['total_bill'].mean()
sem = tips['total_bill'].sem()
print avg, sem
print "The mean bill was $%3.1f ± %3.1f" %(avg,sem)

19.785942623 0.569918525289
The mean bill was $19.8 ± 0.6


In [167]:
# find the median bill using describe(), which creates a DataFrame
# The "index" of this DataFrame is actually text, like 'count', 'mean', etc.

results = tips.describe()
print results
print "The median bill was $%3.2f" %(results['total_bill']['50%'])

       total_bill         tip        size
count  244.000000  244.000000  244.000000
mean    19.785943    2.998279    2.569672
std      8.902412    1.383638    0.951100
min      3.070000    1.000000    1.000000
25%     13.347500    2.000000    2.000000
50%     17.795000    2.900000    2.000000
75%     24.127500    3.562500    3.000000
max     50.810000   10.000000    6.000000
The median bill was $17.80


### *Exercises*

In [168]:
# E: Compare the average bill among male bill payers and female bill payers


## Defining new variables

Sometimes a data set doesn't contain a variable you want, but you can calculate it with other variables. For example, suppose we wanted to study [factors that affect a patron's tipping behavior](http://www.npr.org/sections/alltechconsidered/2014/03/05/283917108/technology-may-soon-get-you-to-be-a-bigger-tipper). Then the relevant quantity is tip *rate*, rather than the actual size of the tip. Tip rate isn't recorded, but we can calculate it from the tip amount and the total bill. 

Using a `DataFrame`'s `.assign()` method, you can specify both the calculation, and a new key for storing the result, which will appear as a new column in the `DataFrame`. There are many ways to specify the calculation, but for now the most straightforward way is to define a function separately, and give it the appropriate columns during the `assign()` step. (If you know how to define `lambda` functions, they offer certain advantages.)

In [179]:
# Goal: find the most atrocious tippers
# 0. define function to calculate tip rate (tip as percentage of bill)
# 1. assign tip rate function to new column
# 2. sort ascending

def tip_rate(tip, bill):
    return tip/bill*100

tips = tips.assign(rate = tip_rate(tips['tip'], tips['total_bill']))

print tips.sort(['rate'], ascending=[True]).head(10)


     total_bill   tip     sex smoker   day    time  size      rate weekend
237       32.83  1.17    Male    Yes   Sat  Dinner     2  3.563814    True
102       44.30  2.50  Female    Yes   Sat  Dinner     3  5.643341    True
57        26.41  1.50  Female     No   Sat  Dinner     2  5.679667    True
0         16.99  1.01  Female     No   Sun  Dinner     2  5.944673    True
187       30.46  2.00    Male    Yes   Sun  Dinner     5  6.565988    True
210       30.06  2.00    Male    Yes   Sat  Dinner     3  6.653360    True
48        28.55  2.05    Male     No   Sun  Dinner     3  7.180385    True
146       18.64  1.36  Female     No  Thur   Lunch     3  7.296137   False
240       27.18  2.00  Female    Yes   Sat  Dinner     2  7.358352    True
184       40.55  3.00    Male    Yes   Sun  Dinner     2  7.398274    True


### *Exercises*

In [180]:
# E: Count how many times a bill payer tipped below 15%

In [181]:
# E: Make a new DataFrame named tips2, and:
#    Add a column, named 'per_person', calculating the average cost per person
#    Sort by per_person cost, descending


In [175]:
# E: To tips2, add another column named 'weekend', 
#    which is True if 'day' is Saturday or Sunday but False otherwise
def is_weekend(day):
    if day == "Sun":
        return True
    elif day == "Sat":
        return True
    else:
        return False

def new_weekend(day):
    return (day=="Sun") | (day=="Sat")

# tips['weekend'] = tips['day']
# tips['weekend'].apply(is_weekend)
# tips = tips.assign(weekend = (tips['day']=="Sun") | (tips['day']=="Sat"))
# tips = tips.assign(weekend = is_weekend(tips.day))
# tips['weekend'] = tips['day']
tips['weekend'] = tips['day'].map(is_weekend)
# tips = tips.assign(weekend = new_weekend(tips['day']))
print tips.loc[50:100]

     total_bill   tip     sex smoker   day    time  size       rate weekend
50        12.54  2.50    Male     No   Sun  Dinner     2  19.936204    True
51        10.29  2.60  Female     No   Sun  Dinner     2  25.267250    True
52        34.81  5.20  Female     No   Sun  Dinner     4  14.938236    True
53         9.94  1.56    Male     No   Sun  Dinner     2  15.694165    True
54        25.56  4.34    Male     No   Sun  Dinner     4  16.979656    True
55        19.49  3.51    Male     No   Sun  Dinner     2  18.009236    True
56        38.01  3.00    Male    Yes   Sat  Dinner     4   7.892660    True
57        26.41  1.50  Female     No   Sat  Dinner     2   5.679667    True
58        11.24  1.76    Male    Yes   Sat  Dinner     2  15.658363    True
59        48.27  6.73    Male     No   Sat  Dinner     4  13.942407    True
60        20.29  3.21    Male    Yes   Sat  Dinner     2  15.820601    True
61        13.81  2.00    Male    Yes   Sat  Dinner     2  14.482259    True
62        11