# Day 2: Intro to Pandas

## Pandas
* high-level library to support data manipulation and analysis
* DataFrame is the primary object we’ll be dealing with
* similar to R’s dataframe
* maps onto tabular structure
* good for time series and econometric data

## Shift from "pythonic" to "pandorable"

* less looping over elements
* lots of built-in functionality
* a "paradigm shift"

# Data structures

We're all familiar with lists:

In [2]:
names = ["Charlotte", "Ingrid", "Ian", "Eric"]
scores = [80, 95, 85, 70]

Now let's say that we wanted to divide each of those scores by two and assign the results to another variable. Go ahead and write some code that does that... There are lots of ways to do this, so go ahead and write one way to do it (without importing any additional python packages) and assign the results to a 
variable called ```half```:

In [3]:
# insert your code here

If you followed the above instructions, the following cell block should print
a list of floats that looks like ```
[40.0, 47.5, 42.5, 35.0]```


In [14]:
half

[40.0, 47.5, 42.5, 35.0]

We can put data into an array structure that allows us to apply more powerful
functions.  The data structure that we're interested in is called an ```ndarray``` and is from the ```numpy``` package:

In [4]:
import numpy as np
ascores = np.array(scores)

In [5]:
ascores 

array([80, 95, 85, 70])

In [6]:
ahalf = ascores / 2

Numpy arrays are powerful, but they have some limitations:  they can only 
consist of one type of data (e.g. int), etc.  pandas provides two additional
data structures that are built on numpy ndarrays.

The first are Series.  Let's create a simple pandas Series and examine it:

In [7]:
import pandas as pd

In [8]:
from pandas import Series

In [9]:
sscores = Series(scores,name='scores')

In [10]:
sscores

0    80
1    95
2    85
3    70
Name: scores, dtype: int64

So you see a couple of useful things: an index (0 to 3) and a data type (dtype), which in this case is an int64.

**A Series is a one-dimensional ndarray with axis labels**

In [11]:
data = dict(zip(names,scores))

In [12]:
data

{'Charlotte': 80, 'Eric': 70, 'Ian': 85, 'Ingrid': 95}

In [13]:
sData = Series(data=data,name='score')

In [14]:
sData

Charlotte    80
Eric         70
Ian          85
Ingrid       95
Name: score, dtype: int64

So Series are a bit friendlier than numpy arrays, but they're still only one-dimensional.  Keep in mind that our basic data abstraction is a table, which can
be thought of as a two-dimensional array.  Let's go ahead and create a simple DataFrame with just one column:

In [15]:
from pandas import DataFrame


In [16]:
sData.to_frame()

Unnamed: 0,score
Charlotte,80
Eric,70
Ian,85
Ingrid,95


Let's return to the code we ran last time and walk though it just to make sure we understand it

In [18]:
years = range(1880, 2015)
pieces = []
for year in years:
    path = 'data/names/yob%d.csv'%year
    frame = pd.read_csv(path)
    frame['year'] = year
    pieces.append(frame)
df_names = pd.concat(pieces, ignore_index=True)

## Today's focus: filtering, slicing and dicing

We're going to use some data from the [Nutrition Facts for McDonald's Menu](https://www.kaggle.com/mcdonalds/nutrition-facts) dataset on [Kaggle](www.kaggle.com).

Go ahead and browse the file using JupyterLab.

Now let's load the file using ```read_csv```.

In [19]:
menu = pd.read_csv('data/menu.csv')

## Exercise 1 (1 point): How many rows and columns are in this dataset?  Include one cell block to determine the number and one markdown block that presents the answer as a complete sentence (i.e. "The McDonald's nutrition data set containx X rows and Y columns").

In [None]:
# insert your code here

In [None]:
Insert your answer here

## Extracting columns 

Getting column names is easy:

In [22]:
menu.columns

Index(['Category', 'Item', 'Serving Size', 'Calories', 'Calories from Fat',
       'Total Fat', 'Total Fat (% Daily Value)', 'Saturated Fat',
       'Saturated Fat (% Daily Value)', 'Trans Fat', 'Cholesterol',
       'Cholesterol (% Daily Value)', 'Sodium', 'Sodium (% Daily Value)',
       'Carbohydrates', 'Carbohydrates (% Daily Value)', 'Dietary Fiber',
       'Dietary Fiber (% Daily Value)', 'Sugars', 'Protein',
       'Vitamin A (% Daily Value)', 'Vitamin C (% Daily Value)',
       'Calcium (% Daily Value)', 'Iron (% Daily Value)'],
      dtype='object')

Similarly, extracting a specific columns is also easy:

In [30]:
menu['Category']

0               Breakfast
1               Breakfast
2               Breakfast
3               Breakfast
4               Breakfast
5               Breakfast
6               Breakfast
7               Breakfast
8               Breakfast
9               Breakfast
10              Breakfast
11              Breakfast
12              Breakfast
13              Breakfast
14              Breakfast
15              Breakfast
16              Breakfast
17              Breakfast
18              Breakfast
19              Breakfast
20              Breakfast
21              Breakfast
22              Breakfast
23              Breakfast
24              Breakfast
25              Breakfast
26              Breakfast
27              Breakfast
28              Breakfast
29              Breakfast
              ...        
230          Coffee & Tea
231          Coffee & Tea
232    Smoothies & Shakes
233    Smoothies & Shakes
234    Smoothies & Shakes
235    Smoothies & Shakes
236    Smoothies & Shakes
237    Smoot

And multiple columns can also be extracted by passing a list of column names

In [31]:
menu[['Item','Calories']]

Unnamed: 0,Item,Calories
0,Egg McMuffin,300
1,Egg White Delight,250
2,Sausage McMuffin,370
3,Sausage McMuffin with Egg,450
4,Sausage McMuffin with Egg Whites,400
5,Steak & Egg McMuffin,430
6,"Bacon, Egg & Cheese Biscuit (Regular Biscuit)",460
7,"Bacon, Egg & Cheese Biscuit (Large Biscuit)",520
8,"Bacon, Egg & Cheese Biscuit with Egg Whites (R...",410
9,"Bacon, Egg & Cheese Biscuit with Egg Whites (L...",470


## Extracting rows

In [34]:
menu.iloc[0]

Category                              Breakfast
Item                               Egg McMuffin
Serving Size                     4.8 oz (136 g)
Calories                                    300
Calories from Fat                           120
Total Fat                                    13
Total Fat (% Daily Value)                    20
Saturated Fat                                 5
Saturated Fat (% Daily Value)                25
Trans Fat                                     0
Cholesterol                                 260
Cholesterol (% Daily Value)                  87
Sodium                                      750
Sodium (% Daily Value)                       31
Carbohydrates                                31
Carbohydrates (% Daily Value)                10
Dietary Fiber                                 4
Dietary Fiber (% Daily Value)                17
Sugars                                        3
Protein                                      17
Vitamin A (% Daily Value)               

You'll notice that the index column is just a series of integers starting with 0.  Sometimes that's fine.  
Other times we want to assign a more useful row as the index.  Note that the values in the index do not need to be unique.

In [44]:
menu_i = menu.set_index('Item')

In [45]:
menu_i.loc['Egg White Delight']

Category                              Breakfast
Serving Size                     4.8 oz (135 g)
Calories                                    250
Calories from Fat                            70
Total Fat                                     8
Total Fat (% Daily Value)                    12
Saturated Fat                                 3
Saturated Fat (% Daily Value)                15
Trans Fat                                     0
Cholesterol                                  25
Cholesterol (% Daily Value)                   8
Sodium                                      770
Sodium (% Daily Value)                       32
Carbohydrates                                30
Carbohydrates (% Daily Value)                10
Dietary Fiber                                 4
Dietary Fiber (% Daily Value)                17
Sugars                                        3
Protein                                      18
Vitamin A (% Daily Value)                     6
Vitamin C (% Daily Value)               

In [46]:
menu_i.iloc[0]

Category                              Breakfast
Serving Size                     4.8 oz (136 g)
Calories                                    300
Calories from Fat                           120
Total Fat                                    13
Total Fat (% Daily Value)                    20
Saturated Fat                                 5
Saturated Fat (% Daily Value)                25
Trans Fat                                     0
Cholesterol                                 260
Cholesterol (% Daily Value)                  87
Sodium                                      750
Sodium (% Daily Value)                       31
Carbohydrates                                31
Carbohydrates (% Daily Value)                10
Dietary Fiber                                 4
Dietary Fiber (% Daily Value)                17
Sugars                                        3
Protein                                      17
Vitamin A (% Daily Value)                    10
Vitamin C (% Daily Value)               

We can also extract a row and a slice of its columns

In [47]:
menu_i.iloc[0,0:2]

Category             Breakfast
Serving Size    4.8 oz (136 g)
Name: Egg McMuffin, dtype: object

Or we can extract a column and a slice of its rows


In [48]:
menu_i.iloc[1:3,:]

Unnamed: 0_level_0,Category,Serving Size,Calories,Calories from Fat,Total Fat,Total Fat (% Daily Value),Saturated Fat,Saturated Fat (% Daily Value),Trans Fat,Cholesterol,...,Carbohydrates,Carbohydrates (% Daily Value),Dietary Fiber,Dietary Fiber (% Daily Value),Sugars,Protein,Vitamin A (% Daily Value),Vitamin C (% Daily Value),Calcium (% Daily Value),Iron (% Daily Value)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Egg White Delight,Breakfast,4.8 oz (135 g),250,70,8.0,12,3.0,15,0.0,25,...,30,10,4,17,3,18,6,0,25,8
Sausage McMuffin,Breakfast,3.9 oz (111 g),370,200,23.0,35,8.0,42,0.0,45,...,29,10,4,17,2,14,8,0,25,10


## Sorting
Sorting is supported using sort_index and sort_values:


In [51]:
menu_sorted_by_cals = menu.sort_values('Calories',ascending=True)

## Exercise 2 (2 points): Display the four menu items that have the most Saturated Fat (the absolute amount, not the % Daily Value).  

In [81]:
# insert your code here

## Filtering

More often than extracting a particular row, we want to extract one or more rows that match
some criteria.  For example, to find all the menu items that contain Trans Fats, we could use:


In [82]:
menu_trans_fats = menu[menu['Trans Fat'] > 0.0]

We're going to spend time in class explaining what just happened there.

In [83]:
menu['Trans Fat']

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
5      1.0
6      0.0
7      0.0
8      0.0
9      0.0
10     0.0
11     0.0
12     0.0
13     0.0
14     0.0
15     0.0
16     0.0
17     0.0
18     1.0
19     0.0
20     0.0
21     0.0
22     0.0
23     0.0
24     0.5
25     0.5
26     1.5
27     0.0
28     0.0
29     0.0
      ... 
230    1.0
231    1.5
232    0.0
233    0.0
234    0.0
235    0.0
236    0.0
237    0.0
238    0.0
239    0.0
240    0.0
241    1.0
242    1.0
243    1.0
244    1.0
245    1.0
246    1.0
247    1.0
248    1.0
249    1.0
250    1.0
251    1.0
252    0.5
253    1.0
254    0.0
255    0.5
256    1.0
257    0.0
258    1.0
259    0.0
Name: Trans Fat, Length: 260, dtype: float64

In [69]:
menu['Trans Fat'] > 0.0

0      False
1      False
2      False
3      False
4      False
5       True
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18      True
19     False
20     False
21     False
22     False
23     False
24      True
25      True
26      True
27     False
28     False
29     False
       ...  
230     True
231     True
232    False
233    False
234    False
235    False
236    False
237    False
238    False
239    False
240    False
241     True
242     True
243     True
244     True
245     True
246     True
247     True
248     True
249     True
250     True
251     True
252     True
253     True
254    False
255     True
256     True
257    False
258     True
259    False
Name: Trans Fat, Length: 260, dtype: bool

In [73]:
menu[menu['Trans Fat'] > 0.0]

Unnamed: 0,Category,Item,Serving Size,Calories,Calories from Fat,Total Fat,Total Fat (% Daily Value),Saturated Fat,Saturated Fat (% Daily Value),Trans Fat,...,Carbohydrates,Carbohydrates (% Daily Value),Dietary Fiber,Dietary Fiber (% Daily Value),Sugars,Protein,Vitamin A (% Daily Value),Vitamin C (% Daily Value),Calcium (% Daily Value),Iron (% Daily Value)
5,Breakfast,Steak & Egg McMuffin,6.5 oz (185 g),430,210,23.0,36,9.0,46,1.0,...,31,10,4,18,3,26,15,2,30,20
18,Breakfast,Steak & Egg Biscuit (Regular Biscuit),7.1 oz (201 g),540,290,32.0,49,16.0,78,1.0,...,38,13,2,8,3,25,10,2,20,25
24,Breakfast,"Bacon, Egg & Cheese Bagel",6.9 oz (197 g),620,280,31.0,48,11.0,56,0.5,...,57,19,3,11,7,30,20,15,20,20
25,Breakfast,"Bacon, Egg & Cheese Bagel with Egg Whites",7.1 oz (201 g),570,230,25.0,39,9.0,45,0.5,...,55,18,3,12,8,30,10,15,20,15
26,Breakfast,"Steak, Egg & Cheese Bagel",8.5 oz (241 g),670,310,35.0,53,13.0,63,1.5,...,56,19,3,12,7,33,20,4,25,25
42,Beef & Pork,Big Mac,7.4 oz (211 g),530,240,27.0,42,10.0,48,1.0,...,47,16,3,13,9,24,6,2,25,25
43,Beef & Pork,Quarter Pounder with Cheese,7.1 oz (202 g),520,240,26.0,41,12.0,61,1.5,...,41,14,3,11,10,30,10,2,30,25
44,Beef & Pork,Quarter Pounder with Bacon & Cheese,8 oz (227 g),600,260,29.0,45,13.0,63,1.5,...,48,16,3,12,12,37,6,15,25,30
45,Beef & Pork,Quarter Pounder with Bacon Habanero Ranch,8.3 oz (235 g),610,280,31.0,48,13.0,64,1.5,...,46,15,3,14,10,37,8,20,25,30
46,Beef & Pork,Quarter Pounder Deluxe,8.6 oz (244 g),540,250,27.0,42,11.0,54,1.5,...,45,15,3,13,9,29,10,8,25,30


In [74]:
menu.columns

Index(['Category', 'Item', 'Serving Size', 'Calories', 'Calories from Fat',
       'Total Fat', 'Total Fat (% Daily Value)', 'Saturated Fat',
       'Saturated Fat (% Daily Value)', 'Trans Fat', 'Cholesterol',
       'Cholesterol (% Daily Value)', 'Sodium', 'Sodium (% Daily Value)',
       'Carbohydrates', 'Carbohydrates (% Daily Value)', 'Dietary Fiber',
       'Dietary Fiber (% Daily Value)', 'Sugars', 'Protein',
       'Vitamin A (% Daily Value)', 'Vitamin C (% Daily Value)',
       'Calcium (% Daily Value)', 'Iron (% Daily Value)'],
      dtype='object')

## Exercise 3 (3 points): List the top 3 breakfast items have the most Dietary Fibre.

In [84]:
# insert your code here

## Exercise 4 (4 points): Show up to three of the best choices for someone who is following the "Atkin's Diet" (Google it).  Justify your choices in a markdown block that follows your code.

In [85]:
# insert your code here

List and justify your choices.