# SI370 Day 2: (Re-) Introduction to Pandas

## Learning Objectives

* Pandas refresher (or introduction)
* explain how pandas operations differ from "traditional" python
* be able to load a CSV file into a Pandas DataFrame
* explain how to extract columns from a DataFrame
* sort a DataFrame
* assign a column as the index of a DataFrame
* filter a DataFrame according to some criteria
* explain how boolean masks work in filtering DataFrames

IMPORTANT: Replace ```?``` in the following code with your uniqname.

In [1]:
MY_UNIQNAME = '?'

In [2]:
import pandas as pd

We're going to use some data from the [Nutrition Facts for McDonald's Menu](https://www.kaggle.com/mcdonalds/nutrition-facts) dataset on [Kaggle](www.kaggle.com).

Now let's load the file using ```read_csv```.

In [3]:
menu = pd.read_csv('https://raw.githubusercontent.com/umsi-data-science/si370/fa2019/data/menu.csv')

### Exercise 2 
(1 point): How many rows and columns are in this dataset?  Include one cell block to determine the number and one markdown block that presents the answer as a complete sentence (i.e. "The McDonald's nutrition data set contains X rows and Y columns"). 

In [4]:
# insert your code here
menu

Unnamed: 0,Category,Item,Serving Size,Calories,Calories from Fat,Total Fat,Total Fat (% Daily Value),Saturated Fat,Saturated Fat (% Daily Value),Trans Fat,Cholesterol,Cholesterol (% Daily Value),Sodium,Sodium (% Daily Value),Carbohydrates,Carbohydrates (% Daily Value),Dietary Fiber,Dietary Fiber (% Daily Value),Sugars,Protein,Vitamin A (% Daily Value),Vitamin C (% Daily Value),Calcium (% Daily Value),Iron (% Daily Value)
0,Breakfast,Egg McMuffin,4.8 oz (136 g),300,120,13.0,20,5.0,25,0.0,260,87,750,31,31,10,4,17,3,17,10,0,25,15
1,Breakfast,Egg White Delight,4.8 oz (135 g),250,70,8.0,12,3.0,15,0.0,25,8,770,32,30,10,4,17,3,18,6,0,25,8
2,Breakfast,Sausage McMuffin,3.9 oz (111 g),370,200,23.0,35,8.0,42,0.0,45,15,780,33,29,10,4,17,2,14,8,0,25,10
3,Breakfast,Sausage McMuffin with Egg,5.7 oz (161 g),450,250,28.0,43,10.0,52,0.0,285,95,860,36,30,10,4,17,2,21,15,0,30,15
4,Breakfast,Sausage McMuffin with Egg Whites,5.7 oz (161 g),400,210,23.0,35,8.0,42,0.0,50,16,880,37,30,10,4,17,2,21,6,0,25,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255,Smoothies & Shakes,McFlurry with Oreo Cookies (Small),10.1 oz (285 g),510,150,17.0,26,9.0,44,0.5,45,14,280,12,80,27,1,4,64,12,15,0,40,8
256,Smoothies & Shakes,McFlurry with Oreo Cookies (Medium),13.4 oz (381 g),690,200,23.0,35,12.0,58,1.0,55,19,380,16,106,35,1,5,85,15,20,0,50,10
257,Smoothies & Shakes,McFlurry with Oreo Cookies (Snack),6.7 oz (190 g),340,100,11.0,17,6.0,29,0.0,30,9,190,8,53,18,1,2,43,8,10,0,25,6
258,Smoothies & Shakes,McFlurry with Reese's Peanut Butter Cups (Medium),14.2 oz (403 g),810,290,32.0,50,15.0,76,1.0,60,20,400,17,114,38,2,9,103,21,20,0,60,6


In [5]:
menu.shape

(260, 24)

## Extracting columns 

Getting column names is easy:

In [6]:
menu.columns

Index(['Category', 'Item', 'Serving Size', 'Calories', 'Calories from Fat',
       'Total Fat', 'Total Fat (% Daily Value)', 'Saturated Fat',
       'Saturated Fat (% Daily Value)', 'Trans Fat', 'Cholesterol',
       'Cholesterol (% Daily Value)', 'Sodium', 'Sodium (% Daily Value)',
       'Carbohydrates', 'Carbohydrates (% Daily Value)', 'Dietary Fiber',
       'Dietary Fiber (% Daily Value)', 'Sugars', 'Protein',
       'Vitamin A (% Daily Value)', 'Vitamin C (% Daily Value)',
       'Calcium (% Daily Value)', 'Iron (% Daily Value)'],
      dtype='object')

Similarly, extracting a specific columns is also easy:

In [7]:
menu['Category'] 
menu.Category

0               Breakfast
1               Breakfast
2               Breakfast
3               Breakfast
4               Breakfast
              ...        
255    Smoothies & Shakes
256    Smoothies & Shakes
257    Smoothies & Shakes
258    Smoothies & Shakes
259    Smoothies & Shakes
Name: Category, Length: 260, dtype: object

And multiple columns can also be extracted by passing a list of column names

In [8]:
menu[['Item','Calories']]

Unnamed: 0,Item,Calories
0,Egg McMuffin,300
1,Egg White Delight,250
2,Sausage McMuffin,370
3,Sausage McMuffin with Egg,450
4,Sausage McMuffin with Egg Whites,400
...,...,...
255,McFlurry with Oreo Cookies (Small),510
256,McFlurry with Oreo Cookies (Medium),690
257,McFlurry with Oreo Cookies (Snack),340
258,McFlurry with Reese's Peanut Butter Cups (Medium),810


## Extracting rows

In [9]:
menu.iloc[0]

Category                              Breakfast
Item                               Egg McMuffin
Serving Size                     4.8 oz (136 g)
Calories                                    300
Calories from Fat                           120
Total Fat                                    13
Total Fat (% Daily Value)                    20
Saturated Fat                                 5
Saturated Fat (% Daily Value)                25
Trans Fat                                     0
Cholesterol                                 260
Cholesterol (% Daily Value)                  87
Sodium                                      750
Sodium (% Daily Value)                       31
Carbohydrates                                31
Carbohydrates (% Daily Value)                10
Dietary Fiber                                 4
Dietary Fiber (% Daily Value)                17
Sugars                                        3
Protein                                      17
Vitamin A (% Daily Value)               

In [10]:
menu

Unnamed: 0,Category,Item,Serving Size,Calories,Calories from Fat,Total Fat,Total Fat (% Daily Value),Saturated Fat,Saturated Fat (% Daily Value),Trans Fat,Cholesterol,Cholesterol (% Daily Value),Sodium,Sodium (% Daily Value),Carbohydrates,Carbohydrates (% Daily Value),Dietary Fiber,Dietary Fiber (% Daily Value),Sugars,Protein,Vitamin A (% Daily Value),Vitamin C (% Daily Value),Calcium (% Daily Value),Iron (% Daily Value)
0,Breakfast,Egg McMuffin,4.8 oz (136 g),300,120,13.0,20,5.0,25,0.0,260,87,750,31,31,10,4,17,3,17,10,0,25,15
1,Breakfast,Egg White Delight,4.8 oz (135 g),250,70,8.0,12,3.0,15,0.0,25,8,770,32,30,10,4,17,3,18,6,0,25,8
2,Breakfast,Sausage McMuffin,3.9 oz (111 g),370,200,23.0,35,8.0,42,0.0,45,15,780,33,29,10,4,17,2,14,8,0,25,10
3,Breakfast,Sausage McMuffin with Egg,5.7 oz (161 g),450,250,28.0,43,10.0,52,0.0,285,95,860,36,30,10,4,17,2,21,15,0,30,15
4,Breakfast,Sausage McMuffin with Egg Whites,5.7 oz (161 g),400,210,23.0,35,8.0,42,0.0,50,16,880,37,30,10,4,17,2,21,6,0,25,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255,Smoothies & Shakes,McFlurry with Oreo Cookies (Small),10.1 oz (285 g),510,150,17.0,26,9.0,44,0.5,45,14,280,12,80,27,1,4,64,12,15,0,40,8
256,Smoothies & Shakes,McFlurry with Oreo Cookies (Medium),13.4 oz (381 g),690,200,23.0,35,12.0,58,1.0,55,19,380,16,106,35,1,5,85,15,20,0,50,10
257,Smoothies & Shakes,McFlurry with Oreo Cookies (Snack),6.7 oz (190 g),340,100,11.0,17,6.0,29,0.0,30,9,190,8,53,18,1,2,43,8,10,0,25,6
258,Smoothies & Shakes,McFlurry with Reese's Peanut Butter Cups (Medium),14.2 oz (403 g),810,290,32.0,50,15.0,76,1.0,60,20,400,17,114,38,2,9,103,21,20,0,60,6


You'll notice that the index column is just a series of integers starting with 0.  Sometimes that's fine.  
Other times we want to assign a more useful row as the index.  Note that the values in the index do not need to be unique.

In [11]:
menu_i = menu.set_index('Item')

In [12]:
menu_i

Unnamed: 0_level_0,Category,Serving Size,Calories,Calories from Fat,Total Fat,Total Fat (% Daily Value),Saturated Fat,Saturated Fat (% Daily Value),Trans Fat,Cholesterol,Cholesterol (% Daily Value),Sodium,Sodium (% Daily Value),Carbohydrates,Carbohydrates (% Daily Value),Dietary Fiber,Dietary Fiber (% Daily Value),Sugars,Protein,Vitamin A (% Daily Value),Vitamin C (% Daily Value),Calcium (% Daily Value),Iron (% Daily Value)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Egg McMuffin,Breakfast,4.8 oz (136 g),300,120,13.0,20,5.0,25,0.0,260,87,750,31,31,10,4,17,3,17,10,0,25,15
Egg White Delight,Breakfast,4.8 oz (135 g),250,70,8.0,12,3.0,15,0.0,25,8,770,32,30,10,4,17,3,18,6,0,25,8
Sausage McMuffin,Breakfast,3.9 oz (111 g),370,200,23.0,35,8.0,42,0.0,45,15,780,33,29,10,4,17,2,14,8,0,25,10
Sausage McMuffin with Egg,Breakfast,5.7 oz (161 g),450,250,28.0,43,10.0,52,0.0,285,95,860,36,30,10,4,17,2,21,15,0,30,15
Sausage McMuffin with Egg Whites,Breakfast,5.7 oz (161 g),400,210,23.0,35,8.0,42,0.0,50,16,880,37,30,10,4,17,2,21,6,0,25,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
McFlurry with Oreo Cookies (Small),Smoothies & Shakes,10.1 oz (285 g),510,150,17.0,26,9.0,44,0.5,45,14,280,12,80,27,1,4,64,12,15,0,40,8
McFlurry with Oreo Cookies (Medium),Smoothies & Shakes,13.4 oz (381 g),690,200,23.0,35,12.0,58,1.0,55,19,380,16,106,35,1,5,85,15,20,0,50,10
McFlurry with Oreo Cookies (Snack),Smoothies & Shakes,6.7 oz (190 g),340,100,11.0,17,6.0,29,0.0,30,9,190,8,53,18,1,2,43,8,10,0,25,6
McFlurry with Reese's Peanut Butter Cups (Medium),Smoothies & Shakes,14.2 oz (403 g),810,290,32.0,50,15.0,76,1.0,60,20,400,17,114,38,2,9,103,21,20,0,60,6


In [15]:
try:
  menu_i.loc[0] # intentional error
except Exception as e:
  print('Caught intentional error:')
  print(e)

Caught intentional error:
cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0] of <class 'int'>


In [None]:
menu_i.iloc[0]

We can also extract a row and a slice of its columns

In [None]:
menu_i.iloc[0,0:2]

Or we can extract a column and a slice of its rows


In [None]:
menu_i.iloc[1:3,:]

## Sorting
Sorting is supported using sort_index and sort_values:


In [None]:
menu_sorted_by_cals = menu.sort_values('Calories',ascending=True)

### Exercise 3 
(2 points): Display the four menu items that have the most Saturated Fat (the absolute amount, not the % Daily Value).</font>

In [None]:
# insert your code here
menu.

## Filtering

More often than extracting a particular row, we want to extract one or more rows that match
some criteria.  For example, to find all the menu items that contain Trans Fats, we could use:


In [None]:
menu['Trans Fat' ] > 0

In [None]:
menu_trans_fats = menu[menu['Trans Fat'] > 0.0]

We're going to spend time in class explaining what just happened there.

In [None]:
menu['Trans Fat']

In [None]:
menu['Trans Fat'] > 0.0

In [None]:
menu[menu['Trans Fat'] > 0.0]

In [None]:
menu.columns

### Exercise 4 
(2 points): List the top 3 breakfast items have the most Dietary Fiber.

In [None]:
# insert your code here
menu[menu['Category']== 'Breakfast'].sort_values('Dietary Fiber',  ascending = False).head(5)

## Exercise 5 
(3 points): Show up to three of the best choices for someone who is following the "Atkin's Diet" (Google it).  Justify your choices in a markdown block that follows your code.

In [None]:
# insert your code here

List and justify your choices.

END OF NOTEBOOK

Remember to submit your work via Canvas.