# Day 2: Intro to Pandas

## Learning Objectives

* Pandas refresher (or introduction)
* explain how pandas operations differ from "traditional" python
* be able to load a CSV file into a Pandas DataFrame
* explain how to extract columns from a DataFrame
* sort a DataFrame
* assign a column as the index of a DataFrame
* filter a DataFrame according to some criteria
* explain how boolean masks work in filtering DataFrames

### IMPORTANT: Replace ```?``` in the following code with your uniqname.

In [18]:
MY_UNIQNAME = '?'

## <font color="magenta">Exercise 1 (2 points): Consider Leek's Figure 2.1:</font>

![EDA](resources/leek2.1.png)

Do you agree with this layout of data analyses?  Can you think of ways to improve it?  Should Exploratory Data Analysis be constrained to where they are in the diagram?  Use a whiteboard to draw a better diagram.  Consider taking a picture of the whiteboard and inserting it in this notebook.

Insert your answer here.

## Pandas
* high-level library to support data manipulation and analysis
* DataFrame is the primary object we’ll be dealing with
* similar to R’s dataframe
* maps onto tabular structure
* good for time series and econometric data

## Shift from "pythonic" to "pandorable"

* less looping over elements
* lots of built-in functionality
* a "paradigm shift"

# Data structures

We're all familiar with lists:

In [2]:
names = ["Charlotte", "Ingrid", "Ian", "Eric"]
scores = [80, 95, 85, 70]

Now let's say that we wanted to divide each of those scores by two and assign the results to another variable. Go ahead and write some code that does that... There are lots of ways to do this, so go ahead and write one way to do it (without importing any additional python packages) and assign the results to a 
variable called ```half```:

In [3]:
# insert your code here

If you followed the above instructions, the following cell block should print
a list of floats that looks like ```
[40.0, 47.5, 42.5, 35.0]```


In [14]:
half

[40.0, 47.5, 42.5, 35.0]

We can put data into an array structure that allows us to apply more powerful
functions.  The data structure that we're interested in is called an ```ndarray``` and is from the ```numpy``` package:

In [4]:
import numpy as np
ascores = np.array(scores)

In [5]:
ascores 

array([80, 95, 85, 70])

In [6]:
ahalf = ascores / 2

Numpy arrays are powerful, but they have some limitations:  they can only 
consist of one type of data (e.g. int), etc.  pandas provides two additional
data structures that are built on numpy ndarrays.

The first are Series.  Let's create a simple pandas Series and examine it:

In [7]:
import pandas as pd

In [8]:
from pandas import Series

In [9]:
sscores = Series(scores,name='scores')

In [10]:
sscores

0    80
1    95
2    85
3    70
Name: scores, dtype: int64

So you see a couple of useful things: an index (0 to 3) and a data type (dtype), which in this case is an int64.

**A Series is a one-dimensional ndarray with axis labels**

In [11]:
data = dict(zip(names,scores))

In [12]:
data

{'Charlotte': 80, 'Eric': 70, 'Ian': 85, 'Ingrid': 95}

In [13]:
sData = Series(data=data,name='score')

In [14]:
sData

Charlotte    80
Eric         70
Ian          85
Ingrid       95
Name: score, dtype: int64

So Series are a bit friendlier than numpy arrays, but they're still only one-dimensional.  Keep in mind that our basic data abstraction is a table, which can
be thought of as a two-dimensional array.  Let's go ahead and create a simple DataFrame with just one column:

In [15]:
from pandas import DataFrame


In [16]:
sData.to_frame()

Unnamed: 0,score
Charlotte,80
Eric,70
Ian,85
Ingrid,95


Let's return to the code we ran last time and walk though it just to make sure we understand it

In [18]:
years = range(1880, 2015)
pieces = []
for year in years:
    path = 'data/names/yob%d.csv'%year
    frame = pd.read_csv(path)
    frame['year'] = year
    pieces.append(frame)
df_names = pd.concat(pieces, ignore_index=True)

## Today's focus: filtering, slicing and dicing

We're going to use some data from the [Nutrition Facts for McDonald's Menu](https://www.kaggle.com/mcdonalds/nutrition-facts) dataset on [Kaggle](www.kaggle.com).

Go ahead and browse the file using JupyterLab.

Now let's load the file using ```read_csv```.

In [19]:
menu = pd.read_csv('data/menu.csv')

## <font color="magenta"> Exercise 2 (1 point): How many rows and columns are in this dataset?  Include one cell block to determine the number and one markdown block that presents the answer as a complete sentence (i.e. "The McDonald's nutrition data set contains X rows and Y columns"). </font>

In [None]:
# insert your code here

In [None]:
Insert your answer here

## Extracting columns 

Getting column names is easy:

In [22]:
menu.columns

Index(['Category', 'Item', 'Serving Size', 'Calories', 'Calories from Fat',
       'Total Fat', 'Total Fat (% Daily Value)', 'Saturated Fat',
       'Saturated Fat (% Daily Value)', 'Trans Fat', 'Cholesterol',
       'Cholesterol (% Daily Value)', 'Sodium', 'Sodium (% Daily Value)',
       'Carbohydrates', 'Carbohydrates (% Daily Value)', 'Dietary Fiber',
       'Dietary Fiber (% Daily Value)', 'Sugars', 'Protein',
       'Vitamin A (% Daily Value)', 'Vitamin C (% Daily Value)',
       'Calcium (% Daily Value)', 'Iron (% Daily Value)'],
      dtype='object')

Similarly, extracting a specific columns is also easy:

In [None]:
menu['Category']

And multiple columns can also be extracted by passing a list of column names

In [None]:
menu[['Item','Calories']]

## Extracting rows

In [34]:
menu.iloc[0]

Category                              Breakfast
Item                               Egg McMuffin
Serving Size                     4.8 oz (136 g)
Calories                                    300
Calories from Fat                           120
Total Fat                                    13
Total Fat (% Daily Value)                    20
Saturated Fat                                 5
Saturated Fat (% Daily Value)                25
Trans Fat                                     0
Cholesterol                                 260
Cholesterol (% Daily Value)                  87
Sodium                                      750
Sodium (% Daily Value)                       31
Carbohydrates                                31
Carbohydrates (% Daily Value)                10
Dietary Fiber                                 4
Dietary Fiber (% Daily Value)                17
Sugars                                        3
Protein                                      17
Vitamin A (% Daily Value)               

You'll notice that the index column is just a series of integers starting with 0.  Sometimes that's fine.  
Other times we want to assign a more useful row as the index.  Note that the values in the index do not need to be unique.

In [44]:
menu_i = menu.set_index('Item')

In [45]:
menu_i.loc['Egg White Delight']

Category                              Breakfast
Serving Size                     4.8 oz (135 g)
Calories                                    250
Calories from Fat                            70
Total Fat                                     8
Total Fat (% Daily Value)                    12
Saturated Fat                                 3
Saturated Fat (% Daily Value)                15
Trans Fat                                     0
Cholesterol                                  25
Cholesterol (% Daily Value)                   8
Sodium                                      770
Sodium (% Daily Value)                       32
Carbohydrates                                30
Carbohydrates (% Daily Value)                10
Dietary Fiber                                 4
Dietary Fiber (% Daily Value)                17
Sugars                                        3
Protein                                      18
Vitamin A (% Daily Value)                     6
Vitamin C (% Daily Value)               

In [46]:
menu_i.iloc[0]

Category                              Breakfast
Serving Size                     4.8 oz (136 g)
Calories                                    300
Calories from Fat                           120
Total Fat                                    13
Total Fat (% Daily Value)                    20
Saturated Fat                                 5
Saturated Fat (% Daily Value)                25
Trans Fat                                     0
Cholesterol                                 260
Cholesterol (% Daily Value)                  87
Sodium                                      750
Sodium (% Daily Value)                       31
Carbohydrates                                31
Carbohydrates (% Daily Value)                10
Dietary Fiber                                 4
Dietary Fiber (% Daily Value)                17
Sugars                                        3
Protein                                      17
Vitamin A (% Daily Value)                    10
Vitamin C (% Daily Value)               

We can also extract a row and a slice of its columns

In [47]:
menu_i.iloc[0,0:2]

Category             Breakfast
Serving Size    4.8 oz (136 g)
Name: Egg McMuffin, dtype: object

Or we can extract a column and a slice of its rows


In [48]:
menu_i.iloc[1:3,:]

Unnamed: 0_level_0,Category,Serving Size,Calories,Calories from Fat,Total Fat,Total Fat (% Daily Value),Saturated Fat,Saturated Fat (% Daily Value),Trans Fat,Cholesterol,...,Carbohydrates,Carbohydrates (% Daily Value),Dietary Fiber,Dietary Fiber (% Daily Value),Sugars,Protein,Vitamin A (% Daily Value),Vitamin C (% Daily Value),Calcium (% Daily Value),Iron (% Daily Value)
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Egg White Delight,Breakfast,4.8 oz (135 g),250,70,8.0,12,3.0,15,0.0,25,...,30,10,4,17,3,18,6,0,25,8
Sausage McMuffin,Breakfast,3.9 oz (111 g),370,200,23.0,35,8.0,42,0.0,45,...,29,10,4,17,2,14,8,0,25,10


## Sorting
Sorting is supported using sort_index and sort_values:


In [51]:
menu_sorted_by_cals = menu.sort_values('Calories',ascending=True)

## <font color="magenta"> Exercise 3 (2 points): Display the four menu items that have the most Saturated Fat (the absolute amount, not the % Daily Value).</font>

In [81]:
# insert your code here

## Filtering

More often than extracting a particular row, we want to extract one or more rows that match
some criteria.  For example, to find all the menu items that contain Trans Fats, we could use:


In [82]:
menu_trans_fats = menu[menu['Trans Fat'] > 0.0]

We're going to spend time in class explaining what just happened there.

In [None]:
menu['Trans Fat']

In [None]:
menu['Trans Fat'] > 0.0

In [None]:
menu[menu['Trans Fat'] > 0.0]

In [74]:
menu.columns

Index(['Category', 'Item', 'Serving Size', 'Calories', 'Calories from Fat',
       'Total Fat', 'Total Fat (% Daily Value)', 'Saturated Fat',
       'Saturated Fat (% Daily Value)', 'Trans Fat', 'Cholesterol',
       'Cholesterol (% Daily Value)', 'Sodium', 'Sodium (% Daily Value)',
       'Carbohydrates', 'Carbohydrates (% Daily Value)', 'Dietary Fiber',
       'Dietary Fiber (% Daily Value)', 'Sugars', 'Protein',
       'Vitamin A (% Daily Value)', 'Vitamin C (% Daily Value)',
       'Calcium (% Daily Value)', 'Iron (% Daily Value)'],
      dtype='object')

## <font color="magenta">Exercise 4 (2 points): List the top 3 breakfast items have the most Dietary Fiber.</font>

In [1]:
# insert your code here

## <font color="magenta">Exercise 5 (3 points): Show up to three of the best choices for someone who is following the "Atkin's Diet" (Google it).  Justify your choices in a markdown block that follows your code.</font>

In [85]:
# insert your code here

List and justify your choices.

# <font color="magenta">END OF NOTEBOOK</font>
## Remember to submit this file in HTML and/or IPYNB format via Canvas.