## Part 1 (50% of HW1): Data processing with pandas 


In this homework you will see examples of some commonly used data wrangling tools in Python. In particular, we aim to give you some familiarity with:

* Slicing data frames
* Filtering data
* Grouped counts
* Joining two tables
* NA/Null values



## Practice (20%)

This part of the homework is graded manually based on showing the correct outputs.

## Setup

You need to execute each step, in order for the next ones to work. First, import necessary libraries:

In [1]:
import pandas as pd
import numpy as np

The code below produces the data frames used in the examples:

In [7]:
# delete it
heroes = pd.DataFrame(
    data={'color': ['red', 'green', 'black', 
                    'blue', 'black', 'red'],
          'first_seen_on': ['a', 'a', 'f', 'a', 'a', 'f'],
          'first_season': [2, 1, 2, 3, 3, 1]},
    index=['flash', 'arrow', 'vibe', 
           'atom', 'canary', 'firestorm']
)

identities = pd.DataFrame(
    data={'ego': ['barry allen', 'oliver queen', 'cisco ramon',
                  'ray palmer', 'sara lance', 
                  'martin stein', 'ronnie raymond'],
          'alter-ego': ['flash', 'arrow', 'vibe', 'atom',
                        'canary', 'firestorm', 'firestorm']}
)

teams = pd.DataFrame(
    data={'team': ['flash', 'arrow', 'flash', 'legends', 
                   'flash', 'legends', 'arrow'],
          'hero': ['flash', 'arrow', 'vibe', 'atom', 
                   'killer frost', 'firestorm', 'speedy']})
print(heroes)
print("\n")
print(identities)
print("\n")
print(teams)

           color first_seen_on  first_season
flash        red             a             2
arrow      green             a             1
vibe       black             f             2
atom        blue             a             3
canary     black             a             3
firestorm    red             f             1


              ego  alter-ego
0     barry allen      flash
1    oliver queen      arrow
2     cisco ramon       vibe
3      ray palmer       atom
4      sara lance     canary
5    martin stein  firestorm
6  ronnie raymond  firestorm


      team          hero
0    flash         flash
1    arrow         arrow
2    flash          vibe
3  legends          atom
4    flash  killer frost
5  legends     firestorm
6    arrow        speedy


In [2]:
heroes = pd.DataFrame(
    data={'color': ['red', 'green', 'black', 
                    'blue', 'black', 'red'],
          'first_seen_on': ['a', 'a', 'f', 'a', 'a', 'f'],
          'first_season': [2, 1, 2, 3, 3, 1]},
    index=['flash', 'arrow', 'vibe', 
           'atom', 'canary', 'firestorm']
)

identities = pd.DataFrame(
    data={'ego': ['barry allen', 'oliver queen', 'cisco ramon',
                  'ray palmer', 'sara lance', 
                  'martin stein', 'ronnie raymond'],
          'alter-ego': ['flash', 'arrow', 'vibe', 'atom',
                        'canary', 'firestorm', 'firestorm']}
)

teams = pd.DataFrame(
    data={'team': ['flash', 'arrow', 'flash', 'legends', 
                   'flash', 'legends', 'arrow'],
          'hero': ['flash', 'arrow', 'vibe', 'atom', 
                   'killer frost', 'firestorm', 'speedy']})

## Pandas and Wrangling

For the examples that follow, we will be using a toy data set containing information about superheroes in the Arrowverse.  In the `first_seen_on` column, `a` stands for Archer and `f`, Flash.

In [8]:
heroes

Unnamed: 0,color,first_seen_on,first_season
flash,red,a,2
arrow,green,a,1
vibe,black,f,2
atom,blue,a,3
canary,black,a,3
firestorm,red,f,1


In [9]:
identities

Unnamed: 0,ego,alter-ego
0,barry allen,flash
1,oliver queen,arrow
2,cisco ramon,vibe
3,ray palmer,atom
4,sara lance,canary
5,martin stein,firestorm
6,ronnie raymond,firestorm


In [10]:
teams

Unnamed: 0,team,hero
0,flash,flash
1,arrow,arrow
2,flash,vibe
3,legends,atom
4,flash,killer frost
5,legends,firestorm
6,arrow,speedy


### Slice and Dice

#### Column selection by label
To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` method. General usage looks like `frame.loc[rowname,colname]`. (Reminder that the colon `:` means "everything").  For example, if we want the `color` column of the `heroes` data frame, we would use :

In [11]:
heroes.loc[:, 'color']

flash          red
arrow        green
vibe         black
atom          blue
canary       black
firestorm      red
Name: color, dtype: object

Selecting multiple columns is easy. You just need to supply a list of column names. Here we select the color and value columns:

In [12]:
heroes.loc[:, ['color', 'first_season']]

Unnamed: 0,color,first_season
flash,red,2
arrow,green,1
vibe,black,2
atom,blue,3
canary,black,3
firestorm,red,1


While .loc is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the [] method, which takes on the form frame['colname'].

In [16]:
heroes['first_seen_on']

flash        a
arrow        a
vibe         f
atom         a
canary       a
firestorm    f
Name: first_seen_on, dtype: object

#### Row Selection by Label

Similarly, if we want to select a row by its label, we can use the same .loc method.

In [32]:
heroes.loc[['flash', 'vibe'], :]

Unnamed: 0,color,first_seen_on,first_season
flash,red,a,2
vibe,black,f,2


If we want all the columns returned, we can, for brevity, drop the colon without issue.

In [30]:
heroes.loc[['flash', 'vibe']]

Unnamed: 0,color,first_seen_on,first_season
flash,red,a,2
vibe,black,f,2


#### General Selection by Label

More generally you can slice across both rows and columns at the same time.  For example:

In [20]:
heroes.loc['flash':'atom', :'first_seen_on']

Unnamed: 0,color,first_seen_on
flash,red,a
arrow,green,a
vibe,black,f
atom,blue,a


#### Selection by Integer Index

If you want to select rows and columns by position, the Data Frame has an analogous `.iloc` method for integer indexing. Remember that Python indexing starts at 0.

In [34]:
heroes.iloc[:4,:2]

Unnamed: 0,color,first_seen_on
flash,red,a
arrow,green,a
vibe,black,f
atom,blue,a


### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, culling out fishy outliers, or analyzing subgroups of your data set.  For example, we may be interested in characters that debuted in season 3 of Archer.  Note that compound expressions have to be grouped with parentheses.

In [29]:
heroes[(heroes['first_season']==3) & (heroes['first_seen_on']=='a')]

Unnamed: 0,color,first_seen_on,first_season
atom,blue,a,3
canary,black,a,3


#### Problem Solving Strategy
We want to highlight the strategy for filtering to answer the question above:

* **Identify the variables of interest**
    * Interested in the debut: `first_season` and `first_seen_on`
* **Translate the question into statements one with True/False answers**
    * Did the hero debut on Archer? $\rightarrow$ The hero has `first_seen_on` equal to `a`
    * Did the hero debut in season 3? $\rightarrow$ The hero has `first_season` equal to `3`
* **Translate the statements into boolean statements**
    * The hero has `first_seen_on` equal to `a` $\rightarrow$ `hero['first_seen_on']=='a'`
    * The hero has `first_season` equal to `3` $\rightarrow$ `heroes['first_season']==3`
* **Use the boolean array to filter the data**

Note that compound expressions have to be grouped with parentheses.

For your reference, some commonly used comparison operators are given below.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
==   | a == b   | Does a equal b?
<=   | a <= b   | Is a less than or equal to b?
>=   | a >= b   | Is a greater than or equal to b?
<    | a < b    | Is a less than b?
&#62;    | a &#62; b    | Is a greater than b?
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

An often-used operation missing from the above table is a test-of-membership.  The `Series.isin(values)` method returns a boolean array denoting whether each element of `Series` is in `values`.  We can then use the array to subset our data frame. For example, if we wanted to see which rows of `heroes` had values in $\{1,3\}$, we would use:

In [303]:
heroes[heroes['first_season'].isin([1,3])]
#heroes['first_season'].isin([1,3])

Unnamed: 0,color,first_seen_on,first_season,hero
arrow,green,a,1,arrow
atom,blue,a,3,atom
canary,black,a,3,canary
firestorm,red,f,1,firestorm


Notice that in both examples above, the expression in the brackets evaluates to a boolean series.  The general strategy for filtering data frames, then, is to write an expression of the form `frame[logical statement]`.

### Counting Rows

To count the number of instances of a value in a `Series`, we can use the `value_counts` method.  Below we count the number of instances of each color.

In [39]:
heroes['color'].value_counts()

red      2
black    2
blue     1
green    1
Name: color, dtype: int64

A more sophisticated analysis might involve counting the number of instances a tuple appears.  Here we count $(color,value)$ tuples.

In [55]:
heroes.groupby(['color', 'first_season']).size()

color  first_season
black  2               1
       3               1
blue   3               1
green  1               1
red    1               1
       2               1
dtype: int64

This returns a series that has been multi-indexed.  We'll eschew this topic for now.  To get a data frame back, we'll use the `reset_index` method, which also allows us to simulataneously name the new column.

In [56]:
heroes.groupby(['color', 'first_season']).size().reset_index(name='count')

Unnamed: 0,color,first_season,count
0,black,2,1
1,black,3,1
2,blue,3,1
3,green,1,1
4,red,1,1
5,red,2,1


### Joining Tables on One Column

Suppose we have another table that classifies superheroes into their respective teams.  Note that `canary` is not in this data set and that `killer frost` and `speedy` are additions that aren't in the original `heroes` set.

For simplicity of the example, we'll convert the index of the `heroes` data frame into an explicit column called `hero`.  A careful examination of the [documentation](http://pandas.pydata.org/pandas-docs/version/0.19.1/generated/pandas.DataFrame.merge.html) will reveal that joining on a mixture of the index and columns is possible.

In [57]:
heroes['hero'] = heroes.index
heroes

Unnamed: 0,color,first_seen_on,first_season,hero
flash,red,a,2,flash
arrow,green,a,1,arrow
vibe,black,f,2,vibe
atom,blue,a,3,atom
canary,black,a,3,canary
firestorm,red,f,1,firestorm


In [59]:
#delete it
teams

Unnamed: 0,team,hero
0,flash,flash
1,arrow,arrow
2,flash,vibe
3,legends,atom
4,flash,killer frost
5,legends,firestorm
6,arrow,speedy


#### Inner Join

The inner join below returns rows representing the heroes that appear in both data frames.

In [60]:
pd.merge(heroes, teams, how='inner', on='hero')

Unnamed: 0,color,first_seen_on,first_season,hero,team
0,red,a,2,flash,flash
1,green,a,1,arrow,arrow
2,black,f,2,vibe,flash
3,blue,a,3,atom,legends
4,red,f,1,firestorm,legends


#### Left and right join
The left join returns rows representing heroes in the `heroes` ("left") data frame, augmented by information found in the `teams` data frame.  Its counterpart, the right join, would return heroes in the `teams` data frame.  Note that the `team` for hero `canary` is an `NaN` value, representing missing data.

In [61]:
pd.merge(heroes, teams, how='left', on='hero')

Unnamed: 0,color,first_seen_on,first_season,hero,team
0,red,a,2,flash,flash
1,green,a,1,arrow,arrow
2,black,f,2,vibe,flash
3,blue,a,3,atom,legends
4,black,a,3,canary,
5,red,f,1,firestorm,legends


#### Outer join

An outer join on `hero` will return all heroes found in both the left and right data frames.  Any missing values are filled in with `NaN`.

In [62]:
pd.merge(heroes, teams, how='outer', on='hero')

Unnamed: 0,color,first_seen_on,first_season,hero,team
0,red,a,2.0,flash,flash
1,green,a,1.0,arrow,arrow
2,black,f,2.0,vibe,flash
3,blue,a,3.0,atom,legends
4,black,a,3.0,canary,
5,red,f,1.0,firestorm,legends
6,,,,killer frost,flash
7,,,,speedy,arrow


#### More than one match?

If the values in the columns to be matched don't uniquely identify a row, then a cartesian product is formed in the merge.  For example, notice that `firestorm` has two different egos, so information from `heroes` had to be duplicated in the merge, once for each ego.

In [63]:
pd.merge(heroes, identities, how='inner', 
         left_on='hero', right_on='alter-ego')

Unnamed: 0,color,first_seen_on,first_season,hero,ego,alter-ego
0,red,a,2,flash,barry allen,flash
1,green,a,1,arrow,oliver queen,arrow
2,black,f,2,vibe,cisco ramon,vibe
3,blue,a,3,atom,ray palmer,atom
4,black,a,3,canary,sara lance,canary
5,red,f,1,firestorm,martin stein,firestorm
6,red,f,1,firestorm,ronnie raymond,firestorm


### Missing Values

There are a multitude of reasons why a data set might have missing values.  The current implementation of Pandas uses the numpy NaN to represent these null values (older implementations even used `-inf` and `inf`).  Future versions of Pandas might implement a true `null` value---keep your eyes peeled for this in updates!  More information can be found [http://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html](http://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

Because of the specialness of missing values, they merit their own set of tools.  Here, we will focus on detection.  For replacement, see the docs.

In [64]:
x = np.nan
y = pd.merge(heroes, teams, how='outer', on='hero')['first_season']
y

0    2.0
1    1.0
2    2.0
3    3.0
4    3.0
5    1.0
6    NaN
7    NaN
Name: first_season, dtype: float64

To check if a value is null, we use the `isnull()` method for series and data frames.  Alternatively, there is a `pd.isnull()` function as well.

In [65]:
x.isnull() # won't work since x is neither a series nor a data frame

AttributeError: 'float' object has no attribute 'isnull'

In [66]:
pd.isnull(x)

True

In [67]:
y.isnull()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
Name: first_season, dtype: bool

In [68]:
pd.isnull(y)

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
Name: first_season, dtype: bool

Since filtering out missing data is such a common operation, Pandas also has conveniently included the analogous `notnull()` methods and function for improved human readability.

In [69]:
y.notnull()

0     True
1     True
2     True
3     True
4     True
5     True
6    False
7    False
Name: first_season, dtype: bool

In [70]:
y[y.notnull()]

0    2.0
1    1.0
2    2.0
3    3.0
4    3.0
5    1.0
Name: first_season, dtype: float64

## Questions (30%)

The practice problems below use the department of transportation's "On-Time" flight data for all flights originating from SFO or OAK in January 2016. Information about the airports and airlines are contained in the comma-delimited files `airports.dat` and `airlines.dat`, respectively.  Both were sourced from http://openflights.org/data.html.

Disclaimer: There is a more direct way of dealing with time data that is not presented in these problems.  This activity is merely an academic exercise.

In [74]:
flights = pd.read_csv("flights.dat", dtype={'sched_dep_time': 'f8', 'sched_arr_time': 'f8'})
# show the first few rows, by default 5
flights.head()

Unnamed: 0,year,month,day,date,carrier,tailnum,flight,origin,destination,sched_dep_time,actual_dep_time,sched_arr_time,actual_arr_time
0,2016,1,1,2016-01-01,AA,N3FLAA,208,SFO,MIA,630.0,628.0,1458.0,1431.0
1,2016,1,2,2016-01-02,AA,N3APAA,208,SFO,MIA,600.0,553.0,1428.0,1401.0
2,2016,1,3,2016-01-03,AA,N3DNAA,208,SFO,MIA,630.0,626.0,1458.0,1431.0
3,2016,1,4,2016-01-04,AA,N3FGAA,208,SFO,MIA,630.0,626.0,1458.0,1444.0
4,2016,1,5,2016-01-05,AA,N3KUAA,208,SFO,MIA,640.0,632.0,1458.0,1439.0


In [72]:
airports_cols = [
    'openflights_id',
    'name',
    'city',
    'country',
    'iata',
    'icao',
    'latitude',
    'longitude',
    'altitude',
    'tz',
    'dst',
    'tz_olson',
    'type',
    'airport_dsource'
]

airports = pd.read_csv("airports.dat", names=airports_cols)
airports.head(3)

Unnamed: 0,openflights_id,name,city,country,iata,icao,latitude,longitude,altitude,tz,dst,tz_olson,type,airport_dsource
0,1,Goroka,Goroka,Papua New Guinea,GKA,AYGA,-6.081689,145.391881,5282,10.0,U,Pacific/Port_Moresby,,
1,2,Madang,Madang,Papua New Guinea,MAG,AYMD,-5.207083,145.7887,20,10.0,U,Pacific/Port_Moresby,,
2,3,Mount Hagen,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.826789,144.295861,5388,10.0,U,Pacific/Port_Moresby,,


### Question 1.1 (12% credit)
It looks like the departure and arrival in `flights` were read in as floating-point numbers.  Write two functions, `extract_hour` and `extract_mins` that converts military time to hours and minutes, respectively. Hint: You may want to use modular arithmetic and integer division. Keep in mind that the data has not been cleaned and you need to check whether the extracted values are valid. Replace all the invalid values with `NaN`. The documentation for `pandas.Series.where` provided [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html) should be helpful.

In [778]:
# 5% credit
def extract_hour(time):
    """
    Extracts hour information from military time.
    
    Args: 
        time (float64): series of time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
    
    Returns:
        array (float64): series of input dimension with hour information.  
          Should only take on integer values in 0-23
    """
    ret = []
    for i in range(len(time)):
        if (time[i] < 0) | (time[i] > 2359):
            #time[i] = np.nan
            ret.append(np.nan)
        elif time[i] == 0:
            ret.append(0)
        else:
            ret.append(((time[i] - (time[i] % 100)) / 100))
    return pd.Series(ret)

In [779]:
# 5% credit
def extract_mins(time):
    """
    Extracts minute information from military time
    
    Args: 
        time (float64): series of time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
    
    Returns:
        array (float64): series of input dimension with minute information.  
          Should only take on integer values in 0-59
    """
    ret2 = []
    for i in time:
        if ((i % 100) < 0) | ((i % 100) > 59):
            ret2.append(np.nan)
        else:
            ret2.append(i % 100)
    return pd.Series(ret2)

In [780]:
# 2% credit
### write code to test your functions here and execute it
l = [-0.01, 1100.0, 1200.0, 1259.0, 1349.0, 3600.9, 1030.0, 1259.0]
extract_hour(l)

0     NaN
1    11.0
2    12.0
3    12.0
4    13.0
5     NaN
6    10.0
7    12.0
dtype: float64

In [781]:
extract_mins(l)

0     NaN
1     0.0
2     0.0
3    59.0
4    49.0
5     0.9
6    30.0
7    59.0
dtype: float64

### Question 1.2 (13% credit)

Using your two functions above, filter the `flights` data for flights that departed 15 or more minutes later than scheduled by comparing `sched_dep_time` and `actual_dep_time`.  You need not worry about flights that were delayed to the next day for this question.

In [783]:
# 5% credit
def convert_to_minofday(time):
    """
    Converts military time to minute of day
    
    Args:
        time (float64): series of time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
    
    Returns:
        array (float64): series of input dimension with minute of day
    
    Example: 1:03pm is converted to 783.0
    """
    hour = extract_hour(time) * 60
    mins = extract_mins(time)
    return hour + mins

In [784]:
def calc_time_diff(x, y):
    """
    Calculates delay times y - x
    
    Args:
        x (float64): array of scheduled time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
        y (float64): array of same dimensions giving actual time
    
    Returns:
        array (float64): array of input dimension with delay time
    """
    
    #scheduled = [YOUR CODE HERE]
    #actual = [YOUR CODE HERE]
    
    #[YOUR CODE HERE]
    scheduled = convert_to_minofday(x)
    actual = convert_to_minofday(y)
    
    return actual - scheduled

In [785]:
# 3% credit
### write code to test your functions here and execute it. 
### your printed results should show the values of the following two variables
#[YOUR CODE HERE]
# Series object showing delay time

heroes[(heroes['first_season']==3) & (heroes['first_seen_on']=='a')]

#test = flights.loc[:,'sched_dep_time':'actual_arr_time']

sched = pd.Series(flights['sched_dep_time'])
#print(sched)
actual = pd.Series(flights['actual_dep_time'])
#print(actual)
#print(sched.count())
#print(actual.count())
#sched.where(sched != np.nan)
#actual.where(actual != np.nan)
delay = calc_time_diff(sched,actual)
print(delay.head())
# Dataframe showing flights delayed by 15 minutes or more
delayed15 = flights[calc_time_diff(flights['sched_dep_time'], flights['actual_dep_time']) >= 15]
delayed15.head()



0   -2.0
1   -7.0
2   -4.0
3   -4.0
4   -8.0
dtype: float64


Unnamed: 0,year,month,day,date,carrier,tailnum,flight,origin,destination,sched_dep_time,actual_dep_time,sched_arr_time,actual_arr_time
15,2016,1,16,2016-01-16,AA,N3GAAA,208,SFO,MIA,640.0,723.0,1458.0,1534.0
19,2016,1,20,2016-01-20,AA,N3BBAA,208,SFO,MIA,640.0,726.0,1458.0,1532.0
22,2016,1,23,2016-01-23,AA,N3BBAA,208,SFO,MIA,640.0,901.0,1458.0,1749.0
32,2016,1,3,2016-01-03,AA,N3BXAA,209,SFO,LAX,1650.0,1706.0,1818.0,1835.0
35,2016,1,6,2016-01-06,AA,N3FPAA,209,SFO,LAX,2035.0,2105.0,2208.0,2257.0


### Question 1.3 (5% credit)

Using your answer from question 1.2, find the full name of every destination city with a flight from SFO or OAK that was delayed by 15 or more minutes.  The airport codes used in `flights` are IATA codes.  Sort the cities alphabetically. Make sure you remove duplicates. You may find `drop_duplicates` and `sort_values` helpful.

In [406]:
# 5% credit
### write code to test your functions here and execute it. 
### your printed results should show the values of the following two variables
data = delayed15[(delayed15['origin'] == "SFO") | (delayed15['origin'] == "OAK")]
#data = delayed15[(delayed15['origin'] != "SFO") & (delayed15['origin'] != "OAK")]
data.count()
data.head()
#[YOUR CODE HERE]
#delayed_airports = ... # Dataframe showing airports that satisfy above conditions
#delayed_destinations = ... # Unique and sorted destination cities

Unnamed: 0,year,month,day,date,carrier,tailnum,flight,origin,destination,sched_dep_time,actual_dep_time,sched_arr_time,actual_arr_time
15,2016,1,16,2016-01-16,AA,N3GAAA,208,SFO,MIA,640.0,723.0,1458.0,1534.0
19,2016,1,20,2016-01-20,AA,N3BBAA,208,SFO,MIA,640.0,726.0,1458.0,1532.0
22,2016,1,23,2016-01-23,AA,N3BBAA,208,SFO,MIA,640.0,901.0,1458.0,1749.0
32,2016,1,3,2016-01-03,AA,N3BXAA,209,SFO,LAX,1650.0,1706.0,1818.0,1835.0
35,2016,1,6,2016-01-06,AA,N3FPAA,209,SFO,LAX,2035.0,2105.0,2208.0,2257.0


In [377]:
flights.head()

Unnamed: 0,year,month,day,date,carrier,tailnum,flight,origin,destination,sched_dep_time,actual_dep_time,sched_arr_time,actual_arr_time
0,2016,1,1,2016-01-01,AA,N3FLAA,208,SFO,MIA,630.0,628.0,1458.0,1431.0
1,2016,1,2,2016-01-02,AA,N3APAA,208,SFO,MIA,600.0,553.0,1428.0,1401.0
2,2016,1,3,2016-01-03,AA,N3DNAA,208,SFO,MIA,630.0,626.0,1458.0,1431.0
3,2016,1,4,2016-01-04,AA,N3FGAA,208,SFO,MIA,630.0,626.0,1458.0,1444.0
4,2016,1,5,2016-01-05,AA,N3KUAA,208,SFO,MIA,640.0,632.0,1458.0,1439.0


In [409]:
airports.head()
airports[airports['iata'] == "SFO"]

Unnamed: 0,openflights_id,name,city,country,iata,icao,latitude,longitude,altitude,tz,dst,tz_olson,type,airport_dsource
3370,3469,San Francisco Intl,San Francisco,United States,SFO,KSFO,37.618972,-122.374889,13,-8.0,A,America/Los_Angeles,,


## Part 2 (50% of HW 1): Web scraping and data collection 

Here, you will practice collecting and processing data in Python. By the end of this exercise hopefully you should look at the wonderful world wide web without fear, comforted by the fact that anything you can see with your human eyes, a computer can see with its computer eyes. In particular, we aim to give you some familiarity with:

* Using HTTP to fetch the content of a website
* HTTP Requests (and lifecycle)
* RESTful APIs
    * Authentication (OAuth)
    * Pagination
    * Rate limiting
* JSON vs. HTML (and how to parse each)
* HTML traversal (CSS selectors)

Since everyone loves food (presumably), the ultimate end goal of this homework will be to acquire the data to answer some questions and hypotheses about the restaurant scene in Chicago (which we will get to later). We will download __both__ the metadata on restaurants in Chicago from the Yelp API and with this metadata, retrieve the comments/reviews and ratings from users on restaurants.


### Library Documentation

For solving this part, you need to look up online documentation for the Python packages you will use:

* Standard Library: 
    * [io](https://docs.python.org/2/library/io.html)
    * [time](https://docs.python.org/2/library/time.html)
    * [json](https://docs.python.org/2/library/json.html)

* Third Party
    * [requests](http://docs.python-requests.org/en/master/)
    * [Beautiful Soup (version 4)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
    * [yelp-fusion](https://www.yelp.com/developers/documentation/v3/get_started)

**Note:** You may come across a `yelp-python` library online. The library is deprecated and incompatible with the current Yelp API, so do not use the library.

## Setup

First, import necessary libraries:

In [421]:
import io, time, json
import requests
from bs4 import BeautifulSoup

## Authentication and working with APIs

There are various authentication schemes that APIs use, listed here in relative order of complexity:

* No authentication
* [HTTP basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication)
* Cookie based user login
* OAuth (v1.0 & v2.0, see this [post](http://stackoverflow.com/questions/4113934/how-is-oauth-2-different-from-oauth-1) explaining the differences)
* API keys
* Custom Authentication

For the NYT example below (**Q2.1**), since it is a publicly visible page we did not need to authenticate. HTTP basic authentication isn't too common for consumer sites/applications that have the concept of user accounts (like Facebook, LinkedIn, Twitter, etc.) but is simple to setup quickly and you often encounter it on with individual password protected pages/sites. 

Cookie based user login is what the majority of services use when you login with a browser (i.e. username and password). Once you sign in to a service like Facebook, the response stores a cookie in your browser to remember that you have logged in (HTTP is stateless). Each subsequent request to the same domain (i.e. any page on `facebook.com`) also sends the cookie that contains the authentication information to remind Facebook's servers that you have already logged in.

Many REST APIs however use OAuth (authentication using tokens) which can be thought of a programmatic way to "login" _another_ user. Using tokens, a user (or application) only needs to send the login credentials once in the initial authentication and as a response from the server gets a special signed token. This signed token is then sent in future requests to the server (in place of the user credentials).

A similar concept common used by many APIs is to assign API Keys to each client that needs access to server resources. The client must then pass the API Key along with _every_ request it makes to the API to authenticate. This is because the server is typically relatively stateless and does not maintain a session between subsequent calls from the same client. Most APIs (including Yelp) allow you to pass the API Key via a special HTTP Header: `Authorization: Bearer <API_KEY>`. Check out the [docs](https://www.yelp.com/developers/documentation/v3/authentication) for more information.


### Question 2.1: Basic HTTP Requests w/o authentication (6%)

First, let's do the "hello world" of making web requests with Python to get a sense for how to programmatically access web pages: an (unauthenticated) HTTP GET to download a web page.

Fill in the funtion to use `requests` to download and return the raw HTML content of the URL passed in as an argument. As an example try the following NYT article (on Youtube's algorithmic recommendation): [https://www.nytimes.com/2019/03/29/technology/youtube-online-extremism.html](https://www.nytimes.com/2019/03/29/technology/youtube-online-extremism.html)

Your function should return a tuple of: (`<status_code>`, `<text>`). (Hint: look at the **Library documentation** listed earlier to see how `requests` should work.) 

In [446]:
# 3% credit
def retrieve_html(url):
    """
    Return the raw HTML at the specified URL.
    Args:
        url (string): 

    Returns:
        status_code (integer):
        raw_html (string): the raw HTML content of the response, properly encoded according to the HTTP headers.
    """
    
    req = requests.get(url)
    return((req.status_code, req.text))

In [450]:
# 3% credit
youtube_article = retrieve_html('https://www.nytimes.com/2019/03/29/technology/youtube-online-extremism.html')
print(youtube_article)
# (200, '<!DOCTYPE html>\n<html lang="en" class="story" xmlns:og="http://opengraphprotocol.org/schema/">\n  <head>\n    <title data-rh="true">YouTube’s ...)

(200, '<!DOCTYPE html>\n<html lang="en" class="story" xmlns:og="http://opengraphprotocol.org/schema/">\n  <head>\n    <title data-rh="true">YouTube’s Product Chief on Online Radicalization and Algorithmic Rabbit Holes - The New York Times</title>\n    <meta data-rh="true" itemprop="inLanguage" content="en-US"/><meta data-rh="true" property="article:published" itemprop="datePublished dateCreated" content="2019-03-29T15:40:56.000Z"/><meta data-rh="true" property="article:modified" itemprop="dateModified" content="2019-04-01T02:48:25.343Z"/><meta data-rh="true" http-equiv="Content-Language" content="en"/><meta data-rh="true" name="robots" content="noarchive"/><meta data-rh="true" name="articleid" itemprop="identifier" content="100000006432623"/><meta data-rh="true" name="nyt_uri" itemprop="identifier" content="nyt://article/2ab5a5e9-efba-5bdd-81e2-0851b18b8f12"/><meta data-rh="true" name="pubp_event_id" itemprop="identifier" content="pubp://event/7f307847a0cf491ca6d4fb6d177510b8"/><meta d

Now while this example might have been fun, we haven't yet done anything more than we could with a web browser. To really see the power of programmatically making web requests we will need to interact with an API. For the rest of this lab we will be working with the [Yelp API](https://www.yelp.com/developers/documentation/v3/get_started) and Yelp data (for an extensive data dump see their [Academic Dataset Challenge](https://www.yelp.com/dataset_challenge)). 

## Yelp API Access

The reasons for using the Yelp API are 3 fold:

1. Incredibly rich dataset that combines:
    * entity data (users and businesses)
    * preferences (i.e. ratings)
    * geographic data (business location and check-ins)
    * temporal data
    * text in the form of reviews
    * and even images.
2. Well [documented API](https://www.yelp.com/developers/documentation/v3/get_started) with thorough examples.
3. Extensive data coverage so that you can find data that you know personally (from your home town/city or account). This will help with understanding and interpreting your results.

Yelp used to use OAuth tokens but has now switched to API Keys. **For the sake of backwards compatibility Yelp still provides a Client ID and Secret for OAuth, but you will not need those for this assignment.** 

To access the Yelp API, we will need to go through a few more steps than we did with the first NYT example. Most large web scale companies use a combination of authentication and rate limiting to control access to their data to ensure that everyone using it abides. The first step (even before we make any request) is to setup a Yelp account if you do not have one and get API credentials.

1. Create a [Yelp](https://www.yelp.com/login) account (if you do not have one already)
2. [Generate API keys](https://www.yelp.com/developers/v3/manage_app) (if you haven't already). You will only need the API Key (not the Client ID or Client Secret) -- more on that later.

Now that we have our accounts setup we can start making requests! 


### Question 2.2: Authenticated HTTP Request with the Yelp API (16%)

First, store your Yelp credentials in a local file (kept out of version control) which you can read in to authenticate with the API. This file can be any format/structure since you will fill in the function stub below.

For example, you may want to store your key in a file called `yelp_api_key.txt` (run in terminal):
```bash
echo 'YOUR_YELP_API_KEY' > yelp_api_key.txt
```

**KEEP THE API KEY FILE PRIVATE AND OUT OF VERSION CONTROL (and definitely do not submit them to Gradescope!)**

You can then read from the file using:

In [2]:
# 3% credit
with open('yelp_api_key.txt', 'r') as f:
    api_key = f.read().replace('\n','')
        # verify your api_key is correct
# DO NOT FORGET TO CLEAR THE OUTPUT TO KEEP YOUR API KEY PRIVATE

In [452]:
# 3% credit
def read_api_key(filepath):
    """
    Read the Yelp API Key from file.
    
    Args:
        filepath (string): File containing API Key
    Returns:
        api_key (string): The API Key
    """
    
    # feel free to modify this function if you are storing the API Key differently
    with open(filepath, 'r') as f:
        return f.read().replace('\n','')

Using the Yelp API, fill in the following function stub to make an authenticated request to the [search](https://www.yelp.com/developers/documentation/v3/business_search) endpoint. Remember Yelp allows you to pass the API Key via a special HTTP Header: `Authorization: Bearer <API_KEY>`. Check out the [docs](https://www.yelp.com/developers/documentation/v3/authentication) for more information.

In [572]:
# delete it
header = {'Authorization' : 'Bearer %s' % read_api_key('yelp_api_key.txt')}
headers = {}
params = {'term':'seafood','location':'New York City'}
#requ = requests.get('https://api.yelp.com/v3/businesses/search', headers = headers, params=params)
#json.loads(requ.text)
def api_get_request(url, headers, url_params):
    
    """Send a HTTP GET request and return a json response 
    
    Args:
        url (string): API endpoint url
        headers (dict): A python dictionary containing HTTP headers including Authentication to be sent
        url_params (dict): The parameters (required and optional) supported by endpoint
        
    Returns:
        results (json): response as json"""
    
    http_method = 'GET'
    # See requests.request?
    resp = requests.get(url,headers=headers,params=url_params)
    return resp.json
#json.loads(resp.text)

p = api_get_request('https://api.yelp.com/v3/businesses/search',headers,params)

In [592]:
# 3% credit
def api_get_request(url, headers, url_params):
    """
    Send a HTTP GET request and return a json response 
    
    Args:
        url (string): API endpoint url
        headers (dict): A python dictionary containing HTTP headers including Authentication to be sent
        url_params (dict): The parameters (required and optional) supported by endpoint
        
    Returns:
        results (json): response as json
    """
    http_method = 'GET'
    # See requests.request?
    response = requests.get(url,headers=headers,params=url_params)
    return response.json()
    
# 4% credit    
def location_search_params(api_key2, location, **kwargs):
    """
    Construct url, headers and url_params. Reference API docs (link above) to use the arguments
    """
    # What is the url endpoint for search?
    url = 'https://api.yelp.com/v3/businesses/search'
    # How is Authentication performed?
    Bearer='Bearer '+api_key2
    headers = {'Authorization' : Bearer}
    # SPACES in url is problematic. How should you handle location containing spaces?
    location = location.replace(" ","+")
    url_params = {'location':location}
    # Include keyword arguments in url_params
    for key,value in kwargs.items():
        url_params[key] = value 
    return url, headers, url_params

def yelp_search(api_key2, location, offset=0):
    """
    Make an authenticated request to the Yelp API.

    Args:
        api_key (string): Your Yelp API Key for Authentication
        location (string): Business Location
        offset (int): param for pagination

    Returns:
        total (integer): total number of businesses on Yelp corresponding to the location
        businesses (list): list of dicts representing each business
    """
    url, headers, url_params = location_search_params(api_key, location, offset=0)
    response_json = api_get_request(url, headers, url_params)
    return response_json["total"], list(response_json["businesses"])


In [594]:
#3% credit
api_key2 = read_api_key('C:\\Users\\abdul\\api_key2.txt')
num_records,data = yelp_search(api_key2, 'Chicago')
print(num_records)
#8500
print(len(data))
#20
print(list(map(lambda x: x['name'], data)))
#['Girl & the Goat', 'Wildberry Pancakes and Cafe', 'Au Cheval', 'The Purple Pig', 'Art Institute of Chicago', 'Smoque BBQ', "Lou Malnati's Pizzeria", 'Alinea', "Kuma's Corner - Belmont", 'Little Goat Diner', "Bavette's Bar & Boeuf", 'Cafe Ba-Ba-Reeba!', "Portillo's Hot Dogs", 'Quartino Ristorante', "Pequod's Pizzeria", 'Crisp', "Joe's Seafood, Prime Steak & Stone Crab", 'Xoco', "Molly's Cupcakes", 'Millennium Park']

8500
20
['Girl & the Goat', 'Wildberry Pancakes and Cafe', 'Au Cheval', 'The Purple Pig', "Lou Malnati's Pizzeria", 'Art Institute of Chicago', 'Smoque BBQ', 'Cafe Ba-Ba-Reeba!', "Bavette's Bar & Boeuf", 'Alinea', 'Little Goat Diner', "Kuma's Corner - Belmont", 'Quartino Ristorante', "Pequod's Pizzeria", "Portillo's Hot Dogs", 'Crisp', "Joe's Seafood, Prime Steak & Stone Crab", 'Xoco', "Molly's Cupcakes", 'Sapori Trattoria']


Now that we have completed the "hello world" of working with the Yelp API, we are ready to really fly! The rest of the exercise will have a bit less direction since there are a variety of ways to retrieve the requested information but you should have all the component knowledge at this point to work with the API. Yelp being a fairly general platform actually has many more business than just restaurants, but by using the flexibility of the API we can ask it to only return the restaurants.

## Parameterization and Pagination

And before we can get any reviews on restaurants, we need to actually get the metadata on ALL of the restaurants in Chicago. Notice above that while Yelp told us that there are ~8500, the response contained far fewer actual `Business` objects. This is due to pagination and is a safeguard against returning __TOO__ much data in a single request (what would happen if there were 100,000 restaurants?) and can be used in conjuction with _rate limiting_ as well as a way to throttle and protect access to Yelp data.

> As a thought exercise, consider: If an API has 1,000,000 records, but only returns 10 records per page and limits you to 5 requests per second... how long will it take to acquire ALL of the records contained in the API?

One of the ways that APIs are an improvement over plain web scraping is the ability to make __parameterized__ requests. Just like the Python functions you have been writing have arguments (or parameters) that allow you to customize its behavior/actions (an output) without having to rewrite the function entirely, we can parameterize the queries we make to the Yelp API to filter the results it returns.

### Question 2.3: Acquire all of the restaurants in Chicago on Yelp (10%)

Again using the [API documentation](https://www.yelp.com/developers/documentation/v3/business_search) for the `search` endpoint, fill in the following function to retrieve all of the _Restuarants_ (using categories) for a given query. Again you should use your `read_api_key()` function outside of the `all_restaurants()` stub to read the API Key used for the requests. You will need to account for __pagination__ and __[rate limiting](https://www.yelp.com/developers/faq)__ to:

1. Retrieve all of the Business objects (# of business objects should equal `total` in the response). **Paginate by querying 20 restaurants each request.**
2. Pause slightly (at least 200 milliseconds) between subsequent requests so as to not overwhelm the API (and get blocked).  

As always with API access, make sure you follow all of the [API's policies](https://www.yelp.com/developers/api_terms) and use the API responsibly and respectfully.

**DO NOT MAKE TOO MANY REQUESTS TOO QUICKLY OR YOUR KEY MAY BE BLOCKED**

In [643]:
# 4% credit
def paginated_restaurant_search_requests(api_key, location, total):
    """
    Returns a list of tuples (url, headers, url_params) for paginated search of all restaurants
    Args:
        api_key (string): Your Yelp API Key for Authentication
        location (string): Business Location
        total (int): Total number of items to be fetched
    Returns:
        results (list): list of tuple (url, headers, url_params)
    """
    # HINT: Use total, offset and limit for pagination
    # You can reuse function location_search_params(...)
    tuple_list = []
    retrieved = 0
    while retrieved < total:
        url,header,param = location_search_params(api_key, location, limit=20, categories = "restaurants", offset=retrieved)
        tuple_list.append((url,header,param))
        retrieved += 20
    return tuple_list  

# 3% credit
def all_restaurants(api_key, location):
    """
    Construct the pagination requests for ALL the restaurants on Yelp for a given location.

    Args:
        api_key (string): Your Yelp API Key for Authentication
        location (string): Business Location

    Returns:
        results (list): list of dicts representing each restaurant
    """
    # What keyword arguments should you pass to get first page of restaurants in Yelp
    url, headers, url_params = location_search_params(api_key, location, offset=0)
    # 
    response_json = api_get_request(url, headers, url_params)
    total_items = response_json["total"]
    
    all_restaurants_request = paginated_restaurant_search_requests(api_key, location, total_items)
    
    # Use returned list of (url, headers, url_params) and function api_get_request to retrive all restaurants
    # REMEMBER to pause slightly after each request.
    #[YOUR CODE HERE]
    dictionary_List = []
    for x in all_restaurants_request:
        response_json = api_get_request(x[0],x[1],x[2])
        data = list(response_json["businesses"])
        for i in data:
            dictionary_List.append(i)
        time.sleep(.3)
    return dictionary_List 

You can test your function with an individual neighborhood in Chicago (for example, Greektown). Chicago itself has a lot of restaurants... meaning it will take a lot of time to download them all.

In [644]:
# 3% credit
data = all_restaurants(api_key, 'Greektown, Chicago, IL')
print(len(data))
# 102
print(list(map(lambda x:x['name'], data)))
# ['Greek Islands Restaurant', 'Meli Cafe & Juice Bar', 'Artopolis', 'WJ Noodles', 'Athena Greek Restaurant', ...]

100
['Greek Islands Restaurant', 'Meli Cafe & Juice Bar', 'Artopolis', 'Athena Greek Restaurant', 'WJ Noodles', 'Zeus Restaurant', 'Green Street Smoked Meats', 'Santorini', 'Mr Greek Gyros', "Philly's Best", 'Primos Chicago Pizza Pasta', 'Monteverde', 'J.P. Graziano Grocery', '9 Muses', 'Sizzling Pot King', 'Sepia', 'Green Street Local', 'High Five Ramen', 'Spectrum Bar and Grill', 'Dawali Jerusalem Kitchen', 'The Allis', "Nando's Peri-Peri", "Lou Mitchell's", 'Jubilee Juice & Grill', "Formento's", 'Taco Burrito King', 'Dine', 'Chicken & Farm Shop', 'H Mart - Chicago', 'La Sardine', 'Loop Juice', "Blaze Fast-Fire'd Pizza", 'The Madison Bar and Kitchen', 'Parlor Pizza Bar', 'M2 Cafe', 'Booze Box', 'El Che Steakhouse & Bar', "Bombacigno's J & C Inn", 'Yolk West Loop', 'Blackwood BBQ', 'Morgan Street Cafe', 'Omakase Yume', "Giordano's", "Vero's Caffe & Gelato", 'Dirty Root', 'Ciao! Cafe & Wine Lounge', "Nonna's Pizza & Sandwiches", 'Bandit', 'Umami Burger - West Loop', 'Slightly Toasted',

Now that we have the metadata on all of the restaurants in Greektown (or at least the ones listed on Yelp), we can retrieve the reviews and ratings. The Yelp API gives us aggregate information on ratings but it doesn't give us the review text or individual users' ratings for a restaurant. For that we need to turn to web scraping, but to find out what pages to scrape we first need to parse our JSON from the API to extract the URLs of the restaurants.

In general, it is a best practice to separate the act of __downloading__ data and __parsing__ data. This ensures that your data processing pipeline is modular and extensible (and autogradable ;). This decoupling also solves the problem of expensive downloading but cheap parsing (in terms of computation and time).

### Question 2.4: Parse the API Responses and Extract the URLs (7%)

Because we want to separate the __downloading__ from the __parsing__, fill in the following function to parse the URLs pointing to the restaurants on `yelp.com`. As input your function should expect a string of [properly formatted JSON](http://www.json.org/) (which is similar to __BUT__ not the same as a Python dictionary) and as output should return a Python list of strings. Hint: print your `data` to see the JSON-formatted information you have. The input JSON will be structured as follows (same as the [sample](https://www.yelp.com/developers/documentation/v3/business_search) on the Yelp API page):

```json
{
  "total": 8228,
  "businesses": [
    {
      "rating": 4,
      "price": "$",
      "phone": "+14152520800",
      "id": "four-barrel-coffee-san-francisco",
      "is_closed": false,
      "categories": [
        {
          "alias": "coffee",
          "title": "Coffee & Tea"
        }
      ],
      "review_count": 1738,
      "name": "Four Barrel Coffee",
      "url": "https://www.yelp.com/biz/four-barrel-coffee-san-francisco",
      "coordinates": {
        "latitude": 37.7670169511878,
        "longitude": -122.42184275
      },
      "image_url": "http://s3-media2.fl.yelpcdn.com/bphoto/MmgtASP3l_t4tPCL1iAsCg/o.jpg",
      "location": {
        "city": "San Francisco",
        "country": "US",
        "address2": "",
        "address3": "",
        "state": "CA",
        "address1": "375 Valencia St",
        "zip_code": "94103"
      },
      "distance": 1604.23,
      "transactions": ["pickup", "delivery"]
    }
  ],
  "region": {
    "center": {
      "latitude": 37.767413217936834,
      "longitude": -122.42820739746094
    }
  }
}
```

In [677]:
# 4% credit
def parse_api_response(data):
    """
    Parse Yelp API results to extract restaurant URLs.
    
    Args:
        data (string): String of properly formatted JSON.

    Returns:
        (list): list of URLs as strings from the input JSON.
    """
    data = json.loads(data)
    all_business = list(data["businesses"])
    return list(map(lambda x:x['url'],all_business))
    

# 3% credit    
url, headers, url_params = location_search_params(api_key, "Chicago", offset=0)
response_text = json.dumps(api_get_request(url,headers,url_params))
parse_api_response(response_text)
# ['https://www.yelp.com/biz/girl-and-the-goat-chicago?adjust_creative=ioqEYAcUhZO272qCIvxcVA....',
#  'https://www.yelp.com/biz/wildberry-pancakes-and-cafe-chicago-2?adjust_creative=ioqEYAcUhZO272qCIvxcVA...',..]

['https://www.yelp.com/biz/girl-and-the-goat-chicago?adjust_creative=sgQKVASka6HfWxkf9wH9iA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=sgQKVASka6HfWxkf9wH9iA',
 'https://www.yelp.com/biz/wildberry-pancakes-and-cafe-chicago-2?adjust_creative=sgQKVASka6HfWxkf9wH9iA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=sgQKVASka6HfWxkf9wH9iA',
 'https://www.yelp.com/biz/au-cheval-chicago?adjust_creative=sgQKVASka6HfWxkf9wH9iA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=sgQKVASka6HfWxkf9wH9iA',
 'https://www.yelp.com/biz/the-purple-pig-chicago?adjust_creative=sgQKVASka6HfWxkf9wH9iA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=sgQKVASka6HfWxkf9wH9iA',
 'https://www.yelp.com/biz/lou-malnatis-pizzeria-chicago?adjust_creative=sgQKVASka6HfWxkf9wH9iA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=sgQKVASka6HfWxkf9wH9iA',
 'https://www.yelp.com/biz/art-institute-of-chicago-chicago-

As we can see, JSON is quite trivial to parse (which is not the case with HTML as we will see in a second) and work with programmatically. This is why it is one of the most ubiquitous data serialization formats (especially for ReSTful APIs) and a huge benefit of working with a well defined API if one exists. But APIs do not always exists or provide the data we might need, and as a last resort we can always scrape web pages...

## Working with Web Pages (and HTML)

Think of APIs as similar to accessing an application's database itself (something you can interactively query and receive structured data back). But the results are usually in a somewhat raw form with no formatting or visual representation (like the results from a database query). This is a benefit _AND_ a drawback depending on the end use case. For data science and _programatic_ analysis this raw form is quite ideal, but for an end user requesting information from a _graphical interface_ (like a web browser) this is very far from ideal since it takes some cognitive overhead to interpret the raw information. And vice versa, if we have HTML it is quite easy for a human to visually interpret it, but to try to perform some type of programmatic analysis we first need to parse the HTML into a more structured form.

> As a general rule of thumb, if the data you need can be accessed or retrieved in a structured form (either from a bulk download or API) prefer that first. But if the data you want (and need) is not as in our case we need to resort to alternative (messier) means.

Going back to the "hello world" example of question 2.0 with the NYT, we will do something similar to retrieve the HTML of the Yelp site itself (rather than going through the API) programmatically as text. 

### Question 2.5: Parse a Yelp restaurant Page (4%)

Using `BeautifulSoup`, parse the HTML of a single Yelp restaurant page to extract the reviews in a structured form as well as the URL to the next page of reviews (or `None` if it is the last page). Fill in following function stubs to parse a single page of reviews and return:
* the reviews as a structured Python dictionary
* the HTML element containing the link/url for the next page of reviews (or None).

For each review be sure to structure your Python dictionary as follows (to be graded correctly). The order of the keys doesn't matter, only the keys and the data type of the values:

```python
{
    'author': str
    'rating': float
    'date': str ('yyyy-mm-dd')
    'description': str
}

# Example
{
    'author': 'Topsy Kretts'
    'rating': 4.7
    'date': '2016-01-23'
    'description': "Wonderful!"
}
```

There can be issues with Beautiful Soup using various parsers, for maximum compatibility (and fewest errors) initialize the library with the default (and Python standard library parser): `BeautifulSoup(markup, "html.parser")`.

Most of the function has been provided to you:

In [746]:
# 4% credit
def parse_page(html):
    """
    Parse the reviews on a single page of a restaurant.
    
    Args:
        html (string): String of HTML corresponding to a Yelp restaurant

    Returns:
        tuple(list, string): a tuple of two elements
            first element: list of dictionaries corresponding to the extracted review information
            second element: URL for the next page of reviews (or None if it is the last page)
    """
    soup = BeautifulSoup(html,'html.parser')
    url_next = soup.find('link',rel='next')
    if url_next:
        url_next = url_next.get('href')
    else:
        url_next = None

    reviews = soup.find_all('div', itemprop="review")
    reviews_list = []
    reviews_dict = {}
    # HINT: print reviews to see what http tag to extract
    for r in reviews:
        reviews_dict['author'] = r.find('meta', itemprop = "author")['content']
        reviews_dict['rating'] = float(r.find('meta', itemprop = "ratingValue")['content'])
        reviews_dict['date'] = r.find('meta', itemprop = "datePublished")['content']
        reviews_dict['description'] = r.find('p', itemprop = "description").text
        reviews_list.append(reviews_dict)
        reviews_dict = {}
    #[YOUR CODE HERE]   
    return reviews_list, url_next

In [750]:
#del
d = retrieve_html('https://www.yelp.com/biz/the-jibarito-stop-chicago-2?start=220')    

r,s = parse_page(d[1])

In [753]:
r,s = parse_page(d[1])
r

[{'author': 'Vegus C.',
  'rating': '5.0',
  'date': '2017-01-08',
  'description': "I absolutely love every time I come into this restaurant growing up in the Puerto Rican household I've always appreciated the mini great Puerto Rican food I was born and raised in New York and now I live in Chicago and I never imagined that I would find a place to call home I love everything from the empanadas jibarito sandwiches and mofongos & maduros if you ever in the Pilsen neighborhood in Chicago stopping and try the bass authentic Puerto Rican food in Chicago and make sure you have a horchata\n"},
 {'author': 'Rebecca R.',
  'rating': '5.0',
  'date': '2017-05-16',
  'description': 'This place is phenomenal!  I often work nearby and I order take out whenever possible.  The plantains are perfect. The steak sandwich is their best menu item.  Perfectly cooked in garlic and onions and put in the middle of plantain patties.  You simply cannot go wrong with anything on the menu at Jibarito Stop.\n'},
 

### Question 2.6: Extract all Yelp reviews for a Single Restaurant (7%)

So now that we have parsed a single page, and figured out a method to go from one page to the next we are ready to combine these two techniques and actually crawl through web pages! 

Using `requests`, programmatically retrieve __ALL__ of the reviews for a __single__ restaurant (provided as a parameter). Just like the API was paginated, the HTML paginates its reviews (it would be a very long web page to show 300 reviews on a single page) and to get all the reviews you will need to parse and traverse the HTML. As input your function will receive a URL corresponding to a Yelp restaurant. As output return a list of dictionaries (structured the same as question 2.3) containing the relevant information from the reviews. You can use `parse_page()` here.

In [776]:
# 4% credits
def extract_reviews(url, html_fetcher):
    """
    Retrieve ALL of the reviews for a single restaurant on Yelp.

    Parameters:
        url (string): Yelp URL corresponding to the restaurant of interest.
        html_fetcher (function): A function that takes url and returns html status code and content
        

    Returns:
        reviews (list): list of dictionaries containing extracted review information
    """
    dict_return = []
    #code, html = html_fethcher(url_next)
    page = html_fetcher(url)[1]
    reviews_list, url_next = parse_page(page)
    for x in reviews_list:
        dict_return.append(x)
    while url_next != None:
        page = html_fetcher(url_next)[1]
        reviews_list, url_next = parse_page(page)
        for x in reviews_list:
            dict_return.append(x)
    return dict_return
        
    #[YOUR CODE HERE]
    #code, html = html_fetcher(url) # function implemented in Q0 should work
    #[YOUR CODE HERE]

You can test your function with this code:

In [777]:
# 3% credits
data = extract_reviews('https://www.yelp.com/biz/the-jibarito-stop-chicago-2?start=220', html_fetcher=retrieve_html)
print(len(data))
# 40
print(data[0])
# {'author': 'Betsy F.', 'rating': '5.0', 'date': '2016-10-01', 'description': "Authentic, incredible ... " }

43
{'author': 'Vegus C.', 'rating': '5.0', 'date': '2017-01-08', 'description': "I absolutely love every time I come into this restaurant growing up in the Puerto Rican household I've always appreciated the mini great Puerto Rican food I was born and raised in New York and now I live in Chicago and I never imagined that I would find a place to call home I love everything from the empanadas jibarito sandwiches and mofongos & maduros if you ever in the Pilsen neighborhood in Chicago stopping and try the bass authentic Puerto Rican food in Chicago and make sure you have a horchata\n"}


# Submission

You're almost done! 

After executing all commands and completing this notebook, save your *hw1.ipynb* as a pdf file and upload it to Gradescope under *Homework 1 (written)*. Make sure you check that your pdf file includes all parts of your solution **(including the outputs)**. We recommend using the browser (not jupyter) for saving the pdf. For Chrome on a Mac, this is under *File->Print...->Open PDF in Preview* and when the PDF opens in Preview you can use *Save...* to save it. This part will be graded based on completion (having executed the code and showing the output) and it constitutes *60%* of HW 1.

Next, you need to copy the functions from Questions 1.1 and 1.2 into the corresponding functions in *hw1part1.py*. Similarly, you need to copy the functions from Questions 2.1, 2.2, 2.3, 2.4, 2.5 and 2.6 into the corresponding functions in *hw1part2.py*. Place your files *hw1part1.py*, *hw1part2.py*, and *hw1.ipynb* in a zip file and upload the zip file to Gradescope under *Homework 1 - (code)*. This part constitutes *40%* of HW 1. In order to get full points for this part, you need to pass all test cases that we will run against your *hw1part1.py* and *hw1part2.py* (and not the notebook) on Gradescope. To check whether your code runs locally, run the four tests in *tests_sample_part1* from your command line: 

`(cs418env) elena-macbook:hw1 elena$ python run_tests_sample.py part1`

You should see the following output:

```
....
----------------------------------------------------------------------
Ran 4 tests in 0.001s

OK
```

Feel free to add more tests that check all parts of your code.

Similarly, you can run sample tests for part2 as follows:

`(cs418env) elena-macbook:hw1 elena$ python run_tests_sample.py part2`

You can submit to Gradescope as many times as you would like. We will only consider your last submission. If your last submission is after the deadline, the late homework policy applies.

After submitting the zip file, the autograder will run. You should see the following on your screen after the autograder finishes the execution:

<img src="correct.png" align="left" float="left"/>

This indicates that all the tests ran successfully on the server, and you're done! If your tests fail, you can debug your program locally by comparing the input, output and expected output (as shown for first two test cases). Make sure `hw1part1.py`, `hw1part2.py` and `hw1.ipynb` are included on the root of the zip file. This means you need to zip those files not the folder containing the files.