# Lecture Three - June 6th, 2017

* Python Dictionaries homework assignment
* Data Cleaner, Data Carpenter, Data Janitor?
* Dive into Pandas

## Python Dictionaries Homework

Write a program that counts the number of emails people wrote in `mbox-short.txt`.
**Extra Credit:** Do the same with days of the week.

Think about breaking this problem into a set of steps. 
* First, loop through the `mbox-short.txt` file line by line looking for lines with email addresses and days the week.
* Second, split the string into parts and slice out the email addresses and days of the week.
* Third, adapt the "Dictionary as a set of counters" code from the book, but instead of counting words you are counting email addresses (and then days of the week). 

*Hint*: The lines you want to parse from the mbox file look like:
```
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
```



In [None]:
# "Search" function takes 2 arguments:
# "n" is an integer corresponding to an element's position in a line (space separated) 
# "term" is a unique string that identifies lines to grab within mbox-short.txt
def search(n, term = 'From '):
    d=dict()
    with open('mbox-short.txt') as fhand:
        for line in fhand:
            if term in line:
                e = line.split(' ')
                #print(e)
                if e[n] not in d: 
                    d[e[n]] = 1 
                else: 
                    d[e[n]] += 1 
    return d # Return dictionary with counts 

# Run Search(i) for i=1 (email) and i=2 (day of week) and print results
for i in [1,2]:
    print(search(i),'\n')

## Data Cleaner, Data Carpenter, Data Janitor?

* Let's discuss the readings
* What is the kind of work these articles were talkig about? Can you give some examples?
* What do you think about this role, do you think this is work a librarian could/should do?


## Dive into Pandas


* Pandas is a third party library for doing data analysis
* It is a foundational component of Python data science
* Developed by someone in the finance industry, but is now used by everyone
* Vanilla Python can do many of the same things, but Pandas is *faster*
* The core of Pandas are the data structures

### Pandas Data Structures

* To understand Pandas, which is hard, you need to start with three data structures
    * Series - For one dimensional data
    * Dataframe - For two dimensional data
    * Index - For naming, selecting, and transforming data within a Pandas Series or Dataframe 

### Series

* A one-dimensional array of indexed data (like a list)
* Kind of like a blend of a Python list and dictionary
* You can create them from a Python list

### Dataframe

* A two-dimensional matrix of indexed data (like a spreadsheet)

### Index

* Used for series and both x- and y-axes of dataframes


In [1]:
import pandas as pd

In [2]:
my_list = [0.25, 0.5, 0.75, 1.0]
data = pd.Series(my_list)
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

* You can index a Series just like a list
* Use index notation to grab the 2nd element of `data`

In [5]:
# hit: the 2nd element is 0.50
# your code here
print(my_list[1])
print(data[1])


0.5
0.5


* You can also slice Series as well
* Use slices to grab the 2nd and 3rd elements of this series

In [13]:
# hint: the 2nd & 3rd elements are 0.50 and 0.75
# your code here
print(my_list[1:3])
print(data[1:3])


[0.5, 0.75]
1    0.50
2    0.75
dtype: float64


* Series also act like Python dictionaries, ordered python dictionaries

In [14]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

* You can use indexing and slicing like above, but now with keys instead of numbers!


In [None]:
population['California']

* Like a Python dictionary, a Series is a list of key/value pairs
* But these are *ordered*, which means you can do slicing
* Try slicing this series, but with keys instead of numbers!

__Note: Use _\.loc()_ and _\.iloc()_ because they are explicitly documented and behavior is well-understood. But, it still works without these options.__

In [18]:
# Hint: Use the same : notation, but use the state names listed above
# Your code here:
print(population.loc['California':'Florida'])
print(population.iloc[0:1])


California    38332521
Florida       19552860
dtype: int64
California    38332521
dtype: int64


* There are a couple ways of creating `Series` objects

In [19]:
# From a list with an implicit index
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [20]:
# From a list with an *explicit* index
pd.Series([2, 4, 6], index=['a','b','c'])

a    2
b    4
c    6
dtype: int64

In [21]:
# From a dictionary so keys are the index and get sorded by keys
pd.Series({2:'a', 1:'b', 3:'c'})

1    b
2    a
3    c
dtype: object

### DataFrame

* `DataFrames` are the real workhorse of Pandas and Python Data Science
* We will be spending a lot of time with data inside of Dataframes, so buckle up!
* `DataFrames` contain two-dimensional data, just like an Excel spreadsheet
* In practice, a `DataFrame` is a bunch of `Series` lined up next to each other
 * They must have the same keys

In [22]:
# Start with our population Series
population

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

In [23]:
# Then create an area Series
area_dict = {'Illinois': 149995, 'California': 423967, 
             'Texas': 695662, 'Florida': 170312, 
             'New York': 141297}
area = pd.Series(area_dict)
area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
dtype: int64

In [34]:
# Now mash them together into a DataFrame
states = pd.DataFrame({'population': population,
                       'area': area})
print(states)

states = pd.DataFrame({'population': population,
                       'area': area,
                       'test':{'California': None,
                               'california': 90210}})

print(states)
states

              area  population
California  423967    38332521
Florida     170312    19552860
Illinois    149995    12882135
New York    141297    19651127
Texas       695662    26448193
              area  population   test
California  423967    38332521    NaN
Florida     170312    19552860    NaN
Illinois    149995    12882135    NaN
New York    141297    19651127    NaN
Texas       695662    26448193    NaN
california     NaN         NaN  90210


Unnamed: 0,area,population,test
California,423967.0,38332521.0,
Florida,170312.0,19552860.0,
Illinois,149995.0,12882135.0,
New York,141297.0,19651127.0,
Texas,695662.0,26448193.0,
california,,,90210.0


* Pandas automatically lines everything up because they have shared index values

In [35]:
print(area.index)
print(population.index)
print(states.index)

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas', 'california'], dtype='object')


* A `DataFrame` actually has two indexes
* One for the rows (as seen above)
* An another for the columes

In [36]:
states.columns

Index(['area', 'population', 'test'], dtype='object')

## Indexes

* Pandas `Series` and `DataFrames` are containers for data
* Index (and Indexing) are the mechanism to make that data retrievable
* In a `Series` the index is the key to each value in the list
* In a `DataFrame` the index is the column headers, but also row headers
* Indexing allows you to merge or join disparate datasets together

## Coding Exercise 

Let's write script that parses information out of the `mbox-short.txt` file and puts it into a Pandas Dataframe.

* Start with the python dictionaries homework assignment.
* Parse every piece of information into a dictionary
* Aggregate all of those dictionaries into a list
* Create a Pandas DataFrame from that list of dictionaries


Transform this:
```
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
```
into this:
```
{'year': '2008', 'month': 'Jan', 'dayofweek': 'Sat', 'address': 'stephen.marquard@uct.ac.za', 'day': '5', 'time': '09:14:16'}
```
into this:
```
address      stephen.marquard@uct.ac.za
day                                   5
dayofweek                           Sat
month                               Jan
time                           09:14:16
year                               2008
Name: 0, dtype: object
```


* Download the data manually with [this link](http://www.py4e.com/code3/mbox-short.txt) or run the cell below if you are on JupyterHUb

In [68]:
!wget http://www.py4e.com/code3/mbox.txt

--2017-06-06 14:37:16--  http://www.py4e.com/code3/mbox.txt
Resolving www.py4e.com (www.py4e.com)... 104.27.159.166, 104.27.158.166, 2400:cb00:2048:1::681b:9fa6, ...
Connecting to www.py4e.com (www.py4e.com)|104.27.159.166|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.py4e.com/code3/mbox.txt [following]
--2017-06-06 14:37:16--  https://www.py4e.com/code3/mbox.txt
Connecting to www.py4e.com (www.py4e.com)|104.27.159.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘mbox.txt’

mbox.txt                [               <=>  ]   6.38M  2.21MB/s    in 2.9s    

2017-06-06 14:37:19 (2.21 MB/s) - ‘mbox.txt’ saved [6687002]



In [None]:
!head -n 10 mbox-short.txt

In [71]:
# Put email parsing code here

mbox_lists = []
n=0
with open('mbox-short.txt') as mbox:
    for line in mbox:
        if 'From ' in line:
            n+=1
            row = line.split()
            """
            mbox_dict[n] = {'address': mbox_dict[1],
                            'dayofweek': mbox_dict[2],
                            'month': mbox_dict[3],
                            'day': mbox_dict[4],
                            'time': mbox_dict[5],
                            'year': mbox_dict[6]}
            """
            mbox_lists.append(row[1:])
            

            
parsed_email_data = pd.DataFrame(mbox_lists, columns=['address',
                                                      'dayofweek',
                                                      'month',
                                                      'day',
                                                      'time',
                                                      'year'])

parsed_email_data


Unnamed: 0,address,dayofweek,month,day,time,year
0,stephen.marquard@uct.ac.za,Sat,Jan,5,09:14:16,2008
1,louis@media.berkeley.edu,Fri,Jan,4,18:10:48,2008
2,zqian@umich.edu,Fri,Jan,4,16:10:39,2008
3,rjlowe@iupui.edu,Fri,Jan,4,15:46:24,2008
4,zqian@umich.edu,Fri,Jan,4,15:03:18,2008
5,rjlowe@iupui.edu,Fri,Jan,4,14:50:18,2008
6,cwen@iupui.edu,Fri,Jan,4,11:37:30,2008
7,cwen@iupui.edu,Fri,Jan,4,11:35:08,2008
8,gsilver@umich.edu,Fri,Jan,4,11:12:37,2008
9,gsilver@umich.edu,Fri,Jan,4,11:11:52,2008


In [70]:
# writing the DataFrame to a CSV file
parsed_email_data.to_csv("parsed-emails.csv", index=False)

## Doing Stuff with Pandas

* Once your data is in a Pandas `DataFrame` you can easily use a ton of analytical tools
* You just have to get your data to fit into a dataframe
* Getting data to fit is a big part of the "data janitor" work...it is the craft of data carpentry
* However, as we will see, there is still a lot of carpentry work to do once your data fits into a `DataFrame`

# Here's the CSV reader!

In [57]:
# Open up the email
parsed_email_data = pd.read_csv("parsed-emails.csv")
parsed_email_data

Unnamed: 0,address,dayofweek,month,day,time,year
0,stephen.marquard@uct.ac.za,Sat,Jan,5,09:14:16,2008
1,louis@media.berkeley.edu,Fri,Jan,4,18:10:48,2008
2,zqian@umich.edu,Fri,Jan,4,16:10:39,2008
3,rjlowe@iupui.edu,Fri,Jan,4,15:46:24,2008
4,zqian@umich.edu,Fri,Jan,4,15:03:18,2008
5,rjlowe@iupui.edu,Fri,Jan,4,14:50:18,2008
6,cwen@iupui.edu,Fri,Jan,4,11:37:30,2008
7,cwen@iupui.edu,Fri,Jan,4,11:35:08,2008
8,gsilver@umich.edu,Fri,Jan,4,11:12:37,2008
9,gsilver@umich.edu,Fri,Jan,4,11:11:52,2008


* This dataframe allows us ask questions of the data, if you know how to ask.
* `value_counts()` is a `Series` method that tabulates the number of values.
* First we need to extract the column we want

In [58]:
parsed_email_data['dayofweek']

0     Sat
1     Fri
2     Fri
3     Fri
4     Fri
5     Fri
6     Fri
7     Fri
8     Fri
9     Fri
10    Fri
11    Fri
12    Fri
13    Fri
14    Fri
15    Fri
16    Fri
17    Fri
18    Fri
19    Fri
20    Fri
21    Thu
22    Thu
23    Thu
24    Thu
25    Thu
26    Thu
Name: dayofweek, dtype: object

In [59]:
parsed_email_data['dayofweek'].value_counts()

Fri    20
Thu     6
Sat     1
Name: dayofweek, dtype: int64

#### Exercise

* Answer the following questions: 
    * Who wrote the most emails?
    * What month is the most popular?
    * How many emails per year?

In [61]:
# Who wrote the most emails?
parsed_email_data['address'].value_counts()

cwen@iupui.edu                   5
zqian@umich.edu                  4
david.horwitz@uct.ac.za          4
louis@media.berkeley.edu         3
gsilver@umich.edu                3
rjlowe@iupui.edu                 2
stephen.marquard@uct.ac.za       2
gopal.ramasammycook@gmail.com    1
ray@media.berkeley.edu           1
wagnermr@iupui.edu               1
antranig@caret.cam.ac.uk         1
Name: address, dtype: int64

In [62]:
# What month is the most popular?
parsed_email_data['month'].value_counts()

Jan    27
Name: month, dtype: int64

In [63]:
# How many emails per year?
parsed_email_data['year'].value_counts()

2008    27
Name: year, dtype: int64

* What if we wanted to tabulate the number of institutions
* There are multiple ways of doing this, a Python way and a Pandas way
* First, lets try the Python way

In [64]:
# Start by looping over all the email address

# Make the series into a list, for maximum python
email_list = parsed_email_data['address'].tolist()
# loop over the list and print the email
for email in email_list:
    print(email)

stephen.marquard@uct.ac.za
louis@media.berkeley.edu
zqian@umich.edu
rjlowe@iupui.edu
zqian@umich.edu
rjlowe@iupui.edu
cwen@iupui.edu
cwen@iupui.edu
gsilver@umich.edu
gsilver@umich.edu
zqian@umich.edu
gsilver@umich.edu
wagnermr@iupui.edu
zqian@umich.edu
antranig@caret.cam.ac.uk
gopal.ramasammycook@gmail.com
david.horwitz@uct.ac.za
david.horwitz@uct.ac.za
david.horwitz@uct.ac.za
david.horwitz@uct.ac.za
stephen.marquard@uct.ac.za
louis@media.berkeley.edu
louis@media.berkeley.edu
ray@media.berkeley.edu
cwen@iupui.edu
cwen@iupui.edu
cwen@iupui.edu


In [65]:
# Python way of getting institutions

# Make the series into a list, for maximum python
email_list = parsed_email_data['address'].tolist()
# loop over the list and print the email
for email in email_list:
    # split the emails into a list of name and institution
    split_email = email.split("@")
    print(split_email)


['stephen.marquard', 'uct.ac.za']
['louis', 'media.berkeley.edu']
['zqian', 'umich.edu']
['rjlowe', 'iupui.edu']
['zqian', 'umich.edu']
['rjlowe', 'iupui.edu']
['cwen', 'iupui.edu']
['cwen', 'iupui.edu']
['gsilver', 'umich.edu']
['gsilver', 'umich.edu']
['zqian', 'umich.edu']
['gsilver', 'umich.edu']
['wagnermr', 'iupui.edu']
['zqian', 'umich.edu']
['antranig', 'caret.cam.ac.uk']
['gopal.ramasammycook', 'gmail.com']
['david.horwitz', 'uct.ac.za']
['david.horwitz', 'uct.ac.za']
['david.horwitz', 'uct.ac.za']
['david.horwitz', 'uct.ac.za']
['stephen.marquard', 'uct.ac.za']
['louis', 'media.berkeley.edu']
['louis', 'media.berkeley.edu']
['ray', 'media.berkeley.edu']
['cwen', 'iupui.edu']
['cwen', 'iupui.edu']
['cwen', 'iupui.edu']


In [66]:
# Python way of getting institutions

# Make the series into a list, for maximum python
email_list = parsed_email_data['address'].tolist()
# loop over the list and print the email
for email in email_list:
    # split the emails into a list of name and institution
    split_email = email.split("@")
    # print the second item in the list, the institution
    print(split_email[1])

uct.ac.za
media.berkeley.edu
umich.edu
iupui.edu
umich.edu
iupui.edu
iupui.edu
iupui.edu
umich.edu
umich.edu
umich.edu
umich.edu
iupui.edu
umich.edu
caret.cam.ac.uk
gmail.com
uct.ac.za
uct.ac.za
uct.ac.za
uct.ac.za
uct.ac.za
media.berkeley.edu
media.berkeley.edu
media.berkeley.edu
iupui.edu
iupui.edu
iupui.edu


* There must be some bad data in the list
* We'll have to write some code to find the bad data as it is looping

In [None]:
# Python way of getting institutions

for email in parsed_email_data['address']:
    split_email = email.split("@")
    if len(split_email) < 2:
        # print the bad example
        print(email)
    else:
        continue



In [None]:
# Another way of doing this with error handling

# an empty list to contain the data
institutions = []
# loop over the address, Pandas series behave like lists
for email in parsed_email_data['address']:
    
    # try to parse the email addresses and append just the institution to the list
    try:
        institution = email.split("@")[1]
        institutions.append(institution)
    except:
        # If there is an error, this code executes instead
        print("The bad email is: ", email)

# print the first ten items in the institutions list
print("First ten institutions:")
print(institutions[0:10])

# Then we could write a dictionary counter...but there is a better way!

### Vectorized String Operations

* There is a Pandas way of doing this that is much more terse and compact
* Pandas has a set of String operations that do much painful work for you
* Especially handling bad data!

In [None]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']

for s in data:
    print(s.capitalize())

* But like above, this breaks very easily with missing values

In [72]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']

for s in data:
    print(s.capitalize())

Peter
Paul


AttributeError: 'NoneType' object has no attribute 'capitalize'

* The Pandas library has *vectorized string operations* that handle missing data

In [73]:
names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

In [74]:
names.str.capitalize()


0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

* Look ma! No errors!
* Pandas includes a a bunch of methods for doing things to strings.

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

#### Exercise

* In the cells below, try three of the string operations listed above on the Pandas Series `monte`
* Remember, you can hit tab to autocomplete and shift-tab to see documentation

In [75]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

In [87]:
# First
monte.str.zfill(20)

0    000000Graham Chapman
1    000000000John Cleese
2    0000000Terry Gilliam
3    00000000000Eric Idle
4    000000000Terry Jones
5    0000000Michael Palin
dtype: object

In [81]:
# Second
monte.str.istitle()

0    True
1    True
2    True
3    True
4    True
5    True
dtype: bool

In [83]:
# Third
monte.str.swapcase()

0    gRAHAM cHAPMAN
1       jOHN cLEESE
2     tERRY gILLIAM
3         eRIC iDLE
4       tERRY jONES
5     mICHAEL pALIN
dtype: object

* So now lets try tabulating the number of institutions the Pandas way

In [90]:
# use a vectorized string operation over the email addresses
parsed_email_data['address'].str.split("@")

0        [stephen.marquard, uct.ac.za]
1          [louis, media.berkeley.edu]
2                   [zqian, umich.edu]
3                  [rjlowe, iupui.edu]
4                   [zqian, umich.edu]
5                  [rjlowe, iupui.edu]
6                    [cwen, iupui.edu]
7                    [cwen, iupui.edu]
8                 [gsilver, umich.edu]
9                 [gsilver, umich.edu]
10                  [zqian, umich.edu]
11                [gsilver, umich.edu]
12               [wagnermr, iupui.edu]
13                  [zqian, umich.edu]
14         [antranig, caret.cam.ac.uk]
15    [gopal.ramasammycook, gmail.com]
16          [david.horwitz, uct.ac.za]
17          [david.horwitz, uct.ac.za]
18          [david.horwitz, uct.ac.za]
19          [david.horwitz, uct.ac.za]
20       [stephen.marquard, uct.ac.za]
21         [louis, media.berkeley.edu]
22         [louis, media.berkeley.edu]
23           [ray, media.berkeley.edu]
24                   [cwen, iupui.edu]
25                   [cwe

* Now we have a Series of list objects (you can tell from the square brackets)
* Lets get just the 2nd element of those lists. We can do that with [vectorized item access](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb#Vectorized-item-access-and-slicing)

In [91]:
# 
parsed_email_data['address'].str.split("@").str.get(1)

0              uct.ac.za
1     media.berkeley.edu
2              umich.edu
3              iupui.edu
4              umich.edu
5              iupui.edu
6              iupui.edu
7              iupui.edu
8              umich.edu
9              umich.edu
10             umich.edu
11             umich.edu
12             iupui.edu
13             umich.edu
14       caret.cam.ac.uk
15             gmail.com
16             uct.ac.za
17             uct.ac.za
18             uct.ac.za
19             uct.ac.za
20             uct.ac.za
21    media.berkeley.edu
22    media.berkeley.edu
23    media.berkeley.edu
24             iupui.edu
25             iupui.edu
26             iupui.edu
dtype: object

In [93]:
institutions = parsed_email_data['address'].str.split("@").str.get(1)
institutions.value_counts()

iupui.edu             8
umich.edu             7
uct.ac.za             6
media.berkeley.edu    4
gmail.com             1
caret.cam.ac.uk       1
dtype: int64

In [95]:
parsed_email_data['institution'] = parsed_email_data['address'].str.split("@").str.get(1)
parsed_email_data

Unnamed: 0,address,dayofweek,month,day,time,year,institution
0,stephen.marquard@uct.ac.za,Sat,Jan,5,09:14:16,2008,uct.ac.za
1,louis@media.berkeley.edu,Fri,Jan,4,18:10:48,2008,media.berkeley.edu
2,zqian@umich.edu,Fri,Jan,4,16:10:39,2008,umich.edu
3,rjlowe@iupui.edu,Fri,Jan,4,15:46:24,2008,iupui.edu
4,zqian@umich.edu,Fri,Jan,4,15:03:18,2008,umich.edu
5,rjlowe@iupui.edu,Fri,Jan,4,14:50:18,2008,iupui.edu
6,cwen@iupui.edu,Fri,Jan,4,11:37:30,2008,iupui.edu
7,cwen@iupui.edu,Fri,Jan,4,11:35:08,2008,iupui.edu
8,gsilver@umich.edu,Fri,Jan,4,11:12:37,2008,umich.edu
9,gsilver@umich.edu,Fri,Jan,4,11:11:52,2008,umich.edu


## Example: Recipe Database

* Let's walk through the recipe database example from the Python Data Science Handbook
* There are a few concepts and commands I haven't yet covered, but I'll explain them as I go along
* Download the recipe file from [this link](https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz) or run the cell below if you are on JupyterHub

In [None]:
# download the recipe file from the internet
!wget https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz
# unzip the file
!gunzip -f 20170107-061401-recipeitems.json.gz

* The recipe database is stored in the JSON file format
* JSON looks like this

In [None]:
# display the first line of the file
!head -n 1 20170107-061401-recipeitems.json

* This is JSON, it is a structure data format like CSV or XML
* It looks like gobbly gook, but there are patterns, what python data structure does it look like?


In [None]:
# read the entire file into a Python array
with open('20170107-061401-recipeitems.json', 'r') as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)

In [None]:
recipes.shape

We see there are nearly 200,000 recipes, and 17 columns.
Let's take a look at one row to see what we have:

In [None]:
# display the first item in the DataFrame
recipes.iloc[0]

In [None]:
# Show the first five items in the DataFrame
recipes.head()

There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web.
In particular, the ingredient list is in string format; we're going to have to carefully extract the information we're interested in.
Let's start by taking a closer look at the ingredients:

In [None]:
# Summarize the length of the ingredients string
recipes['ingredients'].str.len().describe()

In [None]:
# which row has the longest ingredients string
recipes['ingredients'].str.len().idxmax()

In [None]:
# use iloc to fetch that specific row from the dataframe
recipes.iloc[135598]

In [None]:
# look at the ingredients string
recipes.iloc[135598]['ingredients']

* WOW! That is a lot of ingredients! That might need to be cleaned by hand instead of a machine
* What other questions can we ask of the recipe data?

In [None]:
# How many breakfasts?
recipes.description.str.contains('[Bb]reakfast').sum()

In [None]:
# How many have cinnamon as an ingredient?
recipes.ingredients.str.contains('[Cc]innamon').sum()

In [None]:
# How many misspell cinnamon as cinamon?
recipes.ingredients.str.contains('[Cc]inamon').sum()