# City of Chicago Data Set

### Builtin Superheroes (Screencast)

Taken from David Beazley's [presentation](https://www.youtube.com/watch?v=j6VSAsKAj98)

To get the file Food Inspections data file use [wget](https://linux.die.net/man/1/wget)

    wget -c https://data.cityofchicago.org/api/views/4ijn-s7e5/rows.csv?accessType=DOWNLOAD -O Food_Inspections.csv

or alternatively use [curl](https://linux.die.net/man/1/curl)

    curl https://data.cityofchicago.org/api/views/4ijn-s7e5/rows.csv?accessType=DOWNLOAD -o Food_Inspections.csv

__Sorry I don't use Windows as an OS so you'll have to figure out getting it yourself for that.__

***

### Set Up
Import the required libraries and load the data in to memory,
each row in the file will be an element in a list.

We achieve this by using Python's built in csv module.

First we open the file with the __open__ function.
The DictReader class then reads each row as a dict, using the
first row in the file as the keys, alternatively we could
add the __fieldnames__ arg when creating the DictReader
instance to override the default behaviour of using the
first line for the keys for each dict.

Using the __list__ function we covert the DictReader iterator to a list of dictionaries.

Whats a dictionary?

Dictionaries are Python data structure's that are known in other languages as associative arrays or hash maps. A dictionary consists of a collection of key-value pairs. Each key-value pair maps the key to its associated value.


N.B **This is a very inefficient way of representing data in Python**. A better alternative
to use would be [pandas](https://pandas.pydata.org/).


In [None]:
import csv
from collections import Counter, defaultdict
food=list(csv.DictReader(open('Food_Inspections.csv')))


How many items in the food list (rows in the dataset)?

The **len()** function is a Python builtin function. It returns the number of elements/items in a collection.

In [None]:
len(food)

What are the contents of the first row?
List Indices start at 0, so the first row will be 0

The published records contain data representing things like ame, address, longitude & latitude coordinates,
inspection type, inspection date, inspection id and most importantly for the business the results.


Note, not all the data types have a corresponding value.


In [None]:
food[0]

What are the contents of the second row?

Again as list Indices start at 0, the n<sup>th</sup> item will be at index n-1

In [None]:
food[1]

Each row has a __Results__ column, here we get all the unique values in the column buy using a
**set comprehension**

Comprehensions are constructs that allow sequences to be built from other sequences and using the set syntax in Python **{}** we remove any duplicates and get distinct values.

In [None]:
{row['Results'] for row in food}

Let's get all the rows that have failed, we can do this by using a list comprehension. As previously noted comprehensions are constructs that allow sequences to be built from other sequences. Comprehensions can utilize a conditional statement to modify filter existing data. 

Here we use a conditional statement `if row['Results'] == 'Fail'` to filter the data to get each row where the Results column has a value that equals  __Fail__.

In [None]:
fail = [row for row in food if row['Results'] == 'Fail']

How many inspection failed?

Again using the len function we can check how many items
are in the list using the builtin **len()**

In [None]:
len(fail)

What are the contents of the first row in our fail list?

Note rows may have data subsets, for example the Violations data type is a string of violations separated by a a **|** symbol.

In [None]:
fail[0]

Using a __Counter__ object we can create a counter which is a Dict subclass.
Elements are stored as dictionary keys and their counts are stored as values.

In [None]:
worst =  Counter(row['DBA Name'] for row in fail)

The Counter class has a **most_common()** function that will list the n most common elements and their counts ordered from the most common to the least. N is an optional key word arg that defauls to **None**. If n is None, then list all element counts

What are the names 5 most common fails? The result contains a list of tuples

In [None]:
worst.most_common(n=5)

What are the names 15 most common fails? 

As n is in the first position of the args (also the only one), we can just pass a value into the function and not require the key word reference.

In [None]:
worst.most_common(15)

The data is not very clean and we can see there is variations of the same value for **DBA Name**. There may be whitespace in names, long and/or short versions of names and other grammatically different variations. For example **MCDONALDS** & **MC DONALDS** & **MCDONALD'S** probably represent the same name.  

We can attempt to clean the data by converting the text to uppercase and removing all __'__ by using the __replace()__ function and replacing them with an empty string. We then make all characters in the string uppercase, done by using another builtin function **upper()**.

The  __replace()__ function is a builtin function and returns a copy of a string with all occurrences of substring **old** replaced by **new**. If the optional argument count is given, only the first count occurrences are replaced.

    >>> 'aaa'.replace('a', 'b')
    'bbb'
    >>> 'aaa'.replace('a', 'b', 2)
    'bba'
    

The **upper()** function is a builtin function and returns a copy of the string converted to uppercase.

    >>> 'aaa'.upper()
    'AAA'

In [None]:
fail = [{ **row, 'DBA Name': row['DBA Name'].replace("'", '').upper()} 
        for row in fail]

Calculate the worst again with the updated version of fail that has the first attempt of cleaning the **DBA Name** and attempting to use a single version of names

In [None]:
worst =  Counter(row['DBA Name'] for row in fail)

Are they any different after cleaning the **DBA Name** value ?

__Note__ the current dataset available is different to the one used in the video this notebook is based on.

In [None]:
worst.most_common(5)

In [None]:
worst.most_common(15)

We can use the Counter class to count how many times each **Address** is in the fail subset of the food dataset.

In [None]:
bad =  Counter(row['Address'] for row in fail)

The five most common addresses in a list of tuples with address & count

In [None]:
bad.most_common(5)

The 15 most common addresses in a list of tuples with address & count

In [None]:
worst.most_common(15)

Make a defaultdict using **Counter** where the dict will have keys and values, the valuse will be a counter of keys and values of numbers.

In [None]:
by_year = defaultdict(Counter)

Iterate over dict each item in the fail list using a for loop.

Update the by_year dict getting the year of the __'Inspection Date'__. `row['Inspection Date'][-4:]]` creates the year as a string by geting the value of the __'Inspection Date'__. It then gets the last 4 chars of that value string by using a Python approach for getting a subset of a string. Using `[-4:]` which works like `[from:to]` we can take the -4<sup>th</sup> item from the end, to the last, because we don't specify the **to**, it defaults to the last item an alternative would be to use `[-4:-1]`.

Each key in the by_year dict will be the distinct years.The year is the key to the count object which is a dict subclass and contains the addresss and counts.

We can access nested items in dictionaries by using multiple square bracket notations `my_dict['primary']['nested']`. 
By using <span style="color:blue">by_year</span><span style="color:red">[row['Inspection Date'][-4:]]</span><span style="color:green">[row['Address']]</span>

By using the +=1 operator we are incrementing the current count.


In [None]:
for row in fail:
    by_year[row['Inspection Date'][-4:]] [row['Address']] += 1

Show the 5 most common addresses that failed for the year 2015 by using the key *2015* and calling the __most_common()__ function on the counter class which is the value for `by_year['2015']`

In [None]:
by_year['2015'].most_common(5)

Show the 5 most common addresses that failed for the year 2014 by using the key *2014* and calling the __most_common()__ function on the counter class which is the value for `by_year['2014']`

In [None]:
by_year['2014'].most_common(5)

Show the 5 most common addresses that failed for the year 2013 by using the key *2013* and calling the __most_common()__ function on the counter class which is the value for `by_year['2013']`

In [None]:
by_year['2013'].most_common(5)

Show the 5 most common addresses that failed for the year 2016 by using the key *2016* and calling the __most_common()__ function on the counter class which is the value for `by_year['2016']`

In [None]:
by_year['2016'].most_common(5)

The five most common addresses in a list of tuples with address & count

In [None]:
bad.most_common(5)

The *_* variable contains the value of the result of the last block of code executed.

The result of the last block is a list of tuples and we can assess their values using the [] notation and an index number. Here we are saying give me the value of the first item in the list and then the value of the first tuple in that list.

In [None]:
_[0][0]

Using the built in function __id()__ we get the identity of an object. This is guaranteed to be unique among simultaneously existing objects. CPython uses the object's memory address.

In [None]:
id(_)

Lets get all the items that failed and have an address at O Hare. We do this by using a list comprehension and filtering for all the __Addresses__ that start with the string *11601 W Touhy*. Python has a builtin function __startswith__  which returns a boolean True if the string starts with the specified prefix, False otherwise.

Optional start & end args can be used to test S beginning and ending in their given positions.

We are using the startswith because there may be a slight variation in addresses, for example Avenue may be shortned to Ave.

In [None]:
ohare = [row for row in fail if row['Address'].startswith('11601 W TOUHY')]

Show a set of all the distinct __DBA Name__ that have failed a health inspection in ohare

In [None]:
{row['DBA Name'] for row in ohare}

Show the contents of the first item in ohare

In [None]:
ohare[0]

Each business in ohare has a __DBA Name__ (Doing Business As) and __AKA Name__ (Also Known As).
We can identify the worst locations at O Hare to eat by using a Counter object and counting each __AKA Name__.

In [None]:
c = Counter(row['AKA Name'] for row in ohare)

What are the 10 worst most places to eat in O Hare.

In [None]:
c.most_common(10)