# City of Chicago Data Set

### Builtin Superheroes (Screencast)

Taken based on David Beazley's [presentation](https://www.youtube.com/watch?v=j6VSAsKAj98)

To get the file Food Inspections data file use [wget](https://linux.die.net/man/1/wget)

    wget -c https://data.cityofchicago.org/api/views/4ijn-s7e5/rows.csv?accessType=DOWNLOAD -O Food_Inspections.csv

or alternatively use [curl](https://linux.die.net/man/1/curl)

    curl https://data.cityofchicago.org/api/views/4ijn-s7e5/rows.csv?accessType=DOWNLOAD -o Food_Inspections.csv

__Sorry I don't use Windows as an OS so you'll have to figure out getting it yourself for that.__

***

### Set Up

Rather than following the original implementation in the video using basic Python,
in this notebook we use [pandas](https://pandas.pydata.org/) to explore the data.

##### What is pandas?

Pandas is a high-level data manipulation tool, built on the Numpy package and
its key data structure is called the DataFrame.

It was developed by __Wes McKinney__.

##### Whats a dataframe

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
That allow you to store and manipulate tabular data in rows of observations and columns of variables.

##### Import
To import the library we use `import pandas`, but to simplify the reuse of the module name, we rename the
module by using `as`
    
    import pandas as pd
    

##### Loading the data

We use the `read_csv` function in pandas to read the csv file. This function takes a total of 48 args,
most of which are defaulted, but we can do things like setting the column names, making default values for NA,
what the separator is, what the de-limiter is and specifying the cols to use.
 
By passing the file path for the csv we want to open we are saying load this file into memory as a table.

Pandas has other alternative ways of loading data like reading Excel files, plain text and reading HTML,
and using SQL queries amongst others.



In [None]:
import pandas as pd

food_df = pd.read_csv("Food_Inspections.csv")

How many items in the food list (rows in the dataset)?

The **len()** function is a Python builtin function. It returns the number of elements/items in a collection.

We could also use the pandas `shape` attribute, which is a tuple containing the number of rows and columns.
It could be used like `n_rows, n_cols = food_df.shape` Or get the number of rows by the len of the index `len(food_df.index)`

In [None]:
len(food_df)

What are the contents of the first row?

We can access the rows and column's in the dataframe by using indices similar to lists,
using the `loc`, `iloc` and `ix` properties, which are functions that uses the `@property` decorator
.
* Dataframe.loc{] : This function is used for labels.
* Dataframe.iloc[] : This function is used for positions or integer based

They can use a single index which will return a Series or an array of indices which will return a new Dataframe

Indexing can also be known as Subset Selection


##### Whats a Series?

A Series is a Pandas data structure, one-dimensional labeled array capable of holding any data
type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively
referred to as the index. It is one dimensional, where as a dataframe is 2 dimensional.

##### NaN values
Some of the values are __NaN__ (NOT a Number), which acts as a placeholder for missing data.
This should not be confused with Pythons internal type __None__.
Mathematical operations can be performed on nan, they cannon on None.

* nan belongs to the class float
* None belongs to the class NoneType

You can check if a a value is NaN by invoking `isna()` or `isnull()` on the cell in the dataframe,
both of which do the same thing according to the pandas docs.

In [None]:
food_df.iloc[0]

What are the contents of the second row?

Again as list Indices start at 0, the n<sup>th</sup> item will be at index n-1

In [None]:
food_df.iloc[1]

Each row has a __Results__ column, here we get all the unique values in the column by, first using the column index, this similar to the index in a dictionary, and then invoking the `unique()` function, which returns a numpy array containing the unique values of Series.


In [None]:
food_df["Results"].unique()

Let's get all the rows that have failed, we can do this by applying some boolean logic on a column in the row,
and if the boolean value of the comparison == True then the row is returned as a Series in a new DataFrame.

By using the `.copy()` function we create a copy of the original data. This means the can be manipulated and
edited rather than just having a view of the data.


In [None]:
fail = food_df[food_df.Results == "Fail"].copy()

How many inspection failed?

Again using the len function we can check how many items
are in the list using the builtin **len()**

In [None]:
len(fail)

What are the contents of the first row in our fail list?

Note rows may have data subsets, for example the Violations data type is a string of violations separated by
a  **|** (Linux pipe) symbol. Columns with no values are assigned a numpy NaN value by Pandas.

In [None]:
fail.iloc[0]

Using the `value_counts()` function we can count the number of times each value appears in
a column in the dataframe. It returns  a Series containing counts of unique values, in descending order so that
the first element is the most frequently-occurring element.


__It excludes NA values by default__

In [None]:
worst = fail["DBA Name"].value_counts()
worst

Because a Series is a list and the `worst` is in descending order by default, we can use Python's list
index notation `[from:to]` to get the top n most common.

In [None]:
worst[:5]

What are the 15 most common fails? 

An alternative to using the list notation is using the `head()` function. It returns the first n rows based on position.

In [None]:
worst.head(15)

The data is not very clean and we can see there is variations of the same value for **DBA Name**.
There may be whitespace in names, long and/or short versions of names and other
grammatically different variations. For example **MCDONALDS** & **MC DONALDS** & **MCDONALD'S** probably
represent the same name.

We can attempt to clean the data by converting the text to uppercase and removing all __'__ by using the __replace()__
function and replacing them with an empty string. We then make all characters in the string uppercase, done by
using another builtin function **upper()**.

The  __replace()__ function is a builtin function and returns a copy of a string with all occurrences of substring
**old** replaced by **new**. If the optional argument count is given, only the first count occurrences are replaced.

    >>> 'aaa'.replace('a', 'b')
    'bbb'
    >>> 'aaa'.replace('a', 'b', 2)
    'bba'
    

The **upper()** function is a builtin function and returns a copy of the string converted to uppercase.

    >>> 'aaa'.upper()
    'AAA'
    
We can invoke a function on values of Series/column in the dataframe by using the `apply()` function. This applies
the function to each value in the Series.

Here we use the Python __lambda__ expression which is shorthand for creating anonymous functions
`lambda args : expression`. They can have may args, but only 1 expression.

Because apply doesn't have an inplace option, and returns a series, we need to reassign the result to the original column

In [None]:
fail["DBA Name"] = fail["DBA Name"].apply(lambda x: x.replace("'", "").upper())

Calculate the worst again with the updated version of fail that has the first attempt of cleaning the **DBA Name** and
attempting to use a single version of names

In [None]:
worst = fail["DBA Name"].value_counts()
worst

Are they any different after cleaning the **DBA Name** value ?

__Note__ the current dataset available is different to the one used in the video this notebook is based on.

In [None]:
worst.head(5)

In [None]:
worst.head(15)

We can use the `value_counts()` function again to count how many times each **Address** is in the fail dataframe.

In [None]:
bad = fail["Address"].value_counts()
bad

The five most common addresses in the bad Series

In [None]:
bad.head(5)

The 15 most common addresses in the bad Series

In [None]:
worst.head(15)

Lets drop the days & months from `Inspection Dates` as we only require the year. This is done by using the apply method
to apply a function that returns a slice date, that is the last 4 chars, which is the value of the year.

In [None]:
fail["Inspection Date"] = fail["Inspection Date"].apply(lambda x: x[-4:])

We can create a `Series` with 2 indexes,  __'Inspection Date'__ & __'Address'__ by using the `Series.group_by` function,
then by using the `count` function on the Address column, we count the number of times each unique address is in
the fail dataframe.

The group-by operation involves some combination of splitting the object, applying a function, and combining the results.
This can be used to group large amounts of data and compute operations on these groups.


In [None]:
by_year = fail.groupby(["Inspection Date", "Address"])["Address"].count()

Show the 5 most common addresses that failed for the year 2015 by using the key *2015*, then `sort_values` in
descending order. Finally by using the head function we can specify the number of items from the top we want,
the default is 5.

In [None]:
by_year["2015"].sort_values(ascending=False).head()

Show the 5 most common addresses that failed for the year 2014 by using the key *2014*, then sort_values in
descending order. Finally by using the head function we can specify the number of items from the top we want,
the default is 5.

In [None]:
by_year["2014"].sort_values(ascending=False).head()

Show the 5 most common addresses that failed for the year 2013 by using the key *2013*, then sort_values in
descending order. Finally by using the head function we can specify the number of items from the top we want,
the default is 5.

We can also optionally specify the sorting algorithm required.

In [None]:
by_year["2013"].sort_values(ascending=False, kind="quicksort").head(5)

Show the 5 most common addresses that failed for the year 2016 by using the key *2016*, then sort_values in
descending order. Finally by using the head function we can specify the number of items from the top we want,
the default is 5.

In [None]:
by_year["2016"].sort_values(ascending=False, kind="heapsort").head()

The five most common addresses in the bad Series. Sort in descending order and get the the top 5 by using invoking
the  `head()` function.

In [None]:
bad.sort_values(ascending=False).head()

Whats the address that is most common in the failed premises. We do this by getting the most common by invoking the
the `head(1)` function with 1 as an arg, after first sorting the valuse in descending order. We can then get the index
because the value of the address is the actual index for the Series.

In [None]:
bad.sort_values(ascending=False).head(1).index[0]

Lets get all the items that failed and have an address at O Hare. We do this by comparing all the string values in
the Address column to see if they start with a substring of the value of the address that appears most in the bad
dataframe.

Create a copy so we can change the data.

In [None]:
ohare = fail.loc[fail.Address.str.startswith("11601 W TOUHY", na=False)].copy()

Show all the distinct __DBA Name__ that have failed a health inspection in ohare. We do this by using the [] notation
to get the column __DBA Name__ as a pandas Series. We can then use the function `.unique()` which returns the unique
values in the series as a `numpy.ndarray`

In [None]:
ohare["DBA Name"].unique()

Show the contents of the first item in ohare, we can do this multiple ways, depending on he type of data structure
 required.

* ohare[:1], this is a type pandas.core.frame.DataFrame
* ohare.iloc[0], this is a type pandas.core.series.Series
* ohare.head(1), this is a type pandas.core.frame.DataFrame

In [None]:
ohare.iloc[0]

Each business in ohare has a __DBA Name__ (Doing Business As) and __AKA Name__ (Also Known As).
We can identify the worst locations at O Hare to eat by using the `groupby` function again, and counting
the values in __AKA Name__.

In [None]:
c = ohare.groupby(["AKA Name"])["AKA Name"].count()

What are the 10 worst most places to eat in O Hare. Sort the values ascending, then get the first 10 rows by using
the `head()` function

In [None]:
c.sort_values(ascending=False).head(10)

We can group all the rows in the ohare by the  __Licence #__ column. This create a Series with the __Licence #__ as
the key and the row as the value.

This automatically converts the string value from __Licence #__ to numeric value.

In [None]:
inspections = ohare.groupby("License #")

Show all the indices/keys in the inspections Series.

In [None]:
inspections.groups.keys()

Get the inspections per `Licence #`, because the key is numeric we can use an integer or floating point value.

The `get_group()` function creates a DataFrame from a group.

The key `2308566.0` can be used as well as  `2308566`.

In [None]:
inspections.get_group(2308566.0)

Sample using integer value for the key

In [None]:
inspections.get_group(34192)

Show the inspection dates for a given `Inspection #`. By using the `get_group` function a new dataframe is created.

If you require a list we could use `inspections.get_group(34192)["Inspection Date"].tolist()`

In [None]:
inspections.get_group(34192)["Inspection Date"]

Most common way a place in ohare fails an inspection.

Create a new DataFrame from a list. The list is generated by taking the Violations value and splitting it at the __|__ symbol.

In [None]:
pd.DataFrame(ohare.iloc[1]["Violations"].split("|"), columns=["Violations"])

The _ variable contains the value of the result of the last block of code executed. 
Assign the last DataFrame that was created to the violations variable.

In [None]:
violations = _
violations

In [None]:
violations["Violations"].apply(lambda x: x[: x.find("- Comments:")])

Remove the comments from each of the `Violations`. 

This is done by again invoking a function on each violation. 
First we get to index position of the start of the string __"- Comments:"__,  then get a substring of the violation
itself. Because Python strings are a list of chars we can use the list index notation `[from:to]` to get a substring
by index position.
What we are saying here is for each Violation value, get the index where the comments start and create a substring from
the start of the Violation value upto the index where the comments start, then strip any white space, and apply the
result as a new value for the Violation.

This does not persist the result in the dataframe, it just changes the Series that is returned from using `.apply()`

In [None]:
violations["Violations"].apply(lambda x: x[: x.find("- Comments:")].strip())

Creat a new DataFrame `all_violations` which contains all the Violations in the ohare dataframe.

Using a list comprehension, which mimics using a **nested for loop** we get each violation and then split that
result at the __|__ symbol, which results in list. We then loop over that list and strip any white space.

The resulting `_all_violations` is a flat list of all the violations and subset of violations in `ohare["Violations"]`

We can use the `_all_violations` variable to create a new dataframe that contains all the violations in the ohare DataFrame

In [None]:
_all_violations = [
    violation.strip()
    for v_sublist in ohare["Violations"]
    for violation in str(v_sublist).split("|")
]
all_violations = pd.DataFrame(_all_violations, columns=["Violations"])

Again using the previous method of getting a substring of the violation with the comments removed, we can persist this
by assigning the resulting Series to the DataFrame column `all_violations["Violations"]`

In [None]:
all_violations["Violations"] = all_violations["Violations"].apply(
    lambda x: x[: x.find("- Comments:")].strip()
)

What are the top 5 violations in all of ohare?

In [None]:
all_violations["Violations"].value_counts().head()