# Data Wrangling with Pandas

We've seen how to get data with Python. Now let's do some stuff! From here on, we're going to mostly use the PyData stack rather than Python built-in functionality.

Our objective in this section is to learn enough to clean the larger sample of Chicago Health Inspection data and get it ready for modeling.

## Preliminaries: DataFrames

As mentioned, the core data structure in pandas is called a DataFrame. A DataFrame is a tabular data structure, holding many columns, similar to a spreadsheet.

The **Key Features** are

* Easy handling of **missing data**
* **Size mutability**: columns can be inserted and deleted from DataFrames
* Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
* Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
* Intelligent label-based **slicing**, **fancy indexing**, and **subsetting** of large data sets
* Intuitive **merging and joining** data sets
* Flexible **reshaping and pivoting** of data sets
* **Hierarchical labeling** of axes
* Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5
* **Time series functionality**: 
  * date range generation and frequency conversion
  * moving window statistics
  * moving window linear regressions
  * date shifting and lagging, etc.

In [4]:
import pandas as pd

In [34]:
dta = pd.read_csv("data/health_inspection_chi.csv")

Pandas provides labelled **indices** to access rows and columns, should they have natural labels.

In [6]:
dta.index

RangeIndex(start=0, stop=25000, step=1)

In [25]:
dta.columns

Index(['address', 'aka_name', 'city', 'dba_name', 'facility_type',
       'inspection_date', 'inspection_id', 'inspection_type', 'latitude',
       'license_', 'location', 'longitude', 'results', 'risk', 'state',
       'violations', 'zip'],
      dtype='object')

For example, with this data set we have a natural unique identifier in the `inspection_id` column. We might wish to make this our index.

In [26]:
dta.head()

Unnamed: 0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_id,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
0,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,68091,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657.0
1,6744 N SHERIDAN RD,RICE THAI CAFE,CHICAGO,RICE THAI CAFE,Restaurant,2015-08-21T00:00:00.000,1482935,Canvass Re-Inspection,42.004881,2354674.0,"{'type': 'Point', 'coordinates': [-87.66101071...",-87.661011,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60626.0
2,160-164 E SUPERIOR ST,GINO'S EAST PIZZERIA,CHICAGO,GINO'S EAST PIZZERIA,Restaurant,2015-07-28T00:00:00.000,1447916,Suspected Food Poisoning Re-inspection,41.895863,1697132.0,"{'type': 'Point', 'coordinates': [-87.62325304...",-87.623253,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60611.0
3,5844-5846 N BROADWAY,RAS DASHEN ETHIOPIAN RESTAURANT INC,CHICAGO,RAS DASHEN ETHIOPIAN RESTAURANT INC,Restaurant,2017-01-04T00:00:00.000,1978933,Canvass,41.988326,1122395.0,"{'type': 'Point', 'coordinates': [-87.66036036...",-87.66036,Fail,Risk 1 (High),IL,"16. FOOD PROTECTED DURING STORAGE, PREPARATION...",60660.0
4,2352-2358 N MILWAUKEE AVE,EAST ROOM,CHICAGO,EAST ROOM,Liquor,2013-11-18T00:00:00.000,1375515,License Re-Inspection,41.923873,2263696.0,"{'type': 'Point', 'coordinates': [-87.69916285...",-87.699163,Pass,Risk 3 (Low),IL,,60647.0


In [27]:
dta = dta.set_index('inspection_id')

In [28]:
dta.index

Int64Index([  68091, 1482935, 1447916, 1978933, 1375515, 1434800, 1106632,
            1981312, 1501267, 2129593,
            ...
            1387782,  679681, 1981776, 2009959, 1770545, 1482355, 1459253,
            1751851, 2102939, 2009279],
           dtype='int64', name='inspection_id', length=25000)

In [29]:
dta.head()

Unnamed: 0_level_0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
68091,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657.0
1482935,6744 N SHERIDAN RD,RICE THAI CAFE,CHICAGO,RICE THAI CAFE,Restaurant,2015-08-21T00:00:00.000,Canvass Re-Inspection,42.004881,2354674.0,"{'type': 'Point', 'coordinates': [-87.66101071...",-87.661011,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60626.0
1447916,160-164 E SUPERIOR ST,GINO'S EAST PIZZERIA,CHICAGO,GINO'S EAST PIZZERIA,Restaurant,2015-07-28T00:00:00.000,Suspected Food Poisoning Re-inspection,41.895863,1697132.0,"{'type': 'Point', 'coordinates': [-87.62325304...",-87.623253,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60611.0
1978933,5844-5846 N BROADWAY,RAS DASHEN ETHIOPIAN RESTAURANT INC,CHICAGO,RAS DASHEN ETHIOPIAN RESTAURANT INC,Restaurant,2017-01-04T00:00:00.000,Canvass,41.988326,1122395.0,"{'type': 'Point', 'coordinates': [-87.66036036...",-87.66036,Fail,Risk 1 (High),IL,"16. FOOD PROTECTED DURING STORAGE, PREPARATION...",60660.0
1375515,2352-2358 N MILWAUKEE AVE,EAST ROOM,CHICAGO,EAST ROOM,Liquor,2013-11-18T00:00:00.000,License Re-Inspection,41.923873,2263696.0,"{'type': 'Point', 'coordinates': [-87.69916285...",-87.699163,Pass,Risk 3 (Low),IL,,60647.0


## Indexing

To look at a column from a DataFrame, you can either use attribute lookup.

In [30]:
dta.address

inspection_id
68091                         2804 N CLARK ST 
1482935                    6744 N SHERIDAN RD 
1447916                 160-164 E SUPERIOR ST 
1978933                  5844-5846 N BROADWAY 
1375515             2352-2358 N MILWAUKEE AVE 
1434800                     4140 W Addison ST 
1106632                    5657 N LINCOLN AVE 
1981312                     1234 N HALSTED ST 
1501267                           3700 S LOWE 
2129593                    2506 N LARAMIE AVE 
565226           131 N CLINTON ST BLDG BOOTH26
2102944                    3450 N LINCOLN AVE 
1482500                       4445 N BROADWAY 
1684292                       200 W MONROE ST 
1546447                 1431 N NORTH PARK AVE 
2169836                       2935 N BROADWAY 
1229384                  1366 N MILWAUKEE AVE 
2072068                     11601 W TOUHY AVE 
1227664                        3956 W 63RD ST 
2088450                    1424 W CHICAGO AVE 
1202567                        3350 W 47th ST 

Or you can use the **getitem** syntax that relies on square brackets `[]`, which is familiar from dealing with dictionaries (uses `__getitem__`).

In [None]:
dta['address']

These two operations return pandas **Series** objects. **Series** are like single-column DataFrames. If you want to preserve the DataFrame type, index the DataFrame with a list.

In [31]:
dta[['address']]

Unnamed: 0_level_0,address
inspection_id,Unnamed: 1_level_1
68091,2804 N CLARK ST
1482935,6744 N SHERIDAN RD
1447916,160-164 E SUPERIOR ST
1978933,5844-5846 N BROADWAY
1375515,2352-2358 N MILWAUKEE AVE
1434800,4140 W Addison ST
1106632,5657 N LINCOLN AVE
1981312,1234 N HALSTED ST
1501267,3700 S LOWE
2129593,2506 N LARAMIE AVE


You can use this syntax to pull out multiple columns.

In [32]:
dta[['address', 'inspection_date']]

Unnamed: 0_level_0,address,inspection_date
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1
68091,2804 N CLARK ST,2010-02-01T00:00:00.000
1482935,6744 N SHERIDAN RD,2015-08-21T00:00:00.000
1447916,160-164 E SUPERIOR ST,2015-07-28T00:00:00.000
1978933,5844-5846 N BROADWAY,2017-01-04T00:00:00.000
1375515,2352-2358 N MILWAUKEE AVE,2013-11-18T00:00:00.000
1434800,4140 W Addison ST,2014-04-14T00:00:00.000
1106632,5657 N LINCOLN AVE,2013-06-14T00:00:00.000
1981312,1234 N HALSTED ST,2017-01-11T00:00:00.000
1501267,3700 S LOWE,2014-10-28T00:00:00.000
2129593,2506 N LARAMIE AVE,2017-12-21T00:00:00.000


You can index the rows, by using the **loc** and **iloc** accessors.

`loc` does *label-based* indexing.

In [36]:
dta.loc[[1965287, 1329698]]

KeyError: 'None of [[1965287, 1329698]] are in the [index]'

`iloc` on the other hand provides *integer-based* indexing. We can pass a list of rows integers.

In [37]:
dta.iloc[[0, 2]]

Unnamed: 0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_id,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
0,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,68091,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657.0
2,160-164 E SUPERIOR ST,GINO'S EAST PIZZERIA,CHICAGO,GINO'S EAST PIZZERIA,Restaurant,2015-07-28T00:00:00.000,1447916,Suspected Food Poisoning Re-inspection,41.895863,1697132.0,"{'type': 'Point', 'coordinates': [-87.62325304...",-87.623253,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60611.0


Both support the Python **slice notation** (`start:stop:step`). This can be really powerful.

In [38]:
dta.iloc[:5]

Unnamed: 0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_id,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
0,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,68091,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657.0
1,6744 N SHERIDAN RD,RICE THAI CAFE,CHICAGO,RICE THAI CAFE,Restaurant,2015-08-21T00:00:00.000,1482935,Canvass Re-Inspection,42.004881,2354674.0,"{'type': 'Point', 'coordinates': [-87.66101071...",-87.661011,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60626.0
2,160-164 E SUPERIOR ST,GINO'S EAST PIZZERIA,CHICAGO,GINO'S EAST PIZZERIA,Restaurant,2015-07-28T00:00:00.000,1447916,Suspected Food Poisoning Re-inspection,41.895863,1697132.0,"{'type': 'Point', 'coordinates': [-87.62325304...",-87.623253,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60611.0
3,5844-5846 N BROADWAY,RAS DASHEN ETHIOPIAN RESTAURANT INC,CHICAGO,RAS DASHEN ETHIOPIAN RESTAURANT INC,Restaurant,2017-01-04T00:00:00.000,1978933,Canvass,41.988326,1122395.0,"{'type': 'Point', 'coordinates': [-87.66036036...",-87.66036,Fail,Risk 1 (High),IL,"16. FOOD PROTECTED DURING STORAGE, PREPARATION...",60660.0
4,2352-2358 N MILWAUKEE AVE,EAST ROOM,CHICAGO,EAST ROOM,Liquor,2013-11-18T00:00:00.000,1375515,License Re-Inspection,41.923873,2263696.0,"{'type': 'Point', 'coordinates': [-87.69916285...",-87.699163,Pass,Risk 3 (Low),IL,,60647.0


In [39]:
dta.loc[:1335320]

Unnamed: 0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_id,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
0,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,68091,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657.0
1,6744 N SHERIDAN RD,RICE THAI CAFE,CHICAGO,RICE THAI CAFE,Restaurant,2015-08-21T00:00:00.000,1482935,Canvass Re-Inspection,42.004881,2354674.0,"{'type': 'Point', 'coordinates': [-87.66101071...",-87.661011,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60626.0
2,160-164 E SUPERIOR ST,GINO'S EAST PIZZERIA,CHICAGO,GINO'S EAST PIZZERIA,Restaurant,2015-07-28T00:00:00.000,1447916,Suspected Food Poisoning Re-inspection,41.895863,1697132.0,"{'type': 'Point', 'coordinates': [-87.62325304...",-87.623253,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60611.0
3,5844-5846 N BROADWAY,RAS DASHEN ETHIOPIAN RESTAURANT INC,CHICAGO,RAS DASHEN ETHIOPIAN RESTAURANT INC,Restaurant,2017-01-04T00:00:00.000,1978933,Canvass,41.988326,1122395.0,"{'type': 'Point', 'coordinates': [-87.66036036...",-87.660360,Fail,Risk 1 (High),IL,"16. FOOD PROTECTED DURING STORAGE, PREPARATION...",60660.0
4,2352-2358 N MILWAUKEE AVE,EAST ROOM,CHICAGO,EAST ROOM,Liquor,2013-11-18T00:00:00.000,1375515,License Re-Inspection,41.923873,2263696.0,"{'type': 'Point', 'coordinates': [-87.69916285...",-87.699163,Pass,Risk 3 (Low),IL,,60647.0
5,4140 W Addison ST,St. Viator,CHICAGO,St. Viator,School,2014-04-14T00:00:00.000,1434800,Canvass,41.946440,1878519.0,"{'type': 'Point', 'coordinates': [-87.73073817...",-87.730738,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60641.0
6,5657 N LINCOLN AVE,KABAB HOUSE,CHICAGO,"KABAB HOUSE, INC",Restaurant,2013-06-14T00:00:00.000,1106632,Canvass Re-Inspection,41.984405,1979795.0,"{'type': 'Point', 'coordinates': [-87.69705038...",-87.697050,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60659.0
7,1234 N HALSTED ST,SUBWAY,CHICAGO,SUBWAY 50886,Restaurant,2017-01-11T00:00:00.000,1981312,Canvass,41.904502,2115048.0,"{'type': 'Point', 'coordinates': [-87.64826686...",-87.648267,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60642.0
8,3700 S LOWE,BRIDGEPORT CATHOLIC SCHOOL(CAMPUS),CHICAGO,BRIDGEPORT CATHOLIC SCHOOL (CAMPUS),School,2014-10-28T00:00:00.000,1501267,Canvass,41.827056,2224981.0,"{'type': 'Point', 'coordinates': [-87.64250380...",-87.642504,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60609.0
9,2506 N LARAMIE AVE,Cuernavaca Bakery,CHICAGO,Cuernavaca Bakery,Bakery,2017-12-21T00:00:00.000,2129593,Canvass,41.926211,2192872.0,"{'type': 'Point', 'coordinates': [-87.75637041...",-87.756370,Out of Business,Risk 2 (Medium),IL,,60639.0


Note that these inspection ids are *not* sorted, yet we can still use slice notation.

Of course, we can also combine row and index labeling.

In [40]:
dta.iloc[:5, [0, 5]]

Unnamed: 0,address,inspection_date
0,2804 N CLARK ST,2010-02-01T00:00:00.000
1,6744 N SHERIDAN RD,2015-08-21T00:00:00.000
2,160-164 E SUPERIOR ST,2015-07-28T00:00:00.000
3,5844-5846 N BROADWAY,2017-01-04T00:00:00.000
4,2352-2358 N MILWAUKEE AVE,2013-11-18T00:00:00.000


In [41]:
dta.loc[:68091, ["address", "inspection_date"]]

Unnamed: 0,address,inspection_date
0,2804 N CLARK ST,2010-02-01T00:00:00.000
1,6744 N SHERIDAN RD,2015-08-21T00:00:00.000
2,160-164 E SUPERIOR ST,2015-07-28T00:00:00.000
3,5844-5846 N BROADWAY,2017-01-04T00:00:00.000
4,2352-2358 N MILWAUKEE AVE,2013-11-18T00:00:00.000
5,4140 W Addison ST,2014-04-14T00:00:00.000
6,5657 N LINCOLN AVE,2013-06-14T00:00:00.000
7,1234 N HALSTED ST,2017-01-11T00:00:00.000
8,3700 S LOWE,2014-10-28T00:00:00.000
9,2506 N LARAMIE AVE,2017-12-21T00:00:00.000


## Cleaning Data for Types

So far, we've explicitly made an index. We may next want to convert to the dates to datetime types. Here we'll use the **apply** function to apply a function to each row of a Series.

In [None]:
dta.inspection_date = dta.inspection_date.apply(pd.to_datetime)

In [None]:
dta.inspection_date

Now let's cast zip code from a float to a string. Some zip codes can start with 0 (not in Chicago), and we need to account for that.

In [43]:
import numpy as np


def float_to_zip(zip_code):
    if np.isnan(zip_code):
        return np.nan
    
    # 0 makes sure to left-pad with zero
    # zip codes have 5 digits
    # .0 means, we don't want anything after the decimal
    # f is for float
    zip_code = "{:05.0f}".format(zip_code)
    return zip_code

Here we use Python's **string formatting** facilities to convert from a numeric type to a string. Some of the zip codes are empty strings in the file. Pandas uses numpy's `NaN` to indicate missingness, so we'll return it here.

In [44]:
dta.zip = dta.zip.apply(float_to_zip)

In [45]:
dta.head()

Unnamed: 0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_id,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
0,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,68091,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657
1,6744 N SHERIDAN RD,RICE THAI CAFE,CHICAGO,RICE THAI CAFE,Restaurant,2015-08-21T00:00:00.000,1482935,Canvass Re-Inspection,42.004881,2354674.0,"{'type': 'Point', 'coordinates': [-87.66101071...",-87.661011,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60626
2,160-164 E SUPERIOR ST,GINO'S EAST PIZZERIA,CHICAGO,GINO'S EAST PIZZERIA,Restaurant,2015-07-28T00:00:00.000,1447916,Suspected Food Poisoning Re-inspection,41.895863,1697132.0,"{'type': 'Point', 'coordinates': [-87.62325304...",-87.623253,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60611
3,5844-5846 N BROADWAY,RAS DASHEN ETHIOPIAN RESTAURANT INC,CHICAGO,RAS DASHEN ETHIOPIAN RESTAURANT INC,Restaurant,2017-01-04T00:00:00.000,1978933,Canvass,41.988326,1122395.0,"{'type': 'Point', 'coordinates': [-87.66036036...",-87.66036,Fail,Risk 1 (High),IL,"16. FOOD PROTECTED DURING STORAGE, PREPARATION...",60660
4,2352-2358 N MILWAUKEE AVE,EAST ROOM,CHICAGO,EAST ROOM,Liquor,2013-11-18T00:00:00.000,1375515,License Re-Inspection,41.923873,2263696.0,"{'type': 'Point', 'coordinates': [-87.69916285...",-87.699163,Pass,Risk 3 (Low),IL,,60647


DataFrames have a `dtypes` attribute for checking the data types. Pandas relies on NumPy's dtypes objects. Here we see that the `object` dtype is used to hold strings. This for technical reasons.

In [46]:
dta.dtypes[['inspection_date', 'zip']]

inspection_date    object
zip                object
dtype: object

We can also convert variables' types, using `astype`. Here, we'll explicitly cast to pandas Categorical type, which is the only non-native numpy type.

In [47]:
dta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 17 columns):
address            25000 non-null object
aka_name           24731 non-null object
city               24978 non-null object
dba_name           25000 non-null object
facility_type      24565 non-null object
inspection_date    25000 non-null object
inspection_id      25000 non-null int64
inspection_type    25000 non-null object
latitude           24822 non-null float64
license_           24999 non-null float64
location           24822 non-null object
longitude          24822 non-null float64
results            25000 non-null object
risk               24991 non-null object
state              24995 non-null object
violations         19538 non-null object
zip                24991 non-null object
dtypes: float64(3), int64(1), object(13)
memory usage: 3.2+ MB


In [48]:
dta.results = dta.results.astype('category')
dta.risk = dta.risk.astype('category')
dta.inspection_type = dta.inspection_type.astype('category')
dta.facility_type = dta.facility_type.astype('category')

If we only select the categorical types, we can see some categorical variables descriptions.

We can use the `select_dtypes` method to pull out a DataFrame with only the asked for types.

In [49]:
dta.select_dtypes(['category'])

Unnamed: 0,facility_type,inspection_type,results,risk
0,Restaurant,Canvass,Pass,Risk 2 (Medium)
1,Restaurant,Canvass Re-Inspection,Pass,Risk 1 (High)
2,Restaurant,Suspected Food Poisoning Re-inspection,Pass,Risk 1 (High)
3,Restaurant,Canvass,Fail,Risk 1 (High)
4,Liquor,License Re-Inspection,Pass,Risk 3 (Low)
5,School,Canvass,Pass,Risk 1 (High)
6,Restaurant,Canvass Re-Inspection,Pass,Risk 1 (High)
7,Restaurant,Canvass,Pass,Risk 1 (High)
8,School,Canvass,Pass,Risk 1 (High)
9,Bakery,Canvass,Out of Business,Risk 2 (Medium)


Finally, we might want to exclude a column like `location` since we have the separate `latitude` and `longitude` columns. We can delete columns in a DataFrame using Python's built-in `del` statement.

In [50]:
del dta['location']

## Dealing with Types using csv Reader

We can do everything that we did above by providing options to `pd.read_csv`.

We saw before that `csv` reads everything in as strings, `json` does some type conversion with facility for doing more, and `pandas` does a bit more type conversion but it isn't always what we want. For example, we want the zip codes to stay strings.

Let's take a look at how to do with pandas `read_csv`. First, we can use the `parse_dates` argument to read in the larger inspections data sample and tell pandas that one of our columns is a date column. We'll also go ahead and make `inspection_id` the index.

In [51]:
dta = pd.read_csv(
    "data/health_inspection_chi.csv", 
    index_col="inspection_id",
    parse_dates=["inspection_date"]
)

Next, we want to turn the zip codes into strings. Here, we need to assume that the input (from the file) is a string as opposed to the above.

In [52]:
import numpy as np


def float_to_zip(zip_code):
    # convert from the string in the file to a float
    try:
        zip_code = float(zip_code)
    except ValueError:  # some of them are empty
        return np.nan
    
    # 0 makes sure to left-pad with zero
    # zip codes have 5 digits
    # .0 means, we don't want anything after the decimal
    # f is for float
    zip_code = "{:05.0f}".format(zip_code)
    return zip_code

In [53]:
float_to_zip('1234')

'01234'

In [54]:
float_to_zip('123456')

'123456'

As another example of defensive programming, we have to make sure that empty strings are handled.

In [55]:
float_to_zip('')

nan

We can supply this function to the `converters` argument.

In [56]:
dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    converters={
        'zip': float_to_zip
    },
)

In [57]:
dta.head()

Unnamed: 0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_id,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
0,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,68091,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657
1,6744 N SHERIDAN RD,RICE THAI CAFE,CHICAGO,RICE THAI CAFE,Restaurant,2015-08-21T00:00:00.000,1482935,Canvass Re-Inspection,42.004881,2354674.0,"{'type': 'Point', 'coordinates': [-87.66101071...",-87.661011,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60626
2,160-164 E SUPERIOR ST,GINO'S EAST PIZZERIA,CHICAGO,GINO'S EAST PIZZERIA,Restaurant,2015-07-28T00:00:00.000,1447916,Suspected Food Poisoning Re-inspection,41.895863,1697132.0,"{'type': 'Point', 'coordinates': [-87.62325304...",-87.623253,Pass,Risk 1 (High),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60611
3,5844-5846 N BROADWAY,RAS DASHEN ETHIOPIAN RESTAURANT INC,CHICAGO,RAS DASHEN ETHIOPIAN RESTAURANT INC,Restaurant,2017-01-04T00:00:00.000,1978933,Canvass,41.988326,1122395.0,"{'type': 'Point', 'coordinates': [-87.66036036...",-87.66036,Fail,Risk 1 (High),IL,"16. FOOD PROTECTED DURING STORAGE, PREPARATION...",60660
4,2352-2358 N MILWAUKEE AVE,EAST ROOM,CHICAGO,EAST ROOM,Liquor,2013-11-18T00:00:00.000,1375515,License Re-Inspection,41.923873,2263696.0,"{'type': 'Point', 'coordinates': [-87.69916285...",-87.699163,Pass,Risk 3 (Low),IL,,60647


To exclude location, we can take advantage of the fact that the `usecols` argument accepts a function to exclude `location`.

In [58]:
dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    usecols=lambda col: col != 'location'
)

Here we are using a **lambda function** that returns `False` for the location parameter. Lambda functions are what are known as anonymous functions, because they don't have a name. This kind of thing is precisely their intended use.

Here we use a function `lambda x: x` to map the identity function over a list.

In [59]:
list(map(lambda x: x, [1, 2, 3]))

[1, 2, 3]

Finally, in a few cases we may want to take advantage of the pandas native `categorical` type. We can use the `dtype` argument for this, passing a dictionary of type mappings.

In [60]:
dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    dtype={
        'results': 'category',
        'risk': 'category',
        'inspection_type': 'category',
        'facility_type': 'category'
    }
)

In [61]:
dta.risk.head()

0    Risk 2 (Medium)
1      Risk 1 (High)
2      Risk 1 (High)
3      Risk 1 (High)
4       Risk 3 (Low)
Name: risk, dtype: category
Categories (4, object): [All, Risk 1 (High), Risk 2 (Medium), Risk 3 (Low)]

## Exercise

Put all of the above `read_csv` options together in a single call to `read_csv`.

In [None]:
# Type your solution here

In [None]:
# %load solutions/pandas_read_csv.py
import numpy as np
import pandas as pd


def float_to_zip(zip_code):
    # convert from the string in the file to a float
    try:
        zip_code = float(zip_code)
    except ValueError:  # some of them are empty
        return np.nan

    # 0 makes sure to left-pad with zero
    # zip codes have 5 digits
    # .0 means, we don't want anything after the decimal
    # f is for float
    zip_code = "{:05.0f}".format(zip_code)
    return zip_code


dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    index_col='inspection_id',
    parse_dates=['inspection_date'],
    converters={
        'zip': float_to_zip
    },
    usecols=lambda col: col != 'location',
    dtype={
        'results': 'category',
        'risk': 'category',
        'inspection_type': 'category',
        'facility_type': 'category'
    }
)


assert float_to_zip('1234')
assert float_to_zip('123456')
assert np.isnan(float_to_zip(''))


## String Cleaning

Ok, let's start to dig into the data a little bit more. One of the things we're going to be really interested in exploring is the free text of the violations field.

The first thing to notice is that the violations field has null values in it.

In [63]:
dta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 17 columns):
address            25000 non-null object
aka_name           24731 non-null object
city               24978 non-null object
dba_name           25000 non-null object
facility_type      24565 non-null category
inspection_date    25000 non-null object
inspection_id      25000 non-null int64
inspection_type    25000 non-null category
latitude           24822 non-null float64
license_           24999 non-null float64
location           24822 non-null object
longitude          24822 non-null float64
results            25000 non-null category
risk               24991 non-null category
state              24995 non-null object
violations         19538 non-null object
zip                24991 non-null float64
dtypes: category(4), float64(4), int64(1), object(8)
memory usage: 2.6+ MB


We may want to ask ourselves if these values are missing at random or if there is some reason there's no written violation field.

In [64]:
dta.loc[dta.violations.isnull()].head()

Unnamed: 0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_id,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
0,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,68091,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657.0
4,2352-2358 N MILWAUKEE AVE,EAST ROOM,CHICAGO,EAST ROOM,Liquor,2013-11-18T00:00:00.000,1375515,License Re-Inspection,41.923873,2263696.0,"{'type': 'Point', 'coordinates': [-87.69916285...",-87.699163,Pass,Risk 3 (Low),IL,,60647.0
9,2506 N LARAMIE AVE,Cuernavaca Bakery,CHICAGO,Cuernavaca Bakery,Bakery,2017-12-21T00:00:00.000,2129593,Canvass,41.926211,2192872.0,"{'type': 'Point', 'coordinates': [-87.75637041...",-87.75637,Out of Business,Risk 2 (Medium),IL,,60639.0
10,131 N CLINTON ST BLDG BOOTH26,PAPPARDELLE'S PASTA,CHICAGO,PAPPARDELLE'S PASTA,Restaurant,2011-02-23T00:00:00.000,565226,Canvass,41.884188,2022121.0,"{'type': 'Point', 'coordinates': [-87.64111966...",-87.64112,Out of Business,Risk 1 (High),IL,,60661.0
13,200 W MONROE ST,JERSEY MIKE'S SUBS,CHICAGO,JERSEY MIKE'S SUBS,Restaurant,2016-03-28T00:00:00.000,1684292,Canvass Re-Inspection,41.880769,2398227.0,"{'type': 'Point', 'coordinates': [-87.63387497...",-87.633875,Pass,Risk 1 (High),IL,,60606.0


It looks like we're ok. The next thing to notice is that the violation field actually has a lot of violations in the same field for the same visit.

In [65]:
with pd.option_context("display.max_colwidth", 500):
    print(dta.violations.head())

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NaN
1    32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED - Comments: BROKEN RUBBER GASKET INSIDE THE DOORS OF DISPLAY FRONT COOLER.EXCESS GRAY TAPE HOLDING THE BROKEN RUBBER GASKET WITH DEBRIS AND BLACK SUBSTANCE ON IT,INSTRUCTED TO REPLACE. NEWSPAPER AND CARDBOARDS ON SHELVES THROUGHOUT THE PREMISES(FIRST AND SECOND FLOOR)CARD BOARDS AREA GREASY AND DUST BUILD-UP,INSTRUCTED TO REMOVE,SURFACE MUST BE SMOOTH,CLEANABLE AND NON-ABSORBENT MATERIAL.    | 33. FOOD

Let's split these out to make a longer DataFrame where each violation is a single row. Pandas provides a nice way to munge string data through the `str` accessor on string columns.

```python
dta.violations.str.<TAB>
```

## Exercise

Let's see how many violations we have per visit. What does the distribution of violations look like? Explore the methods on the `str` accessor and, perhaps, the `quantile` method.

In [None]:
# Type your solution here

In [None]:
# %load solutions/violation_distribution.py
import numpy as np
import pandas as pd


def float_to_zip(zip_code):
    # convert from the string in the file to a float
    try:
        zip_code = float(zip_code)
    except ValueError:  # some of them are empty
        return np.nan

    # 0 makes sure to left-pad with zero
    # zip codes have 5 digits
    # .0 means, we don't want anything after the decimal
    # f is for float
    zip_code = "{:05.0f}".format(zip_code)
    return zip_code


dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    index_col='inspection_id',
    parse_dates=['inspection_date'],
    converters={
        'zip': float_to_zip
    },
    usecols=lambda col: col != 'location',
    dtype={
        'results': 'category',
        'risk': 'category',
        'inspection_type': 'category',
        'facility_type': 'category'
    }
)


quantiles = [0, .05, .25, .50, .75, .95, 1.00]
(dta.violations.str.count("\|") + 1).quantile(quantiles)


In [70]:
quantiles = [0, .05, .25, .50, .75, .95, 1.00]
(dta.violations.str.count("\|") + 1).quantile(quantiles)

0.00     1.0
0.05     1.0
0.25     2.0
0.50     3.0
0.75     5.0
0.95     9.0
1.00    23.0
Name: violations, dtype: float64

In [71]:
(dta.violations.str.count("\|") + 1)

0         NaN
1         6.0
2         5.0
3         6.0
4         NaN
5         2.0
6         4.0
7         2.0
8         2.0
9         NaN
10        NaN
11        6.0
12        1.0
13        NaN
14        NaN
15        1.0
16        7.0
17        NaN
18        3.0
19        NaN
20        4.0
21        2.0
22        NaN
23        1.0
24        6.0
25        1.0
26       13.0
27        NaN
28        2.0
29        6.0
         ... 
24970     3.0
24971     3.0
24972     3.0
24973     1.0
24974     3.0
24975     2.0
24976     8.0
24977     9.0
24978     5.0
24979     3.0
24980     2.0
24981     8.0
24982     3.0
24983     NaN
24984     5.0
24985     9.0
24986     2.0
24987     2.0
24988     2.0
24989     2.0
24990     6.0
24991     NaN
24992     2.0
24993     6.0
24994     3.0
24995     6.0
24996     5.0
24997     1.0
24998     9.0
24999     5.0
Name: violations, Length: 25000, dtype: float64

Ok, we have a manageable number of violations. Let's split the violations and then turn them into a long DataFrame with a single row for each violation within each visit.

In [73]:
dta = dta.set_index('inspection_id')
violations = dta.violations.str.split("\|", expand=True)
violations.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
68091,,,,,,,,,,,...,,,,,,,,,,
1482935,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENS...,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GO...","35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONST...",36. LIGHTING: REQUIRED MINIMUM FOOT-CANDLES O...,"43. FOOD (ICE) DISPENSING UTENSILS, WASH CLOT...",,,,,...,,,,,,,,,,
1447916,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENS...,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GO...","35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONST...",36. LIGHTING: REQUIRED MINIMUM FOOT-CANDLES O...,,,,,,...,,,,,,,,,,
1978933,"16. FOOD PROTECTED DURING STORAGE, PREPARATION...",32. FOOD AND NON-FOOD CONTACT SURFACES PROPER...,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENS...,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GO...","35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONST...",38. VENTILATION: ROOMS AND EQUIPMENT VENTED A...,,,,,...,,,,,,,,,,
1375515,,,,,,,,,,,...,,,,,,,,,,


When we `unstack` the DataFrame, we're left with what's called a `MultiIndex`. This index has two *levels* now. One is the original `inspection_id`. The other is the, rather meaningless, column names.

In [74]:
violations.unstack().head()

   inspection_id
0  68091                                                          NaN
   1482935          32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
   1447916          32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
   1978933          16. FOOD PROTECTED DURING STORAGE, PREPARATION...
   1375515                                                        NaN
dtype: object

Let's get rid of the empty rows first.

In [75]:
violations = violations.unstack().dropna()

In [77]:
violations.head()

   inspection_id
0  1482935          32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
   1447916          32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
   1978933          16. FOOD PROTECTED DURING STORAGE, PREPARATION...
   1434800          33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...
   1106632          32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
dtype: object

Now we can drop the column name level, which we don't need.

In [78]:
violations.reset_index(level=0, drop=True, inplace=True)

In [79]:
violations.head()

inspection_id
1482935    32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
1447916    32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
1978933    16. FOOD PROTECTED DURING STORAGE, PREPARATION...
1434800    33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...
1106632    32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
dtype: object

One last cleaning step may be helpful here. When we split on the pipe ('`|`'), we likely kept some surrounding whitespace. We can remove that.

In [80]:
violations.str.startswith(" ").any()

True

In [81]:
violations.str.strip().head()

inspection_id
1482935    32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
1447916    32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
1978933    16. FOOD PROTECTED DURING STORAGE, PREPARATION...
1434800    33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...
1106632    32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
dtype: object

In [82]:
violations = violations.str.strip()

In [None]:
((violations.str.startswith(" ").any()) | 
 (violations.str.endswith(" ").any()))

Later, we'll see how to combine these violations back with our original data to do some analysis.

## Working with Dates and Categoricals

Above, we used the `str` accessor on the DataFrame. This isn't the only convenient accessor that pandas provides. There is also the `dt` accessor for datetime types and the `cat` accessor for categorical types.

```python
dta.inspection_date.dt.<TAB>
```

In [None]:
dta.inspection_date.head()

In [None]:
dta.inspection_date.dt.month.head()

Now, let's take a look at the categorical types.

```python
dta.risk.cat.<TAB>
```

In [None]:
dta.risk.head()

In [None]:
dta.risk.cat.codes.head()