# Pandas Cook Book Chapter 07

See the [Cookbook](http://github.com/jvns/pandas-cookbook) here, let's start learning chapter 7.

In [2]:
# render plot inline
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Clean up messy data
First of all, how do we know whether the data is messy?

In [14]:
requests = pd.read_csv('data/311-service-requests.csv')
requests.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location
0,26589651,10/31/2013 02:08:41 AM,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Talking,Street/Sidewalk,11432,90-03 169 STREET,...,,,,,,,,40.708275,-73.791604,"(40.70827532593202, -73.79160395779721)"
1,26593698,10/31/2013 02:01:04 AM,,NYPD,New York City Police Department,Illegal Parking,Commercial Overnight Parking,Street/Sidewalk,11378,58 AVENUE,...,,,,,,,,40.721041,-73.909453,"(40.721040535628305, -73.90945306791765)"
2,26594139,10/31/2013 02:00:24 AM,10/31/2013 02:40:32 AM,NYPD,New York City Police Department,Noise - Commercial,Loud Music/Party,Club/Bar/Restaurant,10032,4060 BROADWAY,...,,,,,,,,40.84333,-73.939144,"(40.84332975466513, -73.93914371913482)"
3,26595721,10/31/2013 01:56:23 AM,10/31/2013 02:21:48 AM,NYPD,New York City Police Department,Noise - Vehicle,Car/Truck Horn,Street/Sidewalk,10023,WEST 72 STREET,...,,,,,,,,40.778009,-73.980213,"(40.7780087446372, -73.98021349023975)"
4,26590930,10/31/2013 01:53:44 AM,,DOHMH,Department of Health and Mental Hygiene,Rodent,Condition Attracting Rodents,Vacant Lot,10027,WEST 124 STREET,...,,,,,,,,40.807691,-73.947387,"(40.80769092704951, -73.94738703491433)"


Sometimes we get a a warning about mixed types in certain columns. For example, the above warning shows that the 'Incident Zip' column (column 8) has mixed type. To get a sense of whether a column has problems, we can:

1. detect if there's NaN value;
2. use unique() function to see all the values, useful if the value type is string;
3. plot a histrogram get a sense of distribution.

### Detect NaN values

In [15]:
len(requests[requests['Incident Zip'].isnull()]) # there are 12262 missing values (NaN)

12262

In [32]:
### Get a glance of data

In [16]:
requests['Incident Zip'].unique()

array([11432.0, 11378.0, 10032.0, 10023.0, 10027.0, 11372.0, 11419.0,
       11417.0, 10011.0, 11225.0, 11218.0, 10003.0, 10029.0, 10466.0,
       11219.0, 10025.0, 10310.0, 11236.0, nan, 10033.0, 11216.0, 10016.0,
       10305.0, 10312.0, 10026.0, 10309.0, 10036.0, 11433.0, 11235.0,
       11213.0, 11379.0, 11101.0, 10014.0, 11231.0, 11234.0, 10457.0,
       10459.0, 10465.0, 11207.0, 10002.0, 10034.0, 11233.0, 10453.0,
       10456.0, 10469.0, 11374.0, 11221.0, 11421.0, 11215.0, 10007.0,
       10019.0, 11205.0, 11418.0, 11369.0, 11249.0, 10005.0, 10009.0,
       11211.0, 11412.0, 10458.0, 11229.0, 10065.0, 10030.0, 11222.0,
       10024.0, 10013.0, 11420.0, 11365.0, 10012.0, 11214.0, 11212.0,
       10022.0, 11232.0, 11040.0, 11226.0, 10281.0, 11102.0, 11208.0,
       10001.0, 10472.0, 11414.0, 11223.0, 10040.0, 11220.0, 11373.0,
       11203.0, 11691.0, 11356.0, 10017.0, 10452.0, 10280.0, 11217.0,
       10031.0, 11201.0, 11358.0, 10128.0, 11423.0, 10039.0, 10010.0,
       11209.0,

From the above we know that we have:

1. Floating point numbers and strings;
2. NaN;
3. Strange value like 'NO CLUE'.

As the first step to clean up data, we can specify options during the data loading step to convert all float to string and treat strange values as missing values NaN. Let's load the data again.

### Reload data with options

In [17]:
naValues = ['NO CLUE']
requests = pd.read_csv('data/311-service-requests.csv', na_values=naValues, dtype={'Incident Zip': str})
len(requests[requests['Incident Zip'].isnull()]) # there are 12263 missing values (NaN), increased by 1 (NO CLUE -> NaN)

12263

In [35]:
requests['Incident Zip'].unique()

array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
       '11417', '10011', '11225', '11218', '10003', '10029', '10466',
       '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
       '10305', '10312', '10026', '10309', '10036', '11433', '11235',
       '11213', '11379', '11101', '10014', '11231', '11234', '10457',
       '10459', '10465', '11207', '10002', '10034', '11233', '10453',
       '10456', '10469', '11374', '11221', '11421', '11215', '10007',
       '10019', '11205', '11418', '11369', '11249', '10005', '10009',
       '11211', '11412', '10458', '11229', '10065', '10030', '11222',
       '10024', '10013', '11420', '11365', '10012', '11214', '11212',
       '10022', '11232', '11040', '11226', '10281', '11102', '11208',
       '10001', '10472', '11414', '11223', '10040', '11220', '11373',
       '11203', '11691', '11356', '10017', '10452', '10280', '11217',
       '10031', '11201', '11358', '10128', '11423', '10039', '10010',
       '11209',

### Deal with strange values

Now, we have convert strange values to NaN. But we still have other issues:

1. things like '00000', '000000' which don't make sense;
2. other zip codes with '-'.

For those zip codes that don't make sense, we can convert them to NaN. For those with '-', the pattern is 5 digit zip codes followed by another 4 digits. So we just take the first 5 digits.

First let's just take out those zip codes whose length is bigger than 5.

In [18]:
longZipCodes = requests['Incident Zip'].str.len() > 5
requests['Incident Zip'][longZipCodes].unique()

array(['77092-2016', '55164-0737', '000000', '11549-3650', '29616-0759',
       '35209-3114'], dtype=object)

Let's deal with things that don't make sense and convert them to NaN.

In [20]:
zeroZips = requests['Incident Zip'].str.startswith('00000').fillna(False)
requests[zeroZips]

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location
42600,26529313,10/22/2013 02:51:06 PM,,TLC,Taxi and Limousine Commission,Taxi Complaint,Driver Complaint,,0,EWR EWR,...,,,,,,,,,,
60843,26507389,10/17/2013 05:48:44 PM,,TLC,Taxi and Limousine Commission,Taxi Complaint,Driver Complaint,Street,0,1 NEWARK AIRPORT,...,,,,,,,,,,


Convert these zip codes to NaN since they don't make sense

In [21]:
requests.loc[zeroZips, 'Incident Zip'] = np.NaN

# requests[zeroZips]['Incident Zip'] = np.NaN # don't use this, because requests[zeroZips]['Incident Zip'] returns a copy 

len(requests[requests['Incident Zip'].isnull()]) # there are 12265 missing values (NaN), increased by 2

12265

Now let's deal with those with '-' in them, just take the first 5 digits.

In [23]:
withDash = requests['Incident Zip'].str.len() > 5

# We can use below to achieve the same result
# requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5)

withDash = requests['Incident Zip'].str.contains('-').fillna(False) # find those containing '-'
requests.loc[withDash, 'Incident Zip'] = requests[withDash]['Incident Zip'].apply(lambda s: s[0:5])
requests['Incident Zip'].unique()

array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
       '11417', '10011', '11225', '11218', '10003', '10029', '10466',
       '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
       '10305', '10312', '10026', '10309', '10036', '11433', '11235',
       '11213', '11379', '11101', '10014', '11231', '11234', '10457',
       '10459', '10465', '11207', '10002', '10034', '11233', '10453',
       '10456', '10469', '11374', '11221', '11421', '11215', '10007',
       '10019', '11205', '11418', '11369', '11249', '10005', '10009',
       '11211', '11412', '10458', '11229', '10065', '10030', '11222',
       '10024', '10013', '11420', '11365', '10012', '11214', '11212',
       '10022', '11232', '11040', '11226', '10281', '11102', '11208',
       '10001', '10472', '11414', '11223', '10040', '11220', '11373',
       '11203', '11691', '11356', '10017', '10452', '10280', '11217',
       '10031', '11201', '11358', '10128', '11423', '10039', '10010',
       '11209',

Now it looks much neater.

## Wrap everything together
Let's do it again with a function

In [25]:
naValues = ['NO CLUE']
requests = pd.read_csv('data/311-service-requests.csv', na_values=naValues, dtype={'Incident Zip': str})


def fixZipCodes(zips):
    """
    [Series] zips => [Series] zips
    
    All elements in zips are either NaN or string
    """
    zips = zips.str.slice(0, 5)
    zips[zips == '00000'] = np.NaN
    
    return zips


requests['Incident Zip'] = fixZipCodes(requests['Incident Zip'])
requests['Incident Zip'].unique()

array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
       '11417', '10011', '11225', '11218', '10003', '10029', '10466',
       '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
       '10305', '10312', '10026', '10309', '10036', '11433', '11235',
       '11213', '11379', '11101', '10014', '11231', '11234', '10457',
       '10459', '10465', '11207', '10002', '10034', '11233', '10453',
       '10456', '10469', '11374', '11221', '11421', '11215', '10007',
       '10019', '11205', '11418', '11369', '11249', '10005', '10009',
       '11211', '11412', '10458', '11229', '10065', '10030', '11222',
       '10024', '10013', '11420', '11365', '10012', '11214', '11212',
       '10022', '11232', '11040', '11226', '10281', '11102', '11208',
       '10001', '10472', '11414', '11223', '10040', '11220', '11373',
       '11203', '11691', '11356', '10017', '10452', '10280', '11217',
       '10031', '11201', '11358', '10128', '11423', '10039', '10010',
       '11209',