# Clean addresses and dates in 311 no heat complaints

This notebook loads the source complaint data and parses zipcodes and apartment numbers from the 'Service Request Address' field and year, month and day from the 'Date/Time Opened' field. It outputs a cleaned complaints file into data/processed. 

In [174]:
# import packages
import pandas as pd
import re

In [175]:
# ignore chained assignment warning messages
pd.options.mode.chained_assignment = None

In [176]:
# import data
df = pd.read_csv('../data/source/311_Heat_Complaints.csv')

In [177]:
df.tail()

Unnamed: 0,SR Number,Date/Time Opened,Service Request Address,Service Request Status,Answer
16574,SR24-00139193,01/23/24 07:37 PM,10437 S HALE AVE<br>Chicago Illinois 60643,Open,
16575,SR24-00139626,01/23/24 09:56 PM,860 N DEWITT PL<br>Apt 1502<br>Chicago Illinoi...,Completed,
16576,SR24-00141986,01/24/24 10:14 AM,7311 S EAST END AVE<br>Chicago Illinois 60649,Open,
16577,SR24-00142172,01/24/24 10:30 AM,611 W BRIAR PL<br>103<br>Chicago Illinois 60657,Open,
16578,Total,16578,,,


In [178]:
# delete the total row
df.drop(16578, inplace=True)

In [179]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16578 entries, 0 to 16577
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   SR Number                16578 non-null  object
 1   Date/Time Opened         16578 non-null  object
 2   Service Request Address  16573 non-null  object
 3   Service Request Status   16578 non-null  object
 4   Answer                   15028 non-null  object
dtypes: object(5)
memory usage: 647.7+ KB


## Parse zipcode from the address column

Splits the service request address by the br tag to isolate a clean street address portion and then uses regex to find a five-digit zipcode pattern in the remaining address.Creates a new zipcode column with the five digit zipcode.

In [180]:
# split Service Request Address by <br> into max three columns
df[['street_address', 'address_1', 'address_2']] = df['Service Request Address'].str.split('<br>',expand=True)

In [181]:
# replace None with an empty string
df.fillna('',inplace=True)

In [182]:
df.tail()

Unnamed: 0,SR Number,Date/Time Opened,Service Request Address,Service Request Status,Answer,street_address,address_1,address_2
16573,SR24-00139103,01/23/24 07:13 PM,860 N DEWITT PL<br>1704<br>Chicago Illinois 60611,Completed,,860 N DEWITT PL,1704,Chicago Illinois 60611
16574,SR24-00139193,01/23/24 07:37 PM,10437 S HALE AVE<br>Chicago Illinois 60643,Open,,10437 S HALE AVE,Chicago Illinois 60643,
16575,SR24-00139626,01/23/24 09:56 PM,860 N DEWITT PL<br>Apt 1502<br>Chicago Illinoi...,Completed,,860 N DEWITT PL,Apt 1502,Chicago Illinois 60611
16576,SR24-00141986,01/24/24 10:14 AM,7311 S EAST END AVE<br>Chicago Illinois 60649,Open,,7311 S EAST END AVE,Chicago Illinois 60649,
16577,SR24-00142172,01/24/24 10:30 AM,611 W BRIAR PL<br>103<br>Chicago Illinois 60657,Open,,611 W BRIAR PL,103,Chicago Illinois 60657


In [183]:
# parse address_1 and address_2 for a five digit zip code and put into a new column called zipcode

pattern = r'\b\d{5}\b'

def get_zip(value):
    zipcode = re.findall(pattern, str(value)) # returns a list of all matches
    if len(zipcode)>0: # if list contains zipcode
        return zipcode[0] # return first value in list
    else:
        return ''

In [184]:
# create a new column for zipcode
df['zipcode'] = ''

# loop through each row to parse zipcode first in address_1 and if not found there, address_2
for index, row in df.iterrows():
    zipcode = get_zip(row['address_1'])
    row['zipcode'] = zipcode
    if zipcode == '':
        zipcode = get_zip(row['address_2'])
        row['zipcode'] = zipcode

## Parse apartment number from the address column

Combine address_1 and address_2 to get all the non street address info into one column. Then did find and replace for 'Chicago Illinois' and a five digit zipcode to isolate the apartment number in a new column called apt_num.

In [185]:
# create a column for apartment number that combines address_1 and address_2
df['apt_num'] = df['address_1'] + ' ' + df['address_2']

In [186]:
df.tail()

Unnamed: 0,SR Number,Date/Time Opened,Service Request Address,Service Request Status,Answer,street_address,address_1,address_2,zipcode,apt_num
16573,SR24-00139103,01/23/24 07:13 PM,860 N DEWITT PL<br>1704<br>Chicago Illinois 60611,Completed,,860 N DEWITT PL,1704,Chicago Illinois 60611,60611,1704 Chicago Illinois 60611
16574,SR24-00139193,01/23/24 07:37 PM,10437 S HALE AVE<br>Chicago Illinois 60643,Open,,10437 S HALE AVE,Chicago Illinois 60643,,60643,Chicago Illinois 60643
16575,SR24-00139626,01/23/24 09:56 PM,860 N DEWITT PL<br>Apt 1502<br>Chicago Illinoi...,Completed,,860 N DEWITT PL,Apt 1502,Chicago Illinois 60611,60611,Apt 1502 Chicago Illinois 60611
16576,SR24-00141986,01/24/24 10:14 AM,7311 S EAST END AVE<br>Chicago Illinois 60649,Open,,7311 S EAST END AVE,Chicago Illinois 60649,,60649,Chicago Illinois 60649
16577,SR24-00142172,01/24/24 10:30 AM,611 W BRIAR PL<br>103<br>Chicago Illinois 60657,Open,,611 W BRIAR PL,103,Chicago Illinois 60657,60657,103 Chicago Illinois 60657


In [187]:
# find and replace Chicago Illinois substrings
city_pattern = r'Chicago Illinois\b'

df['apt_num'] = df['apt_num'].str.replace(pattern, '')

# find and replace zipcode patterns
pattern = r'\b\d{5}\b'

df['apt_num'] = df['apt_num'].str.replace(pattern, '')

  df['apt_num'] = df['apt_num'].str.replace(pattern, '')
  df['apt_num'] = df['apt_num'].str.replace(pattern, '')


In [188]:
df.tail()

Unnamed: 0,SR Number,Date/Time Opened,Service Request Address,Service Request Status,Answer,street_address,address_1,address_2,zipcode,apt_num
16573,SR24-00139103,01/23/24 07:13 PM,860 N DEWITT PL<br>1704<br>Chicago Illinois 60611,Completed,,860 N DEWITT PL,1704,Chicago Illinois 60611,60611,1704 Chicago Illinois
16574,SR24-00139193,01/23/24 07:37 PM,10437 S HALE AVE<br>Chicago Illinois 60643,Open,,10437 S HALE AVE,Chicago Illinois 60643,,60643,Chicago Illinois
16575,SR24-00139626,01/23/24 09:56 PM,860 N DEWITT PL<br>Apt 1502<br>Chicago Illinoi...,Completed,,860 N DEWITT PL,Apt 1502,Chicago Illinois 60611,60611,Apt 1502 Chicago Illinois
16576,SR24-00141986,01/24/24 10:14 AM,7311 S EAST END AVE<br>Chicago Illinois 60649,Open,,7311 S EAST END AVE,Chicago Illinois 60649,,60649,Chicago Illinois
16577,SR24-00142172,01/24/24 10:30 AM,611 W BRIAR PL<br>103<br>Chicago Illinois 60657,Open,,611 W BRIAR PL,103,Chicago Illinois 60657,60657,103 Chicago Illinois


## Clean date column

In [189]:
# convert Date/Time Opened to datetime
df['datetime_opened'] = pd.to_datetime(df['Date/Time Opened'])

# create a year opened col
df['year'] = df['datetime_opened'].dt.year

# create a month_year col
df['month_year'] = df['datetime_opened'].dt.strftime('%m/%Y')

# create a date col
df['date'] = df['datetime_opened'].dt.strftime('%Y-%m-%d')

# create an hour col
df['hour'] = df['datetime_opened'].dt.strftime('%H')

## Export cleaned data as csv

In [190]:
# add a city and state column
df['city'] = 'Chicago'
df['state'] = 'Illinois'

# delete working columns, adddress_1 and address_2
df.drop('address_1', axis=1, inplace=True)
df.drop('address_2', axis=1, inplace=True)

In [191]:
df.head()

Unnamed: 0,SR Number,Date/Time Opened,Service Request Address,Service Request Status,Answer,street_address,zipcode,apt_num,datetime_opened,year,month_year,date,hour,city,state
0,SR18-00198455,12/19/18 02:53 PM,,Completed,No Cause,,,,2018-12-19 14:53:00,2018,12/2018,2018-12-19,14,Chicago,Illinois
1,SR19-01043676,02/20/19 01:40 PM,1135 N Harlem AVE<br> 60302,Completed,No Cause,1135 N Harlem AVE,60302.0,,2019-02-20 13:40:00,2019,02/2019,2019-02-20,13,Chicago,Illinois
2,SR19-01047333,02/21/19 08:56 AM,18231 S Sayre AVE<br> 60477,Completed,No Cause,18231 S Sayre AVE,60477.0,,2019-02-21 08:56:00,2019,02/2019,2019-02-21,8,Chicago,Illinois
3,SR19-01050631,02/21/19 03:29 PM,6726 N SHERIDAN RD<br> 60626,Completed,Processed for Hearing - Standard,6726 N SHERIDAN RD,60626.0,,2019-02-21 15:29:00,2019,02/2019,2019-02-21,15,Chicago,Illinois
4,SR19-01058995,02/23/19 11:35 AM,2746 N 74th AVE<br> 60707,Completed,No Cause,2746 N 74th AVE,60707.0,,2019-02-23 11:35:00,2019,02/2019,2019-02-23,11,Chicago,Illinois


In [192]:
df.to_csv('../data/processed/cleaned_complaints.csv')