# Reading in data

First, data from the NYC Housing Complaints dataset are read in as a pandas data frame using the get_problem_data function. Note the query is set to:
- Only get records where the status date is between 2015-08-01 and 2015-10-01
Additionally, the function is called with lim=5, so 5000 records are pulled.


In [21]:
import pandas as pd
import numpy as np
import re

### 

def get_problem_data(lim):
    '''get_problem_data takes one argument- the number of records to get (in thousands)- and
    returns a pandas dataframe containing lim thousand records from the NYC Complaints Problem
    dataset (filtering only record with status dates between 2015-08-01 and 2015-10-01)'''
    
    query = ("https://data.cityofnewyork.us/resource/gp4p-wib8.json?"
         "$where=statusdate,%20between%20'2015-08-01T00:00:00'%20and%20'2015-10-01T00:00:00'"
         "&$limit=1000")
    complaints_08_01_to_10_01 = pd.read_json(query)
    if lim > 1:
        more_data = True
    else:
        more_data = False
    i=1
    while more_data == True:
        query = ("https://data.cityofnewyork.us/resource/gp4p-wib8.json?"
         "$where=statusdate%20between%20'2015-08-01T00:00:00'%20and%20'2015-10-01T00:00:00'"
         "&$limit=1000&$order=:id&$offset=" + str(i*1000))
        data_page = pd.read_json(query)
        complaints_08_01_to_10_01 = pd.concat([complaints_08_01_to_10_01, data_page],ignore_index=True)
        print 'Currently have ', len(complaints_08_01_to_10_01), ' records'
        i +=1
        if len(data_page) < 1000 or i == lim:
            more_data = False
    return complaints_08_01_to_10_01
        
complaints_08_01_to_10_01 = get_problem_data(300)
short_df_complaints = complaints_08_01_to_10_01

Currently have  2000  records
Currently have  3000  records
Currently have  4000  records
Currently have  5000  records
Currently have  6000  records
Currently have  7000  records
Currently have  8000  records
Currently have  9000  records
Currently have  10000  records
Currently have  11000  records
Currently have  12000  records
Currently have  13000  records
Currently have  14000  records
Currently have  15000  records
Currently have  16000  records
Currently have  17000  records
Currently have  18000  records
Currently have  19000  records
Currently have  20000  records
Currently have  21000  records
Currently have  22000  records
Currently have  23000  records
Currently have  24000  records
Currently have  25000  records
Currently have  26000  records
Currently have  27000  records
Currently have  28000  records
Currently have  29000  records
Currently have  30000  records
Currently have  31000  records
Currently have  32000  records
Currently have  33000  records
Currently have  

Next, complaint status text strings are coded using infer_complaint_status. Finally, the number of observations in each group is reported.

In [22]:

###Proposed classes for StatusDescription:
## 1. "No violations were issued" = r'No\sviolations\swere\sissued'
## 2. "not able to gain access" = r'not\sables\sto\sgain\saccess'
## 3. ". Violations were issued" = r'\.Violations\swere\sissued'
##4. ". Violations were previously issued" = r'\.Violations\swere\spreviously\sissued'
##5. "conditions were corrected" = r'conditions\swere\scorrected'
##5. "advised by a tenant' [that heat or hot water was restore] = r'advised\sby\sa\stenant'
##6. "conditions are still open" = r'conditions\sare\still\open'
##0. Not one of the above.

def infer_complaint_status(input_string):
    try:
        input_string = str(input_string)
    except:
        print input_string
    if bool(re.search(r'No\sviolations\swere\sissued', input_string)):
        code = 1
    elif bool(re.search(r'not\sable\sto\sgain\saccess', input_string)):
        code = 2
    elif bool(re.search(r'\.\sViolations\swere\sissued', input_string)):
        code = 3
    elif bool(re.search(r'\.\sViolations\swere\spreviously\sissued', input_string)):
        code = 4
    elif bool(re.search(r'conditions\swere\scorrected', input_string)):
        code = 5 
    elif bool(re.search(r'advised\sby\sa\stenant', input_string)):
        code = 5
    elif bool(re.search(r'conditions\sare\sstill\sopen', input_string)):
        code = 6
    else:
        code = 0
    return code

short_df_complaints.loc[:,'complaint_status_inferred'] = short_df_complaints.loc[:,'statusdescription'].map(infer_complaint_status)
grouped = short_df_complaints.groupby('complaint_status_inferred')

counts = pd.DataFrame()
counts['Code'] = grouped.groups.keys()
counts['Count'] = map(len, grouped.groups.values())
counts.to_csv('Counts.csv')
print 'Counts for each code'
print counts
print '\n'

## Returns statusdescriptions for all code 6 cases
pd.options.display.max_colwidth = 300
if len(grouped.get_group(0)['statusdescription'].loc[grouped.get_group(0)['statusdescription'].notnull()]) !=0: 
    print "Status descriptions for all '0' code observations"
    print grouped.get_group(0)['statusdescription'].loc[grouped.get_group(0)['statusdescription'].notnull()]
    grouped.get_group(0)['statusdescription'].loc[grouped.get_group(0)['statusdescription'].notnull()].to_csv('Missed_codes.csv')
else:
    print "No non-NaN values in group 0"
                

Counts for each code
   Code  Count
0     0    140
1     1  38475
2     2  22600
3     3  23445
4     4   2887
5     5   7224
6     6  12105


Status descriptions for all '0' code observations
55479                    The Department of Housing Preservation and Development conducted an inspection for the following conditions and identified potential lead-based paint conditions. HPD will attempt to contact you to schedule a follow-up inspection to test the paint for lead.
67472                    The Department of Housing Preservation and Development conducted an inspection for the following conditions and identified potential lead-based paint conditions. HPD will attempt to contact you to schedule a follow-up inspection to test the paint for lead.
76761                    The Department of Housing Preservation and Development conducted an inspection for the following conditions and identified potential lead-based paint conditions. HPD will attempt to contact you to schedule a follow-up 