# Reading in data

First, data from the NYC Housing Complaints dataset are read in as a pandas data frame using the get_problem_data function. Note the query is set to:
- Only get records where the status date is between 2015-08-01 and 2015-10-01
Additionally, the function is called with lim=5, so 5000 records are pulled.

Finally, the dataframe is downsampled to just 1000 records.

In [46]:
import pandas as pd
import numpy as np
import re

### 

def get_problem_data(lim):
    '''get_problem_data takes one argument- the number of records to get (in thousands)- and
    returns a pandas dataframe containing lim thousand records from the NYC Complaints Problem
    dataset (filtering only record with status dates between 2015-08-01 and 2015-10-01)'''
    
    query = ("https://data.cityofnewyork.us/resource/gp4p-wib8.json?"
         "$where=statusdate%20between%20'2015-08-01T00:00:00'%20and%20'2015-10-01T00:00:00'"
         "&$limit=1000")
    complaints_08_01_to_10_01 = pd.read_json(query)
    if lim > 1:
        more_data = True
    else:
        more_data = False
    i=1
    while more_data == True:
        query = ("https://data.cityofnewyork.us/resource/gp4p-wib8.json?"
         "$where=statusdate%20between%20'2015-08-01T00:00:00'%20and%20'2015-10-01T00:00:00'"
         "&$limit=1000&$order=:id&$offset=" + str(i*1000))
        data_page = pd.read_json(query)
        complaints_08_01_to_10_01 = pd.concat([complaints_08_01_to_10_01, data_page],ignore_index=True)
        print 'Currently have ', len(complaints_08_01_to_10_01), ' records'
        i +=1
        if len(data_page) < 1000 or i == lim:
            more_data = False
    return complaints_08_01_to_10_01
        
complaints_08_01_to_10_01 = get_problem_data(5)
short_df_complaints = complaints_08_01_to_10_01.sample(1000)

Currently have  2000  records
Currently have  3000  records
Currently have  4000  records
Currently have  5000  records


Next, complaint status text strings are coded using infer_complaint_status. Finally, the number of observations in each group is reported.

In [49]:

###Proposed classes for StatusDescription:
## 1. "No violations were issued" = r'No\sviolations\swere\sissued'
## 2. "not able to gain access" = r'not\sables\sto\sgain\saccess'
## 3. ". Violations were issued" = r'\.Violations\swere\sissued'
##4. ". Violations were previously issued" = r'\.Violations\swere\spreviously\sissued'
##5. "conditions were corrected" = r'conditions\swere\scorrected'
##5. "advised by a tenant' [that heat or hot water was restore] = r'advised\sby\sa\stenant'
##6. "conditions are still open" = r'conditions\sare\still\open'
##0. Not one of the above.

def infer_complaint_status(input_string):
    try:
        input_string = str(input_string)
    except:
        print input_string
    if bool(re.search(r'No\sviolations\swere\sissued', input_string)):
        code = 1
    elif bool(re.search(r'not\sable\sto\sgain\saccess', input_string)):
        code = 2
    elif bool(re.search(r'\.\sViolations\swere\sissued', input_string)):
        code = 3
    elif bool(re.search(r'\.\sViolations\swere\spreviously\sissued', input_string)):
        code = 4
    elif bool(re.search(r'conditions\swere\scorrected', input_string)):
        code = 5 
    elif bool(re.search(r'advised\sby\sa\stenant', input_string)):
        code = 5
    elif bool(re.search(r'conditions\sare\sstill\sopen', input_string)):
        code = 6
    else:
        code = 0
    return code

short_df_complaints.loc[:,'complaint_status_inferred'] = short_df_complaints.loc[:,'statusdescription'].map(infer_complaint_status)
grouped = short_df_complaints.groupby('complaint_status_inferred')

counts = pd.DataFrame()
counts['Code'] = grouped.groups.keys()
counts['Count'] = map(len, grouped.groups.values())
print 'Counts for each code'
print counts
print '\n'

## Returns statusdescriptions for all code 6 cases
pd.options.display.max_colwidth = 300
print "Status descriptions for all '0' code observations"
print  grouped.get_group(0)['statusdescription']

Counts for each code
   Code  Count
0     0     18
1     1    508
2     2    267
3     3    140
4     4     27
5     5     39
6     6      1


Status descriptions for all '0' code observations
1222    NaN
1469    NaN
1710    NaN
1471    NaN
2176    NaN
1472    NaN
2164    NaN
2109    NaN
1865    NaN
1223    NaN
2209    NaN
4096    NaN
1230    NaN
2112    NaN
1408    NaN
1192    NaN
1448    NaN
1741    NaN
Name: statusdescription, dtype: object
