# Multiple answer cleanup
Attempt at dealing with fields that contain comma-separated lists of answers from questions with multiple-select answer options that also allow respondents to enter their own 'other' answer.

This notebook assumes the file named `responses_scrubbed_tagged.csv` has already been created by running the notebook named `free_text_tagging.ipynb`.  This notebook then creates a new data file called `responses_complete.csv` which has the complete set of final data.

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display, Markdown, Latex

## Load tagged data set
This notebook assumes you have already scrubbed and tagged the data using the `free_text_tagging.ipynb` notebook first.

In [2]:
df = pd.read_csv('responses_scrubbed_tagged.csv', index_col=None, parse_dates=['date'])
df.sample(n=10) #random sample of rows

Unnamed: 0,date,neighborhood,demographic,child_bus_freq,child_bike_freq,child_driven_freq,child_drive_freq,child_walk_freq,walk_freq,bike_freq,...,no_bike_reason_2,no_bike_reason_3,drive_reason_0,drive_reason_1,drive_reason_2,drive_reason_3,drive_reason_4,drive_reason_5,problem_tags,suggestion_tags
400,2019-07-08 17:14:37,Sunset Park,Adult with school kid,"Every day, or almost every day",Rarely or never - my children are too young,Once every few days,Rarely or never - my children are too young,Rarely or never - my children are too young,"Every day, or almost every day",Once every few days,...,,,My own personal preference,,,,,,"speeding,129,sidewalk condition","reduce speed,bike lanes"
179,2019-07-03 19:11:14,Mount Airy,Senior,,,,,,Once in a while,Rarely or never,...,,,Require a vehicle for emergency response,,,,,,driver awareness,
526,2019-07-19 08:40:30,Cortlandt,Adult with school kid,"Every day, or almost every day",Rarely or never - for other reasons,Once in a while,Rarely or never - my children are too young,Rarely or never - for other reasons,Once every few days,Once every few days,...,,,My own personal preference,,,,,,,
99,2019-07-03 16:27:21,Harmon,High School kid,,,,,,,,...,,,I don't have time to walk or bicycle,,,,,,side-by-side cyclists,"bike lanes, educate cyclists"
294,2019-07-04 14:19:02,Sunset Park,Adult with school kid,"Every day, or almost every day",Rarely or never - my children are too young,Rarely or never,Rarely or never - my children are too young,Rarely or never - my children are too young,Once every few days,Rarely or never,...,,,,,,,,,,
192,2019-07-03 19:56:38,Harmon,Adult with school kid,"Every day, or almost every day",Rarely or never - my children are too young,Rarely or never,Rarely or never - my children are too young,Rarely or never - my children are too young,"Every day, or almost every day",Once in a while,...,,,,,,,,,"driver awareness,speeding,aggressive driving,p...","enforce speed,stop signs on benedict,more stop..."
9,2019-05-23 15:14:38,Sunset Park,Adult without school kid,,,,,,Once in a while,Once every few days,...,,,My own personal preference,,,,,,driver awareness,"more crosswalks, maintain crosswalks"
19,2019-07-03 15:37:27,Harmon,Adult without school kid,,,,,,Once every few days,Rarely or never,...,,,I don't have time to walk or bicycle,,,,,,"cyclists breaking rules,cpa","enforce speed,maintain crosswalks,enforce cros..."
522,2019-07-18 15:07:23,Harmon,Senior,,,,,,Once every few days,Once in a while,...,,,My own personal preference,,,,,,129,educate cyclists
237,2019-07-04 01:49:00,Old Post Road North,Adult with school kid,"Every day, or almost every day",Rarely or never - my children are too young,Once in a while,Rarely or never - my children are too young,"Every day, or almost every day",Once in a while,Once in a while,...,,,Safety concerns with bicycling,My own personal preference,,,,,"cpa,n riverside,station lot,missing crosswalks...","illuminated crosswalks,better lighting"


## Answer options to each question
A dictionary containing lists of all the questions and the answer options that the user was asked to select from for each question in the Google Form.

In [3]:
# a dictionary of columns that contain comma-separated multiple answer options
multiple_answer_options = {
    'schools': ['CET', 'PVC', 'CHHS', 'Homeschooled'],
    'child_drive_reason': ['No - we do not drive, or prefer not to drive', 'Our own personal preference', 'Lack of available busing where we live', 'The bus schedule does not match our schedule', 'Safety concerns with buses', 'Safety concerns with walking', 'Safety concerns with bicycling', "My child's health condition"],
    'child_no_walk_reason': ['No - they walk a lot', 'My child does not like to walk', 'We live too far to walk', "We don't have time to walk", 'Fear of dangerous driving', 'Lack of adequate sidewalks', 'Lack of adequate crosswalks at busy intersections', 'Lack of crossing guards at busy intersections', "My child's health condition", 'Visually unappealing route'],
    'child_no_bike_reason': ['No - they bicycle a lot', 'My child does not like to bicycle', 'My child is too young to bicycle', 'We live too far to bicycle', 'Our own personal preference', 'Fear of dangerous driving', 'Lack of adequate bike lanes', "My child's health condition", 'Visually unappealing route', 'Hills'],
    'no_walk_reason': ['No - I walk a lot', 'I live too far to walk', 'I do not like to walk', "I don't have time to walk", 'Fear of dangerous driving', 'Lack of adequate sidewalks', 'Lack of adequate crosswalks at intersections', 'Lack of crossing guards at intersections', "My health condition", 'My own personal preference', 'Visually unappealing'], 
    'no_bike_reason': ['No - I bicycle a lot', 'I live too far to bicycle', 'I do not like to bicycle', "I don't have time to bicycle", 'Fear of dangerous driving', 'Lack of adequate bike lanes', 'My health condition', 'My own personal preference', 'Visually unappealing'],
    'drive_reason': ['No - I do not drive, or prefer not to drive', "I don't have time to walk or bicycle", 'Safety concerns with walking', 'Safety concerns with bicycling', 'My own health condition', 'My own personal preference']
}

# remember that there was also an 'other' field where users could enter custom answers... we'll deal with that later.

## Fix the mistake we made splitting these columns
In the `cleanup.ipynb` notebook, we split up columns with multiple answer options into multiple columns, such as `child_drive_reason_0`, `child_drive_reason1`, etc.  Merge them back together here.

In [4]:
# loop through each question title
for q in multiple_answer_options.keys():
    column_list = []
    for i in range(10):
        column_name = q + '_' + str(i) #auto-generate numbered column name
        if column_name in df.columns:
            column_list.append(column_name)
    if len(column_list) > 0:
        # create a new column with all answers comma-separated
        df[q] = df[column_list].apply(lambda x: ', '.join(x.fillna('').map(str)), axis=1)
        df[q] = df[q].str.strip(', ')
        
        # delete the old numbered columns
        df = df.drop(column_list, axis=1)        
df.columns

Index(['date', 'neighborhood', 'demographic', 'child_bus_freq',
       'child_bike_freq', 'child_driven_freq', 'child_drive_freq',
       'child_walk_freq', 'walk_freq', 'bike_freq', 'bikes_on_sidewalk',
       'self_jog_frequency', 'commutes', 'child_self_school',
       'child_self_bus_freq', 'child_self_bike_freq', 'child_self_driven_freq',
       'child_self_drive_freq', 'child_self_walk_freq', 'child_self_commutes',
       'child_self_has_children', 'commuter_distance', 'commuter_type',
       'commuter_walk_to_station_freq', 'commuter_bike_to_station_freq',
       'commuter_drive_to_station_freq', 'commuter_carpool_to_station_freq',
       'commuter_driven_to_station_freq', 'commuter_bus_to_station_freq',
       'feelings', 'problem_areas', 'drivers_are_safe', 'bicyclists_are_safe',
       'suggested_improvements', 'additional_comments', 'owns_business',
       'business_type', 'business_space', 'business_pedestrian_synergy',
       'business_bicyclists_synergy', 'business_bike_r

## Add in the tagging data
Tagging is done by the data analyst, and is assumed to be correct.  From here on, for simplicity, it will from here be stored and analyzed the same as if tags had been a question with multiple answer options with the most popular tags being treated as the answer options.

In [5]:
#add the most popular complaint tags to the dictionary of questions and their answer options
multiple_answer_options['problem_tags'] = [ 'speeding', 'missing sidewalks', 'cpa', 'sidewalk condition', 'aggressive driving', 'driver awareness', 's riverside', '129', 'maple', 'cyclists breaking rules', 'cleveland', 'road surface', 'municipal place', 'missing crosswalks', 'bushes on sidewalk', 'grand', 'narrow roads', 'mt airy', 'shoprite', 'benedict', 'five corners', 'van wyck', 'dummy light', 'crossing guard', 'gottwald circle', 'n riverside', 'poor lighting', 'croton commons', 'truesdale', 'hmb trail', 'cet', 'on-street parking', 'old post n', 'olcott', 'old post s', 'morningside' ]

#add the most popular suggestion tags to the dictionary of questions and their answer options
multiple_answer_options['suggestion_tags'] = [ 'bike lanes', 'more sidewalks', 'enforce speed', 'maintain sidewalks', 'reduce speed', 'educate cyclists', 'more crosswalks', 'speed bumps', 'no bike lanes', 'maintain crosswalks', 'enforce crosswalks', 'sidewalk on mt airy s', 'reduce on-street parking', 'more traffic lights', 'better lighting', 'more stop signs', 'dedicated pedestrian walk signals', 'more signs', 'improve route 9 access', 'educate pedestrians', 'outlaw cycling', 'enforce trimming of bushes', 'speed cameras', 'parking on one side only', 'bike racks', 'traffic calming', 'speed indicator signs' ]


## Limit to just the multiple select columns
For debugging

In [6]:
column_list = list(multiple_answer_options.keys())
df = df[column_list]
df = df.fillna('')
df.sample(n=10) #random sample of rows

Unnamed: 0,schools,child_drive_reason,child_no_walk_reason,child_no_bike_reason,no_walk_reason,no_bike_reason,drive_reason,problem_tags,suggestion_tags
138,CHHS,The bus schedule does not match our schedule,We don't have time to walk,My child does not like to bicycle,"I live too far to walk, I don't have time to walk","I do not like to bicycle, I don't have time to...",Time,,
339,,,,,,,,"bushes on sidewalk,sidewalk condition,grand,mu...","more sidewalks,maintain sidewalks"
152,,,,,No - I walk a lot,Lack of adequate bike lanes,"No - I do not drive, or prefer not to drive","missing crosswalks,hmb trail","more crosswalks,reopen hmb river trail"
476,CHHS,My son attends school at Yorktown Tech Center.,He drives to Yorktown.,,,,My own personal preference,"brook st, municipal place,129,road surface",
10,Homeschooled,We don’t drive to school,"Fear of dangerous driving, My child doesn’t wa...",Fear of dangerous driving,Lack of adequate sidewalks,Fear of dangerous driving,I don't have time to walk or bicycle,"bungalow, missing sidewalks","more sidewalks, bike lanes"
388,CHHS,Safety concerns with buses,Fear of dangerous driving,Fear of dangerous driving,Fear of dangerous driving,Fear of dangerous driving,Safety concerns with walking,"dummy light,crosswalk condition,sidewalk condi...","pedestrian-only upper village,dedicated pedest..."
430,Happy Hearts Daycare!,Drive to Happy Hearts,My child is too young to walk =D,My child is to young to bike,"Fear of dangerous driving, Lack of adequate si...","Fear of dangerous driving, Lack of adequate bi...",My own personal preference,"benedict,cleveland,maple,speeding,cleveland,si...","enforce speed,speed bumps"
527,,,,,No - I walk a lot,Lack of adequate bike lanes,Too far and have two small children,"cleveland,sidewalk condition","maintain sidewalks,bike lanes"
407,CET,Our own personal preference,No - they walk a lot,"My child does not like to bicycle, My child is...","I do not like to walk, I don't have time to wa...","I do not like to bicycle, I don't have time to...","I don't have time to walk or bicycle, My own h...",,
277,,,,,"Lack of adequate sidewalks, Lack of adequate c...","Fear of dangerous driving, Lack of adequate bi...",Safety concerns with bicycling,"missing sidewalks,road surface","link the trails,bike lanes, more roundabouts"


## Remove commas from any given answer option so we can later split by comma cleanly

In [7]:
# loop through all questions that allow more than one answer
for question, answer_list in multiple_answer_options.items():
    # loop through every answer in the answer list for this question
    for answer in answer_list:
        # remove the commas, from each answer, if any
        if ',' in answer:
            # update the answers with a cleaned version
            df[question] = df[question].str.replace(answer, answer.replace(',', '') )
            
    # remove commas from the list of answers in the question/answer dictionary
    multiple_answer_options[question] = [answer.replace(',' , '') for answer in answer_list]

# show the cleaned up answer options
#multiple_answer_options

## Tally answers
Indicate which users selected which answer options, and put any freeform text answers that users entered by clicking the 'other' option in Google Forms into '_other' columns.

In [8]:
def categorize_answers(response, question, answer_list):    
    """
    Takes a given set of responses to a multi-select question, and puts them into a well-formated Series of dummy values.
    
    @param response A list of the user's responses to this question
    @param question The title of the question (these are codes)
    @param answer_list A list of the answer options that were presented to the user in the Google Form
    """
    
    # set up some lists that we will use to collect responses as dummy values
    indices = [] # will contain the text of the answer option that the user selected
    values = [] # will contain dummy values (1s) for each answer option the user selected, NaNs otherwise.
    
    # skip any nan or blank values
    if str(response) != 'nan' and type(answer_list) != float and type(question) != float and type(response) != float:
        #display(question + " - " + str(response))
        for answer_option in response:
            answer_option = answer_option.strip() # remove any leading/trailing whitespace
            if answer_option in answer_list and question + '_' + answer_option not in indices:
                indices.append(question + '_' + answer_option)
                values.append(1)
            elif answer_option.strip() != '' and question + '_' + 'other' not in indices:
                indices.append(question + '_' + 'other')
                values.append(answer_option)
    
        # put it all into a pandas Series
        response = pd.Series(values,index=indices).fillna(0).astype(str)
        
    return response

# slice up answers by comma and give each its own column
df2 = pd.DataFrame() # build a new dataframe from scratch

# loop through each question/answer item
for question, answer_list in multiple_answer_options.items():
    # get a new dataframe with the questions as indexes and the valuess as dummy values indicating user selections
    d = df[question].str.split(',\s*').apply(categorize_answers, question=question, answer_list=answer_list)
    df2 = df2.append(d) # append this as a row to the new dataframe

# show the new dataframe
df2.sample(n=10) #random sample of rows

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Unnamed: 0,child_drive_reason_Lack of available busing where we live,child_drive_reason_My child's health condition,child_drive_reason_No - we do not drive or prefer not to drive,child_drive_reason_Our own personal preference,child_drive_reason_Safety concerns with bicycling,child_drive_reason_Safety concerns with buses,child_drive_reason_Safety concerns with walking,child_drive_reason_The bus schedule does not match our schedule,child_drive_reason_other,child_no_bike_reason_Fear of dangerous driving,...,suggestion_tags_other,suggestion_tags_outlaw cycling,suggestion_tags_parking on one side only,suggestion_tags_reduce on-street parking,suggestion_tags_reduce speed,suggestion_tags_sidewalk on mt airy s,suggestion_tags_speed bumps,suggestion_tags_speed cameras,suggestion_tags_speed indicator signs,suggestion_tags_traffic calming
314,,,,,,,,,,,...,,,,,,,,,,
115,,,,,,,,,,,...,,,,,,,,,,
203,,,,,,,,,,,...,,,,,,,,,,
321,,,,,,,,,,,...,,,,,,,,,,
391,,,,,,,,,,,...,,,,,,,,,,
76,,,1.0,,,,,,,,...,,,,,,,,,,
393,,,,,,,,,,,...,,,,,,,,,,
232,,,,,,,,,,,...,,,,,,,,,,
295,,,,1.0,,,,,,,...,,,,,,,,,,
180,,,,,,,,,,,...,,,,,,,,,,


## Convert values to ints or booleans, where possible
Most columns contain NaNs or 1s.  NaNs are converted to 0s then to booleans, if desired.  Integer dummy values are perhaps more useful so we can take counts of responses by adding the ints up.  The 'other' columns contain text and are left alone.

In [9]:
# loop through each column
for col in df2.columns:
    try:
        # try to convert to int
        df2[col] = df2[col].fillna(0).astype(int)#.astype(bool)
    except:
        # if this column contains text, skip...
        continue

## Group rows by 'id' so there is one row for each respondent

In [10]:
# add up all values in each column to get a full set of values or each id
# this loses the 'other' columns, since they are not numeric
df3 = df2.groupby(df2.index).sum()

# show the new grouped data... note that it is missing the 'other' fields... more on that later
df3.sample(n=10) #random sample of rows

Unnamed: 0,child_drive_reason_Lack of available busing where we live,child_drive_reason_My child's health condition,child_drive_reason_No - we do not drive or prefer not to drive,child_drive_reason_Our own personal preference,child_drive_reason_Safety concerns with bicycling,child_drive_reason_Safety concerns with buses,child_drive_reason_Safety concerns with walking,child_drive_reason_The bus schedule does not match our schedule,child_no_bike_reason_Fear of dangerous driving,child_no_bike_reason_Hills,...,suggestion_tags_no bike lanes,suggestion_tags_outlaw cycling,suggestion_tags_parking on one side only,suggestion_tags_reduce on-street parking,suggestion_tags_reduce speed,suggestion_tags_sidewalk on mt airy s,suggestion_tags_speed bumps,suggestion_tags_speed cameras,suggestion_tags_speed indicator signs,suggestion_tags_traffic calming
556,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
372,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
187,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
543,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
79,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
381,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
131,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
46,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
298,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
517,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Combine the selected answer options with the 'other' responses users gave

In [11]:
# loop through all questions
for question, answer_options in multiple_answer_options.items():
    # get a dataframe of just the 'other' answers to this question
    other_df = df2[pd.notnull(df2[question + '_other'])][question + '_other'].to_frame()
    # concatenate those responses to the original dataframe
    df3 = df3.join(other_df)
    
# show the data
df3.sample(n=10) #random sample of rows

Unnamed: 0,child_drive_reason_Lack of available busing where we live,child_drive_reason_My child's health condition,child_drive_reason_No - we do not drive or prefer not to drive,child_drive_reason_Our own personal preference,child_drive_reason_Safety concerns with bicycling,child_drive_reason_Safety concerns with buses,child_drive_reason_Safety concerns with walking,child_drive_reason_The bus schedule does not match our schedule,child_no_bike_reason_Fear of dangerous driving,child_no_bike_reason_Hills,...,suggestion_tags_traffic calming,schools_other,child_drive_reason_other,child_no_walk_reason_other,child_no_bike_reason_other,no_walk_reason_other,no_bike_reason_other,drive_reason_other,problem_tags_other,suggestion_tags_other
497,0,0,1,0,0,0,0,0,0,0,...,0,,,,,,,,crosswalk condition,
394,0,0,0,0,0,0,0,0,0,0,...,0,,,,,,Dangerous on upper Old Post Road (I think),A bit far for most things,,
530,0,0,0,0,0,0,0,0,0,0,...,0,,Bad weather,,,,,,,
276,0,0,0,0,0,0,0,0,0,0,...,0,,,,,,,,crossing guards,
215,0,0,0,0,0,0,0,0,0,0,...,0,,,,,,,,,
198,0,0,0,0,0,0,0,0,0,0,...,0,,,,,,,Carrying groceries,,
301,0,0,0,0,0,0,0,0,0,0,...,0,,,,,,,,,
74,0,0,0,0,0,0,0,0,0,0,...,0,,,,,,I don't own a Bicycle,,,
368,0,0,1,0,0,0,0,0,0,0,...,0,,,,,,,,more sidewalks,
147,0,0,1,0,0,0,0,0,1,0,...,0,,,,,,,,,


## Order the columns alphabetically
So it looks a bit more orderly

In [12]:
column_names = df3.columns.sort_values()
df3 = df3[column_names]
df3.sample(n=10) #random sample of rows

Unnamed: 0,child_drive_reason_Lack of available busing where we live,child_drive_reason_My child's health condition,child_drive_reason_No - we do not drive or prefer not to drive,child_drive_reason_Our own personal preference,child_drive_reason_Safety concerns with bicycling,child_drive_reason_Safety concerns with buses,child_drive_reason_Safety concerns with walking,child_drive_reason_The bus schedule does not match our schedule,child_drive_reason_other,child_no_bike_reason_Fear of dangerous driving,...,suggestion_tags_other,suggestion_tags_outlaw cycling,suggestion_tags_parking on one side only,suggestion_tags_reduce on-street parking,suggestion_tags_reduce speed,suggestion_tags_sidewalk on mt airy s,suggestion_tags_speed bumps,suggestion_tags_speed cameras,suggestion_tags_speed indicator signs,suggestion_tags_traffic calming
398,0,0,0,0,0,0,0,0,,0,...,,0,0,0,0,0,0,0,0,0
471,0,0,0,0,0,0,0,0,,0,...,,0,0,0,1,0,0,0,0,0
207,0,0,0,0,0,0,0,0,,0,...,,0,0,0,0,0,0,0,1,0
16,0,0,0,0,0,0,0,0,I only drive them if they miss the bus or have...,0,...,,0,0,0,0,0,0,0,0,0
434,0,0,0,0,0,0,0,0,,0,...,public transport up mt airy,0,0,0,1,0,0,0,0,0
197,0,0,0,0,0,0,0,0,,0,...,,0,0,0,0,0,0,0,0,0
116,0,0,0,0,0,0,0,0,Kids running late,1,...,,0,0,0,0,0,0,0,0,0
533,0,1,0,0,0,0,0,0,,0,...,,0,0,0,0,0,0,0,1,0
72,0,0,0,0,0,0,0,0,,0,...,,0,0,0,0,0,0,0,0,0
84,0,0,0,0,0,0,0,0,,0,...,run shuttles to station,0,0,0,0,0,0,0,0,0


## Merge with remaining data

In [13]:
# load up the original data from scratch again
df = pd.read_csv('responses_scrubbed_tagged.csv', index_col=None, parse_dates=['date'])

# merge the new columns with the old dataframe based on the indexes of each row
df_complete = pd.merge(df, df3, right_index=True, left_index=True)

df_complete.sample(n=10) #random sample of rows

Unnamed: 0,date,neighborhood,demographic,child_bus_freq,child_bike_freq,child_driven_freq,child_drive_freq,child_walk_freq,walk_freq,bike_freq,...,suggestion_tags_other,suggestion_tags_outlaw cycling,suggestion_tags_parking on one side only,suggestion_tags_reduce on-street parking,suggestion_tags_reduce speed,suggestion_tags_sidewalk on mt airy s,suggestion_tags_speed bumps,suggestion_tags_speed cameras,suggestion_tags_speed indicator signs,suggestion_tags_traffic calming
115,2019-07-03 16:37:55,Cortlandt,Adult without school kid,,,,,,Once every few days,Rarely or never,...,,0,0,0,1,0,0,0,0,1
44,2019-07-03 15:48:29,Harmon,Adult with school kid,"Every day, or almost every day",Rarely or never - my children are too young,Rarely or never,Rarely or never - my children are too young,Once in a while,"Every day, or almost every day",Rarely or never,...,,0,0,0,0,0,0,0,0,0
245,2019-07-04 06:53:28,Upper Village,Adult with school kid,Rarely or never,Rarely or never - for other reasons,Once every few days,Rarely or never - my children are too young,"Every day, or almost every day",Once every few days,Rarely or never,...,,0,0,0,1,0,1,0,0,0
232,2019-07-03 23:33:02,Mount Airy,Adult without school kid,,,,,,Once every few days,Rarely or never,...,lower taxes,0,0,0,0,0,0,0,0,0
539,2019-07-22 21:22:20,Sunset Park,Adult with school kid,"Every day, or almost every day",Rarely or never - for other reasons,Rarely or never,Rarely or never - my children are too young,"Every day, or almost every day",Once in a while,Rarely or never,...,,0,0,0,0,0,0,0,0,0
212,2019-07-03 21:13:46,Upper Village,Adult without school kid,,,,,,Once every few days,Rarely or never,...,,0,0,0,0,0,0,0,0,0
2,2019-02-28 08:41:08,Old Post Road North,Adult with school kid,Rarely or never,Once in a while,Once in a while,Rarely or never - my children are too young,"Every day, or almost every day","Every day, or almost every day",Once every few days,...,,0,0,0,0,0,0,0,1,0
200,2019-07-03 20:24:32,Old Post Road North,Adult with school kid,Rarely or never,Rarely or never - for other reasons,"Every day, or almost every day",Rarely or never - for other reasons,Rarely or never - for other reasons,Rarely or never,Rarely or never,...,,0,0,0,0,0,0,0,0,0
208,2019-07-03 20:56:56,Upper Village,Adult with school kid,Rarely or never,Rarely or never - for other reasons,Once every few days,Rarely or never - for other reasons,"Every day, or almost every day",Once every few days,Once in a while,...,,0,0,0,0,0,0,0,0,0
493,2019-07-14 00:15:04,Mount Airy,Adult with school kid,Rarely or never,Rarely or never - for other reasons,"Every day, or almost every day",Rarely or never - for other reasons,Once in a while,Rarely or never,Rarely or never,...,,0,0,0,0,1,0,0,0,0


## Group together some related sets of tags from the free text responses
There were several tags given for complaints about speeding, sidewalks, crosswalks, etc.... Make aggregate columns to make these things easier to search for.

In [14]:
def aggregate_tags(row, related_tags):
    """
    This method looks for a dummy value in a set of related fields.
    Returns 1 if found, 0 others.
    """
    # loop through each tag in the group
    for tag in related_tags:
        # if we find a one, then we're done
        if row[tag] == 1:
            return 1
    # otherwise, we did not find what we were looking for
    return 0

# make some new aggregate columns around topics that are otherwise represented across several fields

# a bunch of tags all related to concerns with drivers' behavior and speed
related_tags = ['problem_tags_driver awareness', 'problem_tags_aggressive driving', 'problem_tags_speeding', 'suggestion_tags_reduce speed', 'suggestion_tags_enforce speed', 'suggestion_tags_speed bumps', 'suggestion_tags_speed cameras', 'suggestion_tags_speed indicator signs', 'suggestion_tags_traffic calming']
df_complete['concerns_with_drivers'] = df_complete.apply (lambda row: aggregate_tags(row, related_tags), axis=1)

# a bunch of tags all related to concerns with sidewalks
related_tags = ['problem_tags_missing sidewalks', 'problem_tags_sidewalk condition', 'suggestion_tags_more sidewalks', 'suggestion_tags_maintain sidewalks']
df_complete['concerns_with_sidewalks'] = df_complete.apply (lambda row: aggregate_tags(row, related_tags), axis=1)

# a bunch of tags all related to concerns with crosswalks
related_tags = ['problem_tags_missing crosswalks', 'suggestion_tags_more crosswalks', 'suggestion_tags_maintain crosswalks', 'suggestion_tags_enforce crosswalks']
df_complete['concerns_with_crosswalks'] = df_complete.apply (lambda row: aggregate_tags(row, related_tags), axis=1)


## Save to new CSV file

In [15]:
df_complete.to_csv('responses_complete.csv', index=False)

## Done!