## Import data from CSV data file
This notebook assumes you have downloaded a CSV file from the Google Sheet that contains the responses to the survey in Google Forms.  

The scripts in this notebook open up the raw Google Forms response data from a file named 'responses.csv', and clean up that data and save it into a new file named 'responses_scrubbed.csv'.  This file is then ready for processing by running another notebook named `free_text_tagging.ipynb`.

In [1]:
import numpy as np
import pandas as pd
from IPython.display import display, Markdown, Latex, clear_output

df = pd.read_csv('responses.csv', index_col=None, parse_dates=['Timestamp'])

## Clean up column names
The original column names are the questions.... let's simplifies them.  And while we're at it, set appropriate columns as categorical data to speed up analysis.

In [2]:
# replace long column titles with shorter versions
df.columns = [ 'date', 'neighborhood', 'demographic', 'schools', 'child_bus_freq', 'child_bike_freq', 'child_driven_freq', 'child_drive_freq', 'child_walk_freq', 'child_drive_reason', 'child_no_walk_reason', 'child_no_bike_reason', 'walk_freq', 'bike_freq', 'bikes_on_sidewalk', 'self_jog_frequency', 'commutes', 'child_self_school', 'child_self_bus_freq', 'child_self_bike_freq', 'child_self_driven_freq', 'child_self_drive_freq', 'child_self_walk_freq', 'child_self_commutes', 'child_self_has_children', 'commuter_distance', 'commuter_type', 'commuter_walk_to_station_freq', 'commuter_bike_to_station_freq', 'commuter_drive_to_station_freq', 'commuter_carpool_to_station_freq', 'commuter_driven_to_station_freq', 'commuter_bus_to_station_freq', 'no_walk_reason', 'no_bike_reason', 'drive_reason', 'feelings', 'problem_areas', 'drivers_are_safe', 'bicyclists_are_safe', 'suggested_improvements', 'additional_comments', 'owns_business', 'business_type', 'business_space', 'business_pedestrian_synergy', 'business_bicyclists_synergy', 'business_bike_rack_interest', 'business_promotion_interest', 'business_additional_comments', 'final_comments', 'contact_interest', 'contact_info', 'wants_pdf' ]

# set some columns that have a linear scale to them to categorical type
categorical_columns = [ 'child_bus_freq', 'child_bike_freq', 'child_driven_freq', 'child_drive_freq', 'child_walk_freq', 'walk_freq', 'bike_freq', 'self_jog_frequency', 'child_self_bus_freq', 'child_self_bike_freq', 'child_self_driven_freq', 'child_self_drive_freq', 'child_self_walk_freq', 'commuter_walk_to_station_freq', 'commuter_bike_to_station_freq', 'commuter_drive_to_station_freq', 'commuter_carpool_to_station_freq', 'commuter_driven_to_station_freq', 'commuter_bus_to_station_freq', 'drivers_are_safe', 'bicyclists_are_safe']
for col in categorical_columns :
    df[col] = df[col].astype('category')


## Remove contact information

In [3]:

# remove contact info for privacy
del df['contact_info']

df.sample(n=10) #random sample of rows

Unnamed: 0,date,neighborhood,demographic,schools,child_bus_freq,child_bike_freq,child_driven_freq,child_drive_freq,child_walk_freq,child_drive_reason,...,business_type,business_space,business_pedestrian_synergy,business_bicyclists_synergy,business_bike_rack_interest,business_promotion_interest,business_additional_comments,final_comments,contact_interest,wants_pdf
36,2019-07-03 15:45:47,Albany Post Road / Prickly Pear Hill / Scenic ...,Senior,,,,,,,,...,,,,,,,,,No,No
263,2019-07-04 09:45:30,Upper Village (the area nearest the Black Cow ...,Senior,,,,,,,,...,,,,,,,,,No,No
197,2019-07-03 20:10:41,Upper Village (the area nearest the Black Cow ...,Senior,,,,,,,,...,,,,,,,,,Yes,No
365,2019-07-06 21:57:20,Harmon,Adult with school age children living with me,"CET, PVC","Every day, or almost every day",Rarely or never - my children are too young,Once every few days,Rarely or never - my children are too young,Rarely or never - my children are too young,violin practice,...,,,,,,,,,No,No
433,2019-07-09 18:44:24,Cortlandt outside of Croton,Adult with school age children living with me,FWS,"Every day, or almost every day",Rarely or never - for other reasons,Once in a while,Rarely or never - my children are too young,Rarely or never - for other reasons,Our own personal preference,...,,,,,,,,,No,No
101,2019-07-03 16:27:40,Old Post Road N area,Adult with school age children living with me,CHHS,Once every few days,Rarely or never - for other reasons,"Every day, or almost every day",Rarely or never - for other reasons,Once every few days,Our own personal preference,...,,,,,,,,,Yes,No
282,2019-07-04 12:25:19,Mount Airy,Adult with school age children living with me,"CHHS, The Harvey School","Every day, or almost every day",Rarely or never - for other reasons,Once in a while,Rarely or never - my children are too young,Rarely or never - for other reasons,Our own personal preference,...,,,,,,,,,Yes,No
466,2019-07-10 20:38:54,Mount Airy,Adult with school age children living with me,CHHS,Once every few days,Rarely or never - for other reasons,"Every day, or almost every day",Once every few days,Once in a while,"Our own personal preference, The bus schedule ...",...,,,,,,,,,Yes,No
464,2019-07-10 19:03:25,Harmon,Adult with school age children living with me,"PVC, CHHS",Rarely or never,"Every day, or almost every day",Once in a while,Rarely or never - my children are too young,Once every few days,Weather related events,...,,,,,,,,,No,No
293,2019-07-04 14:04:42,Upper Village (the area nearest the Black Cow ...,Senior,,,,,,,,...,,,,,,,,Keep going.,Yes,No


## Clean up neighborhood names
Neighorhood names were verbose in the actual survey... and some respondents wrote their own locations.  We group those respondents into their nearest neighborhoods and use consistent neighborhood names here.

In [4]:
df['neighborhood'] = df['neighborhood'].str.strip() # remove white space
df['neighborhood'] = df['neighborhood'].replace({
    
    # simplifying neighborhood names
    'Albany Post Road / Prickly Pear Hill / Scenic Dr area': 'Albany Post Road', 
    'Cortlandt outside of Croton': 'Cortlandt',
    'Old Post Road N area': 'Old Post Road North',
    'Quaker Ridge / Quaker Bridge area': 'Quaker Ridge',
    'Sunset Park area': 'Sunset Park',
    'Teatown area': 'Teatown',
    'Upper Village (the area nearest the Black Cow coffee shop)': 'Upper Village',

    # lumping CET/library area into Harmon
    'CET': 'Harmon', 
    'By CET': 'Harmon',
    'By the library/cemetery/CET': 'Harmon',
    'End of Cleveland near path': 'Harmon',
    'Cleveland near CET/Library': 'Harmon',
    'Cleveland/Park': 'Harmon',
    'Duck Pond': 'Harmon',
    'Harmon Park': 'Harmon',
    'Irving Ave': 'Harmon',
    'Ridge Rd.': 'Harmon',
    'Truesdale Drive': 'Harmon',
    'along the croton river': 'Harmon',
    'Behind high school': 'Harmon',
    'Wells/Beekman Area': 'Harmon',
    
    # lumping Old Post Road South and Sunset Park together
    'Old Post Road S': 'Sunset Park', 
    
    # lumping North Riverside area into Croton Landing
    'North Riverside': 'Croton Landing',
    'Palmer on High St': 'Croton Landing',
    'lower village': 'Croton Landing',
    'Brook St': 'Croton Landing',
    
    # lumping Old Post Road North and Croton Landing together
    'Croton Landing': 'Old Post Road North', 
    'wolf road': 'Albany Post Road',
    
    # lumping nearby streets into Upper Village
    'Bari Manor': 'Upper Village',
    'Harrison st': 'Upper Village',
    'Wells Ave': 'Upper Village',
    
    # lumping Batten Rd and Crompond Rd areas with Mount Airy
    'Batten Road': 'Mount Airy',
    'The trails': 'Mount Airy',
    '129 near dam': 'Mount Airy', 
    
    #lumping Teatown into Quaker Ridge, since there were few Teatown respondents
    'Teatown': 'Quaker Ridge' 
})

df['neighborhood'].value_counts()

Harmon                 166
Upper Village          118
Old Post Road North     77
Mount Airy              67
Sunset Park             48
Albany Post Road        31
Half Moon Bay           21
Quaker Ridge            11
Cortlandt               10
Ossining                 1
Name: neighborhood, dtype: int64

## Clean up demographics
Some respondents indicated their own demographic titles... we're standardizing these.

In [5]:

df['demographic'] = df['demographic'].str.strip() # remove white space
df['demographic'] = df['demographic'].replace({
    
    # creating new category for adults with small children
    'Adult with a 22 year old and a 3 year old living with me': 'Adult with small child',
    'Adult with an infant': 'Adult with small child',
    'Adult with new baby': 'Adult with small child',
    'Adult with toddler': 'Adult with small child',
    'Adult with toddler living with me': 'Adult with small child',
    'Adult with toddlers living with me': 'Adult with small child',
    'Adult with young children': 'Adult with small child',
    'Adult with young children living with me': 'Adult with small child',
    'Adult with 3year old': 'Adult with small child',
    'Adult with children not yet in CET (pre-K)': 'Adult with small child',
    'Adult with daycare-age children living with me': 'Adult with small child',
    'Adult with infant living with us': 'Adult with small child',
    
    # lump adults with infants & toddlers into adults with no school age children
    'Adult with small child': 'Adult with no school age children living with me',
    'Adult with spouse and adult children living with us.': 'Adult with no school age children living with me',
    
    # lumping adults with college kids into the adults with no school age children category
    'adult with post college child living with me': 'Adult with no school age children living with me',
    'Adult with College Children': 'Adult with no school age children living with me',
    'Adult with College Children': 'Adult with no school age children living with me',

        # lump adults with school children in addition to others
    'Adult with school age children and a senior living with me.': 'Adult with school kid'

})    

# simplify response text
df['demographic'] = df['demographic'].replace({
    'Adult with school age children living with me': 'Adult with school kid',
    'Adult with no school age children living with me': 'Adult without school kid',
    'Adult with small child': 'Adult without school kid', # we didn't ask respondents whether they had small kids
    'College student': 'College kid',
    'High School student': 'High School kid',
    'Middle School student': 'Middle School kid',
})

df['demographic'].value_counts()

Adult with school kid       232
Adult without school kid    174
Senior                      140
College kid                   5
High School kid               3
Middle School kid             3
Name: demographic, dtype: int64

## Split up columns with comma-separated values into separate columns

In [6]:
# get a dataframe with each school in its own column, and 0 or 1 as the values indicating which row indicated that school
#df2 = df[pd.notnull(df['schools'])]
#df3 = df2['schools'].str.get_dummies(sep=', ')
#for column in df3:
#    column = column.strip() #remove whitespace
#df3

# split a column with comma-separated values into separate columns
def breakout_comma_separated_values(old_column_name):
    # expand comma-separated values in the column into their own dataframe with multiple columns for each value
    df[old_column_name] = df[old_column_name].str.split(', ') #split by comma into a list
    series = df[old_column_name].apply(pd.Series) # convert each list into a series

    # rename each variable with a prefixed column name in a new dataframe
    new_df = series.rename(columns = lambda x : old_column_name + '_' + str(x))
    
    # return the new dataframe
    return new_df


# try it out
fields = [ 'schools', 'child_drive_reason', 'child_no_walk_reason', 'child_no_bike_reason', 'no_walk_reason', 'no_bike_reason', 'drive_reason' ]
#fields = ['schools']

for column_name in fields:
    
    # split up the column into multiple columns
    df2 = breakout_comma_separated_values(column_name)
    
    # join the new dataframe to the original dataframe
    df = pd.concat([df[:], df2[:]], axis=1)

    # delete the original column from the original dataframe
    del df[column_name]

    display(df2.sample(n=10)) #random sample of rows)


Unnamed: 0,schools_0,schools_1,schools_2
111,,,
501,Saint Augustine School,Ossining; Happy Hearts on Hudson,
357,,,
523,CET,PVC,
73,,,
39,,,
56,,,
99,,,
37,,,
258,,,


Unnamed: 0,child_drive_reason_0,child_drive_reason_1,child_drive_reason_2,child_drive_reason_3
496,,,,
39,,,,
140,,,,
370,,,,
469,,,,
516,,,,
165,,,,
233,,,,
50,No - we do not drive,or prefer not to drive,,
271,,,,


Unnamed: 0,child_no_walk_reason_0,child_no_walk_reason_1,child_no_walk_reason_2,child_no_walk_reason_3,child_no_walk_reason_4
30,,,,,
61,,,,,
272,,,,,
457,,,,,
101,My child does not like to walk,,,,
459,,,,,
315,We don't have time to walk,,,,
293,,,,,
194,,,,,
1,same as previous answer,,,,


Unnamed: 0,child_no_bike_reason_0,child_no_bike_reason_1,child_no_bike_reason_2,child_no_bike_reason_3
391,,,,
478,Fear of dangerous driving,Lack of adequate bike lanes,,
52,,,,
213,My child is too young to bicycle,Fear of dangerous driving,Hills,
375,My child is too young to bicycle,,,
360,,,,
154,,,,
142,My child does not like to bicycle,,,
45,,,,
144,,,,


Unnamed: 0,no_walk_reason_0,no_walk_reason_1,no_walk_reason_2,no_walk_reason_3,no_walk_reason_4,no_walk_reason_5
469,No - I walk a lot,,,,,
224,My own personal preference,,,,,
283,No - I walk a lot,,,,,
298,My health condition,,,,,
392,Lack of adequate sidewalks,,,,,
465,Lack of adequate sidewalks,,,,,
413,No - I walk a lot,,,,,
166,No - I walk a lot,,,,,
288,I prefer bicycling,,,,,
171,I don't have time to walk,,,,,


Unnamed: 0,no_bike_reason_0,no_bike_reason_1,no_bike_reason_2,no_bike_reason_3
156,I do not like to bicycle,,,
145,Too hilly.,,,
412,My own personal preference,,,
408,I do not like to bicycle,,,
175,My health condition,,,
484,I don't have time to bicycle,,,
550,Don't have bikes,but there are hills in our area that don't mak...,,
533,My health condition,,,
149,My own personal preference,,,
357,Lack of adequate bike lanes,,,


Unnamed: 0,drive_reason_0,drive_reason_1,drive_reason_2,drive_reason_3,drive_reason_4,drive_reason_5
31,Sunset Park area is uphill from everywhere.,,,,,
498,I don't have time to walk or bicycle,,,,,
534,,,,,,
161,I don't have time to walk or bicycle,,,,,
75,My own personal preference,,,,,
24,Too far,,,,,
477,I don't have time to walk or bicycle,,,,,
427,Safety concerns with walking,,,,,
79,,,,,,
283,When destination is too far for walking,,,,,


## Display column names in cleaned up data


In [7]:
column_names = [c for c in df]
display(column_names)


['date',
 'neighborhood',
 'demographic',
 'child_bus_freq',
 'child_bike_freq',
 'child_driven_freq',
 'child_drive_freq',
 'child_walk_freq',
 'walk_freq',
 'bike_freq',
 'bikes_on_sidewalk',
 'self_jog_frequency',
 'commutes',
 'child_self_school',
 'child_self_bus_freq',
 'child_self_bike_freq',
 'child_self_driven_freq',
 'child_self_drive_freq',
 'child_self_walk_freq',
 'child_self_commutes',
 'child_self_has_children',
 'commuter_distance',
 'commuter_type',
 'commuter_walk_to_station_freq',
 'commuter_bike_to_station_freq',
 'commuter_drive_to_station_freq',
 'commuter_carpool_to_station_freq',
 'commuter_driven_to_station_freq',
 'commuter_bus_to_station_freq',
 'feelings',
 'problem_areas',
 'drivers_are_safe',
 'bicyclists_are_safe',
 'suggested_improvements',
 'additional_comments',
 'owns_business',
 'business_type',
 'business_space',
 'business_pedestrian_synergy',
 'business_bicyclists_synergy',
 'business_bike_rack_interest',
 'business_promotion_interest',
 'business

## Save cleaned up data to CSV file
So it can be analyzed in subsequent programs.

In [8]:
df.to_csv('responses_scrubbed.csv', index=False)