# SOC 116AC Data Cleaning (Phase 2)

## Tasks:
* [Task 1: Clean Years Worked](#first)
* [Task 2: Clean Years In US](#second)
* [Task 3: Clean Cities](#third)
* [Task 4: T/F Encode Employer Health/Safety Questions (Column BS)](#fourth)

## Setup

In [168]:
import numpy as np
import pandas as pd

import re

In [169]:
# Import data into pandas
# Now we have cleaner data to start with!

df = pd.read_csv('survey_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Worker ID,Employer ID,Employer Type,Language,Timestamp,Organization,Do you work in a private home or in a board & care facility with multiple adults?,Do you work in a private home or in a board & care facility with multiple adults? OTHER,How many years have you been working as a domestic worker or at a board and care home?,...,What Neighborhood Meet Employer (Route Owner),Employer Neighborhood,Employer Nearest Park Or Public Transit,Does Employer Provide PPE Info And Minimize Risk,Employer Health And Safety Precautious,Paid At Least Minimum Wage,Overtime Pay,Paid Sick Time,Meal Or Rest Breaks,Complaint Against Employer
0,0,1,1,Agency/Company,spanish,8/31/2021 15:28:35,CHIRLA,Private homes,,10 - 20 Years,...,,,Randolph y pacific,Yes,"Provide general PPE like gloves or masks etc.,...",Yes,Yes,No,Yes,I Don't Know
1,1,2,1,Elder Care Facility,english,8/31/2021 17:50:40,Pilipino Workers Center,Board and Care Home,,1 - 5 Years,...,,,Don’t know,Yes,Provide information on the use of chemical cle...,Yes,I never work that many hours for this employer...,No,No,No
2,2,2,2,Elder Care Facility,english,8/31/2021 17:50:40,Pilipino Workers Center,Board and Care Home,,1 - 5 Years,...,,,Poway park,Yes,Provide information on the use of chemical cle...,No,I never work that many hours for this employer...,No,Yes,No
3,3,3,1,Route Owner,spanish,8/31/2021 20:27:49,CHIRLA,Other,Limpio building,6 - 10 Years,...,Los angeles,No se,De pende lugar a ir,Yes,"Provide general PPE like gloves or masks etc.,...",Yes,I never work that many hours for this employer...,I Don't Know,Yes,No
4,4,4,1,Direct Pay from Household/Family,spanish,8/31/2021 20:28:04,CHIRLA,Private homes,,6 - 10 Years,...,,,Montana y Sepulveda,Yes,Provide information on the use of chemical cle...,Yes,Yes,Yes,Yes,No


In [170]:
df.columns

Index(['Unnamed: 0', 'Worker ID', 'Employer ID', 'Employer Type', 'Language',
       'Timestamp', 'Organization',
       'Do you work in a private home or in a board & care facility with multiple adults?',
       'Do you work in a private home or in a board & care facility with multiple adults? OTHER',
       'How many years have you been working as a domestic worker or at a board and care home?',
       'How many years have you been working as a domestic worker or at a board and care home? OTHER',
       'How long have you lived in the United States?',
       'How long have you lived in the United States? OTHER', 'Age', 'Gender',
       'Gender OTHER', 'Worker City', 'Worker Neighborhood', 'Worker Zip Code',
       'What is the closest public park or bus or train stop to where you live?',
       'On most days, how do you get to work?',
       'If on public transport, do you have to make any transfers? How many?',
       'On average, how much time do you spend traveling to work?',
    

## Task 1: Clean Years Worked <a class="anchor" id="first"></a>

In [171]:
years_worked_col = "How many years have you been working as a domestic worker or at a board and care home?"
years_worked_other_col = "How many years have you been working as a domestic worker or at a board and care home? OTHER"

In [172]:
N = 8

print(df.columns[N])
df[df.columns[N]]

Do you work in a private home or in a board & care facility with multiple adults? OTHER


0                                     NaN
1                                     NaN
2                                     NaN
3                         Limpio building
4                                     NaN
5                                     NaN
6                                     NaN
7                                     NaN
8                                     NaN
9                                     NaN
10                                    NaN
11                                    NaN
12                          live in nanny
13                                    NaN
14                                    NaN
15                                    NaN
16                                    NaN
17                                    NaN
18                                    NaN
19                                    NaN
20                                    NaN
21                                    NaN
22                                    NaN
23                                

In [173]:
df[df.columns[N]].value_counts()

Limpieza de casas                                                      6
Ahorita no trabajo                                                     4
Private homes, Nanny                                                   3
No                                                                     3
No cuido ancianos                                                      3
Home Care Agency/Registry                                              3
Casas Residenciales                                                    3
limpieza de casa                                                       2
Home Health Aide                                                       2
desde casa                                                             2
Limpio casas                                                           2
Los dos, Oficina y casa                                                2
Private homes, Board and Care Home                                     2
Ihss                                               

In [174]:
# All the rows with "Other"

df.iloc[:, 8:10][df[years_worked_col] == "Other"].head()

Unnamed: 0,Do you work in a private home or in a board & care facility with multiple adults? OTHER,How many years have you been working as a domestic worker or at a board and care home?
276,,Other
304,Vegetable Factory,Other
331,No,Other
415,,Other
431,,Other


In [175]:
df['Years Worked OTHER'] = df[years_worked_other_col].str.replace(r'\D', '')

In [176]:
df[df['Years Worked OTHER'].notna()]['Years Worked OTHER']

12        4
53        5
54        5
55        5
76       21
101       3
112       3
149      29
157       2
160       2
222      25
271      27
272      27
273      27
276        
296      28
304        
331        
349      20
362      25
366    2010
373      25
375      37
397       2
415        
429      30
439        
440        
451        
456      43
458        
465      23
474       7
488      15
503        
504        
505        
515        
523      22
524      22
525        
526        
527        
545      22
Name: Years Worked OTHER, dtype: object

In [177]:
# Replace year started w/ how many years worked 

df.at[366, 'Years Worked OTHER'] = 11

In [178]:
int_other_value_rows = df[df['Years Worked OTHER'].notna()]['Years Worked OTHER'].index
int_other_value_rows

Int64Index([ 12,  53,  54,  55,  76, 101, 112, 149, 157, 160, 222, 271, 272,
            273, 276, 296, 304, 331, 349, 362, 366, 373, 375, 397, 415, 429,
            439, 440, 451, 456, 458, 465, 474, 488, 503, 504, 505, 515, 523,
            524, 525, 526, 527, 545],
           dtype='int64')

In [179]:
for r in int_other_value_rows:
    years_worked = df.at[r, "Years Worked OTHER"]
    if years_worked == '':
        continue
    else:
        years_worked = int(years_worked)
    if 1 <= years_worked <= 5:
        df.at[r, years_worked_col] = "1 - 5 Years"
    if 6 <= years_worked <= 9:
        df.at[r, years_worked_col] = "6 - 10 Years"
    if 10 <= years_worked <= 19:
        df.at[r, years_worked_col] = "10 - 20 Years"
    if 20 <= years_worked <= 29:
        df.at[r, years_worked_col] = "20 - 30 Years"
    if 30 <= years_worked:
        df.at[r, years_worked_col] = "30+ Years"
    print(years_worked, end=", ")

4, 5, 5, 5, 21, 3, 3, 29, 2, 2, 25, 27, 27, 27, 28, 20, 25, 11, 25, 37, 2, 30, 43, 23, 7, 15, 22, 22, 22, 

In [180]:
# Looks like our "Other"s have gone down and range categories gone up!

df[df.columns[N]].value_counts()

Limpieza de casas                                                      6
Ahorita no trabajo                                                     4
Private homes, Nanny                                                   3
No                                                                     3
No cuido ancianos                                                      3
Home Care Agency/Registry                                              3
Casas Residenciales                                                    3
limpieza de casa                                                       2
Home Health Aide                                                       2
desde casa                                                             2
Limpio casas                                                           2
Los dos, Oficina y casa                                                2
Private homes, Board and Care Home                                     2
Ihss                                               

In [181]:
# Everything that's left in "Other" is blank... so we can get rid of it!

df[df[df.columns[N]] == "Other"]["Years Worked OTHER"]

Series([], Name: Years Worked OTHER, dtype: object)

In [182]:
df = df.drop(columns=["Years Worked OTHER"])

## Task 2: Clean Years in U.S. <a class="anchor" id="second"></a>

In [183]:
years_in_us_col = "How long have you lived in the United States?"
years_in_us_other_col = "How long have you lived in the United States? OTHER"

In [184]:
N = 10
print(df.columns[N])

df[df.columns[N]].head(10)

How many years have you been working as a domestic worker or at a board and care home? OTHER


0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
Name: How many years have you been working as a domestic worker or at a board and care home? OTHER, dtype: object

In [185]:
df[df.columns[N]].value_counts()

25                                                                            3
5                                                                             3
27 years                                                                      3
Limpio casas                                                                  3
Lipieza de casa,y edificios                                                   3
Trabajadora? Por que me discriminan?                                          2
22 años                                                                       2
43 years                                                                      1
37 años                                                                       1
7 meses                                                                       1
2 años                                                                        1
4 years                                                                       1
2 yrs less                              

In [186]:
df[df.columns[N:N + 2]]

Unnamed: 0,How many years have you been working as a domestic worker or at a board and care home? OTHER,How long have you lived in the United States?
0,,10 - 20 Years
1,,1 - 5 Years
2,,1 - 5 Years
3,,10 - 20 Years
4,,20 - 30 Years
5,,1 - 5 Years
6,,10 - 20 Years
7,,1 - 5 Years
8,,30+ Years
9,,10 - 20 Years


In [187]:
# All the rows with "Other"

df.iloc[:, 10:12][df[years_in_us_col] == "Other"].head()

Unnamed: 0,How many years have you been working as a domestic worker or at a board and care home? OTHER,How long have you lived in the United States?
388,,Other


In [188]:
df['Years In US OTHER'] = df[years_in_us_other_col].str.replace(r'\D', '')

In [189]:
# Replace year started w/ how many years worked 

df.at[366, 'Years In US OTHER'] = 11

In [190]:
print(df[df['Years In US OTHER'].notna()]['Years In US OTHER'].to_string())

4          22
8          31
24         22
45          2
60         23
72         30
76         21
84         25
95         32
108        22
109        22
110        31
112        17
131        30
145        30
146        30
147        27
148        30
149        30
157         2
160         2
166        25
167        25
177        21
180        31
181        31
182        31
187        30
194        27
203        40
218        27
219        27
220        27
221        25
222        29
227        21
228        27
229        21
230      1978
244        40
249        32
254        24
261        34
262        25
271        27
272        27
273        27
284        21
286        22
291        35
292        35
293        35
296        28
310        28
316        32
318        28
321        24
322        24
326        23
327        23
328        23
330        28
335        22
350    198530
362        31
366        11
369        23
370        23
371        23
373        25
375        38
377   

In [191]:
int_other_value_rows = df[df['Years In US OTHER'].notna()]['Years In US OTHER'].index
int_other_value_rows

Int64Index([  4,   8,  24,  45,  60,  72,  76,  84,  95, 108,
            ...
            531, 537, 538, 539, 540, 541, 543, 544, 545, 559],
           dtype='int64', length=120)

In [192]:
for r in int_other_value_rows:
    years_in_us = df.at[r, 'Years In US OTHER']
    if years_in_us == '':
        continue
    else:
        years_in_us = int(years_in_us)
    if 1 <= years_in_us <= 5:
        df.at[r, years_in_us_col] = "1 - 5 Years"
    if 6 <= years_in_us <= 9:
        df.at[r, years_in_us_col] = "6 - 10 Years"
    if 10 <= years_in_us <= 19:
        df.at[r, years_in_us_col] = "10 - 20 Years"
    if 20 <= years_in_us <= 29:
        df.at[r, years_in_us_col] = "20 - 30 Years"
    if 30 <= years_in_us:
        df.at[r, years_in_us_col] = "30+ Years"
    print(years_in_us, end=" ")


22 31 22 2 23 30 21 25 32 22 22 31 17 30 30 30 27 30 30 2 2 25 25 21 31 31 31 30 27 40 27 27 27 25 29 21 27 21 1978 40 32 24 34 25 27 27 27 21 22 35 35 35 28 28 32 28 24 24 23 23 23 28 22 198530 31 11 23 23 23 25 38 40 40 40 21 22 25 30 30 22 24 24 37 30 30 28 30 23 36 41 23 32 30 24 24 30 23 32 26 30 23 22 27 21 35 30 25 24 29 20 25 25 25 25 25 23 22 25 23 

In [193]:
# Looks like our "Other"s have gone down and range categories gone up!

df[df.columns[N]].value_counts()

25                                                                            3
5                                                                             3
27 years                                                                      3
Limpio casas                                                                  3
Lipieza de casa,y edificios                                                   3
Trabajadora? Por que me discriminan?                                          2
22 años                                                                       2
43 years                                                                      1
37 años                                                                       1
7 meses                                                                       1
2 años                                                                        1
4 years                                                                       1
2 yrs less                              

In [194]:
df = df.drop(columns=['Years In US OTHER'])

## Task 3: Clean Cities <a class="anchor" id="third"></a>

In [195]:
city_col_names = df.columns[df.columns.str.contains("City")]

df[city_col_names].head()

Unnamed: 0,Worker City,Employer City,What City Meet Employer (Route Owner)
0,Huntington Park,City Of Comerce,
1,San Diego,Carlsbad,
2,San Diego,San Diego,
3,Los Angeles,Na,No se
4,Los Angeles,Los Angeles,


In [196]:
employer_city_col = city_col_names[1]

df[employer_city_col].value_counts()

San Francisco           94
Los Angeles             35
Na                      25
San Jose                18
Oakland                 16
San Diego               13
Palo Alto               10
Costa Mesa              10
Santa Monica             9
Location Varies          8
Santa Ana                7
San Mateo                7
Fremont                  6
Newport Beach            6
Mountain View            6
Irvine                   5
Santa Rosa               5
Redwood City             5
Petaluma                 5
Beverly Hills            5
Sunnyvale                5
Vista                    4
Ihss                     4
San Leandro              4
Van Nuys                 4
La Palma                 4
Long Beach               4
Pasadena                 4
Huntington Beach         4
Chatsworth               4
                        ..
Dana Point               1
Newpor Beach             1
San Pedro                1
Philipsburg              1
Via Circle               1
Santa Eosa               1
S

In [197]:
# Make all title case (First letter of each word capitalized, rest lowercase)
# Also get rid of any extra spaces

df[employer_city_col] = df[employer_city_col].str.title()
df[employer_city_col] = df[employer_city_col].str.strip()

In [198]:
# Get rid of 'California' or 'Ca' in city name
df[employer_city_col] = df[employer_city_col].str.replace(r"California|Ca$|Ca\.$", '')
df[employer_city_col] = df[employer_city_col].str.strip()

In [199]:
# These are all the city entries left with more than three words 
df[df[employer_city_col].str.split().str.len() > 3][employer_city_col]

114    El Contado De Sonoma
221      Cerca De Su Ciudad
224      Cerca De La Ciudad
Name: Employer City, dtype: object

In [200]:
# Remove character that is not either a word character or a space (punctuation)
df[employer_city_col] = df[employer_city_col].str.replace(r"[^\w\s]", '')

In [201]:
# Replace consequtive spaces with just one
df[employer_city_col] = df[employer_city_col].str.replace(r"\s{2,}", ' ')
df[employer_city_col] = df[employer_city_col].str.strip()

In [202]:
# Get rid of 'CA' after city name 
df[employer_city_col] = df[employer_city_col].str.replace(r"CA$| Cal$", '')
df[employer_city_col] = df[employer_city_col].str.strip()

In [203]:
# Also some just 'La's
df[employer_city_col] = df[employer_city_col].str.replace(r"^La$", "Los Angeles")
df[employer_city_col] = df[employer_city_col].str.strip()

In [204]:
# Use these functions to replace some city entries
def replace_val(col, old, new):
    df[col] = df[col].str.replace(r"^" + re.escape(old) + "$", new)
    
def replace_many_with(col, olds, new):
    for o in olds:
        df[col] = df[col].str.replace(r"^" + re.escape(o) + "$", new)

In [205]:
na_synonyms = ["No", "No Se", "I DonT Know", "None", "", "Not Sure", "No Aplica", "Not Always Same Employer Changes", \
              "Lives Far Away 46 Minutes Refuses To Disclose Info", "She DoesnT Know", \
               "Does Not Want To Reveal Employer Information"]

varies_synonyms = ["Diferentes", "Trabajo En Diferentes", "Different Cities No Permenant One", \
                   "Son Diferentes Personas Ala Q Les Trabajo", "It Depends On The Assignments", \
                   "Diferentes Lugares", "Solo Tengo 3 Empleo", "Diferentes Ciudades"]

not_working_synonyms = ["Does Not Currently Work", "No Tengo Mas", "No Trabajo"]

sf_synonyms = ["Sf", "San Francico", "San Fransciso", "San Fransisco", "三藩市", "San Fco"]

replace_many_with(employer_city_col, not_working_synonyms, "Currently Not Working")
replace_many_with(employer_city_col, na_synonyms, "Na")
replace_many_with(employer_city_col, varies_synonyms, "Location Varies")
replace_many_with(employer_city_col, sf_synonyms, "San Francisco")

replace_val(employer_city_col, "Costa Messa", "Costa Mesa")
replace_val(employer_city_col, "Los Ángeles", "Los Angeles")
replace_val(employer_city_col, "Newpor Beach", "Newport Beach")
replace_val(employer_city_col, "Daly City", "Daily City")
replace_val(employer_city_col, "Palto Alto", "Palo Alto")
replace_val(employer_city_col, "Sandiego", "San Diego")
replace_val(employer_city_col, "Santa Mónica", "Santa Monica")
replace_val(employer_city_col, "San José", "San Jose")
replace_val(employer_city_col, "Beverly", "Beverly Hills")
replace_val(employer_city_col, "紅木城", "Redwood City")
replace_val(employer_city_col, "Torrance City", "Torrance")
replace_val(employer_city_col, "El Contado De Sonoma", "Sonoma County")
replace_val(employer_city_col, "Formerly In Poway", "Poway")
replace_val(employer_city_col, "Sta Monica", "Santa Monica")
replace_val(employer_city_col, "Monclear", "Montclair")
replace_val(employer_city_col, "Westcovina", "West Covina")
replace_val(employer_city_col, "San Leonardo", "San Leandro")
replace_val(employer_city_col, "Vannuys", "Van Nuys")
replace_val(employer_city_col, "Cambell", "Campbell")
replace_val(employer_city_col, "New Port Beach", "Newport Beach")
replace_val(employer_city_col, "Harbor Cityharbor", "Harbor City")
replace_val(employer_city_col, "Condado De Orange", "Orange County")
replace_val(employer_city_col, "Millvalley", "Mill Valley")
replace_val(employer_city_col, "Yorbalinda", "Yorba Linda")
replace_val(employer_city_col, "Knott St Long Beach", "Long Beach")
replace_val(employer_city_col, "Burbon Tarsana Los Angeles North Hills", "Tarzana Los Angeles North Hills")

print(df[employer_city_col].value_counts().to_string())

San Francisco            94
Los Angeles              35
Na                       25
San Jose                 19
Oakland                  16
San Diego                13
Palo Alto                10
Costa Mesa               10
Santa Monica              9
Location Varies           8
Santa Ana                 7
San Mateo                 7
Newport Beach             7
Fremont                   6
Mountain View             6
Redwood City              5
Petaluma                  5
Santa Rosa                5
Irvine                    5
Sunnyvale                 5
Beverly Hills             5
Long Beach                4
Vista                     4
San Leandro               4
Van Nuys                  4
Huntington Beach          4
Pasadena                  4
Chatsworth                4
Ihss                      4
La Palma                  4
Currently Not Working     3
Carlsbad                  3
Glendale                  3
Carson                    3
Oceanside                 3
Alameda             

In [206]:
# Get rid of "En" before city names
df[df[employer_city_col].str.contains(r"^En ").fillna(False)][employer_city_col]

df[employer_city_col] = df[employer_city_col].str.replace(r"^En ", '')
df[employer_city_col] = df[employer_city_col].str.strip()

In [207]:
# This is a Marriot in Oakland! Zip is: 94621
df[df[employer_city_col] == "Oakland 333 Heg Ember Road"]

df.at[90, "Employer Zip Code"] = 94621
replace_val(employer_city_col, "Oakland 333 Heg Ember Road", "Oakland")

In [208]:
# Zip is: 90501
df[df[employer_city_col] == "Western St In Torrance"]

df.at[129, "Employer Zip Code"] = 90501
replace_val(employer_city_col, "Western St In Torrance", "Torrance")

In [209]:
# "Sunset Boulevard Beverly Hills" Zip is: 90210
df[df[employer_city_col] == "Sunset Boulevard Beverly Hills"]["Employer Zip Code"]

df.at[79, "Employer Zip Code"] = 90210
replace_val(employer_city_col, "Sunset Boulevard Beverly Hills", "Beverly Hills")

In [210]:
# Venice, Los Angeles Zip is: 90291
df[df[employer_city_col] == "Venis"]["Employer Zip Code"]

df.at[376, "Employer Zip Code"] = 90291
replace_val(employer_city_col, "Venis", "Venice")

In [211]:
# Koreatown, Los Angeles Zip is: 90010
df.at[465, "Employer Zip Code"] = 90010
replace_val(employer_city_col, "El Barrio Koreano Y El Otro En Hollywood", "Los Angeles")

In [212]:
# "Wood Acre Drive" in SF Zip is: 94132
df.at[567, "Employer Zip Code"] = 94132
replace_val(employer_city_col, "Wood Acre Drive", "San Francisco")

In [213]:
# Zip is: 991784
df.at[122, "Employer Zip Code"] = 91784
replace_val(employer_city_col, "Romangel 1741 Erin Avenue Upland 91784 The Mother Thrice 25379 Wayne Mills Pl Suite 154 Valencia 91355", "Upland Valencia")

### Make a column "Additional Employer Cities"

In [214]:
# df[df[employer_city_col].str.split().str.len() > 2][employer_city_col]

0            City Of Comerce
13       Rancho Palos Verdes
18      Rolling Hills Estate
193    Robinson Olympic Blvd
221       Cerca De Su Ciudad
224       Cerca De La Ciudad
317    Currently Not Working
395           Marina Del Rey
421    Currently Not Working
458    Currently Not Working
Name: Employer City, dtype: object

In [215]:
# employer_secondary_city_col = "Employer City Secondary"

# df[employer_secondary_city_col] = ""

In [216]:
# def move_extra_cities(row, primary_city):
#     if df.at[row, employer_city_col] != primary_city:
#         secondary_cities = df.at[row, employer_city_col].replace(primary_city, '')
#         df.at[row, employer_city_col] = primary_city
#         df.at[row, employer_secondary_city_col] = secondary_cities

In [217]:
# move_extra_cities(32, "Fremont")
# move_extra_cities(60, "Petaluma")
# move_extra_cities(69, "Newport Beach")

# move_extra_cities(78, "Woodland Hills")
# move_extra_cities(91, "Santa Monica")
# move_extra_cities(122, "Upland")
# move_extra_cities(154, "Lake Forest")
# move_extra_cities(261, "Dana Point")
# move_extra_cities(264, "Sunnyvale")
# move_extra_cities(295, "San Francisco")
# move_extra_cities(319, "Tarzana")
# move_extra_cities(365, "San Jose")
# move_extra_cities(397, "San Francisco")
# move_extra_cities(406, "Santa Ana")
# move_extra_cities(442, "San Francisco")
# move_extra_cities(491, "San José")
# move_extra_cities(507, "Malibu")
# move_extra_cities(518, "Culver City")

In [218]:
# df[df[employer_city_col].str.split().str.len() > 2][employer_city_col]

0            City Of Comerce
13       Rancho Palos Verdes
18      Rolling Hills Estate
193    Robinson Olympic Blvd
221       Cerca De Su Ciudad
224       Cerca De La Ciudad
317    Currently Not Working
395           Marina Del Rey
421    Currently Not Working
458    Currently Not Working
Name: Employer City, dtype: object

In [219]:
# df[df[employer_city_col].str.len() > 13][employer_city_col]

0            City Of Comerce
13       Rancho Palos Verdes
18      Rolling Hills Estate
43          Huntington Beach
75            San Bernardino
78            Woodland Hills
84           Location Varies
128           Woodland Hills
182         Huntington Beach
192        Robinston Olympic
193    Robinson Olympic Blvd
221       Cerca De Su Ciudad
224       Cerca De La Ciudad
228          Location Varies
240         Huntington Beach
317    Currently Not Working
369         Rancho Cucamonga
375         Hacienda Heights
380          Location Varies
388          Location Varies
395           Marina Del Rey
421    Currently Not Working
453         Huntington Beach
458    Currently Not Working
487          American Canyon
553          Location Varies
555          Location Varies
560          Location Varies
566          Location Varies
Name: Employer City, dtype: object

In [220]:
# #Get rid of conjunctions before city names
# df[df[employer_secondary_city_col].str.contains(r"(^And )|(^Y )|(^O )|(^En )").fillna(False)][employer_secondary_city_col]

# df[employer_secondary_city_col] = df[employer_secondary_city_col].str.replace(r"(^And )|(^Y )|(^O )|(^En )", '')
# df[employer_secondary_city_col] = df[employer_secondary_city_col].str.strip()

  This is separate from the ipykernel package so we can avoid doing imports until


In [221]:
# df[df[employer_secondary_city_col].notna()][[employer_city_col, employer_secondary_city_col]]

Unnamed: 0,Employer City,Employer City Secondary
0,City Of Comerce,
1,Carlsbad,
2,San Diego,
3,Na,
4,Los Angeles,
5,Los Angeles,
6,Petaluma,
7,Union City,
8,Na,
9,Dublin,


## Task 4: T/F Encode Employer Health/Safety Questions <a class="anchor" id="fourth"></a>

This column is named "Employer Health And Safety Precautious" (precautions, probably)

Response Choices:
* None
* Provide information on the use of chemical cleaning products
* Provide information on the handling of bodily fluids or bio-waste
* Provide general PPE like gloves or masks etc.
* Provide training on how to prevent injuries? (such as to prevent injuries from repetitive motions or from lifting heavy objects or your clients)
* Ensure access to drinking water, toilets, and washing facilities

In [222]:
employer_health_safety_col_name = "Employer Health And Safety Precautious"

answer_options = np.array(["None",
                          "Provide information on the use of chemical cleaning products",
                          "Provide information on the handling of bodily fluids or bio waste",
                          "Provide general PPE like gloves or masks etc.",
                          "Provide training on how to prevent injuries? (such as to prevent injuries from repetitive motions or from lifting heavy objects or your clients)",
                          "Ensure access to drinking water, toilets, and washing facilities"])

employer_health_safety_col_num = np.where(df.columns == employer_health_safety_col_name)[0][0]
employer_health_safety_col_num

71

### Split this concatenated responses column into seperate columns

In [223]:
# Make new columns based on if an answer was checked (True) or not (False)
col_num = int(employer_health_safety_col_num) + 1

for ans in answer_options:
    df.insert(col_num, ans, df[employer_health_safety_col_name].str.contains(ans, regex=False))
    col_num += 1
    
df[answer_options].head()

Unnamed: 0,None,Provide information on the use of chemical cleaning products,Provide information on the handling of bodily fluids or bio waste,Provide general PPE like gloves or masks etc.,Provide training on how to prevent injuries? (such as to prevent injuries from repetitive motions or from lifting heavy objects or your clients),"Ensure access to drinking water, toilets, and washing facilities"
0,False,False,False,True,True,True
1,False,True,True,True,False,False
2,False,True,True,True,True,True
3,False,False,False,True,False,True
4,False,True,False,True,True,True


In [224]:
# Rename None column to a more useful name, like 'No Employer Health or Safety Precautions'
df = df.rename(columns={'None' : 'No Employer Health or Safety Precautions'})

df.iloc[:,employer_health_safety_col_num:employer_health_safety_col_num+12].head()

Unnamed: 0,Employer Health And Safety Precautious,No Employer Health or Safety Precautions,Provide information on the use of chemical cleaning products,Provide information on the handling of bodily fluids or bio waste,Provide general PPE like gloves or masks etc.,Provide training on how to prevent injuries? (such as to prevent injuries from repetitive motions or from lifting heavy objects or your clients),"Ensure access to drinking water, toilets, and washing facilities",Paid At Least Minimum Wage,Overtime Pay,Paid Sick Time,Meal Or Rest Breaks,Complaint Against Employer
0,"Provide general PPE like gloves or masks etc.,...",False,False,False,True,True,True,Yes,Yes,No,Yes,I Don't Know
1,Provide information on the use of chemical cle...,False,True,True,True,False,False,Yes,I never work that many hours for this employer...,No,No,No
2,Provide information on the use of chemical cle...,False,True,True,True,True,True,No,I never work that many hours for this employer...,No,Yes,No
3,"Provide general PPE like gloves or masks etc.,...",False,False,False,True,False,True,Yes,I never work that many hours for this employer...,I Don't Know,Yes,No
4,Provide information on the use of chemical cle...,False,True,False,True,True,True,Yes,Yes,Yes,Yes,No


### Make a seperate column for the 'Other' responses

How to get the 'Other' responses from this column

For this, I made a copy of the original column to be an 'Other' column by removing all non-Other responses from the original column, so that all that's left is the 'Other' responses. 

In [225]:
# Remove the concatenated responses that we made new columns for from the original column 
# This should only leave the 'Other' responses in our original column, and some leftover punctuation/whitespace
other_col_name = "Employer Health And Safety Precautious OTHER"


df.insert(col_num, other_col_name, df[employer_health_safety_col_name])

for ans in answer_options:
    df[other_col_name] = df[other_col_name].str.replace(ans + ",", '', regex=False)
    df[other_col_name] = df[other_col_name].str.replace(ans, '', regex=False)
    

df[other_col_name].head(15)

0         
1         
2         
3         
4         
5         
6         
7         
8         
9         
10        
11        
12        
13        
14        
Name: Employer Health And Safety Precautious OTHER, dtype: object

In [226]:
# Remove leading/trailing whitespace from the column
df[other_col_name] = df[other_col_name].str.strip()

In [227]:
df.iloc[:,employer_health_safety_col_num:employer_health_safety_col_num+12].head(12)

Unnamed: 0,Employer Health And Safety Precautious,No Employer Health or Safety Precautions,Provide information on the use of chemical cleaning products,Provide information on the handling of bodily fluids or bio waste,Provide general PPE like gloves or masks etc.,Provide training on how to prevent injuries? (such as to prevent injuries from repetitive motions or from lifting heavy objects or your clients),"Ensure access to drinking water, toilets, and washing facilities",Employer Health And Safety Precautious OTHER,Paid At Least Minimum Wage,Overtime Pay,Paid Sick Time,Meal Or Rest Breaks
0,"Provide general PPE like gloves or masks etc.,...",False,False,False,True,True,True,,Yes,Yes,No,Yes
1,Provide information on the use of chemical cle...,False,True,True,True,False,False,,Yes,I never work that many hours for this employer...,No,No
2,Provide information on the use of chemical cle...,False,True,True,True,True,True,,No,I never work that many hours for this employer...,No,Yes
3,"Provide general PPE like gloves or masks etc.,...",False,False,False,True,False,True,,Yes,I never work that many hours for this employer...,I Don't Know,Yes
4,Provide information on the use of chemical cle...,False,True,False,True,True,True,,Yes,Yes,Yes,Yes
5,Provide information on the use of chemical cle...,False,True,True,True,True,True,,Yes,Yes,Yes,Yes
6,"Ensure access to drinking water, toilets, and ...",False,False,False,False,False,True,,Yes,Yes,No,No
7,Provide information on the use of chemical cle...,False,True,True,True,True,True,,Yes,Yes,Yes,Yes
8,Provide information on the use of chemical cle...,False,True,False,True,False,True,,No,No,No,Yes
9,Provide information on the use of chemical cle...,False,True,True,True,False,True,,Yes,Yes,I Don't Know,Yes


## Done!!

In [228]:
# Now let's export cleaned data !!!!
df.to_excel("output.xlsx")