Potential resource(s):
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py

# Mental Health in Tech Project

## Data Sets

[OSMI Survey on Mental Health in the Tech Workplace in 2014](https://www.kaggle.com/osmi/mental-health-in-tech-survey) 

["Ongoing" OSMI survey from 2016](https://data.world/kittybot/osmi-mental-health-tech-2016)


## Questions

What factors are most signficant in influencing whether or not a person believes disclosing a mental health issue would have negative consequences?

Can we predict, based on publicly available features of a person and company, whether that person is likely to beleive disclosing a mental health issue would be harmful for their career?

## Exploring and Cleaning 2016 Data

In [1]:
import pandas as pd

In [2]:
df16 = pd.read_csv("./datasets/2016/mental-health-in-tech-2016_20161114.csv")
print df16.shape
df16.head(3)

(1433, 63)


Unnamed: 0,Are you self-employed?,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health concerns and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:",...,"If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?","If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?",What is your age?,What is your gender?,What country do you live in?,What US state or territory do you live in?,What country do you work in?,What US state or territory do you work in?,Which of the following best describes your work position?,Do you work remotely?
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,...,Not applicable to me,Not applicable to me,39,Male,United Kingdom,,United Kingdom,,Back-end Developer,Sometimes
1,0,6-25,1.0,,No,Yes,Yes,Yes,Yes,Somewhat easy,...,Rarely,Sometimes,29,male,United States of America,Illinois,United States of America,Illinois,Back-end Developer|Front-end Developer,Never
2,0,6-25,1.0,,No,,No,No,I don't know,Neither easy nor difficult,...,Not applicable to me,Not applicable to me,38,Male,United Kingdom,,United Kingdom,,Back-end Developer,Always


In [3]:
# rename columns to match 2014 data set and shorter names
question_col_map = {
    'Are you self-employed?': 'self_employed',
    'How many employees does your company or organization have?': 'num_employees',
    'Is your employer primarily a tech company/organization?': 'tech_company',
    'Is your primary role within your company related to tech/IT?': 'tech_role',
    'Does your employer provide mental health benefits as part of healthcare coverage?': 'benefits',
    'Do you know the options for mental health care available under your employer-provided coverage?': 'care_options',
    'Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?': 'wellness_program',
    'Does your employer offer resources to learn more about mental health concerns and options for seeking help?': 'seek_help',
    'Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?': 'anonymity',
    'If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:': 'leave',  
    'Do you think that discussing a mental health disorder with your employer would have negative consequences?': 'mental_health_consequence',
    'Do you think that discussing a physical health issue with your employer would have negative consequences?': 'phys_health_consequence',
    'Would you feel comfortable discussing a mental health disorder with your coworkers?': 'coworkers',  # "comfortable" updated from "willing"
    'Would you feel comfortable discussing a mental health disorder with your direct supervisor(s)?': 'supervisor',  # "comfortable" updated from "willing"
    'Do you feel that your employer takes mental health as seriously as physical health?': 'mental_vs_physical',
    'Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace?': 'obs_consequence',
    'Do you have medical coverage (private insurance or state-provided) which includes treatment of \xc2\xa0mental health issues?': 'insurance',
    'Do you know local or online resources to seek help for a mental health disorder?': 'know_resources',
    'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?': 'revealed_contacts',
    'If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively?': 'revealed_contacts_consequence',
    'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?': 'revealed_coworkers',
    'If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively?': 'revealed_coworkers_consequence',
    'Do you believe your productivity is ever affected by a mental health issue?': 'productivity_impacted',  # close to 'work_interfere'
    'If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?': 'percent_time_impacted',
    'Do you have previous employers?': 'prev_employer',
    'Have your previous employers provided mental health benefits?': 'prev_benefits',
    'Were you aware of the options for mental health care provided by your previous employers?': 'prev_care_options',
    'Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?': 'prev_wellness_program',
    'Did your previous employers provide resources to learn more about mental health issues and how to seek help?': 'prev_seek_help',
    'Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?': 'prev_anonymity',
    'Do you think that discussing a mental health disorder with previous employers would have negative consequences?': 'prev_mental_health_consequence',
    'Do you think that discussing a physical health issue with previous employers would have negative consequences?': 'prev_phys_health_consequence',
    'Would you have been willing to discuss a mental health issue with your previous co-workers?': 'prev_coworkers',
    'Would you have been willing to discuss a mental health issue with your direct supervisor(s)?': 'prev_supervisor',
    'Did you feel that your previous employers took mental health as seriously as physical health?': 'prev_mental_vs_physical',
    'Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces?': 'prev_obs_consequence',
    'Would you be willing to bring up a physical health issue with a potential employer in an interview?': 'phys_health_interview',
    'Why or why not?': 'phys_health_interview_comment',
    'Would you bring up a mental health issue with a potential employer in an interview?': 'mental_health_interview',
    'Why or why not?.1': 'mental_health_interview_comment',
    'Do you feel that being identified as a person with a mental health issue would hurt your career?': 'hurt_career',
    'Do you think that team members/co-workers would view you more negatively if they knew you suffered from a mental health issue?': 'viewed_negatively',
    'How willing would you be to share with friends and family that you have a mental illness?': 'friends_family',
    'Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?': 'obs_negative_response',  # close to 'obs_consequence'
    'Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?': 'reluctant_due_to_obs',
    'Do you have a family history of mental illness?': 'family_history',
    'Have you had a mental health disorder in the past?': 'past_disorder',
    'Do you currently have a mental health disorder?': 'current_disorder',
    'If yes, what condition(s) have you been diagnosed with?': 'diagnosed_conditions',
    'If maybe, what condition(s) do you believe you have?': 'believed_conditions',
    'Have you been diagnosed with a mental health condition by a medical professional?': 'professional_diagnosed',
    'If so, what condition(s) were you diagnosed with?': 'professional_diagnoses',
    'Have you ever sought treatment for a mental health issue from a mental health professional?': 'treatment',
    'If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?': 'work_interfere_treated',
    'If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?': 'work_interfere_untreated',
    'What is your age?': 'age',
    'What is your gender?': 'gender',
    'What country do you live in?': 'live_in_country',
    'What US state or territory do you live in?': 'live_in_state',
    'What country do you work in?': 'work_in_country',
    'What US state or territory do you work in?': 'work_in_state',
    'Which of the following best describes your work position?': 'position',
    'Do you work remotely?': 'remote_work'  # previously asked at least 50% of time
}
df16.rename(columns=question_col_map, inplace=True)
df16.head(3)

Unnamed: 0,self_employed,num_employees,tech_company,tech_role,benefits,care_options,wellness_program,seek_help,anonymity,leave,...,work_interfere_treated,work_interfere_untreated,age,gender,live_in_country,live_in_state,work_in_country,work_in_state,position,remote_work
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,...,Not applicable to me,Not applicable to me,39,Male,United Kingdom,,United Kingdom,,Back-end Developer,Sometimes
1,0,6-25,1.0,,No,Yes,Yes,Yes,Yes,Somewhat easy,...,Rarely,Sometimes,29,male,United States of America,Illinois,United States of America,Illinois,Back-end Developer|Front-end Developer,Never
2,0,6-25,1.0,,No,,No,No,I don't know,Neither easy nor difficult,...,Not applicable to me,Not applicable to me,38,Male,United Kingdom,,United Kingdom,,Back-end Developer,Always


In [4]:
col_question_map = { question_col_map[q]: q for q in question_col_map.keys()}
# col_question_map

<details><summary> Click to expand all **original questions/fields** </summary>
- Are you self-employed?  
- How many employees does your company or organization have?  
- Is your employer primarily a tech company/organization?  
- Is your primary role within your company related to tech/IT?  
- Does your employer provide mental health benefits as part of healthcare coverage?  
- Do you know the options for mental health care available under your employer-provided coverage?  
- Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?  
- Does your employer offer resources to learn more about mental health concerns and options for seeking help?  
- Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?  
- If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:  
- Do you think that discussing a mental health disorder with your employer would have negative consequences?  
- Do you think that discussing a physical health issue with your employer would have negative consequences?  
- Would you feel comfortable discussing a mental health disorder with your coworkers?  
- Would you feel comfortable discussing a mental health disorder with your direct supervisor(s)?  
- Do you feel that your employer takes mental health as seriously as physical health?  
- Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace?  
- Do you have medical coverage (private insurance or state-provided) which includes treatment of \xc2\xa0mental health issues?  
- Do you know local or online resources to seek help for a mental health disorder?  
- If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?  
- If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively?  
- If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?  
- If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively?  
- Do you believe your productivity is ever affected by a mental health issue?  
- If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?  
- Do you have previous employers?  
- Have your previous employers provided mental health benefits?  
- Were you aware of the options for mental health care provided by your previous employers?  
- Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?  
- Did your previous employers provide resources to learn more about mental health issues and how to seek help?  
- Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?  
- Do you think that discussing a mental health disorder with previous employers would have negative consequences?  
- Do you think that discussing a physical health issue with previous employers would have negative consequences?  
- Would you have been willing to discuss a mental health issue with your previous co-workers?  
- Would you have been willing to discuss a mental health issue with your direct supervisor(s)?  
- Did you feel that your previous employers took mental health as seriously as physical health?  
- Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces?  
- Would you be willing to bring up a physical health issue with a potential employer in an interview?  
- Why or why not?  
- Would you bring up a mental health issue with a potential employer in an interview?  
- Why or why not?.1  
- Do you feel that being identified as a person with a mental health issue would hurt your career?  
- Do you think that team members/co-workers would view you more negatively if they knew you suffered from a mental health issue?  
- How willing would you be to share with friends and family that you have a mental illness?  
- Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?  
- Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?  
- Do you have a family history of mental illness?  
- Have you had a mental health disorder in the past?  
- Do you currently have a mental health disorder?  
- If yes, what condition(s) have you been diagnosed with?  
- If maybe, what condition(s) do you believe you have?  
- Have you been diagnosed with a mental health condition by a medical professional?  
- If so, what condition(s) were you diagnosed with?  
- Have you ever sought treatment for a mental health issue from a mental health professional?  
- If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?  
- If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?  
- What is your age?  
- What is your gender?  
- What country do you live in?  
- What US state or territory do you live in?  
- What country do you work in?  
- What US state or territory do you work in?  
- Which of the following best describes your work position?  
- Do you work remotely?  
</details>

#### Quick NaN Check


In [5]:
# NaN check
counts = df16.count()
numrows = df16.shape[0]
for col in df16.columns:
    if counts[col] != numrows:
        print "{0}\n\t{1} NaNs\t\t{2} values".format(col, numrows-counts[col], counts[col])

num_employees
	287 NaNs		1146 values
tech_company
	287 NaNs		1146 values
tech_role
	1170 NaNs		263 values
benefits
	287 NaNs		1146 values
care_options
	420 NaNs		1013 values
wellness_program
	287 NaNs		1146 values
seek_help
	287 NaNs		1146 values
anonymity
	287 NaNs		1146 values
leave
	287 NaNs		1146 values
mental_health_consequence
	287 NaNs		1146 values
phys_health_consequence
	287 NaNs		1146 values
coworkers
	287 NaNs		1146 values
supervisor
	287 NaNs		1146 values
mental_vs_physical
	287 NaNs		1146 values
obs_consequence
	287 NaNs		1146 values
insurance
	1146 NaNs		287 values
know_resources
	1146 NaNs		287 values
revealed_contacts
	1146 NaNs		287 values
revealed_contacts_consequence
	1289 NaNs		144 values
revealed_coworkers
	1146 NaNs		287 values
revealed_coworkers_consequence
	1146 NaNs		287 values
productivity_impacted
	1146 NaNs		287 values
percent_time_impacted
	1229 NaNs		204 values
prev_benefits
	169 NaNs		1264 values
prev_care_options
	169 NaNs		1264 values
prev_wellness_prog

In [6]:
df16[df16['revealed_coworkers'].isnull()]

Unnamed: 0,self_employed,num_employees,tech_company,tech_role,benefits,care_options,wellness_program,seek_help,anonymity,leave,...,work_interfere_treated,work_interfere_untreated,age,gender,live_in_country,live_in_state,work_in_country,work_in_state,position,remote_work
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,...,Not applicable to me,Not applicable to me,39,Male,United Kingdom,,United Kingdom,,Back-end Developer,Sometimes
1,0,6-25,1.0,,No,Yes,Yes,Yes,Yes,Somewhat easy,...,Rarely,Sometimes,29,male,United States of America,Illinois,United States of America,Illinois,Back-end Developer|Front-end Developer,Never
2,0,6-25,1.0,,No,,No,No,I don't know,Neither easy nor difficult,...,Not applicable to me,Not applicable to me,38,Male,United Kingdom,,United Kingdom,,Back-end Developer,Always
4,0,6-25,0.0,1.0,Yes,Yes,No,No,No,Neither easy nor difficult,...,Sometimes,Sometimes,43,Female,United States of America,Illinois,United States of America,Illinois,Executive Leadership|Supervisor/Team Lead|Dev ...,Sometimes
5,0,More than 1000,1.0,,Yes,I am not sure,No,Yes,Yes,Somewhat easy,...,Not applicable to me,Often,42,Male,United Kingdom,,United Kingdom,,DevOps/SysAdmin|Support|Back-end Developer|Fro...,Sometimes
6,0,26-100,1.0,,I don't know,No,No,No,I don't know,Somewhat easy,...,Not applicable to me,Not applicable to me,30,M,United States of America,Tennessee,United States of America,Tennessee,Back-end Developer,Sometimes
7,0,More than 1000,1.0,,Yes,Yes,No,Yes,Yes,Very easy,...,Sometimes,Often,37,female,United States of America,Virginia,United States of America,Virginia,Dev Evangelist/Advocate|Back-end Developer,Always
8,0,26-100,0.0,1.0,I don't know,No,No,No,I don't know,Very difficult,...,Rarely,Often,44,Female,United States of America,California,United States of America,California,Support|Back-end Developer|One-person shop,Sometimes
10,0,26-100,1.0,,Yes,I am not sure,Yes,Yes,Yes,Very easy,...,Sometimes,Often,28,Male,United States of America,Oregon,United States of America,Oregon,Front-end Developer,Never
11,0,100-500,0.0,1.0,Yes,Yes,No,I don't know,I don't know,Somewhat difficult,...,Never,Rarely,34,Male,United States of America,Pennsylvania,United States of America,Pennsylvania,Executive Leadership,Sometimes


In [7]:
def preview_col(df, col):
    print col
    if col in col_question_map:
        print col_question_map[col]
    print df[col].value_counts(dropna=False)

### Ages

This is supposed to be a survey of working adults, so values under 16 and over 80 can be ignored.

In [8]:
ages = df16.age.unique()
ages.sort()
print ages

[  3  15  17  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33
  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
  52  53  54  55  56  57  58  59  61  62  63  65  66  70  74  99 323]


In [9]:
nan = float('NaN')

df16.age = df16.age.map(lambda x: nan if (x<16 or x>80) else x)

In [10]:
ages = df16.age.unique()
ages.sort()
print ages

[ 17.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.  29.  30.  31.  32.
  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.  46.  47.
  48.  49.  50.  51.  52.  53.  54.  55.  56.  57.  58.  59.  61.  62.  63.
  65.  66.  70.  74.  nan]


### Gender responses

Gender responses seem to be strings entered by the user. To create a more manageable set of variables, I examine all the gender responses and categorize them into `female`, `male`, or `other`.  

Note (and it's noted in a comment in the code) that trans men and women map to `male` and `female` categories, respectively. Gender identity can have an impact on feelings of being stigmatized, so this might be a decision to explore further. 

In [11]:
# create dictionary to organize dummy data frames throughout
dummy_dfs = {}

In [12]:
print df16.gender.unique()

['Male' 'male' 'Male ' 'Female' 'M' 'female' 'm' 'I identify as female.'
 'female ' 'Bigender' 'non-binary' 'Female assigned at birth ' 'F' 'Woman'
 'man' 'fm' 'f' 'Cis female ' 'Transitioned, M2F'
 'Genderfluid (born female)' 'Other/Transfeminine'
 'Female or Multi-Gender Femme' 'Female ' 'woman' 'female/woman' 'Cis male'
 'Male.' 'Androgynous' 'male 9:1 female, roughly' nan 'Male (cis)' 'Other'
 'nb masculine' 'Cisgender Female' 'Man' 'Sex is male'
 'none of your business' 'genderqueer' 'cis male' 'Human' 'Genderfluid'
 'Enby' 'Malr' 'genderqueer woman' 'mtf' 'Queer' 'Agender' 'Dude' 'Fluid'
 "I'm a man why didn't you make this a drop down question. You should of asked sex? And I would of answered yes please. Seriously how much text can this take? "
 'mail' 'M|' 'Male/genderqueer' 'fem' 'Nonbinary' 'male ' 'human'
 'Female (props for making this a freeform field, though)' ' Female'
 'Unicorn' 'Cis Male' 'Male (trans, FtM)' 'Cis-woman' 'Genderqueer'
 'cisdude' 'Genderflux demi-girl' '

In [13]:
# categorize gender responses into male, female, other based on response
def categorize_gender(gender_response):
    if type(gender_response) != str:
        return nan
    gender_response = gender_response.strip().lower()
    
    # caution - removing data about gender identity that may 
        # be correlated with mental health or feelings of being stigmatized
    male_responses = set(['male', 'm', 'man', 'cis male', 'male (cis)',
                          'male (trans)', 'cis man', 'cisdude',
                          'mal', 'male.', 'mail', 'maile', 'make', 'msle', 'malr',
                          "I'm a man why didn't you make this a drop down question. You should of asked sex? And I would of answered yes please. Seriously how much text can this take? ".strip().lower(),
                          'dude', 'male (trans, ftm)'
                         ])
    
    # caution - removing data about gender identity that may 
        # be correlated with mental health or feelings of being stigmatized
    female_responses = set(['female', 'f', 'woman', 'female (cis)', 'female/woman', 
                            'cisgender female', 'trans-female', 'mtf',
                            'trans-female', 'trans woman', 'female (trans)', 
                            'cis-female/femme', 'cis female', 'transgender woman',
                            'femake', 'femail', 'transitioned, m2f', 'cis-woman',
                            'female (props for making this a freeform field, though)'
                           ])
    if gender_response in male_responses:
        return 'male'
    elif gender_response in female_responses:
        return 'female'
    else:
        return 'other'


In [14]:
categorized_gender_responses = df16.gender.map(categorize_gender)

In [15]:
categorized_gender_responses.value_counts(dropna=False)

male      1056
female     337
other       37
NaN          3
Name: gender, dtype: int64

In [16]:
dummy_dfs['gender'] = pd.get_dummies(categorized_gender_responses, prefix='gender', dummy_na=True)

In [17]:
df16.loc[:,'gender_category'] = categorized_gender_responses

In [18]:
df16[['gender', 'gender_category']].tail(3)

Unnamed: 0,gender,gender_category
1430,Male,male
1431,Female,female
1432,non-binary,other


### Yes/no questions

Many of the survey questions are yes/no questions, but the data is currently stored as strings. I convert these to `1` for `Yes` and `0` for `No`.

Most of the columns with three reponses have `'Yes'`, `'No'`, and `'Maybe'`, `'Don't know'`, or some other meaningful third option.  Those will be one-hot coded with dummy variables later.

In [19]:
for col in df16.columns:
    col_uniq = df16[col].unique()
    if len(col_uniq) == 2:
        print col, col_uniq

self_employed [0 1]
prev_employer [1 0]
professional_diagnosed ['Yes' 'No']
treatment [0 1]


In [20]:
def yes_no_same(response):
    if response=='Yes' or response==1:
        return 1
    elif response=='No' or response==0:
        return 0
    else:
        return response

In [21]:
def yes_no_same_column(df, column_name):
    df[column_name] = df[column_name].map(yes_no_same)

In [22]:
two_opt_cols = ['prev_employer', 'professional_diagnosed', 'treatment', 'self_employed']
for col in two_opt_cols:
    yes_no_same_column(df16, col)

In [23]:
df16[two_opt_cols].head(3)

Unnamed: 0,prev_employer,professional_diagnosed,treatment,self_employed
0,1,1,0,0
1,1,1,1,0
2,1,0,1,0


#### Colums with 2 options and nan

In [24]:
two_opt_with_nans = ['tech_company', 'tech_role', 'obs_consequence', 'insurance']

In [25]:
for col in two_opt_with_nans:
    df16[col] = df16[col].map(yes_no_same)
    print df16[col].value_counts(dropna=False)

 1.0    883
NaN     287
 0.0    263
Name: tech_company, dtype: int64
NaN     1170
 1.0     248
 0.0      15
Name: tech_role, dtype: int64
 0.0    1048
NaN      287
 1.0      98
Name: obs_consequence, dtype: int64
NaN     1146
 1.0     185
 0.0     102
Name: insurance, dtype: int64


In [26]:
# for col in two_opt_with_nans:
#     dummy_dfs[col] = pd.get_dummies(df[col], prefix=col, dummy_na=True)

### Yes/no/maybe and other three-option columns

Many of the survey questions have yes/no/maybe or other meaningful third choices.  These columns will need one-hot coding and dummy variables.

#### Three Option Columns

In [27]:
three_opt_cols = [col for col in df16.columns if len(df16[col].unique()) == 3]

for col in three_opt_cols:
    print col, df16[col].unique()

tech_company [  1.  nan   0.]
tech_role [ nan   1.   0.]
obs_consequence [  0.  nan   1.]
insurance [ nan   1.   0.]
phys_health_interview ['Maybe' 'Yes' 'No']
mental_health_interview ['Maybe' 'No' 'Yes']
family_history ['No' 'Yes' "I don't know"]
past_disorder ['Yes' 'Maybe' 'No']
current_disorder ['No' 'Yes' 'Maybe']
remote_work ['Sometimes' 'Never' 'Always']


In [28]:
def get_prefix(response):
    if type(response) != str:
        return response
    else:
        return response.lower().replace(" ", "_").replace("'", "")

In [29]:
true_three_opts = [
    'phys_health_interview', 
    'mental_health_interview', 
    'family_history', 
    'past_disorder', 
    'current_disorder', 
    'remote_work'
]

In [30]:
df16.replace({"I don't know": 'dont know'}, inplace=True)

In [31]:
for col in true_three_opts:
    df16[col] = df16[col].map(get_prefix)
    print df16[col].unique()

['maybe' 'yes' 'no']
['maybe' 'no' 'yes']
['no' 'yes' 'dont_know']
['yes' 'maybe' 'no']
['no' 'yes' 'maybe']
['sometimes' 'never' 'always']


In [32]:
for col in true_three_opts:
    dummy_dfs[col] = pd.get_dummies(df16[col], prefix=col)

In [33]:
dummy_dfs.keys()

['past_disorder',
 'family_history',
 'gender',
 'remote_work',
 'mental_health_interview',
 'phys_health_interview',
 'current_disorder']

**At this point**, the `dummy_dfs` dictionary holds dummy variable sets for all three-option columns and `gender`.

In [34]:
dummy_dfs['remote_work'].head()

Unnamed: 0,remote_work_always,remote_work_never,remote_work_sometimes
0,0,0,1
1,0,1,0
2,1,0,0
3,0,0,1
4,0,0,1


### Other categorical variables

#### Country & State

In [35]:
live_countries = df16['live_in_country'].map(get_prefix)
print live_countries.head(3)

0              united_kingdom
1    united_states_of_america
2              united_kingdom
Name: live_in_country, dtype: object


In [36]:
dummy_dfs['live_in_country'] = pd.get_dummies(live_countries, prefix='live_in')
dummy_dfs['live_in_country'].head(3)

Unnamed: 0,live_in_afghanistan,live_in_algeria,live_in_argentina,live_in_australia,live_in_austria,live_in_bangladesh,live_in_belgium,live_in_bosnia_and_herzegovina,live_in_brazil,live_in_brunei,...,live_in_slovakia,live_in_south_africa,live_in_spain,live_in_sweden,live_in_switzerland,live_in_taiwan,live_in_united_kingdom,live_in_united_states_of_america,live_in_venezuela,live_in_vietnam
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [37]:
def prefix_dummies(df, col, use_na=False):
    df[col] = df[col].map(get_prefix)
    dummy_dfs[col] = pd.get_dummies(df[col], prefix=col, dummy_na=use_na)

In [38]:
prefix_dummies(df16, 'live_in_state', True)

In [39]:
dummy_dfs['live_in_state'].head(3)

Unnamed: 0,live_in_state_alabama,live_in_state_alaska,live_in_state_arizona,live_in_state_california,live_in_state_colorado,live_in_state_connecticut,live_in_state_delaware,live_in_state_district_of_columbia,live_in_state_florida,live_in_state_georgia,...,live_in_state_south_dakota,live_in_state_tennessee,live_in_state_texas,live_in_state_utah,live_in_state_vermont,live_in_state_virginia,live_in_state_washington,live_in_state_west_virginia,live_in_state_wisconsin,live_in_state_nan
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [40]:
prefix_dummies(df16, 'work_in_state', True)
prefix_dummies(df16, 'work_in_country')

In [41]:
dummy_dfs.keys()

['live_in_country',
 'work_in_state',
 'past_disorder',
 'family_history',
 'gender',
 'work_in_country',
 'live_in_state',
 'remote_work',
 'mental_health_interview',
 'phys_health_interview',
 'current_disorder']

#### If you have a mental health issue, do you feel that it interferes with your work?

In [42]:
preview_col(df16, 'work_interfere_treated')
print
preview_col(df16, 'work_interfere_untreated')

work_interfere_treated
If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?
Not applicable to me    557
Sometimes               369
Rarely                  322
Never                   120
Often                    65
Name: work_interfere_treated, dtype: int64

work_interfere_untreated
If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?
Often                   538
Not applicable to me    468
Sometimes               363
Rarely                   52
Never                    12
Name: work_interfere_untreated, dtype: int64


In [43]:
df16.replace({'Not applicable to me': 'doesnt apply'}, inplace=True)

In [44]:
prefix_dummies(df16, 'work_interfere_treated')
prefix_dummies(df16, 'work_interfere_untreated')


#### How many employees does your company or organization have?

In [45]:
preview_col(df16, 'num_employees')

num_employees
How many employees does your company or organization have?
26-100            292
NaN               287
More than 1000    256
100-500           248
6-25              210
500-1000           80
1-5                60
Name: num_employees, dtype: int64


In [46]:
df16.num_employees = df16.num_employees.map(lambda x: '1000+' if x=='More than 1000' else x)
print df16.num_employees.unique()

['26-100' '6-25' nan '1000+' '100-500' '500-1000' '1-5']


In [47]:
dummy_dfs['num_employees'] = pd.get_dummies(df16.num_employees, prefix='num_employees', dummy_na=True)
# dummy_dfs['num_employees'].head(3)

In [48]:
df16.replace({
    "I don't know": 'dont_know',
    "I am not sure": "dont_know",
    "I'm not sure": 'dont_know',
    "United States of America": "United States",
    "Not applicable to me": "doesnt apply",
    'Not eligible for coverage / N/A': "doesnt_apply",
    "No, I don't know any": 'none',
    'Yes, I know several': 'several',
    'I know some': 'some',
    'No, because it would impact me negatively': 'no_due_to_negative_impact',
    'Sometimes, if it comes up': 'sometimes',
    "No, because it doesn't matter": 'no_due_to_doesnt_matter',
    'Yes, always': 'always',
    'No, none did': 'none',
    'Some did': 'some',
    'Yes, they all did': 'all',
    'I was aware of some': 'some',
    'Yes, I was aware of all of them': 'all',
    'No, I only became aware later': 'aware_later',
    'None did': 'none',
    'Yes, all of them': 'all',
    'None of them': 'none',
    'Some of them': 'some',
    'Some of my previous employers': 'some',
    'No, at none of my previous employers': 'none',
    'Yes, at all of my previous employers': 'all',
    'Maybe/Not sure': 'maybe',
    'Yes, I observed': 'observed',
    'Yes, I experienced': 'experienced',
    'Yes, I think it would': 'yes_would',
    "No, I don't think it would": 'no_wouldnt',
    'Yes, it has': 'yes has',
    'No, it has not': 'no hasnt',
    'Yes, I think they would': 'yes would',
    "No, I don't think they would": 'no wouldnt',
    'No, they do not': 'no dont',
    'Yes, they do': 'yes do',
    'N/A (not currently aware)': 'not_currently_aware',
    'Not applicable to me (I do not have a mental illness)': 'no mental illness'
    }, 
    inplace=True)


In [49]:
df16.num_employees = df16.num_employees.map(lambda x: '1000+' if x=='More than 1000' else x)
print df16.num_employees.unique()

['26-100' '6-25' nan '1000+' '100-500' '500-1000' '1-5']


#### If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:

In [50]:
v = preview_col

In [51]:
# x = df16[df16['mental_health_consequence'].isnull()].copy()
# print x.shape
# x.isnull().sum(axis=1).values
# x = x.dropna(axis=1)
# print x.shape
# print x.columns
# x.tail(3)

In [52]:
obj_cols = df16.columns[df16.dtypes==object]
print len(obj_cols)
long_cols = ['phys_health_interview_comment', 
             'mental_health_interview_comment',
             'professional_diagnoses',
             'position',
             'gender',
             'diagnosed_conditions',
             'believed_conditions'
            ]
for col in long_cols:
    obj_cols = obj_cols.drop(col)
for col in obj_cols:
    prefix_dummies(df16, col, True)
    v(df16, col)
#     print col
#     print df16[col].head(3)
    print
#     pass

investigate = [
    'friends_family',
    'family_history',
    'mental_vs_physical'
]

55
num_employees
How many employees does your company or organization have?
26-100      292
NaN         287
1000+       256
100-500     248
6-25        210
500-1000     80
1-5          60
Name: num_employees, dtype: int64

benefits
Does your employer provide mental health benefits as part of healthcare coverage?
yes             531
dont_know       319
NaN             287
no              213
doesnt_apply     83
Name: benefits, dtype: int64

care_options
Do you know the options for mental health care available under your employer-provided coverage?
NaN          420
no           354
dont_know    352
yes          307
Name: care_options, dtype: int64

wellness_program
Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?
no           813
NaN          287
yes          230
dont_know    103
Name: wellness_program, dtype: int64

seek_help
Does your employer offer resources to learn more about mental health concerns

## Exporting Cleaned Data


In [53]:
v(df16, 'reluctant_due_to_obs')

reluctant_due_to_obs
Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?
NaN      776
yes      246
no       234
maybe    177
Name: reluctant_due_to_obs, dtype: int64


In [54]:
df = pd.concat(dummy_dfs.values(), axis=1)
df = pd.concat([df16, df], axis=1)
print df.shape

(1433, 482)


In [55]:
df.head(2)

Unnamed: 0,self_employed,num_employees,tech_company,tech_role,benefits,care_options,wellness_program,seek_help,anonymity,leave,...,work_in_country_united_arab_emirates,work_in_country_united_kingdom,work_in_country_united_states_of_america,work_in_country_venezuela,work_in_country_vietnam,work_in_country_nan,phys_health_consequence_maybe,phys_health_consequence_no,phys_health_consequence_yes,phys_health_consequence_nan
0,0,26-100,1.0,,doesnt_apply,,no,no,dont_know,very_easy,...,0,1,0,0,0,0,0,1,0,0
1,0,6-25,1.0,,no,yes,yes,yes,yes,somewhat_easy,...,0,0,1,0,0,0,0,1,0,0


In [56]:
df.to_csv(path_or_buf="./datasets/2016/clean-mental-health-in-tech-2016.csv")

In [57]:
df16.to_csv(path_or_buf="./datasets/2016/clean-no-dummies-2016.csv")

In [58]:
df16.head()

Unnamed: 0,self_employed,num_employees,tech_company,tech_role,benefits,care_options,wellness_program,seek_help,anonymity,leave,...,work_interfere_untreated,age,gender,live_in_country,live_in_state,work_in_country,work_in_state,position,remote_work,gender_category
0,0,26-100,1.0,,doesnt_apply,,no,no,dont_know,very_easy,...,doesnt_apply,39.0,Male,united_kingdom,,united_kingdom,,Back-end Developer,sometimes,male
1,0,6-25,1.0,,no,yes,yes,yes,yes,somewhat_easy,...,sometimes,29.0,male,united_states,illinois,united_states_of_america,illinois,Back-end Developer|Front-end Developer,never,male
2,0,6-25,1.0,,no,,no,no,dont_know,neither_easy_nor_difficult,...,doesnt_apply,38.0,Male,united_kingdom,,united_kingdom,,Back-end Developer,always,male
3,1,,,,,,,,,,...,sometimes,43.0,male,united_kingdom,,united_kingdom,,Supervisor/Team Lead,sometimes,male
4,0,6-25,0.0,1.0,yes,yes,no,no,no,neither_easy_nor_difficult,...,sometimes,43.0,Female,united_states,illinois,united_states_of_america,illinois,Executive Leadership|Supervisor/Team Lead|Dev ...,sometimes,female


In [59]:
df16

Unnamed: 0,self_employed,num_employees,tech_company,tech_role,benefits,care_options,wellness_program,seek_help,anonymity,leave,...,work_interfere_untreated,age,gender,live_in_country,live_in_state,work_in_country,work_in_state,position,remote_work,gender_category
0,0,26-100,1.0,,doesnt_apply,,no,no,dont_know,very_easy,...,doesnt_apply,39.0,Male,united_kingdom,,united_kingdom,,Back-end Developer,sometimes,male
1,0,6-25,1.0,,no,yes,yes,yes,yes,somewhat_easy,...,sometimes,29.0,male,united_states,illinois,united_states_of_america,illinois,Back-end Developer|Front-end Developer,never,male
2,0,6-25,1.0,,no,,no,no,dont_know,neither_easy_nor_difficult,...,doesnt_apply,38.0,Male,united_kingdom,,united_kingdom,,Back-end Developer,always,male
3,1,,,,,,,,,,...,sometimes,43.0,male,united_kingdom,,united_kingdom,,Supervisor/Team Lead,sometimes,male
4,0,6-25,0.0,1.0,yes,yes,no,no,no,neither_easy_nor_difficult,...,sometimes,43.0,Female,united_states,illinois,united_states_of_america,illinois,Executive Leadership|Supervisor/Team Lead|Dev ...,sometimes,female
5,0,1000+,1.0,,yes,dont_know,no,yes,yes,somewhat_easy,...,often,42.0,Male,united_kingdom,,united_kingdom,,DevOps/SysAdmin|Support|Back-end Developer|Fro...,sometimes,male
6,0,26-100,1.0,,dont_know,no,no,no,dont_know,somewhat_easy,...,doesnt_apply,30.0,M,united_states,tennessee,united_states_of_america,tennessee,Back-end Developer,sometimes,male
7,0,1000+,1.0,,yes,yes,no,yes,yes,very_easy,...,often,37.0,female,united_states,virginia,united_states_of_america,virginia,Dev Evangelist/Advocate|Back-end Developer,always,female
8,0,26-100,0.0,1.0,dont_know,no,no,no,dont_know,very_difficult,...,often,44.0,Female,united_states,california,united_states_of_america,california,Support|Back-end Developer|One-person shop,sometimes,female
9,1,,,,,,,,,,...,often,30.0,Male,united_states,kentucky,united_states_of_america,kentucky,One-person shop|Front-end Developer|Back-end D...,always,male
