# Mental Health in Tech Project

## Data Sets

[OSMI Survey on Mental Health in the Tech Workplace in 2014](https://www.kaggle.com/osmi/mental-health-in-tech-survey) 

["Ongoing" OSMI survey from 2016](https://www.kaggle.com/osmi/mental-health-in-tech-2016)


## Questions

## Process

In [None]:
# external tools
import pandas as pd

### Exploring and Cleaning 2014 Data

In [312]:
df = df14 = pd.read_csv("./datasets/2014/mental-health-in-tech-2014.csv")
print df14.shape
df14.head(3)

(1259, 27)


Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,


In [313]:
df_original = pd.read_csv("./datasets/2014/osmi-mental-health-in-tech-original.csv")
print df_original.shape
# print df_original.columns   # original questions/fields

(1259, 27)


**Original fields/questions:**

<details><summary> Click to expand all original fields/questions </summary>
    
- Timestamp   
- Age  
- Gender   
- Country  
- If you live in the United States, which state or territory do you live in?  
- Are you self-employed?  
- Do you have a family history of mental illness?  
- Have you sought treatment for a mental health condition?  
- If you have a mental health condition, do you feel that it interferes with your work?  
- How many employees does your company or organization have?  
- Do you work remotely (outside of an office) at least 50% of the time?  
- Is your employer primarily a tech company/organization?  
- Does your employer provide mental health benefits?  
- Do you know the options for mental health care your employer provides?  
- Has your employer ever discussed mental health as part of an employee wellness program?  
- Does your employer provide resources to learn more about mental health issues and how to seek help?  
- Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?  
- How easy is it for you to take medical leave for a mental health condition?  
- Do you think that discussing a mental health issue with your employer would have negative consequences?  
- Do you think that discussing a physical health issue with your employer would have negative consequences?  
- Would you be willing to discuss a mental health issue with your coworkers?  
- Would you be willing to discuss a mental health issue with your direct supervisor(s)?  
- Would you bring up a mental health issue with a potential employer in an interview?  
- Would you bring up a physical health issue with a potential employer in an interview?  
- Do you feel that your employer takes mental health as seriously as physical health?  
- Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?  
- Any additional notes or comments
</details>

In [315]:
# standardize columns to have lowercase names
df14.rename(columns={'Age': 'age', 'Gender': 'gender', 'Country': 'country', 'Timestamp': 'timestamp'}, inplace=True)

In [317]:
df14.head(3)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,


In [319]:
# create a dictionary for my reference, 
#    to look up questions based on column names
column_names = df14.columns
questions = df_original.columns
col_question_map = { 
    column_names[i]: questions[i] for i in range(df_original.shape[1]-1) 
}

# for example:
col_question_map['mental_vs_physical']

'Do you feel that your employer takes mental health as seriously as physical health?'

#### Dummy variables from gender responses

Gender responses seem to be strings entered by the user. To create a more manageable set of variables, I examine all the gender responses and categorize them into `female`, `male`, or `other` based on my judgement.  

Note (and it's noted in a comment in the code) that trans men and women map to `male` and `female` categories, respectively. Gender identity can have an impact on feelings of being stigmatized, so this might be a decision to explore further. 

In [320]:
print df14.gender.unique()

['Female' 'M' 'Male' 'male' 'female' 'm' 'Male-ish' 'maile' 'Trans-female'
 'Cis Female' 'F' 'something kinda male?' 'Cis Male' 'Woman' 'f' 'Mal'
 'Male (CIS)' 'queer/she/they' 'non-binary' 'Femake' 'woman' 'Make' 'Nah'
 'All' 'Enby' 'fluid' 'Genderqueer' 'Female ' 'Androgyne' 'Agender'
 'cis-female/femme' 'Guy (-ish) ^_^' 'male leaning androgynous' 'Male '
 'Man' 'Trans woman' 'msle' 'Neuter' 'Female (trans)' 'queer'
 'Female (cis)' 'Mail' 'cis male' 'A little about you' 'Malr' 'p' 'femail'
 'Cis Man' 'ostensibly male, unsure what that really means']


In [321]:
# categorize gender responses into male, female, other based on response
def categorize_gender(gender_response):
    gender_response = gender_response.strip().lower()
    
    # caution - removing data about gender identity that may 
        # be correlated with mental health or feelings of being stigmatized
    male_responses = set(['male', 'm', 'man', 'cis male', 'male (cis)',
                          'trans-female', 'male (trans)', 'cis man',
                          'mal', 'mail', 'maile', 'make', 'msle', 'malr'])
    
    # caution - removing data about gender identity that may 
        # be correlated with mental health or feelings of being stigmatized
    female_responses = set(['female', 'f', 'woman', 'female (cis)', 
                            'trans-female', 'trans woman', 'female (trans)', 
                            'cis-female/femme', 'cis female', 
                            'femake', 'femail'])
    if gender_response in male_responses:
        return 'male'
    elif gender_response in female_responses:
        return 'female'
    else:
        return 'other'

# values mapped to 'other' for this data set:
#        ['male-ish', 'something kinda male?',
#        'queer/she/they', 'non-binary', 'nah', 'all', 'enby', 'fluid',
#        'genderqueer', 'androgyne', 'agender', 'guy (-ish) ^_^',
#        'male leaning androgynous', 'neuter', 'queer',
#        'a little about you', 'p',
#        'ostensibly male, unsure what that really means']

In [322]:
categorized_gender_responses = df14.gender.map(categorize_gender)

In [323]:
categorized_gender_responses.unique()

array(['female', 'male', 'other'], dtype=object)

In [324]:
gender_dummies = pd.get_dummies(categorized_gender_responses, prefix='gender')
# df14 = pd.concat([df14, gender_dummies], axis=1)
df14.head(3)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,


#### Handle invalid ages

Ages range from `-1726` to `99999999999`.  This is supposed to be a survey of working adults, so values under 16 and over 100 can be ignored.

In [325]:
ages = df14.age.unique()
ages.sort()
print ages

[      -1726         -29          -1           5           8          11
          18          19          20          21          22          23
          24          25          26          27          28          29
          30          31          32          33          34          35
          36          37          38          39          40          41
          42          43          44          45          46          47
          48          49          50          51          53          54
          55          56          57          58          60          61
          62          65          72         329 99999999999]


In [326]:
df14.age = df.age.map(lambda x: nan if (x<16 or x>100) else x)

In [327]:
ages = df14.age.unique()
ages.sort()
print ages

[ 18.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.  29.  30.  31.  32.
  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.  46.  47.
  48.  49.  50.  51.  53.  54.  55.  56.  57.  58.  60.  61.  62.  65.  72.
  nan]


In [328]:
df14.head(3)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37.0,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44.0,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32.0,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,


#### Convert yes/no responses to categories. 

Many of the survey questions are yes/no questions, but the data is currently stored as strings. 

In [339]:
df14.self_employed.unique()

array([ nan,   1.,   0.])

In [340]:
def yes_no_nan(response):
    if response=='Yes' or response==1:
        return 1
    elif response=='No' or response==0:
        return 0
    else:
        return nan

In [341]:
df14.self_employed = df14.self_employed.map(yes_no_nan)
df14.self_employed.unique()

array([ nan,   1.,   0.])

In [342]:
# given a data frame and a column name, convert that column's
#   values from 'Yes' or 1, 'No' or 0, and anything else
#   to 1, 0, and nan
def yes_no_nan_column(df, column_name):
    df[column_name] = df[column_name].map(yes_no_nan)

In [343]:
for col in df14.columns:
    col_uniq = df[col].unique()
    if len(col_uniq) == 2:
        print col, col_uniq

family_history [0 1]
treatment [1 0]
remote_work [0 1]
tech_company [1 0]
obs_consequence [0 1]


**Note:** For columns that only have valid values `'Yes'` and `'No'`, it's appropriate to convert using the same `yes_no_nan` function(s). 

Most of the columns with three reponses have `'Maybe'`, `'Don't know'`, or some other meaningful third option.  Those will be preserved with dummy variables.  

In [344]:
for col in ['family_history', 'treatment', 'remote_work', 'tech_company', 'obs_consequence']:
    yes_no_nan_column(df14, col)

In [345]:
df14.head(3)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37.0,Female,United States,IL,,0,1,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,0,
1,2014-08-27 11:29:37,44.0,M,United States,IN,,0,0,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,0,
2,2014-08-27 11:29:44,32.0,Male,Canada,,,0,0,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,0,


#### Convert yes/no/maybe responses to categories. 

The yes/no questions have been coded as numbers, but there are still quite a few questions with yes/no/maybe or other meaningful third choices.  These columns will need dummy variables.

In [439]:
three_opt_cols = [col for col in df14.columns if len(df[col].unique()) == 3]

for col in three_opt_cols:
    print col, df[col].unique()

self_employed [ nan   1.   0.]
benefits ['yes' 'dontknow' 'no']
care_options ['notsure' 'no' 'yes']
wellness_program ['no' 'dontknow' 'yes']
seek_help ['yes' 'dontknow' 'no']
anonymity ['yes' 'dontknow' 'no']
mental_health_consequence ['no' 'maybe' 'yes']
phys_health_consequence ['no' 'yes' 'maybe']
coworkers ['some' 'no' 'yes']
supervisor ['yes' 'no' 'some']
mental_health_interview ['no' 'yes' 'maybe']
phys_health_interview ['maybe' 'no' 'yes']
mental_vs_physical ['yes' 'dontknow' 'no']


Other than `self_employed`, the columns listed above will use dummy variables.

#### Benefits

In [428]:
print col_question_map['benefits']
df = df14.copy()
print df['benefits'].unique()

Does your employer provide mental health benefits?
['Yes' "Don't know" 'No']


In [429]:
def format_y_n_other(df, col_name, other_dict):
    def format_response(response):
        if response in ['Yes', 'No', 'yes', 'no']:
            return response.lower()
        elif response in other_dict:
            return other_dict[response]
        elif response in other_dict.values():
            return response
    df[col_name] = df[col_name].map(format_response)

In [430]:
format_y_n_other(df, 'benefits', { "Don't know": 'dontknow' })

In [431]:
df.benefits.unique()

array(['yes', 'dontknow', 'no'], dtype=object)

#### Other "Don't Know" Columns

In [441]:
dontknow_cols = ['benefits', 'wellness_program', 'seek_help', 'anonymity', 
           'mental_vs_physical']
for col in dontknow_cols:
    format_y_n_other(df, col, { "Don't know": 'dontknow' })

In [442]:
maybe_cols = ['mental_health_consequence', 'phys_health_consequence',
            'mental_health_interview', 'phys_health_interview']
for col in maybe_cols:
    format_y_n_other(df, col, { 'Maybe': 'maybe' })

           
some_cols = ['coworkers', 'supervisor']        
for col in some_cols:
    format_y_n_other(df, col, { 'Some of them': 'some' })           

format_y_n_other(df, 'care_options', { 'Not sure' : 'notsure' })


In [451]:
# three_opt_cols.pop(0)  # remove self_employed
for col in three_opt_cols:
    print col, df[col].unique()

benefits ['yes' 'dontknow' 'no']
care_options ['notsure' 'no' 'yes']
wellness_program ['no' 'dontknow' 'yes']
seek_help ['yes' 'dontknow' 'no']
anonymity ['yes' 'dontknow' 'no']
mental_health_consequence ['no' 'maybe' 'yes']
phys_health_consequence ['no' 'yes' 'maybe']
coworkers ['some' 'no' 'yes']
supervisor ['yes' 'no' 'some']
mental_health_interview ['no' 'yes' 'maybe']
phys_health_interview ['maybe' 'no' 'yes']
mental_vs_physical ['yes' 'dontknow' 'no']


In [452]:
dummy_dfs = {}
for col in three_opt_cols:
    dummy_dfs[col] = pd.get_dummies(df[col], prefix=col)

In [453]:
dummy_dfs.keys()

['wellness_program',
 'benefits',
 'seek_help',
 'phys_health_interview',
 'mental_vs_physical',
 'care_options',
 'coworkers',
 'mental_health_consequence',
 'anonymity',
 'supervisor',
 'mental_health_interview',
 'phys_health_consequence']

In [455]:
dummy_dfs['benefits'].head(3)

Unnamed: 0,benefits_dontknow,benefits_no,benefits_yes
0,0,0,1
1,1,0,0
2,0,1,0


**At this point**, the `dummy_dfs` dictionary holds dummy variable sets for all three-option columns. 

#### Other Categorical Variables... continued

In [459]:
df.head(3)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37.0,Female,United States,IL,,0,1,Often,6-25,...,Somewhat easy,no,no,some,yes,no,maybe,yes,0,
1,2014-08-27 11:29:37,44.0,M,United States,IN,,0,0,Rarely,More than 1000,...,Don't know,maybe,no,no,no,no,no,dontknow,0,
2,2014-08-27 11:29:44,32.0,Male,Canada,,,0,0,Rarely,6-25,...,Somewhat difficult,no,no,yes,yes,yes,yes,no,0,


In [497]:
print col_question_map['work_interfere']
print df['work_interfere'].unique()

If you have a mental health condition, do you feel that it interferes with your work?
['Often' 'Rarely' 'Never' 'Sometimes' nan]


In [514]:
df['work_interfere'] = df['work_interfere'].str.lower()

In [515]:
df['work_interfere'].unique()

array(['often', 'rarely', 'never', 'sometimes', nan], dtype=object)

In [518]:
dummy_dfs['work_interfere'] = pd.get_dummies(df.work_interfere, prefix='work_interfere')
dummy_dfs['work_interfere'].head(3)

Unnamed: 0,work_interfere_never,work_interfere_often,work_interfere_rarely,work_interfere_sometimes
0,0,1,0,0
1,0,0,1,0
2,0,0,1,0


In [519]:
df.head(3)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37.0,female,united states,IL,,0,1,often,6-25,...,Somewhat easy,no,no,some,yes,no,maybe,yes,0,
1,2014-08-27 11:29:37,44.0,m,united states,IN,,0,0,rarely,More than 1000,...,Don't know,maybe,no,no,no,no,no,dontknow,0,
2,2014-08-27 11:29:44,32.0,male,canada,,,0,0,rarely,6-25,...,Somewhat difficult,no,no,yes,yes,yes,yes,no,0,


In [520]:
df2= df14.copy()

In [521]:
df2.str.lower()

AttributeError: 'DataFrame' object has no attribute 'str'

In [524]:
df.country.head(3)

0    united states
1    united states
2           canada
Name: country, dtype: object