# Mental Health in Tech Project

## Data Sets

[OSMI Survey on Mental Health in the Tech Workplace in 2014](https://www.kaggle.com/osmi/mental-health-in-tech-survey) 

["Ongoing" OSMI survey from 2016](https://www.kaggle.com/osmi/mental-health-in-tech-2016)


## Questions

## Process

In [None]:
# external tools
import pandas as pd

### Exploring and Cleaning 2014 Data

In [571]:
df = df14 = pd.read_csv("./datasets/2014/mental-health-in-tech-2014.csv")
print df14.shape
df14.head(3)

(1259, 27)


Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,


In [572]:
df_original = pd.read_csv("./datasets/2014/osmi-mental-health-in-tech-original.csv")
print df_original.shape
# print df_original.columns   # original questions/fields

(1259, 27)


**Original fields/questions:**

<details><summary> Click to expand all original fields/questions </summary>
    
- Timestamp   
- Age  
- Gender   
- Country  
- If you live in the United States, which state or territory do you live in?  
- Are you self-employed?  
- Do you have a family history of mental illness?  
- Have you sought treatment for a mental health condition?  
- If you have a mental health condition, do you feel that it interferes with your work?  
- How many employees does your company or organization have?  
- Do you work remotely (outside of an office) at least 50% of the time?  
- Is your employer primarily a tech company/organization?  
- Does your employer provide mental health benefits?  
- Do you know the options for mental health care your employer provides?  
- Has your employer ever discussed mental health as part of an employee wellness program?  
- Does your employer provide resources to learn more about mental health issues and how to seek help?  
- Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?  
- How easy is it for you to take medical leave for a mental health condition?  
- Do you think that discussing a mental health issue with your employer would have negative consequences?  
- Do you think that discussing a physical health issue with your employer would have negative consequences?  
- Would you be willing to discuss a mental health issue with your coworkers?  
- Would you be willing to discuss a mental health issue with your direct supervisor(s)?  
- Would you bring up a mental health issue with a potential employer in an interview?  
- Would you bring up a physical health issue with a potential employer in an interview?  
- Do you feel that your employer takes mental health as seriously as physical health?  
- Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?  
- Any additional notes or comments
</details>

In [573]:
# standardize columns to have lowercase names
df14.rename(columns={'Age': 'age', 'Gender': 'gender', 'Country': 'country', 'Timestamp': 'timestamp'}, inplace=True)

In [574]:
df14.head(3)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,


In [575]:
# create a dictionary for my reference, 
#    to look up questions based on column names
column_names = df14.columns
questions = df_original.columns
col_question_map = { 
    column_names[i]: questions[i] for i in range(df_original.shape[1]-1) 
}

# for example:
col_question_map['mental_vs_physical']

'Do you feel that your employer takes mental health as seriously as physical health?'

#### Dummy variables from gender responses

Gender responses seem to be strings entered by the user. To create a more manageable set of variables, I examine all the gender responses and categorize them into `female`, `male`, or `other` based on my judgement.  

Note (and it's noted in a comment in the code) that trans men and women map to `male` and `female` categories, respectively. Gender identity can have an impact on feelings of being stigmatized, so this might be a decision to explore further. 

In [576]:
print df14.gender.unique()

['Female' 'M' 'Male' 'male' 'female' 'm' 'Male-ish' 'maile' 'Trans-female'
 'Cis Female' 'F' 'something kinda male?' 'Cis Male' 'Woman' 'f' 'Mal'
 'Male (CIS)' 'queer/she/they' 'non-binary' 'Femake' 'woman' 'Make' 'Nah'
 'All' 'Enby' 'fluid' 'Genderqueer' 'Female ' 'Androgyne' 'Agender'
 'cis-female/femme' 'Guy (-ish) ^_^' 'male leaning androgynous' 'Male '
 'Man' 'Trans woman' 'msle' 'Neuter' 'Female (trans)' 'queer'
 'Female (cis)' 'Mail' 'cis male' 'A little about you' 'Malr' 'p' 'femail'
 'Cis Man' 'ostensibly male, unsure what that really means']


In [577]:
# categorize gender responses into male, female, other based on response
def categorize_gender(gender_response):
    gender_response = gender_response.strip().lower()
    
    # caution - removing data about gender identity that may 
        # be correlated with mental health or feelings of being stigmatized
    male_responses = set(['male', 'm', 'man', 'cis male', 'male (cis)',
                          'trans-female', 'male (trans)', 'cis man',
                          'mal', 'mail', 'maile', 'make', 'msle', 'malr'])
    
    # caution - removing data about gender identity that may 
        # be correlated with mental health or feelings of being stigmatized
    female_responses = set(['female', 'f', 'woman', 'female (cis)', 
                            'trans-female', 'trans woman', 'female (trans)', 
                            'cis-female/femme', 'cis female', 
                            'femake', 'femail'])
    if gender_response in male_responses:
        return 'male'
    elif gender_response in female_responses:
        return 'female'
    else:
        return 'other'

# values mapped to 'other' for this data set:
#        ['male-ish', 'something kinda male?',
#        'queer/she/they', 'non-binary', 'nah', 'all', 'enby', 'fluid',
#        'genderqueer', 'androgyne', 'agender', 'guy (-ish) ^_^',
#        'male leaning androgynous', 'neuter', 'queer',
#        'a little about you', 'p',
#        'ostensibly male, unsure what that really means']

In [578]:
categorized_gender_responses = df14.gender.map(categorize_gender)

In [579]:
categorized_gender_responses.unique()

array(['female', 'male', 'other'], dtype=object)

In [580]:
gender_dummies = pd.get_dummies(categorized_gender_responses, prefix='gender')
# df14 = pd.concat([df14, gender_dummies], axis=1)
df14.head(3)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,


#### Handle invalid ages

Ages range from `-1726` to `99999999999`.  This is supposed to be a survey of working adults, so values under 16 and over 100 can be ignored.

In [581]:
ages = df14.age.unique()
ages.sort()
print ages

[      -1726         -29          -1           5           8          11
          18          19          20          21          22          23
          24          25          26          27          28          29
          30          31          32          33          34          35
          36          37          38          39          40          41
          42          43          44          45          46          47
          48          49          50          51          53          54
          55          56          57          58          60          61
          62          65          72         329 99999999999]


In [582]:
df14.age = df.age.map(lambda x: nan if (x<16 or x>100) else x)

In [583]:
ages = df14.age.unique()
ages.sort()
print ages

[ 18.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.  29.  30.  31.  32.
  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.  46.  47.
  48.  49.  50.  51.  53.  54.  55.  56.  57.  58.  60.  61.  62.  65.  72.
  nan]


In [584]:
df14.head(3)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37.0,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44.0,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32.0,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,


#### Convert yes/no responses to categories. 

Many of the survey questions are yes/no questions, but the data is currently stored as strings. 

In [585]:
df14.self_employed.unique()

array([nan, 'Yes', 'No'], dtype=object)

In [586]:
def yes_no_nan(response):
    if response=='Yes' or response==1:
        return 1
    elif response=='No' or response==0:
        return 0
    else:
        return nan

In [587]:
df14.self_employed = df14.self_employed.map(yes_no_nan)
df14.self_employed.unique()

array([ nan,   1.,   0.])

In [588]:
# given a data frame and a column name, convert that column's
#   values from 'Yes' or 1, 'No' or 0, and anything else
#   to 1, 0, and nan
def yes_no_nan_column(df, column_name):
    df[column_name] = df[column_name].map(yes_no_nan)

In [589]:
for col in df14.columns:
    col_uniq = df[col].unique()
    if len(col_uniq) == 2:
        print col, col_uniq

family_history ['No' 'Yes']
treatment ['Yes' 'No']
remote_work ['No' 'Yes']
tech_company ['Yes' 'No']
obs_consequence ['No' 'Yes']


**Note:** For columns that only have valid values `'Yes'` and `'No'`, it's appropriate to convert using the same `yes_no_nan` function(s). 

Most of the columns with three reponses have `'Maybe'`, `'Don't know'`, or some other meaningful third option.  Those will be preserved with dummy variables.  

In [590]:
for col in ['family_history', 'treatment', 'remote_work', 'tech_company', 'obs_consequence']:
    yes_no_nan_column(df14, col)

In [591]:
df14.head(3)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37.0,Female,United States,IL,,0,1,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,0,
1,2014-08-27 11:29:37,44.0,M,United States,IN,,0,0,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,0,
2,2014-08-27 11:29:44,32.0,Male,Canada,,,0,0,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,0,


#### Convert yes/no/maybe responses to categories. 

The yes/no questions have been coded as numbers, but there are still quite a few questions with yes/no/maybe or other meaningful third choices.  These columns will need dummy variables.

In [592]:
three_opt_cols = [col for col in df14.columns if len(df[col].unique()) == 3]

for col in three_opt_cols:
    print col, df[col].unique()

self_employed [ nan   1.   0.]
benefits ['Yes' "Don't know" 'No']
care_options ['Not sure' 'No' 'Yes']
wellness_program ['No' "Don't know" 'Yes']
seek_help ['Yes' "Don't know" 'No']
anonymity ['Yes' "Don't know" 'No']
mental_health_consequence ['No' 'Maybe' 'Yes']
phys_health_consequence ['No' 'Yes' 'Maybe']
coworkers ['Some of them' 'No' 'Yes']
supervisor ['Yes' 'No' 'Some of them']
mental_health_interview ['No' 'Yes' 'Maybe']
phys_health_interview ['Maybe' 'No' 'Yes']
mental_vs_physical ['Yes' "Don't know" 'No']


Other than `self_employed`, the columns listed above will use dummy variables.

#### Care Options

In [593]:
print col_question_map['care_options']
print df14['care_options'].unique()

Do you know the options for mental health care your employer provides?
['Not sure' 'No' 'Yes']


In [594]:
def format_y_n_other(df, col_name, other_dict):
    def format_response(response):
        if response in ['Yes', 'No', 'yes', 'no']:
            return response.lower()
        elif response in other_dict:
            return other_dict[response]
        elif response in other_dict.values():
            return response
    df[col_name] = df[col_name].map(format_response)

In [595]:
format_y_n_other(df14, 'care_options', { 'Not sure' : 'notsure' })

In [596]:
df14.care_options.unique()

array(['notsure', 'no', 'yes'], dtype=object)

#### Other "Three Option" Columns

In [597]:
dontknow_cols = ['benefits', 'wellness_program', 'seek_help', 'anonymity', 
           'mental_vs_physical']
for col in dontknow_cols:
    format_y_n_other(df14, col, { "Don't know": 'dontknow' })
    
maybe_cols = ['mental_health_consequence', 'phys_health_consequence',
            'mental_health_interview', 'phys_health_interview']
for col in maybe_cols:
    format_y_n_other(df14, col, { 'Maybe': 'maybe' })

           
some_cols = ['coworkers', 'supervisor']        
for col in some_cols:
    format_y_n_other(df14, col, { 'Some of them': 'some' })           


In [598]:
# three_opt_cols.pop(0)  # remove self_employed
for col in three_opt_cols:
    print col, df14[col].unique()

self_employed [ nan   1.   0.]
benefits ['yes' 'dontknow' 'no']
care_options ['notsure' 'no' 'yes']
wellness_program ['no' 'dontknow' 'yes']
seek_help ['yes' 'dontknow' 'no']
anonymity ['yes' 'dontknow' 'no']
mental_health_consequence ['no' 'maybe' 'yes']
phys_health_consequence ['no' 'yes' 'maybe']
coworkers ['some' 'no' 'yes']
supervisor ['yes' 'no' 'some']
mental_health_interview ['no' 'yes' 'maybe']
phys_health_interview ['maybe' 'no' 'yes']
mental_vs_physical ['yes' 'dontknow' 'no']


In [599]:
dummy_dfs = {}
for col in three_opt_cols:
    dummy_dfs[col] = pd.get_dummies(df14[col], prefix=col)

In [600]:
dummy_dfs.keys()

['wellness_program',
 'benefits',
 'seek_help',
 'self_employed',
 'mental_vs_physical',
 'care_options',
 'phys_health_interview',
 'coworkers',
 'mental_health_consequence',
 'anonymity',
 'supervisor',
 'mental_health_interview',
 'phys_health_consequence']

In [601]:
dummy_dfs['benefits'].head(3)

Unnamed: 0,benefits_dontknow,benefits_no,benefits_yes
0,0,0,1
1,1,0,0
2,0,1,0


**At this point**, the `dummy_dfs` dictionary holds dummy variable sets for all three-option columns. 

#### Other categorical variables... continued

In [602]:
# df14.head(3)

In [603]:
print col_question_map['work_interfere']
print df14['work_interfere'].unique()

If you have a mental health condition, do you feel that it interferes with your work?
['Often' 'Rarely' 'Never' 'Sometimes' nan]


In [604]:
df14['work_interfere'] = df['work_interfere'].str.lower()

In [605]:
df14['work_interfere'].unique()

array(['often', 'rarely', 'never', 'sometimes', nan], dtype=object)

In [606]:
dummy_dfs['work_interfere'] = pd.get_dummies(df14.work_interfere, prefix='work_interfere')
dummy_dfs['work_interfere'].head(3)

Unnamed: 0,work_interfere_never,work_interfere_often,work_interfere_rarely,work_interfere_sometimes
0,0,1,0,0
1,0,0,1,0
2,0,0,1,0


In [607]:
df14.head(3)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37.0,Female,United States,IL,,0,1,often,6-25,...,Somewhat easy,no,no,some,yes,no,maybe,yes,0,
1,2014-08-27 11:29:37,44.0,M,United States,IN,,0,0,rarely,More than 1000,...,Don't know,maybe,no,no,no,no,no,dontknow,0,
2,2014-08-27 11:29:44,32.0,Male,Canada,,,0,0,rarely,6-25,...,Somewhat difficult,no,no,yes,yes,yes,yes,no,0,


In [608]:
df = df14.copy()

In [610]:
df.country = df.country.str.lower()

In [611]:
df.country.unique()

array(['united states', 'canada', 'united kingdom', 'bulgaria', 'france',
       'portugal', 'netherlands', 'switzerland', 'poland', 'australia',
       'germany', 'russia', 'mexico', 'brazil', 'slovenia', 'costa rica',
       'austria', 'ireland', 'india', 'south africa', 'italy', 'sweden',
       'colombia', 'latvia', 'romania', 'belgium', 'new zealand',
       'zimbabwe', 'spain', 'finland', 'uruguay', 'israel',
       'bosnia and herzegovina', 'hungary', 'singapore', 'japan',
       'nigeria', 'croatia', 'norway', 'thailand', 'denmark',
       'bahamas, the', 'greece', 'moldova', 'georgia', 'china',
       'czech republic', 'philippines'], dtype=object)

## This is the point at which I learned about [`pandas.DataFrame.replace`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html).

In [622]:
# better with regex?!
df.replace(
    to_replace = { 
        'country': {
            'united states': 'united_states',
            'united kingdom': 'united_kingdom',
            'costa rica':'costa_rica',
            'south africa': 'south_africa',
            'new zealand': 'new_zealand',
            'bosnia and herzegovina':'bosnia_and_herzegovina',
            'bahamas, the': 'the_bahamas',
            'czech republic':'czech_republic'
        }
    },
    inplace=True
)

In [623]:
dummy_dfs['country'] = pd.get_dummies(df.country, prefix='c')
dummy_dfs['country'].head(3)

Unnamed: 0,c_australia,c_austria,c_belgium,c_bosnia_and_herzegovina,c_brazil,c_bulgaria,c_canada,c_china,c_colombia,c_costa_rica,...,c_south_africa,c_spain,c_sweden,c_switzerland,c_thailand,c_the_bahamas,c_united_kingdom,c_united_states,c_uruguay,c_zimbabwe
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [639]:
dummy_dfs['state'] = pd.get_dummies(df.state)
dummy_dfs['state'].head(3)
# baseline is nan states

Unnamed: 0,AL,AZ,CA,CO,CT,DC,FL,GA,IA,ID,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [624]:
df.columns

Index([u'timestamp', u'age', u'gender', u'country', u'state', u'self_employed',
       u'family_history', u'treatment', u'work_interfere', u'no_employees',
       u'remote_work', u'tech_company', u'benefits', u'care_options',
       u'wellness_program', u'seek_help', u'anonymity', u'leave',
       u'mental_health_consequence', u'phys_health_consequence', u'coworkers',
       u'supervisor', u'mental_health_interview', u'phys_health_interview',
       u'mental_vs_physical', u'obs_consequence', u'comments'],
      dtype='object')

In [654]:
col_question_map['no_employees']

'How many employees does your company or organization have?'

In [655]:
df.no_employees.unique()

array(['6-25', 'More than 1000', '26-100', '100-500', '1-5', '500-1000'], dtype=object)

In [657]:
df.no_employees = df.no_employees.map(lambda x: '1000+' if x=='More than 1000' else x)

In [658]:
df.no_employees.unique()

array(['6-25', '1000+', '26-100', '100-500', '1-5', '500-1000'], dtype=object)

In [660]:
dummy_dfs['no_employees'] = pd.get_dummies(df.no_employees, prefix='num_employees')
dummy_dfs['no_employees'].head(3)

Unnamed: 0,num_employees_1-5,num_employees_100-500,num_employees_1000+,num_employees_26-100,num_employees_500-1000,num_employees_6-25
0,0,0,0,0,0,1
1,0,0,1,0,0,0
2,0,0,0,0,0,1


## A Model


In [662]:
df2 = df.copy()

In [666]:
col_question_map['mental_health_consequence']

'Do you think that discussing a mental health issue with your employer would have negative consequences?'

In [670]:
col_question_map

{'age': 'Age',
 'anonymity': 'Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?',
 'benefits': 'Does your employer provide mental health benefits?',
 'care_options': 'Do you know the options for mental health care your employer provides?',
 'country': 'Country',
 'coworkers': 'Would you be willing to discuss a mental health issue with your coworkers?',
 'family_history': 'Do you have a family history of mental illness?',
 'gender': 'Gender',
 'leave': 'How easy is it for you to take medical leave for a mental health condition?',
 'mental_health_consequence': 'Do you think that discussing a mental health issue with your employer would have negative consequences?',
 'mental_health_interview': 'Would you bring up a mental health issue with a potential employer in an interview?',
 'mental_vs_physical': 'Do you feel that your employer takes mental health as seriously as physical health?',
 'no_employees': 'How many employees

In [716]:
# X = df2.copy()
# = df2['mental_health_consequence']
# X = df2['age'].copy()
# X = X.dropna()

In [723]:
y = dummy_dfs['mental_health_consequence']

In [717]:
from sklearn import model_selection

trainX, testX, trainY, testY = sklearn.model_selection.train_test_split(X, y, train_size=.6, stratify=y)

ValueError: Found input variables with inconsistent numbers of samples: [1251, 1259]

In [718]:
from sklearn import linear_model 
model = linear_model.LogisticRegression().fit(trainX, trainY)



ValueError: Input contains NaN, infinity or a value too large for dtype('float64').