Potential resource(s):
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py

# Mental Health in Tech Project

## Data Sets

[OSMI Survey on Mental Health in the Tech Workplace in 2014](https://www.kaggle.com/osmi/mental-health-in-tech-survey) 

["Ongoing" OSMI survey from 2016](https://data.world/kittybot/osmi-mental-health-tech-2016)


## Questions

What factors are most signficant in influencing whether or not a person believes disclosing a mental health issue would have negative consequences?

Can we predict, based on publicly available features of a person and company, whether that person is likely to beleive disclosing a mental health issue would be harmful for their career?

## Plan

In [1]:
# clean 2014 data:
#   - remove invalid ages  ( < 16, > 80)
#   - create gender categories
#   - create dictionary to map questions to column names (from original data set)
#   - add year column with all 2014s

In [2]:
# clean 2016 data:
#   - replace full question column names with shorter titles
#   - remove invalid ages  ( < 16, > 80)
#   - create gender categories
#   - create dictionary to map questions to column names (from original data set)
#   - add year column with all 2016s
#   - split position column into dummies
#   - spit physician diagnoses and diagnosed conditions into dummies

In [3]:
# add features to 2016 to correspond to 2016:
#   - state, for when live in state matches work in state
#   - country, for when live in country matches work in country
#   - change "United States of America" to "United States" to match 2014 data

In [4]:
# combine 2014 and 2016 U.S. data into one data frame

In [5]:
# plot number of responses by age

In [6]:
# plot frequency of mental health consequences yes/no/maybe by age

In [7]:
# plot frequency of mental health consequence yes/no/maybe by age groups (quantiles, 4-6 groups)

In [8]:
################ Logistic Regression

In [9]:
# logistic regression 
# - combine mental health consequences (yes and maybe) vs no as boolean categorical Y to predict
# - pull out X variables that are public
# - created dummies for categorical public X variables

## Setup

In [10]:
import pandas as pd
import numpy as np
%matplotlib inline

## Import & Initial Data Cleaning

In [11]:
df16 = pd.read_csv('./datasets/2016/clean-no-dummies-2016.csv', index_col=0)

In [12]:
df14 = pd.read_csv('./datasets/2014/clean-no-dummies-2014.csv', index_col=0)
print df14.shape
df14.head(3)

(1259, 28)


Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,num_employees,...,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments,gender_category
0,2014-08-27 11:29:31,37.0,Female,United States,IL,,no,1,often,6-25,...,no,no,some_of_them,yes,no,maybe,yes,0,,female
1,2014-08-27 11:29:37,44.0,M,United States,IN,,no,0,rarely,1000+,...,maybe,no,no,no,no,no,dont_know,0,,male
2,2014-08-27 11:29:44,32.0,Male,Canada,,,no,0,rarely,6-25,...,no,no,yes,yes,yes,yes,no,0,,male


In [13]:
df16 = pd.read_csv('./datasets/2016/clean-no-dummies-2016.csv', index_col=0)
print df16.shape
df16.head(3)

(1433, 66)


Unnamed: 0,self_employed,num_employees,tech_company,tech_role,benefits,care_options,wellness_program,seek_help,anonymity,leave,...,gender,live_in_country,live_in_state,work_in_country,work_in_state,position,remote_work,gender_category,state,country
0,0,26-100,1.0,,doesnt_apply,,no,no,dont_know,very_easy,...,Male,united_kingdom,,united_kingdom,,Back-end Developer,sometimes,male,,United Kingdom
1,0,6-25,1.0,,no,yes,yes,yes,yes,somewhat_easy,...,male,united_states,il,united_states,il,Back-end Developer|Front-end Developer,never,male,il,United States
2,0,6-25,1.0,,no,,no,no,dont_know,neither_easy_nor_difficult,...,Male,united_kingdom,,united_kingdom,,Back-end Developer,always,male,,United Kingdom


### Column-Question Mappings

Conveniently look up the survey question associated with each column name.

Can also add to the maps in the future to describe created columns.

In [14]:
# map column names to questions
df14.cq = {'age': 'Age',
 'anonymity': 'Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?',
 'benefits': 'Does your employer provide mental health benefits?',
 'care_options': 'Do you know the options for mental health care your employer provides?',
 'country': 'Country',
 'coworkers': 'Would you be willing to discuss a mental health issue with your coworkers?',
 'family_history': 'Do you have a family history of mental illness?',
 'gender': 'Gender',
 'leave': 'How easy is it for you to take medical leave for a mental health condition?',
 'mental_health_consequence': 'Do you think that discussing a mental health issue with your employer would have negative consequences?',
 'mental_health_interview': 'Would you bring up a mental health issue with a potential employer in an interview?',
 'mental_vs_physical': 'Do you feel that your employer takes mental health as seriously as physical health?',
 'num_employees': 'How many employees does your company or organization have?',
 'obs_consequence': 'Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?',
 'phys_health_consequence': 'Do you think that discussing a physical health issue with your employer would have negative consequences?',
 'phys_health_interview': 'Would you bring up a physical health issue with a potential employer in an interview?',
 'remote_work': 'Do you work remotely (outside of an office) at least 50% of the time?',
 'seek_help': 'Does your employer provide resources to learn more about mental health issues and how to seek help?',
 'self_employed': 'Are you self-employed?',
 'state': 'If you live in the United States, which state or territory do you live in?',
 'supervisor': 'Would you be willing to discuss a mental health issue with your direct supervisor(s)?',
 'tech_company': 'Is your employer primarily a tech company/organization?',
 'timestamp': 'Timestamp',
 'treatment': 'Have you sought treatment for a mental health condition?',
 'wellness_program': 'Has your employer ever discussed mental health as part of an employee wellness program?',
 'work_interfere': 'If you have a mental health condition, do you feel that it interferes with your work?'}
df16.cq = {'age': 'What is your age?',
 'anonymity': 'Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?',
 'benefits': 'Does your employer provide mental health benefits as part of healthcare coverage?',
 'care_options': 'Do you know the options for mental health care available under your employer-provided coverage?',
 'coworkers': 'Would you feel comfortable discussing a mental health disorder with your coworkers?',
 'current_believed_conditions': 'If maybe, what condition(s) do you believe you have?',
 'current_diagnosed_conditions': 'If yes, what condition(s) have you been diagnosed with?',
 'current_disorder': 'Do you currently have a mental health disorder?',
 'family_history': 'Do you have a family history of mental illness?',
 'friends_family': 'How willing would you be to share with friends and family that you have a mental illness?',
 'gender': 'What is your gender?',
 'hurt_career': 'Do you feel that being identified as a person with a mental health issue would hurt your career?',
 'insurance': 'Do you have medical coverage (private insurance or state-provided) which includes treatment of \xc2\xa0mental health issues?',
 'know_resources': 'Do you know local or online resources to seek help for a mental health disorder?',
 'leave': 'If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:',
 'live_in_country': 'What country do you live in?',
 'live_in_state': 'What US state or territory do you live in?',
 'mental_health_consequence': 'Do you think that discussing a mental health disorder with your employer would have negative consequences?',
 'mental_health_interview': 'Would you bring up a mental health issue with a potential employer in an interview?',
 'mental_health_interview_comment': 'Why or why not?.1',
 'mental_vs_physical': 'Do you feel that your employer takes mental health as seriously as physical health?',
 'num_employees': 'How many employees does your company or organization have?',
 'obs_consequence': 'Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace?',
 'obs_negative_response': 'Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?',
 'past_disorder': 'Have you had a mental health disorder in the past?',
 'percent_time_impacted': 'If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?',
 'phys_health_consequence': 'Do you think that discussing a physical health issue with your employer would have negative consequences?',
 'phys_health_interview': 'Would you be willing to bring up a physical health issue with a potential employer in an interview?',
 'phys_health_interview_comment': 'Why or why not?',
 'position': 'Which of the following best describes your work position?',
 'prev_anonymity': 'Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?',
 'prev_benefits': 'Have your previous employers provided mental health benefits?',
 'prev_care_options': 'Were you aware of the options for mental health care provided by your previous employers?',
 'prev_coworkers': 'Would you have been willing to discuss a mental health issue with your previous co-workers?',
 'prev_employer': 'Do you have previous employers?',
 'prev_mental_health_consequence': 'Do you think that discussing a mental health disorder with previous employers would have negative consequences?',
 'prev_mental_vs_physical': 'Did you feel that your previous employers took mental health as seriously as physical health?',
 'prev_obs_consequence': 'Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces?',
 'prev_phys_health_consequence': 'Do you think that discussing a physical health issue with previous employers would have negative consequences?',
 'prev_seek_help': 'Did your previous employers provide resources to learn more about mental health issues and how to seek help?',
 'prev_supervisor': 'Would you have been willing to discuss a mental health issue with your direct supervisor(s)?',
 'prev_wellness_program': 'Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?',
 'productivity_impacted': 'Do you believe your productivity is ever affected by a mental health issue?',
 'professional_diagnosed': 'Have you been diagnosed with a mental health condition by a medical professional?',
 'professional_diagnoses': 'If so, what condition(s) were you diagnosed with?',
 'reluctant_due_to_obs': 'Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?',
 'remote_work': 'Do you work remotely?',
 'revealed_contacts': 'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?',
 'revealed_contacts_consequence': 'If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively?',
 'revealed_coworkers': 'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?',
 'revealed_coworkers_consequence': 'If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively?',
 'seek_help': 'Does your employer offer resources to learn more about mental health concerns and options for seeking help?',
 'self_employed': 'Are you self-employed?',
 'supervisor': 'Would you feel comfortable discussing a mental health disorder with your direct supervisor(s)?',
 'tech_company': 'Is your employer primarily a tech company/organization?',
 'tech_role': 'Is your primary role within your company related to tech/IT?',
 'treatment': 'Have you ever sought treatment for a mental health issue from a mental health professional?',
 'viewed_negatively': 'Do you think that team members/co-workers would view you more negatively if they knew you suffered from a mental health issue?',
 'wellness_program': 'Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?',
 'work_in_country': 'What country do you work in?',
 'work_in_state': 'What US state or territory do you work in?',
 'work_interfere_treated': 'If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?',
 'work_interfere_untreated': 'If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?'}

#### Additional Derived Column Descriptions

In [15]:
gender_category_explanation = 'Derived column: categorized gender responses into male/female/other. Trans males and females are categorized as males and females, respectively.'
df14.cq['gender_category'] = gender_category_explanation
df16.cq['gender_category'] = gender_category_explanation

In [16]:
df16.cq['country'] = 'Derived column: a single column for when live_in_country and work_in_country match. Rows where the countries do not match have value "multi."'
df16.cq['state'] = 'Derived column: a single column for when live_in_state and work_in_state match (or are both nan). Rows where the states do not match have value "multi."'

In [17]:
print df16.cq['live_in_country']
print df16.cq['work_in_country']
print df16.cq['country']

What country do you live in?
What country do you work in?
Derived column: a single column for when live_in_country and work_in_country match. Rows where the countries do not match have value "multi."


### Overview function

In [18]:
# helper function to get column preview
def preview_col(col, data):
    if col in data.cq:
        print data.cq[col]
    print data[col].value_counts(dropna=False)

In [19]:
preview_col('treatment', df14)
print
preview_col('treatment', df16)

Have you sought treatment for a mental health condition?
1    637
0    622
Name: treatment, dtype: int64

Have you ever sought treatment for a mental health issue from a mental health professional?
1    839
0    594
Name: treatment, dtype: int64


### Create Combined US Dataframe

In [20]:
df14['year'] = 2014
df16['year'] = 2016

In [80]:
usdf14 = df14[df14['country']=='United States']
usdf14.cq = df14.cq
print usdf14.shape
usdf16 = df16[df16['country']=='United States']
usdf16.cq = df16.cq
print usdf16.shape
usdf = pd.concat([usdf14, usdf16], axis=0)
print usdf.shape
usdf.reset_index(inplace=True)
usdf.head(3)


(751, 29)
(837, 67)
(1588, 70)


Unnamed: 0,index,age,anonymity,benefits,care_options,comments,country,coworkers,current_believed_conditions,current_diagnosed_conditions,...,timestamp,treatment,viewed_negatively,wellness_program,work_in_country,work_in_state,work_interfere,work_interfere_treated,work_interfere_untreated,year
0,0,37.0,yes,yes,not_sure,,United States,some_of_them,,,...,2014-08-27 11:29:31,1,,no,,,often,,,2014
1,1,44.0,dont_know,dont_know,no,,United States,no,,,...,2014-08-27 11:29:37,0,,dont_know,,,rarely,,,2014
2,4,31.0,dont_know,yes,no,,United States,some_of_them,,,...,2014-08-27 11:30:22,0,,dont_know,,,never,,,2014


In [81]:
def question(col):
    q = ""
    if col in df14.cq:
        q += '{0}: {1}\n'.format(2014, df14.cq[col])
        q += '\n'
    if col in df16.cq:
        q += '{0}: {1}\n'.format(2016, df16.cq[col])
    return q
# print question('treatment')

In [82]:
cols14 = set(df14.columns)
cols16 = set(df16.columns)
cols = cols14.union(cols16)
usdf.cq = { c:question(c) for c in cols }
overlap = cols14.intersection(cols16) 
overlap

{'age',
 'anonymity',
 'benefits',
 'care_options',
 'country',
 'coworkers',
 'family_history',
 'gender',
 'gender_category',
 'leave',
 'mental_health_consequence',
 'mental_health_interview',
 'mental_vs_physical',
 'num_employees',
 'obs_consequence',
 'phys_health_consequence',
 'phys_health_interview',
 'remote_work',
 'seek_help',
 'self_employed',
 'state',
 'supervisor',
 'tech_company',
 'treatment',
 'wellness_program',
 'year'}

In [83]:
shouldnt_combine = ['remote_work', 'family_history', 'leave', 'cowoerkers']
more_work_to_combine = ['state', 'self_employed']

In [90]:
preview_col('year', usdf)


2016    837
2014    751
Name: year, dtype: int64


In [85]:
preview_col('supervisor', usdf14)

Would you be willing to discuss a mental health issue with your direct supervisor(s)?
yes             304
no              238
some_of_them    209
Name: supervisor, dtype: int64


In [86]:
preview_col('supervisor', usdf16)

Would you feel comfortable discussing a mental health disorder with your direct supervisor(s)?
yes      271
maybe    229
no       208
NaN      129
Name: supervisor, dtype: int64
