# Mental Health in Tech Project

## Data Sets

[OSMI Survey on Mental Health in the Tech Workplace in 2014](https://www.kaggle.com/osmi/mental-health-in-tech-survey) 

["Ongoing" OSMI survey from 2016](https://data.world/kittybot/osmi-mental-health-tech-2016)


## Questions

What factors are most signficant in influencing whether or not a person believes disclosing a mental health issue would have negative consequences?

Can we predict, based on publicly available features of a person and company, whether that person is likely to beleive disclosing a mental health issue would be harmful for their career?

## Exploring and Cleaning 2014 Data

See cleaning.ipynb

In [1]:
import pandas as pd

In [2]:
df14 = pd.read_csv("./datasets/2014/mental-health-in-tech-2014.csv")
print df14.shape
# df14.head(3)

(1259, 27)


In [3]:
# standardize columns to have lowercase names
df14.rename(columns={'Age': 'age', 'Gender': 'gender', 'Country': 'country', 'Timestamp': 'timestamp'}, inplace=True)
# replace confusing no_employees column name
df14.rename(columns={'no_employees': 'num_employees'}, inplace=True)


In [4]:
df_original = pd.read_csv("./datasets/2014/osmi-mental-health-in-tech-original.csv")
print df_original.shape
# print df_original.columns   # original questions/fields

(1259, 27)


<details><summary> Click to expand all **original questions/fields** </summary>
    
- Timestamp   
- Age  
- Gender   
- Country  
- If you live in the United States, which state or territory do you live in?  
- Are you self-employed?  
- Do you have a family history of mental illness?  
- Have you sought treatment for a mental health condition?  
- If you have a mental health condition, do you feel that it interferes with your work?  
- How many employees does your company or organization have?  
- Do you work remotely (outside of an office) at least 50% of the time?  
- Is your employer primarily a tech company/organization?  
- Does your employer provide mental health benefits?  
- Do you know the options for mental health care your employer provides?  
- Has your employer ever discussed mental health as part of an employee wellness program?  
- Does your employer provide resources to learn more about mental health issues and how to seek help?  
- Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?  
- How easy is it for you to take medical leave for a mental health condition?  
- Do you think that discussing a mental health issue with your employer would have negative consequences?  
- Do you think that discussing a physical health issue with your employer would have negative consequences?  
- Would you be willing to discuss a mental health issue with your coworkers?  
- Would you be willing to discuss a mental health issue with your direct supervisor(s)?  
- Would you bring up a mental health issue with a potential employer in an interview?  
- Would you bring up a physical health issue with a potential employer in an interview?  
- Do you feel that your employer takes mental health as seriously as physical health?  
- Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?  
- Any additional notes or comments
</details>

In [5]:
# create reference to look up questions based on column names
column_names = df14.columns
questions = df_original.columns
col_question_map = { 
    column_names[i]: questions[i] for i in range(df_original.shape[1]-1) 
}

# for example:
col_question_map['mental_vs_physical']

'Do you feel that your employer takes mental health as seriously as physical health?'

#### Load Cleaned Data


In [6]:
df = pd.read_csv("./datasets/2014/clean-mental-health-in-tech-2014.csv", index_col=0)
print df.shape

(1259, 185)


In [7]:
# quick NaN check
counts = df.count()
numrows = df.shape[0]
for col in df.columns:
    if counts[col] != numrows:
        print "{0} has {1} NaNs".format(col, numrows-counts[col])

age has 8 NaNs
state has 515 NaNs
self_employed has 18 NaNs
work_interfere has 264 NaNs
comments has 1095 NaNs


In [8]:
df.head(2)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,num_employees,...,phys_health_consequence_no,phys_health_consequence_yes,leave_dont_know,leave_somewhat_difficult,leave_somewhat_easy,leave_very_difficult,leave_very_easy,gender_category_female,gender_category_male,gender_category_other
0,2014-08-27 11:29:31,37.0,Female,United States,IL,,no,1,often,6-25,...,1,0,0,0,1,0,0,1,0,0
1,2014-08-27 11:29:37,44.0,M,United States,IN,,no,0,rarely,1000+,...,1,0,1,0,0,0,0,0,1,0


## A Model


In [9]:
from sklearn import tree
dtree = tree.DecisionTreeClassifier()

In [10]:
col_question_map['mental_health_consequence']

'Do you think that discussing a mental health issue with your employer would have negative consequences?'

In [11]:
tdf = df.dropna(subset=['self_employed', 'age'])
tdf = tdf[tdf['country'] == 'United States']
print tdf.shape
total_y = tdf['mental_health_consequence']

(735, 185)


In [12]:
# col_question_map

In [77]:
# information not explicitly related to mental health
employer_cols = ['self_employed', 'tech_company', 'num_employees',] #'country',]
person_cols = [ 'gender_category'] #, 'age']
private_cols = ['treatment', 'family_history', 'obs_consequence']
X_cols = employer_cols+person_cols+private_cols

In [78]:
total_X = tdf[X_cols].copy()
total_X.head(2)

Unnamed: 0,self_employed,tech_company,num_employees,gender_category,treatment,family_history,obs_consequence
18,yes,1,1-5,male,0,yes,1
20,no,1,100-500,male,1,yes,0


In [79]:
mhc_labels = total_y.copy()  # mental_health_consequence labels
mhc_labels.replace(to_replace={'no': 0, 'maybe': 1, 'yes': 2}, inplace=True)
mhc_labels.head()

18    0
20    1
21    0
22    1
23    1
Name: mental_health_consequence, dtype: int64

In [80]:
# countries = total_X['country'].unique()
# country_dict = {countries[i]: i for i in range(len(countries))}
# print country_dict

In [81]:
employee_nums = total_X['num_employees'].unique()
employee_num_dict = {employee_nums[i]: i for i in range(len(employee_nums))}
print employee_num_dict

{'26-100': 2, '1-5': 0, '6-25': 4, '100-500': 1, '500-1000': 5, '1000+': 3}


In [82]:
totX = total_X.replace({
    'self_employed': {'yes': 2, 'no': 0},
    'gender_category': {'female': 0, 'other': 1, 'male': 2},
#     'country': country_dict,
    'num_employees': employee_num_dict,
    'family_history':  {'yes': 2, 'no': 0},
    'obs_consequence': {1: 2}
})

In [83]:
totX

Unnamed: 0,self_employed,tech_company,num_employees,gender_category,treatment,family_history,obs_consequence
18,2,1,0,2,0,2,2
20,0,1,1,2,1,2,0
21,2,1,0,2,0,0,0
22,0,1,2,2,1,0,0
23,0,0,3,2,1,0,0
24,0,1,2,2,1,2,0
25,0,0,3,2,1,2,0
26,0,1,0,2,0,0,0
27,0,1,4,0,1,2,0
28,0,1,2,2,1,0,0


In [84]:
# quick NaN check
# counts = totX.count()
# numrows = totX.shape[0]
# for col in totX.columns:
#     if counts[col] != numrows:
#         print "{0} has {1} NaNs".format(col, numrows-counts[col])
totX.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 735 entries, 18 to 1258
Data columns (total 7 columns):
self_employed      735 non-null int64
tech_company       735 non-null int64
num_employees      735 non-null int64
gender_category    735 non-null int64
treatment          735 non-null int64
family_history     735 non-null int64
obs_consequence    735 non-null int64
dtypes: int64(7)
memory usage: 45.9 KB


In [85]:
dtree.fit(totX, total_y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [86]:
# from IPython import display
# from sklearn import datasets, tree, utils
# from sklearn.externals.six import StringIO  
# import pydotplus
# from sklearn import tree


# clf = tree.DecisionTreeClassifier()
# clf.fit(totX, total_y)

# tree_clf = dtree

# # Generate a plot of the decision tree
# dot_data = StringIO() 
# tree.export_graphviz(tree_clf, 
#                      class_names = total_y.tolist(),
#                      feature_names = totX.columns.tolist(), out_file=dot_data, 
#                     filled=True, ) 
# graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 

# display.Image(graph.create_png())


# Random Forest


In [87]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
model.fit(totX, total_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [88]:

# # This is all that's needed for scikit-plot
# import matplotlib.pyplot as plt
# from scikitplot import classifier_factory

# classifier_factory(model)

# model.plot_roc_curve(totX.as_matrix(), total_y, random_state=2)
# plt.show()

In [89]:
features = totX.columns
feature_importances = model.feature_importances_

features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort_values(by='Importance Score', inplace=True, ascending=False)

print features_df

          Features  Importance Score
2    num_employees          0.425356
3  gender_category          0.118824
5   family_history          0.100246
6  obs_consequence          0.099158
4        treatment          0.098956
1     tech_company          0.097150
0    self_employed          0.060310


In [90]:
# from sklearn.cross_validation import cross_val_score

# scores = cross_val_score(model, totX, total_y, scoring='roc_auc')
# print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

# for n_trees in range(1, 100, 10):
#     model = RandomForestClassifier(n_estimators = n_trees)
#     scores = cross_val_score(model, totX, total_y, scoring='roc_auc')
#     print('n trees: {}, CV AUC {}, Average AUC {}'.format(n_trees, scores, scores.mean()))