Potential resource(s):
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py

# Mental Health in Tech Project

## Data Sets

[OSMI Survey on Mental Health in the Tech Workplace in 2014](https://www.kaggle.com/osmi/mental-health-in-tech-survey) 

["Ongoing" OSMI survey from 2016](https://data.world/kittybot/osmi-mental-health-tech-2016)


## Questions

What factors are most signficant in influencing whether or not a person believes disclosing a mental health issue would have negative consequences?

Can we predict, based on publicly available features of a person and company, whether that person is likely to beleive disclosing a mental health issue would be harmful for their career?

## Exploring and Cleaning 2014 Data

In [2]:
import pandas as pd

In [71]:
df14 = pd.read_csv("./datasets/2014/clean-no-dummies-2014.csv", index_col=0)
df14['year'] = '2014'
print df14.shape
df14.head(3)

(1259, 29)


Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,num_employees,...,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments,gender_category,year
0,2014-08-27 11:29:31,37.0,Female,United States,IL,,0,1,often,6-25,...,no,some_of_them,yes,no,maybe,yes,0,,female,2014
1,2014-08-27 11:29:37,44.0,M,United States,IN,,0,0,rarely,1000+,...,no,no,no,no,no,dont_know,0,,male,2014
2,2014-08-27 11:29:44,32.0,Male,Canada,,,0,0,rarely,6-25,...,no,yes,yes,yes,yes,no,0,,male,2014


In [73]:
df16 = pd.read_csv("./datasets/2016/clean-no-dummies-2016.csv", index_col=0)
df16['year'] = '2016'
print df16.shape
df16.head(3)

(1433, 65)


Unnamed: 0,self_employed,num_employees,tech_company,tech_role,benefits,care_options,wellness_program,seek_help,anonymity,leave,...,age,gender,live_in_country,live_in_state,work_in_country,work_in_state,position,remote_work,gender_category,year
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,...,39.0,Male,United Kingdom,,United Kingdom,,Back-end Developer,Sometimes,male,2016
1,0,6-25,1.0,,No,Yes,Yes,Yes,Yes,Somewhat easy,...,29.0,male,United States of America,Illinois,United States of America,Illinois,Back-end Developer|Front-end Developer,Never,male,2016
2,0,6-25,1.0,,No,,No,No,I don't know,Neither easy nor difficult,...,38.0,Male,United Kingdom,,United Kingdom,,Back-end Developer,Always,male,2016


In [78]:
# print df14.columns
# print df16.columns
colset14 = set(df14.columns)
colset16 = set(df16.columns)
print colset14 - colset16
print colset16 - colset14

set(['timestamp', 'state', 'work_interfere', 'comments', 'country'])
set(['hurt_career', 'revealed_contacts', 'prev_coworkers', 'prev_mental_health_consequence', 'mental_health_interview_comment', 'reluctant_due_to_obs', 'insurance', 'prev_seek_help', 'work_in_state', 'past_disorder', 'prev_employer', 'professional_diagnosed', 'prev_benefits', 'friends_family', 'diagnosed_conditions', 'prev_supervisor', 'revealed_coworkers', 'professional_diagnoses', 'prev_phys_health_consequence', 'tech_role', 'live_in_state', 'viewed_negatively', 'work_interfere_untreated', 'percent_time_impacted', 'prev_wellness_program', 'know_resources', 'believed_conditions', 'productivity_impacted', 'revealed_contacts_consequence', 'prev_anonymity', 'current_disorder', 'obs_negative_response', 'work_interfere_treated', 'revealed_coworkers_consequence', 'prev_mental_vs_physical', 'prev_obs_consequence', 'phys_health_interview_comment', 'live_in_country', 'prev_care_options', 'work_in_country', 'position'])


##### Can I combine lived in and worked in locations?

In [79]:
cdf = df16[df16['live_in_country']!=df16['work_in_country']]
cdf[['live_in_country', 'work_in_country']].shape[0]

26

In [80]:
sdf = df16[(df16['live_in_state'].isnull() == False)]
sdf[sdf['live_in_state'] != sdf['work_in_state']].shape[0]

44

#### NaN Check


In [82]:
# print '2014 data'
# counts = df14.count()
# numrows = df14.shape[0]
# for col in df14.columns:
#     if counts[col] != numrows:
#         print "{0} has {1} NaNs".format(col, numrows-counts[col])
        
# print '\n2016 data'
# counts = df16.count()
# numrows = df16.shape[0]
# for col in df16.columns:
#     if counts[col] != numrows:
#         print "{0} has {1} NaNs".format(col, numrows-counts[col])

In [83]:
df16.head(1)

Unnamed: 0,self_employed,num_employees,tech_company,tech_role,benefits,care_options,wellness_program,seek_help,anonymity,leave,...,age,gender,live_in_country,live_in_state,work_in_country,work_in_state,position,remote_work,gender_category,year
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,...,39.0,Male,United Kingdom,,United Kingdom,,Back-end Developer,Sometimes,male,2016


In [84]:
df14.head(1)

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,num_employees,...,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments,gender_category,year
0,2014-08-27 11:29:31,37.0,Female,United States,IL,,0,1,often,6-25,...,no,some_of_them,yes,no,maybe,yes,0,,female,2014


In [85]:
df = pd.concat([df14, df16])

In [93]:
df14[df14['country']=="United States"].shape[0]

751

In [97]:
usdf16 = df16[df16['work_in_country']=="United States of America"].copy()
usdf16.rename(columns={'work_in_country': 'country'}, inplace=True)

In [98]:
usdf14 = df14[df14['country']=="United States"]

In [145]:
usdf = pd.concat([usdf14, usdf16])
usdf['country'] = "United States"
# usdf['country'].value_counts(dropna=False)
usdf.head(2)
usdf['work_interfere_treated'].value_counts()

Often                   357
Not applicable to me    241
Sometimes               221
Rarely                   27
Never                     5
Name: work_interfere_treated, dtype: int64

In [None]:
df[['work_interfere', 'work_interfere_treated', 'work_interfere_untreated']].value_counts()

In [155]:
def work_interfere_category (row):
    if row['work_interfere'] == 'never' or row['work_interfere_treated'] == 'never' or row['work_interfere_untreated'] == 'never'  :
        return 0
    elif row['work_interfere'] == 'doesnt_apply' or row['work_interfere_treated'] == 'doesnt_apply' or row['work_interfere_untreated'] == 'doesnt_apply':
        return nan
    return 1
    

In [156]:
df.apply(work_interfere_category, axis=1).value_counts(dropna=False)

1    2479
0     213
dtype: int64