## ID : 816000325
## Name: Ajay Sieunarine
## Email: ajay.sieunarine@my.uwi.edu

In this assignment I will use the provided survey to determine how companies treat employees burdened with mental health issues.

1. Are companies giving them enough attention?
2. Are they giving them benefits?
3. Are they treating them fairly?
4. Are they being given adequate recovery time (if needed)?
5. Are they being medically treated?

In [26]:
import pandas as pd # library to help with files and dataframes
from matplotlib import pyplot as plt # package to help with plotting points 
import seaborn as sns # package to help with graphs and plots for data vis
import numpy as np # library to help with data structs, np.array etc
from sklearn.preprocessing import LabelEncoder
%matplotlib inline

In [27]:
df = pd.read_csv('survey.csv')

In [28]:
df.head() # check to see if the file is read into a dataframe

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [29]:
df = df.drop([ # drop the unnecessary cols
    'Timestamp', 
    'comments', 'no_employees'
], axis=1) 

In [30]:
df.head()

Unnamed: 0,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,remote_work,tech_company,...,wellness_program,seek_help,anonymity,leave,mental_health_consequence,coworkers,supervisor,mental_health_interview,mental_vs_physical,obs_consequence
0,37,Female,United States,IL,,No,Yes,Often,No,Yes,...,No,Yes,Yes,Somewhat easy,No,Some of them,Yes,No,Yes,No
1,44,M,United States,IN,,No,No,Rarely,No,No,...,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No,Don't know,No
2,32,Male,Canada,,,No,No,Rarely,No,Yes,...,No,No,Don't know,Somewhat difficult,No,Yes,Yes,Yes,No,No
3,31,Male,United Kingdom,,,Yes,Yes,Often,No,Yes,...,No,No,No,Somewhat difficult,Yes,Some of them,No,Maybe,No,Yes
4,31,Male,United States,TX,,No,No,Never,Yes,Yes,...,Don't know,Don't know,Don't know,Don't know,No,Some of them,Yes,Yes,Don't know,No


In [31]:
df.dtypes

Age                           int64
Gender                       object
Country                      object
state                        object
self_employed                object
family_history               object
treatment                    object
work_interfere               object
remote_work                  object
tech_company                 object
benefits                     object
care_options                 object
wellness_program             object
seek_help                    object
anonymity                    object
leave                        object
mental_health_consequence    object
coworkers                    object
supervisor                   object
mental_health_interview      object
mental_vs_physical           object
obs_consequence              object
dtype: object

## I will now attempt to sanitize the data
- Apply LabelEncoders to the relevant cols
- Change the data type of cols (if needed)

## Changing data types of some cols

In [32]:
# give existing cols their appropriate types
df = df.astype(str)
df['Age'] = df['Age'].astype('int64')
df['Gender'] = df['Gender'].astype('str')
df['self_employed'] = df['self_employed'].astype('str')
df['family_history'] = df['family_history'].astype('str')
df['treatment'] = df['treatment'].astype('str')
df['work_interfere'] = df['work_interfere'].astype('str')
df['remote_work'] = df['remote_work'].astype('str')
df['benefits'] = df['benefits'].astype('str')
df['care_options'] = df['care_options'].astype('str')
df['wellness_program'] = df['wellness_program'].astype('str')
df['seek_help'] = df['seek_help'].astype('str')
df['anonymity'] = df['anonymity'].astype('str')
df['leave'] = df['leave'].astype('str')
df['phys_health_consequence'] = df['phys_health_consequence'].astype('str')
df['phys_health_interview'] = df['phys_health_interview'].astype('str')
df['mental_health_consequence'] = df['mental_health_consequence'].astype('str')
df['coworkers'] = df['coworkers'].astype('str')
df['supervisor'] = df['supervisor'].astype('str')
df['mental_health_interview'] = df['mental_health_interview'].astype('str')
df['obs_consequence'] = df['obs_consequence'].astype('str')

In [33]:
df.dtypes

Age                           int64
Gender                       object
Country                      object
state                        object
self_employed                object
family_history               object
treatment                    object
work_interfere               object
remote_work                  object
tech_company                 object
benefits                     object
care_options                 object
wellness_program             object
seek_help                    object
anonymity                    object
leave                        object
mental_health_consequence    object
coworkers                    object
supervisor                   object
mental_health_interview      object
mental_vs_physical           object
obs_consequence              object
dtype: object

## Encoding Gender Column

In [34]:
gender_encoder = LabelEncoder()
gender_encoder.fit(df['Gender'])
gender_encoder.classes_

array(['A little about you', 'Agender', 'All', 'Androgyne', 'Cis Female',
       'Cis Male', 'Cis Man', 'Enby', 'F', 'Femake', 'Female', 'Female ',
       'Female (cis)', 'Female (trans)', 'Genderqueer', 'Guy (-ish) ^_^',
       'M', 'Mail', 'Make', 'Mal', 'Male', 'Male ', 'Male (CIS)',
       'Male-ish', 'Malr', 'Man', 'Nah', 'Neuter', 'Trans woman',
       'Trans-female', 'Woman', 'cis male', 'cis-female/femme', 'f',
       'femail', 'female', 'fluid', 'm', 'maile', 'male',
       'male leaning androgynous', 'msle', 'non-binary',
       'ostensibly male, unsure what that really means', 'p', 'queer',
       'queer/she/they', 'something kinda male?', 'woman'], dtype=object)

We can see that applicants entered some very interesting results. I am only interested in Males, Females, however, I will classify the remainders as 'Other'.

In [35]:
# replace the gender column with an encoded col, labeled 'sex'
df['sex'] = df['Gender'].apply(
    lambda x: 0 if x[0] == 'm' or x[0] == 'M' 
    else 1 if x[0] == 'f' or x[0] == 'F'
    else 2
)
df[['Gender', 'sex']].head()

Unnamed: 0,Gender,sex
0,Female,1
1,M,0
2,Male,0
3,Male,0
4,Male,0


The map/apply method is working, we can now represent genders as follows:
    
    - 0 -> Male
    - 1 -> Female
    - 2 -> Other

Now I will drop the actual column.

In [36]:
del df['Gender']
df.head()

Unnamed: 0,Age,Country,state,self_employed,family_history,treatment,work_interfere,remote_work,tech_company,benefits,...,seek_help,anonymity,leave,mental_health_consequence,coworkers,supervisor,mental_health_interview,mental_vs_physical,obs_consequence,sex
0,37,United States,IL,,No,Yes,Often,No,Yes,Yes,...,Yes,Yes,Somewhat easy,No,Some of them,Yes,No,Yes,No,1
1,44,United States,IN,,No,No,Rarely,No,No,Don't know,...,Don't know,Don't know,Don't know,Maybe,No,No,No,Don't know,No,0
2,32,Canada,,,No,No,Rarely,No,Yes,No,...,No,Don't know,Somewhat difficult,No,Yes,Yes,Yes,No,No,0
3,31,United Kingdom,,,Yes,Yes,Often,No,Yes,No,...,No,No,Somewhat difficult,Yes,Some of them,No,Maybe,No,Yes,0
4,31,United States,TX,,No,No,Never,Yes,Yes,Yes,...,Don't know,Don't know,Don't know,No,Some of them,Yes,Yes,Don't know,No,0


## Encoding some other cols:

In [37]:
# se_encoder = LabelEncoder()
# se_encoder.fit(df['self_employed'])
# # se_encoder.classes_

# replace the self_employed column with an encoded col, labeled 'SelfEmployed'
df['SelfEmployed'] = df['self_employed'].apply(
    lambda x: 
    2 if pd.isnull(x)
    else 1 if x[0] == 'y' or x[0] == 'Y'
    else 0
)
del df['self_employed']

df.head()

Unnamed: 0,Age,Country,state,family_history,treatment,work_interfere,remote_work,tech_company,benefits,care_options,...,anonymity,leave,mental_health_consequence,coworkers,supervisor,mental_health_interview,mental_vs_physical,obs_consequence,sex,SelfEmployed
0,37,United States,IL,No,Yes,Often,No,Yes,Yes,Not sure,...,Yes,Somewhat easy,No,Some of them,Yes,No,Yes,No,1,0
1,44,United States,IN,No,No,Rarely,No,No,Don't know,No,...,Don't know,Don't know,Maybe,No,No,No,Don't know,No,0,0
2,32,Canada,,No,No,Rarely,No,Yes,No,No,...,Don't know,Somewhat difficult,No,Yes,Yes,Yes,No,No,0,0
3,31,United Kingdom,,Yes,Yes,Often,No,Yes,No,Yes,...,No,Somewhat difficult,Yes,Some of them,No,Maybe,No,Yes,0,0
4,31,United States,TX,No,No,Never,Yes,Yes,Yes,No,...,Don't know,Don't know,No,Some of them,Yes,Yes,Don't know,No,0,0
