=====================================================================
# <font size = 5 color='Black'><u>Employee Wellness Prediction using Logistic Regression<u> 
=====================================================================

## Abstract

Take care of your employees: A drastic thing happened and ABCXYZ123 Technical Solutions have lost one of their important employees. The company is now very concerned about the health of their employees and would want you to find that set of employees who are in need or may be in need of treatment, taking into account multiple attributes that are already stored in the database. So buckle up the wellness of your employees is in your hand.


## Problem Statement

The objective is to predict values “treatment” attribute from the given features of the Test data.

## Data Description

• Timestamp

• Age

• Gender

• Country

• state -> If you live in the United States, which state or territory do you live in?

• self_employed -> Are you self-employed?

• family_history -> Do you have a family history of mental illness?

• work_interfere -> If you have a mental health condition, do you feel that it interferes with your work?

• no_employees -> How many employees does your company or organization have?

• remote_work -> Do you work remotely (outside of an office) at least 50% of the time?

• tech_company -> Is your employer primarily a tech company/organization?

• benefits -> Does your employer provide mental health benefits?

• care_options -> Do you know the options for mental health care your employer provides?

• wellness_program -> Has your employer ever discussed mental health as part of an employee wellness program?

• seek_help -> Does your employer provide resources to learn more about mental health issues and how to seek help?

• anonymity -> Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?

• leave -> How easy is it for you to take medical leave for a mental health condition?

• mental_health_consequence -> Do you think that discussing a mental health issue with your employer would have negative consequences?

• phys_health_consequence -> Do you think that discussing a physical health issue with your employer would have negative consequences?

• coworkers -> Would you be willing to discuss a mental health issue with your coworkers?

• supervisor -> Would you be willing to discuss a mental health issue with your direct supervisor(s)?

• mental_health_interview -> Would you bring up a mental health issue with a potential employer in an interview?

• phys_health_interview -> Would you bring up a physical health issue with a potential employer in an interview?

• mental_vs_physical -> Do you feel that your employer takes mental health as seriously as physical health?

• obs_consequence -> Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?

• comments -> Any additional notes or comments.

• treatment -> Does he or she really needs treatment.

=====================================================================

### Importing Necessary Libraries

In [1]:
#to hide warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
#for working with arrays
import numpy as np
#for reading data and other manipulations
import pandas as pd
#for visulaization
import matplotlib.pyplot as plt
import seaborn as sns
#for encoding categorical variables
from sklearn.preprocessing import LabelEncoder
#for scaling the data
from sklearn.preprocessing import StandardScaler
#for splitting the data into train set ans test set
from sklearn.model_selection import train_test_split
#for performing Linear regression and other regularization techniques
from sklearn.linear_model import LogisticRegression
#for model evaluation
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
#for performing Stochastic Gradient Descent Classification
from sklearn.linear_model import SGDClassifier
#for feature selection
from sklearn.feature_selection import RFE
#for kfold cross validation
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

### Reading data 

In [3]:
#Creating a dataframe for train data
data = pd.read_csv(r"C:/Users/shree/Desktop/bhavika ml/ClassifierDataSet/training_.csv",header = 0,delimiter = ',')
#Creating a dataframe for test data
test_data = pd.read_csv(r"C:/Users/shree/Desktop/bhavika ml/ClassifierDataSet/test.csv",header = 0,delimiter = ',')

In [4]:
#to display all the columns and intended rows on the screen
pd.set_option('display.max_columns',None)
#to check the sample data
data.head()

Unnamed: 0,S.No,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,1,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,No,Yes,Yes,Not sure,No,Yes,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,No,No,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,3,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,No,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,4,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,No,Yes,No,Yes,No,No,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,5,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,Yes,Yes,Yes,No,Don't know,Don't know,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [5]:
test_data.head()

Unnamed: 0,S.No,Timestamp,Age,Gender,Country,state,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,1,2014-08-29 11:32:22,39,Male,United Kingdom,,Yes,Yes,Sometimes,1-5,Yes,Yes,No,Yes,Yes,Yes,Yes,Somewhat difficult,No,No,Yes,Yes,No,Maybe,Yes,Yes,These result may be a tad confusing so a summa...
1,2,2014-08-29 11:32:44,26,female,United States,WA,No,Yes,Sometimes,More than 1000,No,Yes,Yes,Yes,No,No,Don't know,Don't know,No,No,Some of them,Yes,No,Maybe,No,Yes,I should note one of the places my employer fa...
2,3,2014-08-29 11:33:54,23,Female,United States,IL,No,Yes,Sometimes,26-100,No,No,No,No,No,No,Don't know,Somewhat difficult,Yes,No,No,Some of them,No,Maybe,No,No,
3,4,2014-08-29 11:34:07,35,Male,Switzerland,,No,Yes,Often,More than 1000,No,Yes,Don't know,Not sure,No,No,Yes,Very easy,No,No,Some of them,Some of them,No,Maybe,No,No,
4,5,2014-08-29 11:36:38,36,Male,United States,FL,No,No,Never,1-5,Yes,Yes,Don't know,Not sure,Don't know,Don't know,Don't know,Very easy,No,No,Some of them,Some of them,No,No,Don't know,No,


### Understanding the data

In [6]:
data.shape

(1048, 28)

Data has 1048 observation and 28 columns out of which 1 column has Dependent variable and other 27 are Independent var

In [7]:
data.columns

Index(['S.No', 'Timestamp', 'Age', 'Gender', 'Country', 'state',
       'self_employed', 'family_history', 'treatment', 'work_interfere',
       'no_employees', 'remote_work', 'tech_company', 'benefits',
       'care_options', 'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'comments'],
      dtype='object')

In [8]:
data.dtypes

S.No                          int64
Timestamp                    object
Age                           int64
Gender                       object
Country                      object
state                        object
self_employed                object
family_history               object
treatment                    object
work_interfere               object
no_employees                 object
remote_work                  object
tech_company                 object
benefits                     object
care_options                 object
wellness_program             object
seek_help                    object
anonymity                    object
leave                        object
mental_health_consequence    object
phys_health_consequence      object
coworkers                    object
supervisor                   object
mental_health_interview      object
phys_health_interview        object
mental_vs_physical           object
obs_consequence              object
comments                    

we can see that most of the variables are categorical variables

In [9]:
#to get insights from data by looking at their mean,std etc.,
data.describe(include = 'all')

Unnamed: 0,S.No,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
count,1048.0,1048,1048.0,1048,1048,636,1030,1048,1048,812,1048,1048,1048,1048,1048,1048,1048,1048,1048,1048,1048,1048,1048,1048,1048,1048,1048,132
unique,,1035,,45,42,45,2,2,2,4,6,2,2,3,3,3,3,3,5,3,3,3,3,3,3,3,2,128
top,,2014-08-27 15:55:07,,Male,United States,CA,No,No,No,Sometimes,6-25,No,Yes,Yes,No,No,No,Don't know,Don't know,No,No,Some of them,Yes,No,Maybe,Don't know,No,* Small family business - YMMV.
freq,,2,,518,644,123,906,643,536,386,245,733,870,400,422,692,534,685,466,424,773,651,439,835,461,471,902,5
mean,524.5,,95419880.0,,,,,,,,,,,,,,,,,,,,,,,,,
std,302.675844,,3089010000.0,,,,,,,,,,,,,,,,,,,,,,,,,
min,1.0,,-1726.0,,,,,,,,,,,,,,,,,,,,,,,,,
25%,262.75,,27.0,,,,,,,,,,,,,,,,,,,,,,,,,
50%,524.5,,31.0,,,,,,,,,,,,,,,,,,,,,,,,,
75%,786.25,,36.0,,,,,,,,,,,,,,,,,,,,,,,,,


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048 entries, 0 to 1047
Data columns (total 28 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   S.No                       1048 non-null   int64 
 1   Timestamp                  1048 non-null   object
 2   Age                        1048 non-null   int64 
 3   Gender                     1048 non-null   object
 4   Country                    1048 non-null   object
 5   state                      636 non-null    object
 6   self_employed              1030 non-null   object
 7   family_history             1048 non-null   object
 8   treatment                  1048 non-null   object
 9   work_interfere             812 non-null    object
 10  no_employees               1048 non-null   object
 11  remote_work                1048 non-null   object
 12  tech_company               1048 non-null   object
 13  benefits                   1048 non-null   object
 14  care_opt

As per the Non-Null Count, we can see there are a lot of missing values in some columns like state, work_interfere and comments

In [11]:
# to check for the unique values in all the variables to understand the data
#we do not want unique values for timestamp and comments because they are going to be enormous
for i in data.columns:
    if i != 'Timestamp' and i != 'comments':
        print("COLUMN: ", i.upper())
        print("")
        print(data[i].unique())
        print("")

COLUMN:  S.NO

[   1    2    3 ... 1046 1047 1048]

COLUMN:  AGE

[         37          44          32          31          33          35
          39          42          23          29          36          27
          46          41          34          30          40          38
          50          24          18          28          26          22
          19          25          45          21         -29          43
          56          60          54         329          55 99999999999
          48          20          57          58          47          62
          51          65          49       -1726           5          53
          61           8]

COLUMN:  GENDER

['Female' 'M' 'Male' 'male' 'female' 'm' 'Male-ish' 'maile' 'Trans-female'
 'Cis Female' 'F' 'something kinda male?' 'Cis Male' 'Woman' 'f' 'Mal'
 'Male (CIS)' 'queer/she/they' 'non-binary' 'Femake' 'woman' 'Make' 'Nah'
 'All' 'Enby' 'fluid' 'Genderqueer' 'Female ' 'Androgyne' 'Agender'
 'cis-female/femme

We can see that we need to encode all the categorical variables to discrete/numerical variables

And we also noticed that some values of age like 99999999999,329,-1726,-29 are totally inaccurate and we should handle them

### Data Preprocessing

# <font size = 3 color='Black'><b>1). Feature Selection</b>

We can drop the insignificant independent variableby using domain specific knowledge

In this data, Sno,Timestamp,comments,country,state does not have any significance with respect to predictiong wellness of employees

In [12]:
#we'll drop the data from the copy of dataframe to keep the original data intact
df = data.drop(['S.No', 'Timestamp','Country', 'state','comments'], axis =1)
#will apply same to the test data
df_test = test_data.drop(['S.No', 'Timestamp','Country', 'state','comments'], axis =1)

In [13]:
df.head()

Unnamed: 0,Age,Gender,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,37,Female,,No,Yes,Often,6-25,No,Yes,Yes,Not sure,No,Yes,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No
1,44,M,,No,No,Rarely,More than 1000,No,No,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No
2,32,Male,,No,No,Rarely,6-25,No,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No
3,31,Male,,Yes,Yes,Often,26-100,No,Yes,No,Yes,No,No,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes
4,31,Male,,No,No,Never,100-500,Yes,Yes,Yes,No,Don't know,Don't know,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No


In [14]:
df_test.head()

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,39,Male,Yes,Yes,Sometimes,1-5,Yes,Yes,No,Yes,Yes,Yes,Yes,Somewhat difficult,No,No,Yes,Yes,No,Maybe,Yes,Yes
1,26,female,No,Yes,Sometimes,More than 1000,No,Yes,Yes,Yes,No,No,Don't know,Don't know,No,No,Some of them,Yes,No,Maybe,No,Yes
2,23,Female,No,Yes,Sometimes,26-100,No,No,No,No,No,No,Don't know,Somewhat difficult,Yes,No,No,Some of them,No,Maybe,No,No
3,35,Male,No,Yes,Often,More than 1000,No,Yes,Don't know,Not sure,No,No,Yes,Very easy,No,No,Some of them,Some of them,No,Maybe,No,No
4,36,Male,No,No,Never,1-5,Yes,Yes,Don't know,Not sure,Don't know,Don't know,Don't know,Very easy,No,No,Some of them,Some of them,No,No,Don't know,No


# <font size = 3 color='Black'><b>2). Handling Missing Values</b>

In [15]:
#Check for missing values count in whole dataset
df.isnull().sum()

Age                            0
Gender                         0
self_employed                 18
family_history                 0
treatment                      0
work_interfere               236
no_employees                   0
remote_work                    0
tech_company                   0
benefits                       0
care_options                   0
wellness_program               0
seek_help                      0
anonymity                      0
leave                          0
mental_health_consequence      0
phys_health_consequence        0
coworkers                      0
supervisor                     0
mental_health_interview        0
phys_health_interview          0
mental_vs_physical             0
obs_consequence                0
dtype: int64

We can see there are Nan values in self_emplyed and work_interfere

As these are categorical variables, we'replace Nan with mode

In [16]:
df_test.isnull().sum()

Age                           0
Gender                        0
self_employed                 0
family_history                0
work_interfere               28
no_employees                  0
remote_work                   0
tech_company                  0
benefits                      0
care_options                  0
wellness_program              0
seek_help                     0
anonymity                     0
leave                         0
mental_health_consequence     0
phys_health_consequence       0
coworkers                     0
supervisor                    0
mental_health_interview       0
phys_health_interview         0
mental_vs_physical            0
obs_consequence               0
dtype: int64

There are missing values in work_interfere in test data

In [17]:
#we can use use fillna to fill the missing values with mean of the variable
#inplace is kept as true to work om only the missing values rather than the whole data and thereby reducing kernel load
df['self_employed'].fillna(df['self_employed'].mode()[0], inplace = True)

In [18]:
#df['work_interfere'].fillna(df['work_interfere'].mode()[0], inplace = True)
#as work_interfere is multi class variable so we added a new class for NaN values

In [19]:
df['work_interfere'].fillna("Don't Know", inplace = True)

In [20]:
df.isnull().sum()

Age                          0
Gender                       0
self_employed                0
family_history               0
treatment                    0
work_interfere               0
no_employees                 0
remote_work                  0
tech_company                 0
benefits                     0
care_options                 0
wellness_program             0
seek_help                    0
anonymity                    0
leave                        0
mental_health_consequence    0
phys_health_consequence      0
coworkers                    0
supervisor                   0
mental_health_interview      0
phys_health_interview        0
mental_vs_physical           0
obs_consequence              0
dtype: int64

In [21]:
df_test['work_interfere'].fillna("Don't Know", inplace = True)

In [22]:
df_test.isnull().sum()

Age                          0
Gender                       0
self_employed                0
family_history               0
work_interfere               0
no_employees                 0
remote_work                  0
tech_company                 0
benefits                     0
care_options                 0
wellness_program             0
seek_help                    0
anonymity                    0
leave                        0
mental_health_consequence    0
phys_health_consequence      0
coworkers                    0
supervisor                   0
mental_health_interview      0
phys_health_interview        0
mental_vs_physical           0
obs_consequence              0
dtype: int64

# <font size = 3 color='Black'><b>Handling Age variable</b>

The possible age range for employees can be considered between 17 to 65 and therefore we must replace/delete the values above or below this range

In [23]:
#calculating mean for values between 17 to 65
mean_age = int(df['Age'][(df['Age']<=65) & (df['Age']>=17)].mean())
mean_age_test = int(df_test['Age'][(df_test['Age']<=65) & (df_test['Age']>=17)].mean())

In [24]:
#Replacing values greater than 65 or lower than 17 with mean
df['Age'] = df['Age'].apply(lambda x: mean_age if (x>65 or x<17) else int(x))
df_test['Age']= df_test['Age'].apply(lambda x: mean_age_test if(x>65 or x<17) else int(x))

# <font size = 3 color='Black'><b>3). Converting Categorical/Discrete Var to Continuous Var</b>

We need to convert all variables except Age to discrete/numerical variables


1.) Gender

We need to manually encode the values to female(0),male(1) and others(2) as there are too many class for single labels

'''['Female' 'M' 'Male' 'male' 'female' 'm' 'Male-ish' 'maile' 'Trans-female'
 'Cis Female' 'F' 'something kinda male?' 'Cis Male' 'Woman' 'f' 'Mal'
 'Male (CIS)' 'queer/she/they' 'non-binary' 'Femake' 'woman' 'Make' 'Nah'
 'All' 'Enby' 'fluid' 'Genderqueer' 'Female ' 'Androgyne' 'Agender'
 'cis-female/femme' 'Guy (-ish) ^_^' 'male leaning androgynous' 'Male '
 'Man' 'Trans woman' 'msle' 'Neuter' 'Female (trans)' 'queer'
 'Female (cis)' 'Mail' 'cis male' 'A little about you' 'Malr']'''

In [25]:
df['Gender'] = df['Gender'].replace(('Female','female','F','Woman','f','Femake','woman','Female ','Cis Female','cis-female/femme','Female (cis)'),0)
df['Gender'] = df['Gender'].replace(('M','Male','male','m','Male-ish','maile','Mal','Make','Male ','Man','msle','Mail','Malr','Cis Male','Male (CIS)','cis male'),1)
df['Gender'] = df['Gender'].replace(('Trans-female','something kinda male?','queer/she/they','non-binary',
                                 'Nah','All','Enby','fluid','Genderqueer','Androgyne','Agender','Guy (-ish) ^_^',
                                  'male leaning androgynous','Trans woman','Neuter','Female (trans)','queer',
                                  'A little about you'),2)

In [26]:
# to check for the unique values of gender in test data
df_test['Gender'].unique()

array(['Male', 'female', 'Female', 'M', 'male', 'Male ', 'm', 'p', 'F',
       'Woman', 'femail', 'Cis Man',
       'ostensibly male, unsure what that really means', 'f'],
      dtype=object)

In [27]:
#Applying encoding to test data
df_test['Gender'] = df_test['Gender'].replace(('Female','female','F','Woman','f','femail'),0)
df_test['Gender'] = df_test['Gender'].replace(('M','Male','male','m','Male ','Cis Man'),1)
df_test['Gender'] = df_test['Gender'].replace(('p','ostensibly male, unsure what that really means'),2)

2.) no_employees

Unique values: ['6-25' 'More than 1000' '26-100' '100-500' '1-5' '500-1000']

In [28]:
df['no_employees'] = df['no_employees'].map({'1-5':0,'6-25':1,'26-100':2,'100-500':3,'500-1000':4,'More than 1000':5})

In [29]:
df_test['no_employees'] = df_test['no_employees'].map({'1-5':0,'6-25':1,'26-100':2,'100-500':3,'500-1000':4,'More than 1000':5})

In [30]:
df_test.dtypes

Age                           int64
Gender                        int64
self_employed                object
family_history               object
work_interfere               object
no_employees                  int64
remote_work                  object
tech_company                 object
benefits                     object
care_options                 object
wellness_program             object
seek_help                    object
anonymity                    object
leave                        object
mental_health_consequence    object
phys_health_consequence      object
coworkers                    object
supervisor                   object
mental_health_interview      object
phys_health_interview        object
mental_vs_physical           object
obs_consequence              object
dtype: object

3.)All other categorical values

In [32]:
#Reducing redundant class to a single class for 'leaves' variable
df['leave'] = df['leave'].replace(['Somewhat easy', 'Very easy'],'easy')
df_test['leave'] = df_test['leave'].replace(['Somewhat easy', 'Very easy'],'easy')

In [33]:
encode = LabelEncoder()
for i in df.columns:
    if df[i].dtype == 'object':
            df[i] = encode.fit_transform(df[i])

In [34]:
#To check if all the data types have been converted to int from object
df.dtypes

Age                          int64
Gender                       int64
self_employed                int32
family_history               int32
treatment                    int32
work_interfere               int32
no_employees                 int64
remote_work                  int32
tech_company                 int32
benefits                     int32
care_options                 int32
wellness_program             int32
seek_help                    int32
anonymity                    int32
leave                        int32
mental_health_consequence    int32
phys_health_consequence      int32
coworkers                    int32
supervisor                   int32
mental_health_interview      int32
phys_health_interview        int32
mental_vs_physical           int32
obs_consequence              int32
dtype: object

In [35]:
df.head()

Unnamed: 0,Age,Gender,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,37,0,0,0,1,2,1,0,1,2,1,1,2,2,3,1,1,1,2,1,0,2,0
1,44,1,0,0,0,3,5,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0
2,32,1,0,0,0,3,1,0,1,1,0,1,1,0,1,1,1,2,2,2,2,1,0
3,31,1,0,1,1,2,2,0,1,1,2,1,1,1,1,2,2,1,0,0,0,1,1
4,31,1,0,0,0,1,3,1,1,2,0,0,0,0,0,1,1,1,2,2,2,0,0


In [36]:
df.tail()

Unnamed: 0,Age,Gender,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
1043,26,1,0,0,1,4,1,0,1,2,2,1,1,2,0,0,1,1,0,1,0,0,0
1044,29,0,0,0,1,2,5,0,0,2,2,1,1,0,0,0,1,1,0,1,2,1,1
1045,26,0,0,1,1,4,3,0,1,2,1,1,1,2,0,0,1,1,0,1,0,0,0
1046,33,1,0,1,1,4,1,0,1,1,1,1,1,0,3,1,1,1,2,1,1,0,0
1047,28,1,0,1,0,4,3,0,1,1,0,1,1,0,3,1,1,1,1,1,1,0,1


In [37]:
#encoding variables for data from test set
for i in df_test.columns:
    if df_test[i].dtype == 'object':
            df_test[i] = encode.fit_transform(df_test[i])

In [38]:
#To check if all the data types have been converted to int from object
df_test.dtypes

Age                          int64
Gender                       int64
self_employed                int32
family_history               int32
work_interfere               int32
no_employees                 int64
remote_work                  int32
tech_company                 int32
benefits                     int32
care_options                 int32
wellness_program             int32
seek_help                    int32
anonymity                    int32
leave                        int32
mental_health_consequence    int32
phys_health_consequence      int32
coworkers                    int32
supervisor                   int32
mental_health_interview      int32
phys_health_interview        int32
mental_vs_physical           int32
obs_consequence              int32
dtype: object

In [39]:
df_test.head()

Unnamed: 0,Age,Gender,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,39,1,1,1,4,0,1,1,1,2,2,2,2,1,1,1,2,2,1,0,2,1
1,26,0,0,1,4,5,0,1,2,2,1,1,0,0,1,1,1,2,1,0,1,1
2,23,0,0,1,4,2,0,0,1,0,1,1,0,1,2,1,0,1,1,0,1,0
3,35,1,0,1,2,5,0,1,0,1,1,1,2,3,1,1,1,1,1,0,1,0
4,36,1,0,0,1,0,1,1,0,1,0,0,0,3,1,1,1,1,1,1,0,0


### Creating x and y

In [40]:
# Target/Dependent variable is Treatment
x = df.loc[:, df.columns != 'treatment']
y = df.loc[:,'treatment']

In [41]:
print(x.shape)
print(y.shape)

(1048, 22)
(1048,)


### Scaling the data

In [42]:
scaler = StandardScaler()
#X = scaler.fit_transform(X)
#sepatate function for fit and transform
scaler.fit(x)
x = scaler.transform(x)

In [43]:
scaler.fit(df_test)
df_test = scaler.transform(df_test)

### Splitting the data into train and test set

Splitting train set to train and validation set as we already have our test set

In [44]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state = 10,stratify = y)

In [45]:
print("x_train:",x_train.shape)
print("x_test:",x_test.shape)
print("y_train:",y_train.shape)
print("y_test:",y_test.shape)

x_train: (733, 22)
x_test: (315, 22)
y_train: (733,)
y_test: (315,)


### Building Logistic Regression Model

In [46]:
#training the model
lr = LogisticRegression()
lr.fit(x_train,y_train)

LogisticRegression()

In [47]:
#Looking for coefficients value of output equation as logistic regression outputs a equation too
Coefficients = list(zip(df.columns[:-1],lr.coef_.ravel())) #ravel() is used to flatten the array
print(Coefficients)
print(lr.intercept_)

[('Age', 0.1896944061067346), ('Gender', -0.22318918272604368), ('self_employed', -0.0942058997162623), ('family_history', 0.5616608157176773), ('treatment', 1.4504827447163287), ('work_interfere', -0.14156972993623465), ('no_employees', -0.0002355909740508867), ('remote_work', -0.14078868191665192), ('tech_company', 0.2774712007956207), ('benefits', 0.3076093048413515), ('care_options', -0.03859578013071909), ('wellness_program', -0.18125656069073956), ('seek_help', 0.30425037493390344), ('anonymity', -0.11391851138333618), ('leave', -0.015661146489882684), ('mental_health_consequence', 0.08588907566261504), ('phys_health_consequence', 0.244996221762841), ('coworkers', -0.1507637535829961), ('supervisor', 0.06351873282039477), ('mental_health_interview', 0.22953759653606068), ('phys_health_interview', 0.0888748849525661), ('mental_vs_physical', 0.20009829907827476)]
[-0.14404415]


In [48]:
#outputting Probability matrix
prob = lr.predict_proba(x_test)
prob

array([[0.87286481, 0.12713519],
       [0.62527109, 0.37472891],
       [0.27669703, 0.72330297],
       [0.88572324, 0.11427676],
       [0.24106397, 0.75893603],
       [0.51612181, 0.48387819],
       [0.58499005, 0.41500995],
       [0.91269501, 0.08730499],
       [0.29175169, 0.70824831],
       [0.98584205, 0.01415795],
       [0.66700094, 0.33299906],
       [0.45849488, 0.54150512],
       [0.03311294, 0.96688706],
       [0.07240624, 0.92759376],
       [0.15725009, 0.84274991],
       [0.19918978, 0.80081022],
       [0.07933   , 0.92067   ],
       [0.82168191, 0.17831809],
       [0.95435078, 0.04564922],
       [0.92531586, 0.07468414],
       [0.85870274, 0.14129726],
       [0.94467504, 0.05532496],
       [0.70997631, 0.29002369],
       [0.12666241, 0.87333759],
       [0.29801177, 0.70198823],
       [0.01176998, 0.98823002],
       [0.09765669, 0.90234331],
       [0.61732189, 0.38267811],
       [0.61818767, 0.38181233],
       [0.04927964, 0.95072036],
       [0.

In [49]:
#Predicting values
y_pred = lr.predict(x_test)

### Evaluation of model

In [50]:
cfm = confusion_matrix(y_test,y_pred)
print("Confusion Matrix: ")
print(cfm)

acc = accuracy_score(y_test,y_pred)
print("accuracy score",acc)

print('classification report : ')
print(classification_report(y_test,y_pred))

Confusion Matrix: 
[[134  27]
 [ 24 130]]
accuracy score 0.8380952380952381
classification report : 
              precision    recall  f1-score   support

           0       0.85      0.83      0.84       161
           1       0.83      0.84      0.84       154

    accuracy                           0.84       315
   macro avg       0.84      0.84      0.84       315
weighted avg       0.84      0.84      0.84       315



### Fine Tuning the model

#### 1.) Adjusting Threshold

Using Ridge Regression:

In [51]:
#to loop through various values of threshold and check which one gives least error(wrong predictions)
for i in np.arange(0.4,0.61,0.01):
    pred = np.where(prob[:,1]>i,1,0)
    cfm = confusion_matrix(y_test,pred)
    err = cfm[1,0]+cfm[0,1]
    print("Error for threshold,",i," : ",err, "and Type II error :",cfm[1,0],"and Type I error :",cfm[0,1])

Error for threshold, 0.4  :  46 and Type II error : 14 and Type I error : 32
Error for threshold, 0.41000000000000003  :  46 and Type II error : 14 and Type I error : 32
Error for threshold, 0.42000000000000004  :  47 and Type II error : 16 and Type I error : 31
Error for threshold, 0.43000000000000005  :  47 and Type II error : 16 and Type I error : 31
Error for threshold, 0.44000000000000006  :  46 and Type II error : 17 and Type I error : 29
Error for threshold, 0.45000000000000007  :  45 and Type II error : 17 and Type I error : 28
Error for threshold, 0.4600000000000001  :  46 and Type II error : 18 and Type I error : 28
Error for threshold, 0.4700000000000001  :  46 and Type II error : 18 and Type I error : 28
Error for threshold, 0.4800000000000001  :  48 and Type II error : 20 and Type I error : 28
Error for threshold, 0.4900000000000001  :  49 and Type II error : 22 and Type I error : 27
Error for threshold, 0.5000000000000001  :  51 and Type II error : 24 and Type I error : 2

We can see we are getting lesser Type 2 error at threshold 0.44 so we'll try prdeicting the value using that threshold and generate evaluation metrics

In [52]:
#prob is the probability matrix that we genearted above
y_pred_t = [1 if i>0.44 else 0 for i in prob[:,1]]

In [53]:
cfm = confusion_matrix(y_test,y_pred_t)
print("Confusion Matrix: ")
print(cfm)

acc = accuracy_score(y_test,y_pred_t)
print("accuracy score",acc)

print('classification report : ')
print(classification_report(y_test,y_pred_t))

Confusion Matrix: 
[[132  29]
 [ 17 137]]
accuracy score 0.8539682539682539
classification report : 
              precision    recall  f1-score   support

           0       0.89      0.82      0.85       161
           1       0.83      0.89      0.86       154

    accuracy                           0.85       315
   macro avg       0.86      0.85      0.85       315
weighted avg       0.86      0.85      0.85       315



We can conclude following from above evaluation:

    Recall for class 0 has been decresed by 1 unit but f1-score is  increased by 1 unit
    
    Recall for class 1 has been increased by 6 units and f1-score is  increased by 2 units
    
    Accuracy score has improved a little bit from 0.838 to 0.857

Therefore, we'll use the logistic regression model having threshold as 0.44

In [54]:
log_adj = LogisticRegression()
lr.fit(x_train,y_train)

LogisticRegression()

### SGD Classification

In [55]:
sgd = SGDClassifier(loss = 'log', random_state = 10, 
                    learning_rate ='constant', eta0 = 0.00001,
                   max_iter= 1000,shuffle = True,
                   early_stopping = True,n_iter_no_change = 5)
sgd.fit(x_train,y_train)
y_sgd = sgd.predict(x_test)

In [56]:
cfm = confusion_matrix(y_test,y_sgd)
print("Confusion Matrix: ")
print(cfm)

acc = accuracy_score(y_test,y_sgd)
print("accuracy score",acc)

print('classification report : ')
print(classification_report(y_test,y_sgd))

Confusion Matrix: 
[[129  32]
 [ 34 120]]
accuracy score 0.7904761904761904
classification report : 
              precision    recall  f1-score   support

           0       0.79      0.80      0.80       161
           1       0.79      0.78      0.78       154

    accuracy                           0.79       315
   macro avg       0.79      0.79      0.79       315
weighted avg       0.79      0.79      0.79       315



As we can see, accuracy score and f1-score for both the classes are lower than sklearn model so we'll not use this

### Recursive Feature Elimination(RFE)

In [61]:
col = df.loc[:,df.columns != 'treatment']
colname = col.columns
#lr is the our fitted model using LogisticRegression
rfe = RFE(lr,12)

#storing this into an object because we want to use some functions using that object 
model_rfe = rfe.fit(x_train,y_train) 

# total number features left
print("Number of Features left: ",model_rfe.n_features_)

#will give true false values to the columns, true if var exist in final model
print("Selected Features:")
print(list(zip(colname,model_rfe.support_)))

#also we can check ranking column name wise, to see which var is eliminated at which level,
#1 means present in final model, highest number rank is eliminated first
print("Feature Ranking: ")
print(list(zip(colname,model_rfe.ranking_))) 

Number of Features left:  12
Selected Features:
[('Age', True), ('Gender', True), ('self_employed', False), ('family_history', True), ('work_interfere', True), ('no_employees', False), ('remote_work', False), ('tech_company', False), ('benefits', True), ('care_options', True), ('wellness_program', False), ('seek_help', True), ('anonymity', True), ('leave', False), ('mental_health_consequence', False), ('phys_health_consequence', False), ('coworkers', True), ('supervisor', True), ('mental_health_interview', False), ('phys_health_interview', True), ('mental_vs_physical', False), ('obs_consequence', True)]
Feature Ranking: 
[('Age', 1), ('Gender', 1), ('self_employed', 6), ('family_history', 1), ('work_interfere', 1), ('no_employees', 3), ('remote_work', 11), ('tech_company', 2), ('benefits', 1), ('care_options', 1), ('wellness_program', 9), ('seek_help', 1), ('anonymity', 1), ('leave', 4), ('mental_health_consequence', 10), ('phys_health_consequence', 5), ('coworkers', 1), ('supervisor',

In [62]:
#predicting values using RFE model
y_rfe = model_rfe.predict(x_test)

In [63]:
cfm = confusion_matrix(y_test,y_rfe)
print("Confusion Matrix: ")
print(cfm)

acc = accuracy_score(y_test,y_rfe)
print("accuracy score",acc)

print('classification report : ')
print(classification_report(y_test,y_rfe))

Confusion Matrix: 
[[136  25]
 [ 27 127]]
accuracy score 0.834920634920635
classification report : 
              precision    recall  f1-score   support

           0       0.83      0.84      0.84       161
           1       0.84      0.82      0.83       154

    accuracy                           0.83       315
   macro avg       0.83      0.83      0.83       315
weighted avg       0.83      0.83      0.83       315



We can see that the results are best when we build model over 12 variables selected using RFE

### K-Fold Cross Validation

In [64]:
#For evaluating Logistic Regression Model lr

kfold_cv = KFold(n_splits = 10, random_state = 10)

result = cross_val_score(estimator = lr,X= x_train,y=y_train, cv = kfold_cv)

print(result)
print(result.mean())

[0.78378378 0.85135135 0.7972973  0.76712329 0.82191781 0.84931507
 0.80821918 0.75342466 0.80821918 0.79452055]
0.8035172158459829


As we can see, the mean of accuracy of kfolds is slightly greater than the accuracy score of lr which was 0.695

In [65]:
#For evaluating SGD model

kfold_cv = KFold(n_splits = 10, random_state = 10)

result = cross_val_score(estimator = sgd,X= x_train,y=y_train, cv = kfold_cv)

print(result)
print(result.mean())

[0.66216216 0.86486486 0.78378378 0.71232877 0.79452055 0.79452055
 0.80821918 0.75342466 0.69863014 0.75342466]
0.7625879303961496


As we can see, the mean of accuracy of kfolds is slightly greater than the accuracy score of sgd which was 0.676

# <font size = 5 color='Black'><b>Therefore, we can conclude that the best model is the one built using Logistic  Regression<b> 

Predicting values for actual test data

In [67]:
#lr = LogisticRegression()
#lr.fit(x_train,y_train)
prob_test = lr.predict_proba(df_test)
prob_test

array([[0.04534142, 0.95465858],
       [0.06131426, 0.93868574],
       [0.20614177, 0.79385823],
       [0.62022864, 0.37977136],
       [0.91084696, 0.08915304],
       [0.60546166, 0.39453834],
       [0.03824112, 0.96175888],
       [0.96927692, 0.03072308],
       [0.96789283, 0.03210717],
       [0.48044387, 0.51955613],
       [0.20251929, 0.79748071],
       [0.84951589, 0.15048411],
       [0.35856143, 0.64143857],
       [0.92279604, 0.07720396],
       [0.92526633, 0.07473367],
       [0.22117079, 0.77882921],
       [0.46592636, 0.53407364],
       [0.93274744, 0.06725256],
       [0.88902939, 0.11097061],
       [0.73752149, 0.26247851],
       [0.98232704, 0.01767296],
       [0.32771826, 0.67228174],
       [0.3598264 , 0.6401736 ],
       [0.31638916, 0.68361084],
       [0.96480669, 0.03519331],
       [0.2761859 , 0.7238141 ],
       [0.04564294, 0.95435706],
       [0.11363525, 0.88636475],
       [0.18026769, 0.81973231],
       [0.064083  , 0.935917  ],
       [0.

In [68]:
pred_test =[1 if i>0.44 else 0 for i in prob_test[:,1]]

In [77]:
output = pd.DataFrame()
output['treatment'] = pred_test
output['treatment'] = output['treatment'].map({1:"Yes",0:"No"})

In [79]:
output.to_csv("output.csv",header=True,index=False)