# Employee Risk Prediction

This project aims to build a predictive risk score model for each employee who "return to work" (RTW) after COVID-19 by using a machine-learning algorithm.

---

We generated random data for a total of 10,000 employees by giving weightage to some variables.

The variables were selected according to the potential risk factors that have been identified till now.Given below are selected 18 attributes-  

      


**Attributes-**

1. Employee id-(Unique employee id)
2. Company's Location-(office zip code)
3. Age (20-60)      
4. Gender(Male/ Female/ Gender)
5. Residence(employee's zip address)
6. Commute mode(Private/Personal vechile) 
7. Shift(General/Night)
8. Office Structure(Cabin/ Cubical/ Continuous Desk)                    
9. Travel history(in past 30 days)
10. Current Disease(Yes/No) 
11. Current infected (Yes/No)    
12. Previously infected(Yes/No)
13. Dependent(Senior Citizen/Infant)(Yes-Exist/No-Not exist)
14. Dependent's disease(Yes/No)
15. Dependent infected currently(Yes/No)
16. Vaccine Status(Yes/No)
17. Fitness Level(Fit/Unfit)               
18. Infection Rate 

# Importing Libraries and dataset

In [48]:
import numpy as np
import pandas as pd 
import warnings
warnings.filterwarnings('ignore')

In [49]:
Infection_data=pd.read_csv("data_infection.csv")

# Data Cleaning



 > **Error handling**



In [50]:
#Define customize exception class
class duplicateID(Exception):
     pass
class invalidAge(Exception):
     pass
class invalidResidence(Exception):
     pass
class invalidDatatype(Exception):
     pass

1. **Handle duplicate Employee id**




In [51]:
try:
  if not Infection_data["Employee id"].duplicated().any():
    pass
  else:
    #remove duplicate employee id(just keep one)
    duplicate_index=Infection_data.index[Infection_data["Employee id"].duplicated(keep=False)]
    Infection_data=Infection_data.drop_duplicates(subset='Employee id', keep="last")
    raise duplicateID(Exception)
except duplicateId:
  duplicate_emp_id=list(map(lambda x: x + 12001, duplicate_index))

2. **Handle invalid Age error(Age should be greater than 18)**



In [52]:
try:
    if Infection_data[Infection_data.Age>18]["Age"].all():
      pass
    else:
      invalid_age_index=Infection_data.index[Infection_data.Age<=18]
      Infection_data=Infection_data.drop(invalid_age_index)
      raise invalidAge

except invalidAge:
  emp_inavlid_age=list(map(lambda x: x + 12001, invalid_age_index))

3. **Handle invalid Residence error(Residence should be within range)**


In [53]:
try:
  r = list(range(10000,15000)) 
    
  id=[]
  for i in range(len(Infection_data)):
   if Infection_data.Residence[i] in r:
     pass
   else:
     for i in range(len(Infection_data)):
       if Infection_data.Residence[i] not in r:
         id.append(i)
     Infection_data=Infection_data.drop(id)
     raise invalidResidence

except invalidResidence:
  emp_resdience_out_of_range=list(map(lambda x: x + 12001, id))

4. **Handle invalid objects in specific datatype columns**

In [54]:
#for numeric columns
numeric_col=["Employee id","Company's Location","Age","Residence","Infection Rate"]
try:
    Infection_data_col=Infection_data.select_dtypes(include=np.number).columns.tolist()
    if Infection_data_col==numeric_col:
      pass
    else:
      cols=[x for x in numeric_col if x not in Infection_data_col]#non_numeric col
      for cols in cols:
        invalid_index=Infection_data.index[Infection_data[cols].apply(lambda x: isinstance(x, str))].values.tolist()
        Infection_data=Infection_data.drop(invalid_index)
        raise invalidDatatype

except invalidDatatype:
  emp_invalid_number=list(map(lambda x: x + 12001, invalid_index))
 

In [55]:
# for non numeric columns
non_numeric_col=['Gender', 'Commute mode', 'Shift', 'Office Structure',
       'Travel history(in past 30 days)', 'Current Disease',
       'Current infected', 'Previously infected',
       'Dependent(Senior Citizen/Infant)', "Dependent's disease",
       'Dependent infected currently', 'Vaccine Status', 'Fitness Level']
try:
    Infection_data_col=Infection_data.select_dtypes(exclude=np.number).columns.tolist()
    if Infection_data_col==non_numeric_col:
      pass
    else:
      cols=[x for x in non_numeric_col if x not in Infection_data_col]#numeric col
      for cols in cols:
        invalid_index=Infection_data.index[Infection_data[cols].apply(lambda x: isinstance(x, int))].values.tolist()
        Infection_data=Infection_data.drop(invalid_index)
        raise invalidDatatype

except invalidDatatype:
  emp_invalid_str=list(map(lambda x: x + 12001, invalid_index))




> <b>Handling missing value</b>



In [56]:
#should be removed if null, can't assign random prediction
col=['Employee id', 'Current Disease', 'Travel history(in past 30 days)','Current infected', 'Previously infected', "Dependent's disease", 'Dependent infected currently',"Vaccine Status"]
index_col=[]

for cols in col: 
    index_col.append(Infection_data[cols].index[Infection_data[cols].isna().values.tolist()])#must store null value index for future error generation
    Infection_data=Infection_data.dropna(subset=[cols])
    

Note:Must store the index of null values of [employee id, current disease, vaccine date, travel history, current infected, previously infected, dependent's disease, dependent infected, vaccine date ]and these indexes should be dropped from the dataset.

In [57]:
#unique column
Infection_data["Company's Location"].fillna(10022,inplace = True)

In [58]:
#Age column
mean_age_seat = Infection_data.groupby("Office Structure", as_index=False).Age.mean()

nan_index = Infection_data['Age'].index[Infection_data['Age'].isna()]
for i in nan_index:

    if (Infection_data["Office Structure"][i] == "nan"):

        age = mean_age_seat.loc[mean_age_seat["Office Structure"] ==
                                Infection_data["Office Structure"][i]].values[0]
        Infection_data["Age"][i] = int(age[1])

    else:
        Infection_data["Age"][i] = int(Infection_data["Age"].mean())

In [59]:
#Gender column
nan_index=Infection_data['Gender'].index[Infection_data['Gender'].isna()]
for i in nan_index:
    Infection_data["Gender"][i]=random.choice(["Female","Male"], p=[0.48,0.52])

In [60]:
#Residence column
nan_index=Infection_data['Residence'].index[Infection_data['Residence'].isna()]
for i in nan_index:
    x = random.randint(10000,15000)
    Infection_data["Residence"][i]=random.choice([10022,x], p=[0.9,0.1])#most residence should match office location

In [61]:
#Commute mode column
nan_index=Infection_data['Commute mode'].index[Infection_data['Commute mode'].isna()]
for i in nan_index:
    seating=Infection_data["Office Structure"][i]#commute mode generally depends upon office type of an employee
    if seating=="Cabin":
        Infection_data["Commute mode"][i]="Private"
    else:
        Infection_data["Commute mode"][i]=random.choice(["Private","Public"], p=[0.3,0.7])

In [62]:
#Shift column
nan_index = Infection_data['Shift'].index[Infection_data['Shift'].isna()]
for i in nan_index:
    Infection_data["Shift"][i] = random.choice(["General", "Night"], p=[0.9, 0.1])

In [63]:
#Office structure column
nan_index = Infection_data['Office Structure'].index[Infection_data['Office Structure'].isna()]
for i in nan_index:
    age = Infection_data["Age"][i]
    if age > 40:
        Infection_data["Office Structure"][i] = "Cabin"
    else:
        Infection_data["Office Structure"][i] = random.choice(
            ["Cubical", "Cabin", "Continous Desk"], p=[0.5, 0.1, 0.4])

In [64]:
#Dependent(Senior Citizen/Infant) column
nan_index = Infection_data["Dependent(Senior Citizen/Infant)"].index[
    Infection_data["Dependent(Senior Citizen/Infant)"].isna()]
for i in nan_index:
    if Infection_data["Gender"][i] == "Female":
        if (Infection_data["Age"][i] >= 26) & (Infection_data["Age"][i] <= 40):
            Infection_data["Dependent(Senior Citizen/Infant)"][i] = "Yes"
        else:
            Infection_data["Dependent(Senior Citizen/Infant)"][i] = random.choice(
                ["Yes", "No"], p=[0.1, 0.9])

    elif Infection_data["Gender"][i] == "Male":
        if (Infection_data["Age"][i] >= 30) & (Infection_data["Age"][i] <= 44):
            Infection_data["Dependent(Senior Citizen/Infant)"][i] = "Yes"
        else:
            Infection_data["Dependent(Senior Citizen/Infant)"][i] = random.choice(
                ["Yes", "No"], p=[0.1, 0.9])

    else:
        Infection_data["Dependent(Senior Citizen/Infant)"][i] = "No"

In [65]:
#17 Fitness column
Infection_data["Fitness Level"].replace(np.nan,Infection_data["Fitness Level"].mode()[0],inplace = True)

In [66]:
#18 Infection Rate
nan_index = Infection_data['Infection Rate'].index[Infection_data['Infection Rate'].isna()]
Infection_data_infect= Infection_data[~Infection_data.index.isin(nan_index)]

for i in nan_index:
   Infection_data_sorted=Infection_data.iloc[(Infection_data_infect['Residence']-Infection_data["Residence"][i]).abs().argsort()[:2]]
   Infection_data["Infection Rate"][i]=Infection_data_sorted["Infection Rate"].mean()



> **employees whose risk can't be calculated**



In [67]:
column_name = []
mylists = []
base_id = 12001
k = 0

for colm in index_col:
    mylists = mylists + colm.values.tolist()

id = list(map(lambda x: x + base_id, mylists))

name = [
    "Employee id", "Travel history(in past 30 days)", "Current Disease",
    "Current infected", "Previously infected", "Dependent infected",
    "Dependent's disease","Vaccine Status"]

for colm in index_col:
    for i in range(len(colm)):
        column_name.append(name[k])
    k += 1

In [68]:
emp_null_data=pd.DataFrame(list(zip(id,column_name)), 
               columns =["Employee id",'Incomplete column'])

# Grouping numerical data

**Zone wise clustering**

In [69]:
!pip install kmeans1d



In [70]:
#Clustering based on infection rate in employee's residence
#!pip install kmeans1d
import kmeans1d

x=Infection_data.iloc[:,17].values
k = 3

clusters, centroids = kmeans1d.cluster(x, k)


In [71]:
Infection_data["Zone"]=clusters

def zone(cluster):
    if cluster==0:
      return "Green"
    elif cluster==1:
      return "Yellow"
    if cluster==2:
      return "Red"
Infection_data["Zone"]=Infection_data["Zone"].apply(zone)


**Age wise grouping**

In [72]:
def age(param):
    if param <=35 and param >=20:
        return '20-35'
    elif param >35 and param<=50:
        return '36-50'
    else:
      return "50+"
    
Infection_data['Age-group'] = Infection_data['Age'].apply(age)

# Risk Score



> We define a **risk function** to **manually find risk_score** of employee based upon the **importance of variable(10 means most important and 1 means least variable)**

---


> We consider a **row with worst case scenerio** of each column (like current_diease= Yes, travel history=Yes and so on) and assign **worst risk=100**
Now if any row has **one or more columns which doesn't belong to worst case** then ,**reduce a score(based on importance) from worst risk score(here,100)**






In [73]:
def risk(Infection_data,Score):
    if Infection_data["Dependent infected currently"]=="No":
        Score=Score-10
    if Infection_data["Current Disease"]=="No":
        Score=Score-9

    if Infection_data["Age-group"]=="20-35":
        Score=Score-8
    if Infection_data["Age-group"]=="36-50":
        Score=Score-5
    if Infection_data["Fitness Level"]=="Fit":
        Score=Score-8
    if Infection_data["Travel history(in past 30 days)"]=="No":
        Score=Score-8
    if Infection_data["Zone"]=="Yellow":
        Score=Score-4
    if Infection_data["Zone"]=="Green":
        Score=Score-6
    if Infection_data["Commute mode"]=="Private":
        Score=Score-5
    if Infection_data["Vaccine Status"]=="Yes":
        Score=Score-9
    if Infection_data["Dependent's disease"]=="No":
        Score=Score-3
    if Infection_data["Office Structure"]=="Cabin":
        Score=Score-2
    if Infection_data["Office Structure"]=="Cubical":
        Score=Score-1
    if Infection_data["Dependent(Senior Citizen/Infant)"]=="No":
        Score=Score-2
    if Infection_data["Previously infected"]=="Yes":
        Score=Score-2
    if Infection_data["Gender"]=="No":
        Score=Score-1
    if Infection_data["Shift"]=="No":
        Score=Score-2
        
    return Score

In [74]:
risk_score=[]
for i in range(len(Infection_data)):
  risk_score.append(risk(Infection_data.iloc[i],100))
Infection_data["Risk_Score"]=risk_score

# Feature Scaling

In [None]:
risk_data=Infection_data.copy()
#Scale Age and infection rate between 0-1
from sklearn.preprocessing import MinMaxScaler
sc= MinMaxScaler()
risk_data["Age"]= sc.fit_transform(risk_data["Age"].values.reshape(-1,1))
risk_data["Infection Rate"]= sc.fit_transform(risk_data["Infection Rate"].values.reshape(-1,1))

# Label Encoding

In [None]:
col=list(risk_data.select_dtypes(include=['object']).columns)

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
for cols in col:
    i=risk_data.columns.get_loc(cols)
    risk_data.iloc[:,i] = labelencoder_X.fit_transform(risk_data.iloc[:,i])

risk_data=pd.get_dummies(risk_data,columns=["Office Structure"])
risk_data=risk_data.drop(['Office Structure_2'],axis=1)


# Machine Lerning Model

In [None]:
from sklearn.model_selection import train_test_split 
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error

In [None]:
#Not important in training the model
risk_data = risk_data.drop(['Employee id','Company\'s Location','Residence', 'Current infected','Age-group','Zone'], axis=1)

In [None]:
#Split into features and label
y = risk_data['Risk_Score']
X = risk_data.drop(['Risk_Score'], axis=1)

In [None]:
#train_test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) 


> XGBOOST regression

In [None]:
xg_model= XGBRegressor(n_estimators=100,silent = True)
xg_fit_model=xg_model.fit(X_train, y_train)

In [None]:
#hyperparameter tunning
from sklearn.model_selection import RandomizedSearchCV
m_dep = [5,6,7,8]
gammas = [0.01,0.001,0.001]
min_c_wt = [1,5,10]
l_rate = [0.05,0.1, 0.2, 0.3]
n_est = [5,10,20,100]

param_grid = {'n_estimators': n_est, 'gamma': gammas, 'max_depth': m_dep,
              'min_child_weight': min_c_wt, 'learning_rate': l_rate}

xgb_cv= RandomizedSearchCV(estimator = xg_model, n_iter=100, param_distributions =  param_grid, random_state=51, cv=2, n_jobs=-1, refit=True)
xgb_fit_cv=xgb_cv.fit(X_train,y_train)


In [None]:
xgb_hyper_model=xgb_fit_cv.estimator
xgb_fit_hyper=xgb_hyper_model.fit(X_train, y_train)

In [None]:
Predicted_risk= xgb_fit_hyper.predict(X_test)
Predicted_risk=[round(x) for x in Predicted_risk]

# Recommendation

In [75]:
Infection_data=Infection_data.reset_index(drop=True)

import json

employee_recommendations = []

emp_id = Infection_data["Employee id"].tolist()
emp_score = Infection_data["Risk_Score"].tolist()

for i in range(len(Infection_data)) : 
  
    recommendations = []


    if(Infection_data.loc[i,"Vaccine Status"] == "No"):
        rec = {
        "text":"Get the employee vaccinated",
        "score": 9
        }
        recommendations.append(rec)

    if(Infection_data.loc[i,"Travel history(in past 30 days)"] == "Yes"):
        rec = {
        "text":"Wait 30 days from international travel dates",
        "score": 8
        }
        recommendations.append(rec)

    if(Infection_data.loc[i,"Zone"] == "Red"):
        rec = {
        "text":"Provide accomodation to employee in a zone with low infection rate",
        "score": 6
        }
        recommendations.append(rec)

    if(Infection_data.loc[i,"Commute mode"] == "Public"):
        rec = {
        "text":"Provide the employee with a private vehicle",
        "score": 5
        }
        recommendations.append(rec)

    if(Infection_data.loc[i,"Shift"] == "General"):
        rec = {
        "text":"Call the employee in night shift",
        "score": 2
        }
        recommendations.append(rec)

    if(Infection_data.loc[i,"Dependent(Senior Citizen/Infant)"] == "Yes"):
        rec = {
        "text":"Provide private accomodation to employee",
        "score": 2
        }
        recommendations.append(rec)

    if(Infection_data.loc[i,"Office Structure"] == "Continous Desk"):
        rec = {
        "text":"Shift the employee into a cubical",
        "score": 1
        }
        recommendations.append(rec)

    if(Infection_data.loc[i,"Office Structure"] == "Cubical"):
        rec = {
        "text":"Shift the employee into a cabin",
        "score": 1
        }
        recommendations.append(rec)
        
    
    recs_data = {
        "emp_id": emp_id[i],
        "recommendations": recommendations,
        "emp_score": emp_score[i]
    }
    
    employee_recommendations.append(recs_data)
    
recommendations_data= json.dumps(employee_recommendations)
print(recommendations_data)


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)





> **Final Analysis**





1. High Risk analysis



In [None]:
age_gender_high=Infection_data.groupby(['Age-group', 'Gender']).Risk_Score.apply(lambda x: round(len(x[x > 75])/len(x) * 100 , 2))

commute_shift_high=Infection_data.groupby(['Commute mode', 'Shift']).Risk_Score.apply(lambda x: round(len(x[x > 75])/len(x) * 100 , 2))

office_commute_high=Infection_data.groupby(["Office Structure", 'Commute mode']).Risk_Score.apply(lambda x: round(len(x[x > 75])/len(x) * 100 , 2))

age_disease_high=Infection_data.groupby(['Age-group', 'Current Disease']).Risk_Score.apply(lambda x: round(len(x[x > 75])/len(x) * 100 , 2))

2. Medium Risk analysis

In [None]:
age_gender_medium=Infection_data.groupby(['Age-group', 'Gender']).Risk_Score.apply(lambda x: round(len(x[(x>45) & (x< 75)])/len(x) * 100 , 2))

commute_shift_medium=Infection_data.groupby(['Commute mode', 'Shift']).Risk_Score.apply(lambda x: round(len(x[ (x>45) & (x<75)])/len(x) * 100 , 2))

office_commute_medium=Infection_data.groupby(["Office Structure", 'Commute mode']).Risk_Score.apply(lambda x: round(len(x[(x < 75) & (x>45)])/len(x) * 100 , 2))

age_disease_medium=Infection_data.groupby(['Age-group', 'Current Disease']).Risk_Score.apply(lambda x: round(len(x[(x < 75) & (x>45)])/len(x) * 100 , 2))

3. Low Risk analysis

In [None]:
age_gender_low=Infection_data.groupby(['Age-group', 'Gender']).Risk_Score.apply(lambda x: round(len(x[x <45])/len(x) * 100 , 2))

commute_shift_low=Infection_data.groupby(['Commute mode', 'Shift']).Risk_Score.apply(lambda x: round(len(x[x<45])/len(x) * 100 , 2))

office_commute_low=Infection_data.groupby(["Office Structure", 'Commute mode']).Risk_Score.apply(lambda x: round(len(x[x<45])/len(x) * 100 , 2))

age_disease_low=Infection_data.groupby(['Age-group', 'Current Disease']).Risk_Score.apply(lambda x: round(len(x[x<45])/len(x) * 100 , 2))