# INTRODUCTION

- *Context of the problem*
- *We will be wotking with five sources of data in csv format*

# DATA PREPROCESSING
Setting the environment

In [3]:
import numpy as np
import os
from numpy.random import default_rng
import matplotlib.pyplot as plt
import pandas as pd

## Loading the DataSets

In [4]:
# Change this to the relative/absolute path of the Datasets folder
#os.chdir("C:/Users/Delfina/OneDrive/Escritorio/Delfina/Francia 2025/CESI/AI/PROJECT/INDIAI/Datasets")
os.chdir("./Datasets")

general_data = pd.read_csv('general_data.csv').copy()
employee_survey_data = pd.read_csv('employee_survey_data.csv').copy()
manager_survey_data = pd.read_csv('manager_survey_data.csv').copy()
in_time = pd.read_csv('in_time.csv').copy()
out_time = pd.read_csv('out_time.csv').copy()

The datasets will be processed separately, then once clean thet'll be merged into one.

## General Data

Let's take a look at what the data looks like:

In [5]:
general_data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeID,Gender,...,NumCompaniesWorked,Over18,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager
0,51,No,Travel_Rarely,Sales,6,2,Life Sciences,1,1,Female,...,1.0,Y,11,8,0,1.0,6,1,0,0
1,31,Yes,Travel_Frequently,Research & Development,10,1,Life Sciences,1,2,Female,...,0.0,Y,23,8,1,6.0,3,5,1,4
2,32,No,Travel_Frequently,Research & Development,17,4,Other,1,3,Male,...,1.0,Y,15,8,3,5.0,2,5,0,3
3,38,No,Non-Travel,Research & Development,2,5,Life Sciences,1,4,Male,...,3.0,Y,11,8,3,13.0,5,8,7,5
4,32,No,Travel_Rarely,Research & Development,10,1,Medical,1,5,Male,...,4.0,Y,12,8,2,9.0,2,6,0,4


### Dropping unnecesary attributes

Just by looking, we notice that:
- The attribute **Over18** is redundant: the more precise attribute **Age** is also present.
- The attribute **EmployeeCount** makes no sense, as each entry represents only one employee.
- We can divise two attributes that represent sensitive personal information, like **Gender** and **MaritalStatus**. We don't consider these relevant for the current analysis.

Let's now look at the metrics for each numerical attribute.

In [6]:
general_data.describe()

Unnamed: 0,Age,DistanceFromHome,Education,EmployeeCount,EmployeeID,JobLevel,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager
count,4410.0,4410.0,4410.0,4410.0,4410.0,4410.0,4410.0,4391.0,4410.0,4410.0,4410.0,4401.0,4410.0,4410.0,4410.0,4410.0
mean,36.92381,9.192517,2.912925,1.0,2205.5,2.063946,65029.312925,2.69483,15.209524,8.0,0.793878,11.279936,2.79932,7.008163,2.187755,4.123129
std,9.133301,8.105026,1.023933,0.0,1273.201673,1.106689,47068.888559,2.498887,3.659108,0.0,0.851883,7.782222,1.288978,6.125135,3.221699,3.567327
min,18.0,1.0,1.0,1.0,1.0,1.0,10090.0,0.0,11.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,30.0,2.0,2.0,1.0,1103.25,1.0,29110.0,1.0,12.0,8.0,0.0,6.0,2.0,3.0,0.0,2.0
50%,36.0,7.0,3.0,1.0,2205.5,2.0,49190.0,2.0,14.0,8.0,1.0,10.0,3.0,5.0,1.0,3.0
75%,43.0,14.0,4.0,1.0,3307.75,3.0,83800.0,4.0,18.0,8.0,1.0,15.0,3.0,9.0,3.0,7.0
max,60.0,29.0,5.0,1.0,4410.0,5.0,199990.0,9.0,25.0,8.0,3.0,40.0,6.0,40.0,15.0,17.0


- The attributes **EmployeeCount** and **StandardHours** have a standard deviation of 0.0, meaning that, for them, all entries have the same value -which are 1 and 8.0 respectively-. They are not useful to tell employees apart in any way.

We proceed by dropping the mentioned attributes, reducing the number of columns from 24 to 19.

In [7]:
general_data.drop('Gender', axis=1,inplace=True)
general_data.drop('MaritalStatus', axis=1,inplace=True)
general_data.drop('Over18', axis=1,inplace=True)
general_data.drop('EmployeeCount', axis=1,inplace=True)
general_data.drop('StandardHours', axis=1,inplace=True)
general_data

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeID,JobLevel,JobRole,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager
0,51,No,Travel_Rarely,Sales,6,2,Life Sciences,1,1,Healthcare Representative,131160,1.0,11,0,1.0,6,1,0,0
1,31,Yes,Travel_Frequently,Research & Development,10,1,Life Sciences,2,1,Research Scientist,41890,0.0,23,1,6.0,3,5,1,4
2,32,No,Travel_Frequently,Research & Development,17,4,Other,3,4,Sales Executive,193280,1.0,15,3,5.0,2,5,0,3
3,38,No,Non-Travel,Research & Development,2,5,Life Sciences,4,3,Human Resources,83210,3.0,11,3,13.0,5,8,7,5
4,32,No,Travel_Rarely,Research & Development,10,1,Medical,5,1,Sales Executive,23420,4.0,12,2,9.0,2,6,0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4405,42,No,Travel_Rarely,Research & Development,5,4,Medical,4406,1,Research Scientist,60290,3.0,17,1,10.0,5,3,0,2
4406,29,No,Travel_Rarely,Research & Development,2,4,Medical,4407,1,Laboratory Technician,26790,2.0,15,0,10.0,2,3,0,2
4407,25,No,Travel_Rarely,Research & Development,25,2,Life Sciences,4408,2,Sales Executive,37020,0.0,20,0,5.0,4,4,1,2
4408,42,No,Travel_Rarely,Sales,18,2,Medical,4409,1,Laboratory Technician,23980,0.0,14,1,10.0,2,9,7,8


In [8]:
general_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4410 entries, 0 to 4409
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      4410 non-null   int64  
 1   Attrition                4410 non-null   object 
 2   BusinessTravel           4410 non-null   object 
 3   Department               4410 non-null   object 
 4   DistanceFromHome         4410 non-null   int64  
 5   Education                4410 non-null   int64  
 6   EducationField           4410 non-null   object 
 7   EmployeeID               4410 non-null   int64  
 8   JobLevel                 4410 non-null   int64  
 9   JobRole                  4410 non-null   object 
 10  MonthlyIncome            4410 non-null   int64  
 11  NumCompaniesWorked       4391 non-null   float64
 12  PercentSalaryHike        4410 non-null   int64  
 13  StockOptionLevel         4410 non-null   int64  
 14  TotalWorkingYears       

### Treating missing values

In [15]:
def DisplayMissingValues(data, data_name):
    
    df = pd.DataFrame(data)

    missing_counts = df.isnull().sum()
    missing_columns = missing_counts[missing_counts > 0]
    
    if missing_columns.empty:
        print("[0] No missing values in:",data_name,"\n")
        return
    print(len(missing_columns)," attributes with missing values in: ",data_name)
    print(missing_columns.to_string(),"\n")

DisplayMissingValues(general_data, "general data")

2  attributes with missing values in:  general data
NumCompaniesWorked    19
TotalWorkingYears      9 



### Treating categorical values

### Survey Data

In [17]:
manager_survey_data

Unnamed: 0,EmployeeID,JobInvolvement,PerformanceRating
0,1,3,3
1,2,2,4
2,3,3,3
3,4,2,3
4,5,3,3
...,...,...,...
4405,4406,3,3
4406,4407,2,3
4407,4408,3,4
4408,4409,2,3


In [18]:
employee_survey_data

Unnamed: 0,EmployeeID,EnvironmentSatisfaction,JobSatisfaction,WorkLifeBalance
0,1,3.0,4.0,2.0
1,2,3.0,2.0,4.0
2,3,2.0,2.0,1.0
3,4,4.0,4.0,3.0
4,5,4.0,1.0,3.0
...,...,...,...,...
4405,4406,4.0,1.0,3.0
4406,4407,4.0,4.0,3.0
4407,4408,1.0,3.0,3.0
4408,4409,4.0,1.0,3.0


In [19]:
in_time

Unnamed: 0.1,Unnamed: 0,2015-01-01,2015-01-02,2015-01-05,2015-01-06,2015-01-07,2015-01-08,2015-01-09,2015-01-12,2015-01-13,...,2015-12-18,2015-12-21,2015-12-22,2015-12-23,2015-12-24,2015-12-25,2015-12-28,2015-12-29,2015-12-30,2015-12-31
0,1,,2015-01-02 09:43:45,2015-01-05 10:08:48,2015-01-06 09:54:26,2015-01-07 09:34:31,2015-01-08 09:51:09,2015-01-09 10:09:25,2015-01-12 09:42:53,2015-01-13 10:13:06,...,,2015-12-21 09:55:29,2015-12-22 10:04:06,2015-12-23 10:14:27,2015-12-24 10:11:35,,2015-12-28 10:13:41,2015-12-29 10:03:36,2015-12-30 09:54:12,2015-12-31 10:12:44
1,2,,2015-01-02 10:15:44,2015-01-05 10:21:05,,2015-01-07 09:45:17,2015-01-08 10:09:04,2015-01-09 09:43:26,2015-01-12 10:00:07,2015-01-13 10:43:29,...,2015-12-18 10:37:17,2015-12-21 09:49:02,2015-12-22 10:33:51,2015-12-23 10:12:10,,,2015-12-28 09:31:45,2015-12-29 09:55:49,2015-12-30 10:32:25,2015-12-31 09:27:20
2,3,,2015-01-02 10:17:41,2015-01-05 09:50:50,2015-01-06 10:14:13,2015-01-07 09:47:27,2015-01-08 10:03:40,2015-01-09 10:05:49,2015-01-12 10:03:47,2015-01-13 10:21:26,...,2015-12-18 10:15:14,2015-12-21 10:10:28,2015-12-22 09:44:44,2015-12-23 10:15:54,2015-12-24 10:07:26,,2015-12-28 09:42:05,2015-12-29 09:43:36,2015-12-30 09:34:05,2015-12-31 10:28:39
3,4,,2015-01-02 10:05:06,2015-01-05 09:56:32,2015-01-06 10:11:07,2015-01-07 09:37:30,2015-01-08 10:02:08,2015-01-09 10:08:12,2015-01-12 10:13:42,2015-01-13 09:53:22,...,2015-12-18 10:17:38,2015-12-21 09:58:21,2015-12-22 10:04:25,2015-12-23 10:11:46,2015-12-24 09:43:15,,2015-12-28 09:52:44,2015-12-29 09:33:16,2015-12-30 10:18:12,2015-12-31 10:01:15
4,5,,2015-01-02 10:28:17,2015-01-05 09:49:58,2015-01-06 09:45:28,2015-01-07 09:49:37,2015-01-08 10:19:44,2015-01-09 10:00:50,2015-01-12 10:29:27,2015-01-13 09:59:32,...,2015-12-18 09:58:35,2015-12-21 10:03:41,2015-12-22 10:10:30,2015-12-23 10:13:36,2015-12-24 09:44:24,,2015-12-28 10:05:15,2015-12-29 10:30:53,2015-12-30 09:18:21,2015-12-31 09:41:09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4405,4406,,2015-01-02 09:20:32,2015-01-05 10:17:53,2015-01-06 10:26:51,2015-01-07 10:06:58,2015-01-08 09:45:06,2015-01-09 09:49:24,2015-01-12 09:37:10,2015-01-13 09:25:02,...,2015-12-18 10:01:06,2015-12-21 10:25:25,2015-12-22 10:16:11,2015-12-23 10:04:40,2015-12-24 09:45:40,,2015-12-28 10:15:39,2015-12-29 10:10:09,2015-12-30 09:28:19,2015-12-31 10:00:12
4406,4407,,2015-01-02 10:03:41,,2015-01-06 09:44:00,2015-01-07 09:42:10,2015-01-08 10:00:57,2015-01-09 09:44:04,2015-01-12 10:07:32,2015-01-13 10:05:11,...,2015-12-18 09:27:32,2015-12-21 09:41:24,2015-12-22 09:50:30,2015-12-23 10:32:21,2015-12-24 09:47:41,,2015-12-28 09:54:23,2015-12-29 10:13:32,2015-12-30 10:21:09,2015-12-31 10:09:48
4407,4408,,2015-01-02 10:01:01,2015-01-05 09:33:00,2015-01-06 09:49:17,2015-01-07 10:28:12,2015-01-08 09:47:38,2015-01-09 10:01:03,2015-01-12 09:49:12,2015-01-13 09:47:10,...,2015-12-18 10:00:57,2015-12-21 09:51:07,2015-12-22 10:02:10,2015-12-23 09:58:29,2015-12-24 09:56:05,,2015-12-28 09:59:24,,2015-12-30 10:02:36,2015-12-31 10:03:30
4408,4409,,2015-01-02 10:17:05,2015-01-05 10:02:27,2015-01-06 10:12:50,2015-01-07 10:12:31,2015-01-08 09:42:57,,2015-01-12 10:00:38,2015-01-13 09:48:03,...,2015-12-18 09:54:33,2015-12-21 10:01:08,2015-12-22 10:10:19,2015-12-23 09:42:30,2015-12-24 09:56:05,,2015-12-28 09:55:25,2015-12-29 09:54:42,2015-12-30 10:15:44,2015-12-31 09:56:47


In [20]:
out_time

Unnamed: 0.1,Unnamed: 0,2015-01-01,2015-01-02,2015-01-05,2015-01-06,2015-01-07,2015-01-08,2015-01-09,2015-01-12,2015-01-13,...,2015-12-18,2015-12-21,2015-12-22,2015-12-23,2015-12-24,2015-12-25,2015-12-28,2015-12-29,2015-12-30,2015-12-31
0,1,,2015-01-02 16:56:15,2015-01-05 17:20:11,2015-01-06 17:19:05,2015-01-07 16:34:55,2015-01-08 17:08:32,2015-01-09 17:38:29,2015-01-12 16:58:39,2015-01-13 18:02:58,...,,2015-12-21 17:15:50,2015-12-22 17:27:51,2015-12-23 16:44:44,2015-12-24 17:47:22,,2015-12-28 18:00:07,2015-12-29 17:22:30,2015-12-30 17:40:56,2015-12-31 17:17:33
1,2,,2015-01-02 18:22:17,2015-01-05 17:48:22,,2015-01-07 17:09:06,2015-01-08 17:34:04,2015-01-09 16:52:29,2015-01-12 17:36:48,2015-01-13 18:00:13,...,2015-12-18 18:31:28,2015-12-21 17:34:16,2015-12-22 18:16:35,2015-12-23 17:38:18,,,2015-12-28 17:08:38,2015-12-29 17:54:46,2015-12-30 18:31:35,2015-12-31 17:40:58
2,3,,2015-01-02 16:59:14,2015-01-05 17:06:46,2015-01-06 16:38:32,2015-01-07 16:33:21,2015-01-08 17:24:22,2015-01-09 16:57:30,2015-01-12 17:28:54,2015-01-13 17:21:25,...,2015-12-18 17:02:23,2015-12-21 17:20:17,2015-12-22 16:32:50,2015-12-23 16:59:43,2015-12-24 16:58:25,,2015-12-28 16:43:31,2015-12-29 17:09:56,2015-12-30 17:06:25,2015-12-31 17:15:50
3,4,,2015-01-02 17:25:24,2015-01-05 17:14:03,2015-01-06 17:07:42,2015-01-07 16:32:40,2015-01-08 16:53:11,2015-01-09 17:19:47,2015-01-12 17:13:37,2015-01-13 17:11:45,...,2015-12-18 17:55:23,2015-12-21 16:49:09,2015-12-22 17:24:00,2015-12-23 17:36:35,2015-12-24 16:48:21,,2015-12-28 17:19:34,2015-12-29 16:58:16,2015-12-30 17:40:11,2015-12-31 17:09:14
4,5,,2015-01-02 18:31:37,2015-01-05 17:49:15,2015-01-06 17:26:25,2015-01-07 17:37:59,2015-01-08 17:59:28,2015-01-09 17:44:08,2015-01-12 18:51:21,2015-01-13 18:14:58,...,2015-12-18 17:52:48,2015-12-21 17:43:35,2015-12-22 18:07:57,2015-12-23 18:00:49,2015-12-24 17:59:22,,2015-12-28 17:44:59,2015-12-29 18:47:00,2015-12-30 17:15:33,2015-12-31 17:42:14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4405,4406,,2015-01-02 17:27:37,2015-01-05 19:08:20,2015-01-06 18:50:49,2015-01-07 18:57:40,2015-01-08 17:58:31,2015-01-09 18:06:15,2015-01-12 17:58:48,2015-01-13 18:10:35,...,2015-12-18 18:06:05,2015-12-21 18:35:06,2015-12-22 18:33:44,2015-12-23 18:40:56,2015-12-24 18:21:29,,2015-12-28 18:44:35,2015-12-29 19:14:38,2015-12-30 18:24:56,2015-12-31 18:30:41
4406,4407,,2015-01-02 16:19:01,,2015-01-06 15:07:37,2015-01-07 15:25:50,2015-01-08 16:12:33,2015-01-09 15:26:56,2015-01-12 16:10:42,2015-01-13 16:22:43,...,2015-12-18 15:23:02,2015-12-21 15:31:14,2015-12-22 15:45:59,2015-12-23 16:38:59,2015-12-24 15:47:15,,2015-12-28 15:34:34,2015-12-29 16:47:02,2015-12-30 16:03:17,2015-12-31 16:18:39
4407,4408,,2015-01-02 17:17:35,2015-01-05 17:08:07,2015-01-06 17:27:46,2015-01-07 18:27:22,2015-01-08 17:05:25,2015-01-09 17:02:57,2015-01-12 17:35:45,2015-01-13 17:15:52,...,2015-12-18 17:48:05,2015-12-21 17:43:05,2015-12-22 17:47:23,2015-12-23 17:43:37,2015-12-24 17:20:12,,2015-12-28 17:43:28,,2015-12-30 17:48:14,2015-12-31 18:08:55
4408,4409,,2015-01-02 19:48:37,2015-01-05 19:37:40,2015-01-06 20:00:08,2015-01-07 19:35:59,2015-01-08 18:55:13,,2015-01-12 19:18:17,2015-01-13 19:24:02,...,2015-12-18 19:52:44,2015-12-21 19:21:35,2015-12-22 19:32:40,2015-12-23 18:57:00,2015-12-24 19:37:57,,2015-12-28 19:58:36,2015-12-29 18:55:26,2015-12-30 19:37:22,2015-12-31 19:33:45


## Treatment of Missing values

In [None]:
#os.chdir("./Datasets")

# Code to detect missing values
def DisplayMissingValues(data,data_name,display=False):
    
    #print("Missing values from:",data_name)
    df = pd.DataFrame(data)
    threshold = df.shape[0]

    filtered_df = df[df.columns[df.notnull().sum() < threshold]]
    
    if display:
        filtered_df.info()
        print()
    print("[",len(filtered_df.columns),"] attributes with missing values in:",data_name,"\n")


### General Data

In [None]:

DisplayMissingValues(general_data,"general_data",display=True)

total_employees = len(general_data)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4410 entries, 0 to 4409
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   NumCompaniesWorked  4391 non-null   float64
 1   TotalWorkingYears   4401 non-null   float64
dtypes: float64(2)
memory usage: 69.0 KB

[ 2 ] attributes with missing values in: general_data 



2 attributes with missing values

### Employee Survey Data

In [None]:

DisplayMissingValues(employee_survey_data,"employee_survey_data",display=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4410 entries, 0 to 4409
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   EnvironmentSatisfaction  4385 non-null   float64
 1   JobSatisfaction          4390 non-null   float64
 2   WorkLifeBalance          4372 non-null   float64
dtypes: float64(3)
memory usage: 103.5 KB

[ 3 ] attributes with missing values in: employee_survey_data 



### Manager Survey Data

In [None]:

DisplayMissingValues(manager_survey_data,"manager_survey_data",display=False)

[ 0 ] attributes with missing values in: manager_survey_data 



### in_time and out_time

In [None]:

DisplayMissingValues(in_time,"in_time")


DisplayMissingValues(out_time,"out_time")

[ 261 ] attributes with missing values in: in_time 

[ 261 ] attributes with missing values in: out_time 



Too many attributes with missing values! We need a different approach for these two dataframes...

The following steps are to be followed:
1. Remove entrances/days with all N/A values for in or out time.
2. Merge both datasets into one by two key attributes: EmployeeID and Date.
3. Compute average hours of work per day per employee as a new attribute.
4. Compute deviation as a new attribute, to measure how dynamic the employee's routine is.

In [51]:
nonsence_entries_in = pd.DataFrame(in_time.loc[:, in_time.isnull().sum() == 4410])
print(nonsence_entries_in.shape[1]," days with all null values for time of entrance.")
nonsence_entries_in

12  days with all null values for time of entrance.


Unnamed: 0,2015-01-01,2015-01-14,2015-01-26,2015-03-05,2015-05-01,2015-07-17,2015-09-17,2015-10-02,2015-11-09,2015-11-10,2015-11-11,2015-12-25
0,,,,,,,,,,,,
1,,,,,,,,,,,,
2,,,,,,,,,,,,
3,,,,,,,,,,,,
4,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
4405,,,,,,,,,,,,
4406,,,,,,,,,,,,
4407,,,,,,,,,,,,
4408,,,,,,,,,,,,


In [52]:
nonsence_entries_out = pd.DataFrame(out_time.loc[:, in_time.isnull().sum() == 4410])
print(nonsence_entries_out.shape[1]," days with all null values for time of exit.")
nonsence_entries_out

12  days with all null values for time of exit.


Unnamed: 0,2015-01-01,2015-01-14,2015-01-26,2015-03-05,2015-05-01,2015-07-17,2015-09-17,2015-10-02,2015-11-09,2015-11-10,2015-11-11,2015-12-25
0,,,,,,,,,,,,
1,,,,,,,,,,,,
2,,,,,,,,,,,,
3,,,,,,,,,,,,
4,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
4405,,,,,,,,,,,,
4406,,,,,,,,,,,,
4407,,,,,,,,,,,,
4408,,,,,,,,,,,,


In [53]:
in_time = in_time.drop(columns=nonsence_entries_in.columns)
out_time = out_time.drop(columns=nonsence_entries_out.columns)

In [54]:
DisplayMissingValues(in_time,"in_time")
DisplayMissingValues(out_time,"out_time")

[ 249 ] attributes with missing values in: in_time 

[ 249 ] attributes with missing values in: out_time 



In [55]:
# verifying all columns in in_time exist in out_time and viceversa
# (homemade version)

in_t = in_time.columns
out_t = out_time.columns

for date_in_t in in_t:
    if date_in_t not in out_t:
        print(date_in_t)

for date_out_t in in_t:
    if date_out_t not in in_t:
        print(date_out_t)

By retiring the non-workable days, we reduced the columns with missing values by 12. But still, too many days with missing values.

In [94]:
def GetTime(datetime):
    if pd.notnull(datetime):
        return datetime[11:16]
    return None

def TimeToMinutes(time):
    if pd.notnull(time):
        return int(time[0:2])*60 + int(time[4:6])
    return 0

In [None]:
working_hours = pd.DataFrame(columns=["EmployeeID","AvgWorkingHours","RecentInasistances"])
for id in range (1,total_employees):
    # para calcular el promedio: procuro que todo este ordenado por id del mas bajo al mas alto
    # voy contando las inasistencias
    in_time[1:][id].sum()
    out_time[1:][id]
    avg_working_hours = round(round((TimeToMinutes(out_t)-TimeToMinutes(in_t))/60,2).sum()/emp["WORKED_HOURS"].ne(0).sum(),2)
    total_inasistances = emp["WORKED_HOURS"].eq(0).sum()
    working_hours.loc[len(working_hours)] = (id,avg_working_hours,total_inasistances)

In [None]:
# don't run this again. It took me ONE HOUR.
def EmployeesWorkingHoursInfo_DF(in_time,out_time,ID):
    
    wh_info = pd.DataFrame(columns=['DATE','IN_TIME','OUT_TIME','WORKED_HOURS'])

    employee_in_time_data = in_time[in_time['Unnamed: 0']==ID]      # all entry times for given employee
    employee_out_time_data = out_time[out_time['Unnamed: 0']==ID]   # all exit times for given employee

    for i in range (1,len(employee_in_time_data.columns)):  # 250 iterations: all days employee worked
        in_t    = GetTime(employee_in_time_data[employee_in_time_data.columns[i]][ID-1])
        out_t   = GetTime(employee_out_time_data[employee_out_time_data.columns[i]][ID-1])
        wh_info.loc[len(wh_info)] = (employee_in_time_data.columns[i],in_t,out_t,round((TimeToMinutes(out_t)-TimeToMinutes(in_t))/60,2))

    return wh_info

info_emp_3 = EmployeesWorkingHoursInfo_DF(in_time,out_time,4400)
print("TOTAL WORKABLE DAYS:               ",max(info_emp_3.count()))
print("AVERAGE WORKING HOURS PER WORK DAY:",round(info_emp_3["WORKED_HOURS"].sum()/info_emp_3["WORKED_HOURS"].ne(0).sum(),2))
print("AMOUNT OF RECENT INASSISTANCES:    ",info_emp_3["WORKED_HOURS"].eq(0).sum())

# posibles nuevos atributos: avg worked hours, missed days, standard deviation as in punctuality at entry or outing?
working_hours = pd.DataFrame(columns=["EmployeeID","AvgWorkingHours","RecentInasistances"])
for id in range (1,total_employees):
    emp = EmployeesWorkingHoursInfo_DF(in_time,out_time,id)
    avg_working_hours = round(emp["WORKED_HOURS"].sum()/emp["WORKED_HOURS"].ne(0).sum(),2)
    total_inasistances = emp["WORKED_HOURS"].eq(0).sum()
    working_hours.loc[len(working_hours)] = (id,avg_working_hours,total_inasistances)


TOTAL WORKABLE DAYS:                249
AVERAGE WORKING HOURS PER WORK DAY: 6.36
AMOUNT OF RECENT INASSISTANCES:     22


No missing values

In [183]:
emp = EmployeesWorkingHoursInfo_DF(in_time,out_time,4410)
avg_working_hours = round(emp["WORKED_HOURS"].sum()/emp["WORKED_HOURS"].ne(0).sum(),2)
total_inasistances = emp["WORKED_HOURS"].eq(0).sum()
working_hours.loc[len(working_hours)] = (4410,avg_working_hours,total_inasistances)

In [192]:
copy_working_hours = working_hours.copy()
copy_working_hours["EmployeeID"] = copy_working_hours["EmployeeID"].astype(int)
copy_working_hours["RecentInasistances"] = copy_working_hours["RecentInasistances"].astype(int)
copy_working_hours

Unnamed: 0,EmployeeID,AvgWorkingHours,RecentInasistances
0,1,7.41,17
1,2,7.75,13
2,3,6.99,7
3,4,7.22,14
4,5,8.04,4
...,...,...,...
4405,4406,8.50,6
4406,4407,6.11,8
4407,4408,7.72,18
4411,4409,9.46,8


Merge all datasets into one

In [193]:
merge_key = "EmployeeID"
merged_df = pd.read_csv("general_data.csv")
merged_df = pd.merge(merged_df, pd.read_csv("employee_survey_data.csv"), on=merge_key, how="outer")
merged_df = pd.merge(merged_df, pd.read_csv("manager_survey_data.csv"), on=merge_key, how="outer")
merged_df = pd.merge(merged_df,copy_working_hours, on=merge_key, how="outer")

columns_order = ["EmployeeID"] + [col for col in merged_df.columns if col != "EmployeeID"]
merged_df = merged_df[columns_order]
#sorted_merged_df = merged_df.sort_values(by="EmployeeID")
merged_df.to_csv("..\merged_data.csv", index=True)
merged_df


Unnamed: 0,EmployeeID,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,Gender,...,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager,EnvironmentSatisfaction,JobSatisfaction,WorkLifeBalance,JobInvolvement,PerformanceRating,AvgWorkingHours,RecentInasistances
0,1,51,No,Travel_Rarely,Sales,6,2,Life Sciences,1,Female,...,1,0,0,3.0,4.0,2.0,3,3,7.41,17
1,2,31,Yes,Travel_Frequently,Research & Development,10,1,Life Sciences,1,Female,...,5,1,4,3.0,2.0,4.0,2,4,7.75,13
2,3,32,No,Travel_Frequently,Research & Development,17,4,Other,1,Male,...,5,0,3,2.0,2.0,1.0,3,3,6.99,7
3,4,38,No,Non-Travel,Research & Development,2,5,Life Sciences,1,Male,...,8,7,5,4.0,4.0,3.0,2,3,7.22,14
4,5,32,No,Travel_Rarely,Research & Development,10,1,Medical,1,Male,...,6,0,4,4.0,1.0,3.0,3,3,8.04,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4405,4406,42,No,Travel_Rarely,Research & Development,5,4,Medical,1,Female,...,3,0,2,4.0,1.0,3.0,3,3,8.50,6
4406,4407,29,No,Travel_Rarely,Research & Development,2,4,Medical,1,Male,...,3,0,2,4.0,4.0,3.0,2,3,6.11,8
4407,4408,25,No,Travel_Rarely,Research & Development,25,2,Life Sciences,1,Male,...,4,1,2,1.0,3.0,3.0,3,4,7.72,18
4408,4409,42,No,Travel_Rarely,Sales,18,2,Medical,1,Male,...,9,7,8,4.0,1.0,3.0,2,3,9.46,8
