# In This Notebook we will do Visual Analysis on the **Absenteeism at Work** dataset. 
- This notebook is motivated by the data analysis workshop by Gururajan Govindan, Shubhangi Hora, and Konstantin Palagachev

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings

%matplotlib inline
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('Absenteeism_at_work.csv', sep=';')
data.head().T

# Descriptive Statistics 
#### As a rule of thumb it is a good thing to start the analysis by displaying the shape, missing values, types of columns and a discription about the data.

- By using the info() function:
    1. We can easily find the shape is 740 entries and total 21 columns.
    2. No missing values and types of columns.
    3. By looking at the head of the data and the info() We have to do some decoding to columns like " Month of absence,Day of the week, Seasons,Disciplinary failure, Education, Social drinker, Social smoker "
    4. We have 2 Boolean columns (Social drinker and Social smoker).


In [None]:
data.info()

- Using the describe function to show the description of the numerical data.
    1. We have 28 different reasons for absence 0-28 excluding 20.
    2. Range of Ages 27-58.
    3. by looking to the (mean, median) Seems that Transportation expense and Service time will be normally distributed. but Abenteeism time in hours wouldn't be normally distributed 
    4. Seems that Hit target column is a percentage of hitting the target.

In [None]:
data.describe().T

#  Data Preprocessing
- Decoding and Categorizing Variables:
    1. Some columns need to be decoded as mentioned in the info phase.
    2. We can categorize the Reason for absence into categories Disease and not a Disease "yes/no" by the International Code of Diseases (ICD)
    3. We can also cluster the Body mass index into three categories (Obese, Normal weight and Over weight)
    4. We can categorize absence based on age (Early young adults, Mid young adults, Mid career Professionals, Late career Professionals and Pre-retirement).

In [None]:
# Define decoding dictionaries
month_decoding = {
    1: "January",
    2: "February",
    3: "March",
    4: "April",
    5: "May",
    6: "June",
    7: "July",
    8: "August",
    9: "September",
    10: "October",
    11: "November",
    12: "December",
    0: "Unknown"
}

day_of_week_decoding = {
    2: "Monday",
    3: "Tuesday",
    4: "Wednesday",
    5: "Thursday",
    6: "Friday"
}

season_decoding = {
    1: "Spring",
    2: "Summer",
    3: "Fall",
    4: "Winter"
}

education_decoding = {
    1: "High School",
    2: "Graduate",
    3: "Postgraduate",
    4: "Master/PhD"
}

yes_no_decoding = {
    0: "No",
    1: "Yes"
}

In [None]:
preprocessed_data = data.copy()

# Define encoding dictionaries
decoding_dict = {
    "Month of absence": month_decoding,
    "Day of the week": day_of_week_decoding,
    "Seasons": season_decoding,
    "Education": education_decoding,
    "Disciplinary failure": yes_no_decoding,
    "Social drinker": yes_no_decoding,
    "Social smoker": yes_no_decoding
}

# Backtransform numerical variables to categorical using replace() method
preprocessed_data.replace(decoding_dict, inplace=True)

# Print the transformed data
preprocessed_data.head().T

- Creating Column Disease for the disease reasons and non-disease values.

In [None]:
# Use lambda function instead of defining a separate function as it
preprocessed_data["Disease"] = preprocessed_data["Reason for absence"].\
                                apply(
                                        lambda val: True if val > 0 and val <= 21 else False
                                    )

- Categorizing people by the body mass index

In [None]:
def bod_ms_ind(val):
    if val < 18.5:
        return 'Underweight'
    elif val < 25:
        return 'Normal Weight'
    elif val < 30:
        return 'Overweight'
    else:
        return 'Obese'

preprocessed_data['BMI category'] = preprocessed_data['Body mass index'].apply(bod_ms_ind)


- Categorizing People by their age

In [None]:
def age_categorization(val):
    if val <= 30:
        return 'Early young adult'
    elif val <= 35:
        return 'Mid young adult'
    elif val <= 44:
        return 'Mid career professional'
    elif val <= 53:
        return 'Late career professional'
    else:
        return 'Pre-retirement'
    
preprocessed_data['Career level'] = preprocessed_data['Age'].apply(age_categorization)

# The Dataset is Ready for the Visual Analysis. It's been preprocessed and cleaned. 