# Preppin' Data Challenge -
## 2023: Week 33 - HR Month - Combinations
## Created by: Ghafar Shah

- Challenge: https://preppindata.blogspot.com/2023/08/2023-week-33-hr-month-combinations.html

### About: 
The HR analyst used the data from last week to build a dashboard. The DC managers found it very useful, and they requested some new features. First, we need to add the employee’s tenure (how many months and years they have worked at that particular DC) to the dataset. 

Second, the HR analyst would like to keep the reports consistent from DC to DC, so they requested an aggregated dataset that fills in zeroes if a DC does not have any employees in a specific demographic group each month. For example, DC #1 did not have any employees in the 60-64 years age group for the month of February 2019, so we need to add a row for that combination, with 0 employees.

In [None]:
# import libraries
import pandas as pd
import numpy as np

In [None]:
# read in generations info
ee_monthly_v3 = pd.read_csv('ee_monthly_v3.csv')

# preview dataframe
ee_monthly_v3

In [None]:
# read in demographics table
ee_dim_v3 = pd.read_csv('ee_dim_v3.csv')

# preview dataframe
ee_dim_v3

## Using the monthly table, calculate each employee’s tenure

For the tenure_months, we want to know the number of full months between the employee's hire_date and either the month_end_date or leave_date - whichever is soonest

In [None]:
# Convert the date_string column to datetime format
ee_monthly_v3['Leave_Date'] = pd.to_datetime(ee_monthly_v3['leave_date'], format='%d/%m/%Y')
ee_monthly_v3['Month_End_Date'] = pd.to_datetime(ee_monthly_v3['month_end_date'], format='%d/%m/%Y')
ee_monthly_v3['Hire_Date'] = pd.to_datetime(ee_monthly_v3['hire_date'], format='%d/%m/%Y')

In [None]:
# preview dataframe
ee_monthly_v3

In [None]:
# Drop redundant colums, hire date, leave date, month end date
ee_monthly_v3 = ee_monthly_v3.drop(columns=['hire_date', 'leave_date', 'month_end_date'])
ee_monthly_v3

In [None]:
# Create a new column for the soonest date between Month_End_Date and Leave_Date
ee_monthly_v3['Soonest_Date_Month_End_or_Leave'] = ee_monthly_v3[['Month_End_Date', 'Leave_Date']].min(axis=1)

In [None]:
# preview dataframe
ee_monthly_v3

### Calculate Tenure Months

In [None]:
# subtrats hire date from the soonest month_end_date or leave _date
ee_monthly_v3['Tenure_Months'] = ((ee_monthly_v3.Soonest_Date_Month_End_or_Leave - ee_monthly_v3.Hire_Date)/ np.timedelta64(1, 'M'))

In [None]:
# convert tenure months column to int
ee_monthly_v3['Tenure_Months'] = ee_monthly_v3['Tenure_Months'].astype(int)

In [None]:
# preview dataframe
ee_monthly_v3

### Calculate Tenure Years

In [None]:
# Calculate the number of full years from Full_Months
ee_monthly_v3['Tenure_Years'] = ee_monthly_v3['Tenure_Months'] // 12

In [None]:
# preview dataframe
ee_monthly_v3

### Join the ee_dim table to the monthly data on employee_id to get the employee attributes

In [None]:
# Join monthly employee table to ee_dim table on column employee_id
monthly_ee_dim_df = ee_monthly_v3.merge(ee_dim_v3 , on='employee_id',how='left')

In [None]:
# preview dataframe
monthly_ee_dim_df

### Create a summary record for each DC/month/demographic:
- For each DC, month, and generation name, count the number of employees 
- Name the employee count “ee_count”
- Rename the generation_name column to “demographic_detail”
- Add a new column, demographic_type, which will have the same string in every row, “Generation Name”
- Repeat above steps for gender, nationality, age_range, and tenure_years
- Union all of the demographic summaries into one dataset

In [None]:
# Generation Name - group by three keys and then summarize each group
generation_summary_df = monthly_ee_dim_df.groupby(['dc_nbr', 'Month_End_Date', 'generation_name']).size().unstack(fill_value=0).reset_index()
generation_summary_df.rename(columns = {'generation_name':'demographic_detail'}, inplace = True)
generation_summary_df['demographic_type']='Generation Name' 


# Gender - group by three keys and then summarize each group
gender_summary_df = monthly_ee_dim_df.groupby(['dc_nbr', 'Month_End_Date', 'gender']).size().unstack(fill_value=0).reset_index()
gender_summary_df.rename(columns = {'gender':'demographic_detail'}, inplace = True)
gender_summary_df['demographic_type']='Gender'         

# Nationality - group by three keys and then summarize each group
nationality_summary_df = monthly_ee_dim_df.groupby(['dc_nbr', 'Month_End_Date', 'nationality']).size().unstack(fill_value=0).reset_index()
nationality_summary_df.rename(columns = {'nationality':'demographic_detail'}, inplace = True)
nationality_summary_df['demographic_type']='Nationality' 

# Tenure Years - group by three keys and then summarize each group
Tenure_Years_summary_df = monthly_ee_dim_df.groupby(['dc_nbr', 'Month_End_Date', 'Tenure_Years']).size().unstack(fill_value=0).reset_index()
Tenure_Years_summary_df.rename(columns = {'Tenure_Years':'demographic_detail'}, inplace = True)
Tenure_Years_summary_df['demographic_type']='Tenure' 

# Age Range - group by three keys and then summarize each group
Age_Range_summary_df = monthly_ee_dim_df.groupby(['dc_nbr', 'Month_End_Date', 'age_range']).size().unstack(fill_value=0).reset_index()
Age_Range_summary_df.rename(columns = {'age_range':'demographic_detail'}, inplace = True)
Age_Range_summary_df['demographic_type']='Age Range'

In [None]:
# 1) Preview generation dataframe
generation_summary_df

### We need to include the rows that have a zero employee count. Since the dataframe above is excluding those rows, we'll use the Pandas melt() function to change the DataFrame format from wide to long.

In [None]:
# Melt to combine generation names into a single column
Generation_Name = generation_summary_df.melt(id_vars=['dc_nbr', 'Month_End_Date', 'demographic_type'], var_name='generation_name', value_name='employee_count')
Generation_Name.rename(columns={'generation_name': 'demographic_detail'}, inplace=True)
Generation_Name

In [None]:
# 2) Preview gender datarame
gender_summary_df

In [None]:
# 2) Melt to combine genders into a single column
Gender = gender_summary_df.melt(id_vars=['dc_nbr', 'Month_End_Date', 'demographic_type'], var_name='gender', value_name='employee_count')
Gender.rename(columns={'gender': 'demographic_detail'}, inplace=True)
Gender

In [None]:
# 3) Preview nationality dataframe
nationality_summary_df

In [None]:
# 3) Melt to combine nationality into a single column
Nationality = nationality_summary_df.melt(id_vars=['dc_nbr', 'Month_End_Date', 'demographic_type'], var_name='nationality', value_name='employee_count')
Nationality.rename(columns={'nationality': 'demographic_detail'}, inplace=True)
Nationality

In [None]:
# 4) Preview Tenure Years dataframe
Tenure_Years_summary_df

In [None]:
# 4) Melt to combine age ranges into a single column
Tenure_Years = Tenure_Years_summary_df.melt(id_vars=['dc_nbr', 'Month_End_Date', 'demographic_type'], var_name='Tenure_Years', value_name='employee_count')
Tenure_Years.rename(columns={'Tenure_Years': 'demographic_detail'}, inplace=True)
Tenure_Years

In [None]:
# 5) Preview Age Range dataframe
Age_Range_summary_df

In [None]:
# 5) Melt to combine age ranges into a single column
Age_Range = Age_Range_summary_df.melt(id_vars=['dc_nbr', 'Month_End_Date', 'demographic_type'], var_name='age_range', value_name='employee_count')
Age_Range.rename(columns={'age_range': 'demographic_detail'}, inplace=True)
Age_Range

In [None]:
# Union the dataframe pivot summaries together
combined_summaries = pd.concat([Age_Range, Tenure_Years, Nationality, Gender, Generation_Name ])

### Final DataFrame - 
Note: Uncomment code to export CSV

In [None]:
# Preview final dataframe
combined_summaries

# Ucomment code below to export data
#combined_summaries.to_csv('combined_summaries4.csv')

## Exploring Data Visualization with Python

Seaborn Annoted Heatmap Chart:
https://seaborn.pydata.org/examples/spreadsheet_heatmap.html

In [None]:
# import required libraries
import seaborn as sns
import matplotlib.pyplot as plt
import calendar # required to convert the number momths to actual string months

In [None]:
# Convert employee_count to integers
combined_summaries['employee_count'] = combined_summaries['employee_count'].astype(int)
combined_summaries

In [None]:
# Pivot the dataframe to create a heatmap
heatmap_data = combined_summaries.pivot_table(index=combined_summaries['Month_End_Date'].dt.month, 
                              columns=combined_summaries['Month_End_Date'].dt.year, 
                              values='employee_count', 
                              aggfunc='sum').fillna(0).astype(int) # handles NaNs - replace with 0s

# Preview heatmap
heatmap_data

In [None]:
# Set the heatmap plot size
plt.figure(figsize=(12, 8))

# Build out a heatmap with the employee counts in each cell
sns.set_theme()
ax = sns.heatmap(heatmap_data, annot=True, fmt="", linewidths=5, cmap='YlGnBu')

# Now, we'll format the month names from numbers to actual names (e.g., 12 => December)
month_names = [calendar.month_abbr[i] for i in range(1, 13)]
ax.set_yticklabels(month_names, rotation=0)

# Increase font size of month labels on both x and y axes
ax.set_xticklabels(ax.get_xticklabels(), fontsize=16, rotation=45)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=16, rotation=0)

# set the y- and x- axis labels, fot size
plt.title('Yearly and Monthly Workforce Trends', fontsize=22)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Month', fontsize=16)

# Save the visualizatio as an image; DPI stands for "Dots per Inch" for image quality
plt.savefig('heatmap_W332023.png', bbox_inches='tight', dpi=800, facecolor='white')

# Show the plot!
plt.show()