# HR Analytics: Understanding Employee Attrition
Analyzing trends and patterns in employee attrition using HR analytics data from Kaggle. The data is on Atlas Lab employees.

## Objectives
- Understand the distribution of employee attributes (age, department, role, etc.)
- Explore patterns in employee attrition
- Analyze job and performance satisfaction metrics
- Identify factors correlated with attrition

## Load and Clean Data
- run data_cleaning.py

In [None]:
import pandas as pd
from pandas.api.types import CategoricalDtype

def clean_data():
    # load the data
    df_employee = pd.read_csv('Employee.csv')
    df_performance = pd.read_csv('PerformanceRating.csv')

    # merge both datasets
    df = pd.merge(df_employee, df_performance, on='EmployeeID')
    
    # map binary columns
    df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})
    df['OverTime'] = df['OverTime'].map({'Yes': 1, 'No': 0})

    # convert ordinal columns
    levels = CategoricalDtype(categories=[1,2,3,4,5], ordered=True)
    cols_to_convert = [
        'EnvironmentSatisfaction', 
        'JobSatisfaction', 'RelationshipSatisfaction', 
        'WorkLifeBalance', 'SelfRating', 
        'ManagerRating', 'Education',
    ]
    for col in cols_to_convert:
        df[col] = df[col].astype(levels)

    # convert categorical columns
    categorical_cols = [
        'BusinessTravel', 'Department', 'State', 'Ethnicity', 'EducationField', 
        'JobRole', 'MaritalStatus', 'StockOptionLevel', 'TrainingOpportunitiesWithinYear',
        'TrainingOpportunitiesTaken'
    ]
    for col in categorical_cols:
        df[col] = df[col].astype('category')

    # date columns
    df['HireDate'] = pd.to_datetime(df['HireDate'])
    df['ReviewDate'] = pd.to_datetime(df['ReviewDate'])

    # fix typos in EducationField
    df['EducationField'] = df['EducationField'].str.strip().str.title()
    df['EducationField'] = df['EducationField'].replace({
        'Marketing ': 'Marketing',
        'Marketting': 'Marketing'
    })

    # drop unused columns
    df = df.drop(['FirstName', 'LastName', 'PerformanceID'], axis=1)
    

    # check summary before saving
    print("\nSummary:")
    print(df.info())
    print("\nMissing Values:")
    print(df.isnull().sum())
    print("\nPreview:")
    print(df.head())

    # save cleaned data
    df.to_csv('cleaned_employee_data.csv', index=False)
    print("Data cleaned and saved as cleaned_employee_data.csv")

if __name__ == "__main__":
    clean_data()


## 1. Exploratory Data Analysis - Overview of Dataset
- create helper functions for eda.py
    - 3 functions for each data type: ql_stats, qt_stats, dt_stats
    - will print out summaries, plots, and stats for all columns
- run eda.py

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew, kurtosis, normaltest, probplot
import os

sns.set(style='whitegrid', palette='hls')
plt.rcParams['figure.figsize'] = (10, 6)

# Function to print out summary for qualitative variables 
# and proportion along w/plots.
def ql_stats(df, col):
    """
    Prints the summary and percentage of each category 
    in the specified column w/count plots.

    Parameters: 
    df (DataFrame): The DataFrame containing the data.
    col (str): The name of the column to summarize.
    """
    # make directory for column
    dir = f"{col}"
    os.makedirs(dir, exist_ok=True)

    print(f"\n--- Categorical Summary: {col} ---")
    counts = df[col].value_counts(dropna=False)
    percentages = df[col].value_counts(normalize=True, dropna=False) * 100

    summary = pd.DataFrame({
        'Count': counts,
        'Percentages': percentages.round(2)
    })
    print(summary)
    print(f"Unique categories: {df[col].nunique(dropna=False)}")
    print(f"Most frequent: {df[col].mode()[0]}")

    plt.figure(figsize=(12, 6))
    sns.countplot(x=col, data=df)
    
    if col in ['Ethnicity', 'EducationField', 'JobRole']:
        # rotate to make space for x labels
        plt.xticks(rotation=45, ha='right')
        plt.subplots_adjust(bottom=0.3)
    else:
        plt.tight_layout()
    
    plt.title(f'Distribution of {col}')
    # save the plot to the new directory
    filename = os.path.join(dir, f'Distribution_of_{col}.jpg')
    plt.savefig(filename, dpi=300, bbox_inches='tight')

    plt.show()

# Same concept but for quantitative variables
def qn_stats(df, col):
    """
    Prints out summary for quantitative columns and creates 
    histogram + KDE plots.

    Parameters:
    df (DataFrame): The DataFrame containing the data.
    col (str): The name of the numeric column to summarize.
    """
    # make directory for column
    dir = f"{col}"
    os.makedirs(dir, exist_ok=True)

    # summary
    print(f"\n--- Numerical Summary: {col} ---")
    desc = df[col].describe()
    print(desc)

    # Boxplot
    sns.boxplot(x=df[col])
    plt.title(f"Boxplot of {col}")
    plt.xlabel(col)
    filename = os.path.join(dir, f"Boxplot_of_{col}.jpg")
    plt.savefig(filename, dpi=300, bbox_inches='tight')
    plt.show()
    print(f"Mode: {df[col].mode()[0]}")
    print(f"Skewness: {skew(df[col].dropna()):.2f}")
    print(f"Kurtosis: {kurtosis(df[col].dropna()):.2f}")
    
    # Histogram + KDE
    sns.histplot(df[col], kde=True, bins=20)
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel('Frequency')
    filename = os.path.join(dir, f'Distribution_of_{col}.jpg')
    plt.savefig(filename, dpi=300, bbox_inches='tight')
    plt.show()

    # Normality Test
    stat, p=normaltest(df[col])
    print(f"\nD'Agostino and Pearson Test:")
    print(f" Statistic = {stat:.4f}, p-value = {p:.4f}")
    if p < 0.05:
        print("Data is not normally distributed.")
    else:
        print("Data is normally distributed.")

    # QQ Plot
    plt.figure(figsize=(6,6))
    probplot(df[col], dist='norm', plot=plt)
    plt.title(f'QQ-Plot of {col}')
    plt.xlabel("Theoretical Quantiles")
    plt.ylabel('Sample Quantiles')
    plt.grid(True)
    filename = os.path.join(dir, f'QQ-Plot_of_{col}.jpg')
    plt.savefig(filename, dpi=300, bbox_inches='tight')
    plt.show()

# Same thing but with datetime columns
def dt_stats(df, col):
    """
    Summarizes datetime columns and plots time series trends.
    
    Parameters:
    df (DataFrame): The DataFrame containing the data.
    col (str): The name of the numeric column to summarize.
    """
    # make directory for column
    dir = f"{col}"
    os.makedirs(dir, exist_ok=True)

    print(f"\n--- Datetime Summary: {col} ---")
    print(f"Min date: {df[col].min()}")
    print(f"Max date: {df[col].max()}")
    print(f"Range: {df[col].max() - df[col].min()}")
    print(f"Median: {df[col].median()}")
    print(f"Mode: {df[col].mode()[0]}")
    print(f"Unique dates: {df[col].nunique(dropna=False)}")

    #Counts per year
    year_counts = df[col].dt.year.value_counts().sort_index()
    print("\nCounts per year:")
    print(year_counts)

    # Yearly counts
    dt_yearly = df.set_index(col).resample('Y').size()
    plt.figure(figsize=(14, 7))
    dt_yearly.plot(marker='o')
    plt.title(f"Yearly Count of {col}")
    plt.xlabel("Year")
    plt.ylabel("Count")
    plt.grid(True)
    plt.tight_layout()
    filename=os.path.join(dir, f'Yearly_Count_of_{col}.jpg')
    plt.savefig(filename, dpi=300, bbox_inches='tight')
    plt.show()

import pandas as pd
from pandas.api.types import CategoricalDtype
from univariate import ql_stats, qn_stats, dt_stats


def run_eda():
    # Load the dataset
    df = pd.read_csv('cleaned_employee_data.csv')

    # Drop Employee ID from the columns
    df = df.drop(columns=['EmployeeID'])

    # recast types for EDA
    # ordinal columns
    levels = CategoricalDtype(categories=[1,2,3,4,5], ordered=True)
    ordinal_cols = [
    'EnvironmentSatisfaction', 
    'JobSatisfaction', 
    'RelationshipSatisfaction', 
    'WorkLifeBalance', 
    'SelfRating', 
    'ManagerRating', 
    'Education']

    for col in ordinal_cols:
        df[col] = df[col].astype(levels)
    
    # retyping the non-ordinal category columns
    categorical_cols = ['BusinessTravel', 'Department', 'State', 'Ethnicity', 'EducationField', 
        'JobRole', 'MaritalStatus', 'StockOptionLevel', 'TrainingOpportunitiesWithinYear',
        'TrainingOpportunitiesTaken']
    
    for col in categorical_cols:
        df[col] = df[col].astype('category')

    # date columns
    df['HireDate'] = pd.to_datetime(df['HireDate'])
    df['ReviewDate'] = pd.to_datetime(df['ReviewDate'])

    # Temporarily turns Attrition and OverTime variables as categories
    df['Attrition'] = df['Attrition'].astype('category')
    df['OverTime'] = df['OverTime'].astype('category')

    # Loop through all columns and run stats, tests, and plots
    for col in df.columns:
        if pd.api.types.is_datetime64_any_dtype(df[col]):
            dt_stats(df, col)
        elif pd.api.types.is_numeric_dtype(df[col]):
            qn_stats(df, col)
        else:
            ql_stats(df, col)

if __name__ == "__main__":
    run_eda()

## Key Observations:
### Gender
- 46.07% of employees identify as female
- 44.24% of employees identify as male
- 8.78% of employees identify as non-binary
- 0.91% of employees did not state their gender identity 
- Slightly more employees identify as female than male, with a small proportion identifying was non-binary or not disclosing

### Age
- The average age of employees is 30.78 years (SD = 7.93)
- The youngest employee is 18 years old, and the oldest is 51 years old.
- 25% of employees are under 25 years old, 50% are under 28 years old, and 75% are under 37.
- The most common age (mode) is 25 years old. 
- The distribution of ages is right-skewed (skewness = 0.67), indicating a higher concentration of younger employees.
- The kurtosis (-0.72) indicates a platykurtic distribution with a flatter shape and lighter tails. 
- THe histogram shows a clear peak at age 25 and a more even spread of ages after 30.
- Since the p-value is 0 (and is less than 0.05) we reject the null hypothesis and say that the distribution of age is not normal. 
- The QQ-Plot shows deviations from the theoretical normal line 

### BusinessTravel
- 70.65% of employees engage in some travel which is the most common.
- 20.24% of employees are frequent travelers.
- 9.11% of employees do not travel at all.
- Overall, more than half of the employees have some travel requirements as part of their role.

### Department
- 63.45% of the company work in the Technology department
- 32.03% of the company work in the Sales department
- 4.52% of the company work in the Human resources department
- Overall, more than half of the employees work in Technology.

### DistranceFromHome (KM)
- The average distance distance from home for an employee is 22.3 km.
- The standard deviation is 12.90 km.
- The shortest commute is 1 km, while the longest is 45 km.
- 25% of the company lives within 12 km away from work, 50% within 22 km, and 75% within 33 km.
- The most common commute distance is 14 km. 
- The skewness (0.07) is close to 0 which indicates a nearly symmetrical distribution.
- The kurtosis (-1.16) suggests a platykurtic distribution with lighter tails and fewer extreme values.
- The distribution plot shows a very flat peak with no extreme outliers.
- Overall, employee commute distances vary widely but are symmetrical distributed without strong outliers.
- Since the p-value is 0 (and is less than the significance level of 0.05), we reject the null hypothesis and say that the distribution of distance from home is not normal.
- The QQ-plot confirms the distribution deviates from a normal distribution.

### State
- 60.35% of employees live in California.
- 27.57% of employees live in New York.
- 12.07% of employees live in Illinois. 
- The majority of employees residing in California suggests that the company is likely based there, with smaller offices in New York and Illinois.

### Ethnicity
- 49.61% of employees identify as white.
- 16.38% of employees identify as mixed.
- 15.96% of employees identify as black.
- 10.18% of employees identify as asian.
- 4.23% of employees identify as American Indian or Alaska Native.
- 2.34% of employees identify as Native Hawaiian or Other Pacific Islander.
- 1.3% of employees identify as Other.
- Just under half of the company identifies as White.


### Education
- 39.26% of employees have a Bachelor's degree.
- 24.94% of employees have a Master's degree.
- 20.14% of employees have a GED (General Educational Development certificate).
- 12.52% of employees have no formal qualifications.
- 3.15% of employees have a Doctorate degree.
- The company hires mostly employees with secondary or higher education, with Bachelor's and Master's degrees being the most common.
- However, since more employees hold a GED or no formal qualifcications than those with a Doctorate, advanced degrees like Doctorates may not be a significant factor in hiring decisions.

### Education Field
- 29.71% of employees majored in Computer Science.
- 23% of employees majored in Information Systems.
- 22.98% of employees majored in Marketing.
- 6.78% of employees majored in Business Studies.
- 6.65% of employees majored in Economics.
- 5.50% of employees majored in other studies.
- 3.4% of employees obtained a technical degree.
- 1.98% of employees majored in human resources. 
- It would be interesting to compare these Education Fields with the Department column to see how closely aligned employees' majors are to their departments.
- Analyzing Education Field against attrition might be less informative, since the majority of employees come from just a few majors (Computer Science, Information Systems, and Marketing).

### Job Role
- The 3 job roles with the most employees (from largest to smallest) are: sales executive (22.67%), data scientist (20.27%), and software engineer (19.88%).
- There are 10 additional job roles each comprising less than 9% of employees.
- This distribution aligns well with the most common education fields - Computer Science, Information Systems, and Marketing
- This suggests the company may prioritize hiring candidates whose educational abckgrounds closely match their job roles, leading to a strong alignment between qualifications and responsibilites.

### Marital Status
- 42.67% of employees are married
- 37.83% of employees are single
- 19.50% of employees are divorced
- The number of married and single employees is fairly close, with only a difference of 324 employees.
- Given that 50% of employees are under 28 and 75% are under 38, it makes sense that many are either single or newly married, reflecting typical young adult life stages.


### Salary
- The average annual salary of employees is $111,061.75.
- The standard deviation is $98,268.10, indictating high variability in salries.
- The lowest salary recorded is $20,387, while the highest is $547,204 per year.
- The 25th percentile salary is $45,276, the median (50th percentile) is $75,667, and the 75th percentile is $127,427.
- The boxplot shows most salaries are concentrated on the lower end, with a significant number of outliers on the higher end, indicating less employees earn substantially more than the majority.
- The most common salary (mode) is $107,863.
- Positive skewness (1.91) indicates the distribution is right-tailed, with more extreme high values.
- The kurtosis is 3.79, suggesting a distribution with a sharper peak and heavier tails than a normal distribtion.
- Since the p-value is 0 (less than the significance level of 0.05), we reject the null hypothesis and say that it is not normally distributed.
- The QQ plot supports this, which shows deviations from the expected quantiles of a normal distribution.


### Stock Option Level
- 47.74% of employees were granted 0 stocks.
- 37.50% of employees were granted 1 stock. 
- 9.57% of employees were granted 2 stocks.
- 5.19% of employees were granted 3 stocks. 
- This suggests that the company does not heavily rely on stock grants as part of employee compensation.

### Overtime
- 66.66% of employees do not work overtime.
- 33.34% of employees do work overtime. 
- It could be interesting to analyze whether working overtime impacts employee attrition.

### Hire Date
- The first day of hiring was January 3rd, 2012.
- The most recent day of hiring was December 3rd, 2022.
- The time span between the first and most recent hire is $3,987 days.
- The median hire date is March 28th, 2015.
- The date with the most hires was April 23rd, 2012.
- There are 1,048 unique hire dates in the dataset.
- The years with the highest number of hires, from highest to lowest, are 2012, 2013, and 2014.
- The year with the fewest hires was 2022.
- The time series plot shows a general decline in hiring over the years, with a small hearing spike in 2018.
- We can infer that many employees hired between 2012-2014 are still employed, resulting in fewer open roles in recent years.


### Attrition
- 66.3% of employees stayed with the company upon hire.
- 33.7% of employees eventually left the company upon hire. 
- Considering the decrease in hiring over recent years, the retention rate aligns with the trend of fewer new hires.


### Years At Company
- The average tenure of employees at this company is approximately 6 years.
- The standard deviation of tenure is 3.33 years.
- The shortest tenure recorded is 0 years.
- The longest tenure recorded is 10 years.
- Given the large hiring surge in 2012, it's expected to see many employees around the 10-year mark.
- 25% of employees have stayed for 3 years or less.
- 50% of employees stayed for 6 years or less.
- 75% of employees statyed for 9 years or less.
- The boxplot reveals a longer left tail than right, indicating mroe employees with shorter tenures.
- The most common tenure (mode) is 10 years.
- Negative skewness (-0.32) indicates the distribution is left-skewed.
- The kurtosis below 3 suggests a flatter distribution with lighter tails, meaning fewer extreme tenure values.
- The distribution plot shows a relatively consistent number of employees with tenures between 0 to 7 years.
- There is an increase in employees with tenure around 8 years and above, likely reflecting the 2012 hiring surge.
- Since the p-value is 0 (and less than the significance level of 0.05), we reject the null hypothesis and conclude the tenure data is not normally distributed.
- The QQ-Plot confirms this by showing flatter tails drifting away from the theoretical line on both ends.


### Years In Most Recent Role
- The average tenure in an employee's most recent role is 2.86 years.
- The standard deviation is 2.81 years.
- The shortest recorded tenure is 0 years (likely reent promotions or hires).
- The longest recorded tenure is 10 years.
- 25% of employees have been in their current role for 0 years.
- 50% of employees have been in their current role for 2 years.
- 75% of employees have been in their current role for 5 years. 
- The boxplot shows a long right tail, meaning fewer employees hav stayed in the same role for extended periods.
- The most common tenure length is 0 years, suggesting many recent role changes or promotions.
- From the distribution, the number of employees in the same role steadily decreases from 1 to 10 years, indicating that employees who have been with the company longer often transition into new roles over time.
- The skewness (0.78) is positive, meaning the data is right-skewed with a long tail on the higher end.
- The kurtosis (-0.48) is less than 3, indicating a flatter peak and lighter tails.
- Since the p-value is 0 (and is less than the significant value of 0.05), we reject the null hypothesis and say that it is not normally distributed.
- The QQ-plot shows deviations from the diagonal on both ends, confirming the departure from normality.



### Years Since Last Promotion
- The average time since an employee's last promotion is 4 years.
- The standard deviation is 3.12 years.
- The shortest recorded gap since promotion is 0 years.
- The longest is 10 years.
- 25% of employees were last promoted 1 year ago.
- 50% of employees were last promoted 4 years ago.
- 75% of employees were last promoted 7 years ago.
- The boxplot shows a slightly longer right whisker, suggesting only a few employees have gone significantly longer without promotion.
- The most common value is 0 years, indicating many employees were promoted recently.
- The skewness (0.18) is close to 0, meaning the distribution is nearly symmetrical.
- The kurtosis (-1.21) is less than 3, suggesting a flatter peak and lighter tails.
- The distribution remains relatively stable across 1-10 years, but most data points cluster toward the lower end.
- Since the p-value is 0 (and is less than the significance value of 0.05), we reject the null hypothesis and say that it is not normally distributed.
- The QQ-plot confirms this, as the data points deviate from the theoretical line.

### Years With Current Manager
- The average tenure with a current manager is 2.81 years, with a standard deviation of 2.80 years.
- The minimum tenure is 0 years, and maximum is 10 years.
- 25% of employees have been with their manager for 0 years, 50% for 2 years, and 75% for 5 years.
- The boxplot shows most of the data clustered toward the lower end, with a long right whisker.
- Positive skewness (0.84) indicates that the distribution if right-skewed, meaning only a smaller amount of employees have unusually long tenures with their manager.
- The kurtosis (-0.36) indicates a flatter peak with fewer extreme outliers and is platykurtic.
- The mode is 0 years, meaning the largest group of employees are newly assigned to their current manager.
- This aligns with the observation that many employees were promoted under a year ago. However, the data suggests that quick promotions are not necessarily a company-wide trend, but a result of recent role changes.
- Since the p-value is 0 (and is less than the signifiance value=0.05), we reject the null hypothesis and say that the data is not normally distributed.
- The QQ plot also supports this conclusion.



### Yearly Count of ReviewDate
- The earliest review was conducted on January 2nd, 2013, and the most recent one was on December 31st, 2022, with a span of 3,650 days.
- The midpoint review date was September 15th, 2019.
- The day with the most reviews was May 22nd, 2022, which may conincide with a large promotion cycle.
- There were 2771 unique review dates in the dataset. 
- 2022 had the highest number of reviews.
- Since most employees were promoted less than a year ago, it is likely that many of these promotions occured in 2022, rather than demotions.
- The time series plot shows a steady, almost linear, increase in review counts each year.
- This trend could explain the company's relatively low attrition rate -- frequent reviews may coincide with career progression opportunities.
- Looking ahead, when the large hiring cohort from 2012-2014 begins to retire or leave, the company may need another significant hiring surge to maintain staffing levels.


### Environment Satisfaction
- 32.96% of employees rate their environment a 3.
- 32.42% of employees rate their environment a 4.
- 30.50% of employees rate their environment a 5.
- 2.10% of employees rate their environment a 2.
- 2.03% of employees rate their environment a 1.
- Ratings of 3-5 are almost evenly distributed, suggesting that most employees view the work environment positively or at least neutrally, with very few rating it poorly.
- 95.88% of employees rated their environment 3 or higher, indicating an overwhelmingly positive-to-neutral perception of the work place.


### Job Satisfaction
- 25.12% of employees rate their job a 4.
- 24.95% of employees rate their job a 2.
- 24.61% of employees rate their job a 3.
- 23.39% of employees rate their job a 5
- 1.94% of employees rate their job a 1. 
- The ratings are fairly balanced between high (4-5) and moderate/low (1-3) scores, but only 1.94% rated their job a 1, indicating extreme dissatisfaction is rare.

### Relationship Satisfaction
- 24.19% of employees rate their work relationships a 4
- 25.15% of employees rate their work relationships a 2
- 24.24% of employees rate their work relationships a 3
- 23.39% of employees rate their work relationships a 5
- 2.04% of employees rate their work relationships a 1
- There;s a fairly even distributino of ratings between 3, 4, and 5, suggesting mixed but generally positive perceptions of workplace relationships, with very few employees giving the lowest rating.

### Training Opportunities Within Year
- 34.3% of employees had 3 training opportunities within the year.
- 33.9% of employees had 1 training opportunity within the year.
- 32.7% of employees had 2 training opportunities within the year.
- The distribution is relatively even across the three categoires, suggesting that the number of training opportunities may be intentionally allocated, potentially based on job role. This would align with the fact that three major job roles make up most of the workforce.

### Training Opportunities Taken
- 36.50% of employees took 1 training opportunity
- 35.36% of employees took 0 training opportunities
- 19.20% of employees took 2 training opportunities
- 8.94% of employees took 3 training opportunities
- While training oppportunities are fairly evenly offered, participation is more skewed as most employees either take 0 or 1. The proprotion of employees who took 0 or 1 training opportunity is fairly similar.
- This may be because employees aren't required to take all the training opportunities they're given. Even if employees have 2-3 opportunities available, many may only take 1 or none at all.

### Work Life Balance
- 25.43% of employees rated their work life balance a 4.
- 25.37% of employees rated their work life balance a 2.
- 24.89% of employees rated their work life balance a 3.
- 22.51% of employees rated their work life balance a 5.
- 1.80% of employees rated their work life balance a 1.
- Ratings between 2 and 5 each account for over 20% of responses, suggesting a broad spread of opinions but without extreme dissatisfaction. The very small proportion of employees rating it a 1 indicates that severe dissatisfaction with work-life balance is rare.

### Self Rating
- 34.13% of employees rated themselves a 3.
- 33.33% of employees rated themselves a 4.
- 32.54% of employees rated themselves a 5. 
- No employees rated themselves a 1 or 2.
- The absence of low ratings suggests that employees generally feel competenet and satisfied with their own performance. 
- The nearly even distribution across ratings 3-5 indicates a neutral-to-positive self-assessment culture, with no strong lean toward extreme self-confidence or self-criticism.

### Manager Rating
- 33.13% of employees were rated a 3 by their manager.
- 33.09% of employees were rated a 4 by their manager.
- 17.77% of employees were rated a 2 by their manager.
- 16.01% of employees were rated a 5 by their manager.
- No employees were rated a 0 by their manager.
- Ratings of 3 and 4 are the most common, with very similar proportions, suggesting managers often give mid-to-high evaluations.
- Ratings of 2 or 5 occur far less frequently, indicating that extreme performance judgements are either low or exceptional, are relatively rare.

## Nonparametric Statistical Testing
- using nonparametric testes since most of the variables were shown not to be normally distributed (check previous results).
- created a formula that runs both the chi-square and mann-whitney u tests
- categorical variables: chi-square test of independence
- numeric variables: mann-whitney u test
- the significance level will be 0.05.
- goal: identify which variables have the most strongest statistical association with attrition
- chi-square independence test interpretation:  if p-value is small (<0.05) we reject the null hypothesis and say there is a statistical significant relationship between the variable being tested vs attrition.
- mann-whitney u test intepretation: if the p-value is small (<0.05), we reject the null hypothesis and say that the distribution between both groups are statistically significently different.
- datetime variables are excluded since they are not directly comparable in this context
- the results will help guide feature selection, transformations, and modeling.