# Regression Analysis

Conduct a multiple linear regression with independent variables: JobLevel, Gender, YearsatCompany, Education

Test: Whether Gender is still a statistically significant predictor after controlling for job-related factors.

How much salary variation can my model explain? (R^2)

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

In [None]:
data = pd.read_csv('..\data\processed\cleaned_hr_data.csv')

What to look for:

p-value for C(Gender) → if < 0.05, it’s statistically significant

Coefficient for C(Gender)[T.1] → how much the salary changes for women (or men, depending on coding)

R² → how much variation your model explains

In [13]:
model = smf.ols(
	formula='AnnualSalary ~ JobLevel + C(Gender) + YearsAtCompany + Education',
	data=data
).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           AnnualSalary   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                    0.4403
Date:                Tue, 16 Sep 2025   Prob (F-statistic):              0.780
Time:                        10:51:20   Log-Likelihood:                -17743.
No. Observations:                1470   AIC:                         3.550e+04
Df Residuals:                    1465   BIC:                         3.552e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept       1.371e+05   4040.567     33.

R^2 states the linear combination of IV in model explains 0.1% of the variance in the AnnualSalary. (Value closer to 1 is preferred)

| Variable            | Coef     | p-value | Interpretation                                                                                                         |
| ------------------- | -------- | ------- | ---------------------------------------------------------------------------------------------------------------------- |
| **Intercept**       | 137,100  | 0.000   | Baseline salary for reference group (e.g., Gender=0, Education=1, etc.)                                                |
| **C(Gender)\[T.1]** | -\$120   | 0.958   | Not statistically significant — no meaningful difference in salary between genders after controlling for other factors |
| **JobLevel**        | -\$1,000 | 0.398   | Surprisingly **negative**, but not significant (p > 0.05)                                                              |
| **YearsAtCompany**  | -\$48    | 0.821   | No impact on salary once you account for other variables                                                               |
| **Education**       | +\$821   | 0.449   | No significant effect detected here either                                                                             |


In [15]:
# Adding jobroles to the model

# Clean jobrole column names to remove spaces and special characters
data.columns = data.columns.str.replace(' ', '_')


In [16]:
data.columns

Index(['DailyRate', 'DistanceFromHome', 'Education', 'EnvironmentSatisfaction',
       'Gender', 'JobLevel', 'OverTime', 'PercentSalaryHike',
       'RelationshipSatisfaction', 'StockOptionLevel', 'WorkLifeBalance',
       'YearsAtCompany', 'AnnualSalary', 'BusinessTravel_Travel_Frequently',
       'BusinessTravel_Travel_Rarely', 'Department_Research_&_Development',
       'Department_Sales', 'EducationField_Life_Sciences',
       'EducationField_Marketing', 'EducationField_Medical',
       'EducationField_Other', 'EducationField_Technical_Degree',
       'JobRole_Human_Resources', 'JobRole_Laboratory_Technician',
       'JobRole_Manager', 'JobRole_Manufacturing_Director',
       'JobRole_Research_Director', 'JobRole_Research_Scientist',
       'JobRole_Sales_Executive', 'JobRole_Sales_Representative'],
      dtype='object')

In [17]:

model_with_jobroles = smf.ols(
    formula='AnnualSalary ~ JobLevel + C(Gender) + YearsAtCompany + Education + JobRole_Human_Resources + JobRole_Laboratory_Technician + JobRole_Manager + JobRole_Manufacturing_Director + JobRole_Research_Director + JobRole_Research_Scientist + JobRole_Sales_Executive + JobRole_Sales_Representative',
    data=data
).fit()
print(model_with_jobroles.summary())

                            OLS Regression Results                            
Dep. Variable:           AnnualSalary   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                 -0.004
Method:                 Least Squares   F-statistic:                    0.4995
Date:                Tue, 16 Sep 2025   Prob (F-statistic):              0.916
Time:                        11:25:38   Log-Likelihood:                -17741.
No. Observations:                1470   AIC:                         3.551e+04
Df Residuals:                    1457   BIC:                         3.558e+04
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                                             coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------

The R-squared increased slightly from 0.001 to 0.004, showing that these variables collectively explain less than 1% of the variance in salary. Regression model performs no better than predicting the mean salary for every employee.

In [20]:
## Export summary to csv
# Extract the summary as a DataFrame
summary_df = pd.DataFrame({
    'coef': model.params,
    'std_err': model.bse,
    't': model.tvalues,
    'p_value': model.pvalues,
    'conf_low': model.conf_int()[0],
    'conf_high': model.conf_int()[1]
})

# Optional: round for readability
summary_df = summary_df.round(4)

# Export to CSV
summary_df.to_csv('../outputs/model_summary.csv')