# **Multivariate regression analysis: Predicting Employee Job Satisfaction**

Model Goal: The intended regression model seeks to determine **how various factors in a work environment contribute to an employee's overall job satisfaction score:**

$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon$$

Where:

$Y$ = JobSatisfactionScore (The dependent variable)

$X_1$ = WeeklyHoursWorked

$X_2$ = RemoteWorkRatio

$X_3$ = TeamCohesionRating

| **Variable Name**            | **Description**                                                         | **Type**            | **Expected Relationship with JobSatisfactionScore**     |
| ---------------------------- | ----------------------------------------------------------------------- | ------------------- | ------------------------------------------------------- |
| **JobSatisfactionScore (Y)** | A composite index of satisfaction, 1 (Very Low) to 10 (Very High).      | Scale (1–10)        | **Dependent**                                           |
| **WeeklyHoursWorked (X1)**   | Average hours worked per week.                                          | Continuous          | **Negative** — More hours often lowers satisfaction     |
| **RemoteWorkRatio (X2)**     | Percentage of time working remotely (0% to 100%).                       | Continuous (0–100)  | **Positive** — Flexibility often increases satisfaction |
| **TeamCohesionRating (X3)**  | A measure of team effectiveness and support, 1 (Poor) to 5 (Excellent). | Ordinal/Scale (1–5) | **Positive** — Better teams improve satisfaction        |


# **Import data**

In [1]:
import pandas as pd

df = pd.read_csv('/content/job_satisfaction_dataset.csv')
display(df.head())

Unnamed: 0,Employee ID,JobSatisfactionScore,WeeklyHoursWorked,RemoteWorkRatio,TeamCohesionRating
0,1,7.5,42,50,4
1,2,4.1,55,10,2
2,3,8.9,38,80,5
3,4,6.0,45,25,3
4,5,3.5,50,0,1


# **DEFINE THE MODEL FORMULA**

The formula is Y ~ X1 + X2 + X3

JobSatisfactionScore is the dependent variable (Y)

WeeklyHoursWorked, RemoteWorkRatio, and TeamCohesionRating are the independent variables (X's)

In [2]:
formula = 'JobSatisfactionScore ~ WeeklyHoursWorked + RemoteWorkRatio + TeamCohesionRating'

# **FIT THE OLS REGRESSION MODEL**

OLS stands for Ordinary Least Squares, the standard method for linear regression.

In [4]:
import statsmodels.formula.api as smf
model = smf.ols(formula=formula, data=df)
results = model.fit()

# **PRINT THE REGRESSION SUMMARY**

In [5]:
print(results.summary())

                             OLS Regression Results                             
Dep. Variable:     JobSatisfactionScore   R-squared:                       0.981
Model:                              OLS   Adj. R-squared:                  0.980
Method:                   Least Squares   F-statistic:                     1615.
Date:                  Wed, 10 Dec 2025   Prob (F-statistic):           5.42e-82
Time:                          06:55:03   Log-Likelihood:                -12.915
No. Observations:                   100   AIC:                             33.83
Df Residuals:                        96   BIC:                             44.25
Df Model:                             3                                         
Covariance Type:              nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept       

**Model Fit StatisticsR-squared ($R^2$):**

This value (between 0 and 1) tells you the percentage of the variation in JobSatisfactionScore that is explained by the three independent variables in your model.

F-statistic: Tests whether the model as a whole is statistically significant (i.e., whether at least one of your predictors is useful).

# **Interpret Regression Coefficients and Statistical Significance**

Analyze the provided OLS regression results to interpret the coefficients for `WeeklyHoursWorked`, `RemoteWorkRatio`, and `TeamCohesionRating` on `JobSatisfactionScore`, explain their estimated impact on JobSatisfactionScore, along with their statistical significance (p-values), evaluate the model's fit using R-squared and Adjusted R-squared, address the multicollinearity warning ('Cond. No.'), and summarize the key findings regarding factors influencing job satisfaction.

Based on the OLS regression summary:

*   **WeeklyHoursWorked (X1)**:
    *   **Coefficient**: -0.1062
    *   **P-value**: 0.000
    *   **Interpretation**: For every one-unit increase in `WeeklyHoursWorked`, the `JobSatisfactionScore` is estimated to decrease by approximately **0.1062 points**, holding other variables constant. The **p-value of 0.000 (which is < 0.05)** indicates that this **negative impact is statistically significant.**

*   **RemoteWorkRatio (X2)**:
    *   **Coefficient**: 0.0245
    *   **P-value**: 0.000
    *   **Interpretation**: For every one-unit (percentage point) increase in `RemoteWorkRatio`, the `JobSatisfactionScore` is estimated to increase by approximately **0.0245 points**, holding other variables constant. The **p-value of 0.000 (which is < 0.05)** indicates that this **positive impact is statistically significant.**

*   **TeamCohesionRating (X3)**:
    *   **Coefficient**: 0.3545
    *   **P-value**: 0.000
    *   **Interpretation**: For every one-unit increase in `TeamCohesionRating`, the `JobSatisfactionScore` is estimated to increase by approximately **0.3545 points**, holding other variables constant. The **p-value of 0.000 (which is < 0.05)** indicates that this **positive impact is statistically significant.**

# **Evaluate Model Fit (R-squared)**

The regression summary provides the following key metrics for model fit:

*   **R-squared: 0.981**
*   **Adjusted R-squared: 0.980**

## **Explanation of R-squared**

**R-squared**, also known as the **coefficient of determination, measures the proportion of the variance in the dependent variable that can be explained by the independent variables in the model.** In this case, an R-squared value of **0.981** means that approximately **98.1%** of the variance in `JobSatisfactionScore` **can be explained by the independent variables**(`WeeklyHoursWorked`, `RemoteWorkRatio`, and `TeamCohesionRating`). This is a **very high value**, suggesting that **the model's predictors are highly effective in explaining job satisfaction.**

## **Explanation of Adjusted R-squared**

Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. Unlike R-squared, **the adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It penalizes the inclusion of unnecessary predictors.** In this model, the Adjusted R-squared is **0.980**. The fact that it is **very close to the R-squared value (0.981)** indicates that the **independent variables included in the model are all significant and contribute meaningfully to explaining the variance in `JobSatisfactionScore` without overfitting due to too many predictors.**

## **Discussion on Model Fit**

Both the R-squared (0.981) and Adjusted R-squared (0.980) values are exceptionally high, indicating a **very strong fit** for the model. This suggests that `WeeklyHoursWorked`, `RemoteWorkRatio`, and `TeamCohesionRating` are excellent predictors of `JobSatisfactionScore`, and the model effectively captures a large portion of the variability in job satisfaction.

A high R-squared means that the model can explain almost all the variance in the dependent variable. The close proximity of R-squared and Adjusted R-squared further reinforces the quality of the model and its chosen predictors.

# **Addressing Multicollinearity: Interpreting the Condition Number**

The regression summary provides a 'Cond. No.' (Condition Number) of **1.45e+03** (1450). This value is significantly high, typically anything above 30 indicates potential multicollinearity.

**What a High 'Cond. No.' Indicates:**

A high Condition Number suggests strong multicollinearity among the independent variables (`WeeklyHoursWorked`, `RemoteWorkRatio`, and `TeamCohesionRating`). Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. This means that **changes in one independent variable can be linearly predicted from changes in another, making it difficult for the model to isolate the unique effect of each variable on the dependent variable (`JobSatisfactionScore`).**

**Implications of Multicollinearity for Reliability and Interpretation:**
1.  **Unreliable Coefficient Estimates:** While the overall model fit (e.g., R-squared) might still be good, the individual regression coefficients ($\beta_1, \beta_2, \beta_3$) become unstable and highly sensitive to small changes in the data. This means their estimated values might fluctuate wildly from one sample to another, making them unreliable.
2.  **Inflated Standard Errors:** Multicollinearity leads to larger standard errors for the coefficients. Larger standard errors result in smaller t-statistics and larger p-values, making it difficult to reject the null hypothesis that a coefficient is zero. Consequently, a variable that is actually important might appear statistically insignificant.
3.  **Difficulty in Interpretation:** It becomes challenging to interpret the individual impact of each independent variable on job satisfaction because their effects are intertwined. For example, if `WeeklyHoursWorked` and `RemoteWorkRatio` are highly correlated, it's hard to tell how much each contributes uniquely to job satisfaction.
4.  **Incorrect Signs of Coefficients:** In severe cases, multicollinearity can even cause coefficient estimates to have signs opposite to what is expected based on domain knowledge.

**Approaches to Address Multicollinearity:**
1.  **Collect More Data:** Sometimes, more data can help reduce the collinearity by providing more variation.
2.  **Remove Highly Correlated Variables:** If two or more variables are very highly correlated, consider removing one of them if it doesn't significantly impact the theoretical model.
3.  **Combine Variables:** Create a composite variable (e.g., an index) from highly correlated variables if they measure similar underlying concepts.
4.  **Feature Engineering:** Transform existing variables or create new ones to reduce collinearity.
5.  **Use Ridge Regression or Lasso Regression:** These regularized regression techniques can handle multicollinearity by adding a penalty term to the loss function, which shrinks the coefficients and can improve their stability. (This is more advanced and beyond a simple OLS context).

Given the high condition number, it's crucial to investigate the correlations between `WeeklyHoursWorked`, `RemoteWorkRatio`, and `TeamCohesionRating` to determine which variables might be contributing most to the multicollinearity and consider appropriate remedies.

# Summary:

1.  **How do `WeeklyHoursWorked`, `RemoteWorkRatio`, and `TeamCohesionRating` influence `JobSatisfactionScore`?**
    *   `WeeklyHoursWorked`: For every one-unit increase, `JobSatisfactionScore` is estimated to decrease by approximately 0.1062 points. This is a statistically significant negative impact.
    *   `RemoteWorkRatio`: For every one-unit (percentage point) increase, `JobSatisfactionScore` is estimated to increase by approximately 0.0245 points. This is a statistically significant positive impact.
    *   `TeamCohesionRating`: For every one-unit increase, `JobSatisfactionScore` is estimated to increase by approximately 0.3545 points. This is a statistically significant positive impact.

2.  **How well does the model explain the variance in `JobSatisfactionScore`?**
    The model demonstrates a very strong fit, with an R-squared of 0.981 and an Adjusted R-squared of 0.980. This means that approximately 98.1% of the variance in `JobSatisfactionScore` can be explained by the independent variables (`WeeklyHoursWorked`, `RemoteWorkRatio`, and `TeamCohesionRating`).

3.  **What are the implications of the 'Cond. No.' warning for multicollinearity?**
    The 'Cond. No.' of 1.45e+03 (1450) indicates strong multicollinearity among the independent variables. This implies that the individual coefficient estimates may be unstable, their standard errors inflated (potentially leading to false insignificance), and their unique impacts difficult to interpret reliably.

## **Data Analysis Key Findings**

*   **All three independent variables** (`WeeklyHoursWorked`, `RemoteWorkRatio`, `TeamCohesionRating`) are **statistically significant predictors** of `JobSatisfactionScore` (all p-values are 0.000).

*   `WeeklyHoursWorked` has a **statistically significant negative impact** on job satisfaction, with a coefficient of -0.1062.

*   `RemoteWorkRatio` and `TeamCohesionRating` both have **statistically significant positive impacts on job satisfaction,** with coefficients of 0.0245 and 0.3545, respectively.

*   The model exhibits a remarkably high explanatory power, with an R-squared value of 0.981 and an Adjusted R-squared of 0.980, indicating that the **independent variables explain over 98% of the variability in `JobSatisfactionScore`.**

*   A high Condition Number of 1450 suggests **significant multicollinearity among the predictors, which could compromise the reliability and interpretation of individual coefficient estimates despite the strong overall model fit.**

## **Insights**
*   Investigate the correlations between `WeeklyHoursWorked`, `RemoteWorkRatio`, and `TeamCohesionRating` to identify the source of multicollinearity and consider remedies such as removing highly correlated variables, combining them, or using advanced regression techniques.

*   Given the strong predictive power of the model (high R-squared), **addressing multicollinearity** will further enhance the trustworthiness and interpretability of the individual factor contributions to `JobSatisfactionScore`, which is crucial for targeted interventions.
