Section 1: Multiple Regression Walk Through
Cybersecurity incidents are an increasing threat to safe and responsible business operations, but even the critical job of protecting sensitive data doesn’t come with an unlimited budget. Where should you invest funds on cybersecurity enforcement across different departments?
•	Why are some departments getting hit harder than others?
•	Which risk factors matter the most?
•	Can we predict which departments are likely to see more incidents in the near future?

Break your work up in your lab notebook the following stages using the outline below.

1.	Data Familiarization & Hygiene
    a.	Inspect schema, ranges, and datatypes.
    b.	Detect and fix issues: missing values, impossible percentages, and text in numeric fields.
    c.	Apply simple imputation and clipping strategies.
2.	Exploratory Analysis (EDA)
    a.	Plot distributions (incidents histogram).
    b.	Explore relationships (incidents vs. vulnerabilities, patch time).
    c.	Spot skewness and potential transformations.
3.	Feature Engineering & Confounders
    a.	Create transformations (log_org_size, log_vulns).
    b.	Engineer interaction terms (e.g., vulnerabilities × endpoint coverage gaps).
    c.	Discuss how org size and budget confound relationships.
4.	Modeling – Naïve vs. Improved
    a.	Fit a baseline OLS model without confounders.
    b.	Fit an improved OLS model with transformations, confounders, and interactions.
    c.	Compare coefficients, R2, AIC/BIC.
5.	Assumptions Diagnostics
    a.	Check linearity, normality, and homoscedasticity (residual plots, QQ plot, Breusch–Pagan).
    b.	Evaluate multicollinearity with VIF.
6.	Outliers & Influence
    a.	Identify leverage points and high Cook’s D values.
    b.	Discuss whether to drop or retain influential cases.
    c.	Refit model and compare metrics.
7.	Generalization Check
    a.	Train/test split; calculate RMSE on test data.
    b.	Reflect on in-sample vs. out-of-sample fit.
8.	Think About:
    a.	Which variables appear most predictive of incidents?
    b.	How do confounders shift interpretations?
    c.	What model would you present to leadership—and what caveats remain?


1.	Data Familiarization & Hygiene
    a.	Inspect schema, ranges, and datatypes.
    b.	Detect and fix issues: missing values, impossible percentages, and text in numeric fields.
    c.	Apply simple imputation and clipping strategies.



In [None]:
#load the csv into a data frame
#perform some familiarization steps with the data set
#inspect and display some meta/summary data about the data set

import pandas as pd

#load CSV to pandas data frame
cyber_df = pd.read_csv("Lab 05/cybersec.csv")

# 1a. Inspect schema and types
print("Schema and Data Types:\n", cyber_df.dtypes, "\n")

#show summary statistics
print("Summary Statistics:\n", cyber_df.describe(include='all'), "\n")

#show missing values
print("Missing Values:\n", cyber_df.isnull().sum(), "\n")

#show some unique values for object columns
object_columns = cyber_df.select_dtypes(include=['object']).columns
for col in object_columns:
    print(f"{col} unique values:", cyber_df[col].unique()[:10])
print()



In [None]:
# 1b. Detect & fix issues
#clip impossible % values, just in case
cyber_df['phishing_sim_click_rate'] = cyber_df['phishing_sim_click_rate'].clip(upper=1.0)

#fix text 'null' if present (not needed here, but included as good hygiene)
cyber_df.replace("null", pd.NA, inplace=True)

# 1c. setup the missing values (mean)
cyber_df['vuln_count'] = cyber_df['vuln_count'].fillna(cyber_df['vuln_count'].mean())
cyber_df['training_completion_rate'] = cyber_df['training_completion_rate'].fillna(cyber_df['training_completion_rate'].mean())

#check if all missing values handled
print("Post-cleaning Missing Values:\n", cyber_df.isnull().sum())

#quick preview
cyber_df.head()

#actual column names
print(cyber_df.columns.tolist())

#check the min and max values of the endpoint_coverage column
min_coverage = cyber_df['endpoint_coverage'].min()
max_coverage = cyber_df['endpoint_coverage'].max()

print(f"Endpoint Coverage ranges from {min_coverage}% to {max_coverage}%.")


2.	Exploratory Analysis (EDA)
    a.	Plot distributions (incidents histogram).
    b.	Explore relationships (incidents vs. vulnerabilities, patch time).
    c.	Spot skewness and potential transformations.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import skew

#2a: Histogram of Security Incidents

plt.figure(figsize=(8, 5))
sns.histplot(cyber_df['security_incidents'], bins=30, kde=True)
plt.title("Histogram of Security Incidents (Original)")
plt.xlabel("Number of Security Incidents")
plt.ylabel("Frequency")
plt.show()


In [None]:
#2b: Explore relationships via scatter plots

#incidents vs. vuln_count
plt.figure(figsize=(8, 5))
sns.scatterplot(x='vuln_count', y='security_incidents', data=cyber_df)
plt.title("Security Incidents vs. Vulnerability Count")
plt.xlabel("Vulnerability Count")
plt.ylabel("Security Incidents")
plt.show()

#incidents vs. mean_time_to_patch
plt.figure(figsize=(8, 5))
sns.scatterplot(x='mean_time_to_patch', y='security_incidents', data=cyber_df)
plt.title("Security Incidents vs. Mean Time to Patch")
plt.xlabel("Mean Time to Patch (days)")
plt.ylabel("Security Incidents")
plt.show()




In [None]:
#2c: Skewness and Log Transformation

#original skewness
orig_skew = skew(cyber_df['security_incidents'].dropna())
print(f"Skewness of security_incidents (original): {orig_skew:.2f}")

#log transform
cyber_df['log_security_incidents'] = np.log1p(cyber_df['security_incidents'])

#skewness after transformation
log_skew = skew(cyber_df['log_security_incidents'].dropna())
print(f"Skewness after log transformation: {log_skew:.2f}")

#plot log-transformed histogram
plt.figure(figsize=(8, 5))
sns.histplot(cyber_df['log_security_incidents'], bins=30, kde=True)
plt.title("Histogram of Log-Transformed Security Incidents")
plt.xlabel("log(1 + Security Incidents)")
plt.ylabel("Frequency")
plt.show()

3.	Feature Engineering & Confounders
    a.	Create transformations (log_org_size, log_vulns).
    b.	Engineer interaction terms (e.g., vulnerabilities × endpoint coverage gaps).
    c.	Discuss how org size and budget confound relationships.


In [None]:
#create log-transformed columns

cyber_df['log_org_size'] = np.log1p(cyber_df['org_size'])
cyber_df['log_vulns'] = np.log1p(cyber_df['vuln_count'])

#visualize
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
sns.histplot(cyber_df['log_org_size'], bins=30, kde=True, ax=axs[0])
axs[0].set_title("Log-transformed Org Size")

sns.histplot(cyber_df['log_vulns'], bins=30, kde=True, ax=axs[1])
axs[1].set_title("Log-transformed Vulnerability Count")

plt.tight_layout()
plt.show()

In [None]:
#create interaction term
#cyber_df['vuln_x_endpoint'] = cyber_df['vuln_count'] * (1 - cyber_df['endpoint_coverage'])
#print(cyber_df['endpoint_coverage'])

#visualize
#sns.scatterplot(x='vuln_x_endpoint', y='security_incidents', data=cyber_df)
#plt.title("Security Incidents vs. Vulnerability × Endpoint Gap Interaction")
#plt.xlabel("Vulnerability × Endpoint Gap")
#plt.ylabel("Security Incidents")
#plt.show()

#endpoint_coverage_gap if not already done
cyber_df['endpoint_coverage_gap'] = 100 - cyber_df['endpoint_coverage']

#scatterplot to show the relationship
plt.figure(figsize=(8, 5))
sns.scatterplot(data=cyber_df, x='endpoint_coverage', y='endpoint_coverage_gap')

#add reference line (y = 100 - x)
plt.plot([0, 100], [100, 0], color='red', linestyle='--', label='Gap = 100 - Coverage')

#labels and legend
plt.title('Endpoint Coverage vs. Coverage Gap')
plt.xlabel('Endpoint Coverage (%)')
plt.ylabel('Coverage Gap (%)')
plt.legend()
plt.grid(True)
plt.tight_layout()

plt.show()

3c.	Discuss how org size and budget confound relationships.

Organizational size (org_size) and budget (it_budget_per_emp) are likely confounding variables in this dataset. Larger organizations typically:
	•	Have higher vulnerability counts due to more assets
	•	Spend more on cybersecurity (not necessarily per employee, but in total)
	•	May still experience more incidents simply due to scale -
        More humans = more assets = more vulnerabilities = more RISK!

We should account for these confounders to avoid biased interpretations, such as overestimating the impact of vulnerabilities or underestimating the effectiveness of a security budget. Log-transforming org_size and including it as a predictor helps address nonlinear scaling and reduces skewness.

4.	Modeling – Naïve vs. Improved
    a.	Fit a baseline OLS model without confounders.
    b.	Fit an improved OLS model with transformations, confounders, and interactions.
    c.	Compare coefficients, R2, AIC/BIC.


In [None]:
#4.	Modeling – Naïve vs. Improved
#    a.	Fit a baseline OLS model without confounders.
        
import statsmodels.api as sm

#define X and y for the baseline model (no confounders)
X_base = cyber_df[['vuln_count', 'mean_time_to_patch', 'endpoint_coverage_gap']]
y = cyber_df['security_incidents']

#add a constant to the predictors (for intercept)
X_base_const = sm.add_constant(X_base)

#fit the baseline model
baseline_model = sm.OLS(y, X_base_const).fit()

#print model summary
print(baseline_model.summary())

In [None]:
#  b.Fit an improved OLS model with transformations, confounders, 
#    and interactions.

#transformations
cyber_df['log_org_size'] = np.log(cyber_df['org_size'])
cyber_df['log_vuln_count'] = np.log(cyber_df['vuln_count'] + 1)  # +1 to avoid log(0)

# Interaction term
cyber_df['vuln_x_gap'] = cyber_df['vuln_count'] * cyber_df['endpoint_coverage_gap']

cyber_df_original = cyber_df.copy() #in case we need the original later

#define X and y for improved model ---
predictors = [
    'log_vuln_count',
    'mean_time_to_patch',
    'endpoint_coverage_gap',
    'vuln_x_gap',
    'log_org_size',
    'it_budget_per_emp'
]
X_improved = cyber_df[predictors]
y = cyber_df['security_incidents']

#add constant (intercept)
X_improved_const = sm.add_constant(X_improved)

#fit model
improved_model = sm.OLS(y, X_improved_const).fit()

#show summary
print(improved_model.summary())

Compare coefficients, R2, AIC/BIC.

The improved model performs better across R², AIC, and BIC.  It also includes meaningful new predictors (log_vuln_count, log_org_size, interaction terms)

In [None]:
#the dessired model comparison metrics
metrics = ['R-squared', 'Adjusted R²', 'AIC', 'BIC']

#setup baseline and improved model values to plot
baseline_values = [baseline_model.rsquared, 
                   baseline_model.rsquared_adj, 
                   baseline_model.aic, 
                   baseline_model.bic]
improved_values = [improved_model.rsquared, 
                   improved_model.rsquared_adj, 
                   improved_model.aic, 
                   improved_model.bic]

#create a compoarison DataFrame
comparison_df = pd.DataFrame({
    'Metric': metrics,
    'Baseline Model': baseline_values,
    'Improved Model': improved_values
})

#try a plot
fig, axs = plt.subplots(1, 2, figsize=(14, 5))

#plot R² comparison
comparison_df.iloc[:2].plot(x='Metric', kind='bar', ax=axs[0], color=['#1f77b4', '#2ca02c'])
axs[0].set_title('R² Comparison')
axs[0].set_ylabel('Score')
axs[0].legend(loc='upper left')

#plot AIC/BIC comparison
comparison_df.iloc[2:].plot(x='Metric', kind='bar', ax=axs[1], color=['#1f77b4', '#2ca02c'])
axs[1].set_title('AIC/BIC Comparison')
axs[1].set_ylabel('Score (Lower is Better)')
axs[1].legend(loc='upper right')

plt.tight_layout()
plt.show()

I thought I'd try the visual comparison, but it didn't work well due to the scale.  I left it in, just so you can see the steps I took.  Here are some written thoughts on the model differences.

Next, I'll try to print these in a table for better observation.

In [None]:
#metric names
metrics = ['R²', 'Adjusted R²', 'AIC', 'BIC']

#create a comparison DataFrame
coefficient_df = pd.DataFrame({
    'Metric': metrics,
    'Baseline Model': baseline_values,
    'Improved Model': improved_values
})

#format numbers for readability
coefficient_df['Baseline Model'] = coefficient_df['Baseline Model'].apply(lambda x: round(x, 3) if isinstance(x, float) else x)
coefficient_df['Improved Model'] = coefficient_df['Improved Model'].apply(lambda x: round(x, 3) if isinstance(x, float) else x)

#display the coefficient comparison table
print(coefficient_df)

Thoughts on comparing the two models:

R2 base is 0.255, improved is 0.272 which is a slight improvement.  The improved model explains 2% more variance in the outcome (security_incidents) than the baseline model.

Adjusted R2 base is 0.246, improved is 0.253 which is a consistent improvement: Adjusted R² also increases, suggesting that the additional predictors in the improved model are beneficial after accounting for model complexity.

AIC base is 919.032, improved is 919.461, which is only a slight increase. This usually suggests a worse model, but the difference is so small (~0.4) that it is likely statistically insignificant.

BIC base is 932.955, improved is 943.825, which is higher in the (presumably) improved model: BIC penalizes complexity more heavily than AIC. The increase (~11 points) indicates that the improved model may be too complex for the added benefit.


5.	Assumptions Diagnostics
    a.	Check linearity, normality, and homoscedasticity (residual plots, QQ plot, Breusch–Pagan).
    b.	Evaluate multicollinearity with VIF.


In [None]:

#setup components
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import OLSInfluence
from scipy import stats

#get residuals and fitted values from improved model
residuals = improved_model.resid
fitted = improved_model.fittedvalues

#linearity & Homoscedasticity
plt.figure(figsize=(8, 5))
sns.scatterplot(x=fitted, y=residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title("Residuals vs Fitted (Linearity & Homoscedasticity)")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

#Normality: Q-Q Plot
sm.qqplot(residuals, line='s')
plt.title("Q-Q Plot (Normality of Residuals)")
plt.show()

#Breusch–Pagan Test for Homoscedasticity
#requires residuals and independent variables
X_improved_const = improved_model.model.exog
bp_test = het_breuschpagan(residuals, X_improved_const)

labels = ['Lagrange multiplier statistic', 'p-value', 
          'f-value', 'f p-value']
print("\nBreusch–Pagan Test Results:")
for name, value in zip(labels, bp_test):
    print(f"{name}: {value:.4f}")

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

#assume X_improved_const is your design matrix with intercept term already added
#if not already done:
X_improved_const = sm.add_constant(X_improved)

#create a DataFrame for VIF values
vif_data = pd.DataFrame()
vif_data["Variable"] = X_improved_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_improved_const, i) 
                   for i in range(X_improved_const.shape[1])]

#format all float values to 3 decimal places
pd.set_option("display.float_format", "{:.3f}".format)

#display VIF table
print(vif_data)

log_vuln_count has a high multicollinearity which is likely a problem 

log_org_size is just barely showing high multicollinearity, which should be investigated further



6.	Outliers & Influence
    a.	Identify leverage points and high Cook’s D values.
    b.	Discuss whether to drop or retain influential cases.
    c.	Refit model and compare metrics.


In [None]:
#using our improved model
influence = improved_model.get_influence()

#leverage values
leverage = influence.hat_matrix_diag

#Cook's Distance
cooks_d, _ = influence.cooks_distance

# Create a summary DataFrame
influence_df = pd.DataFrame({
    "Leverage": leverage,
    "Cooks_D": cooks_d
})

#flag high leverage: rule of thumb: leverage > 2k/n
n = X_improved_const.shape[0]
k = X_improved_const.shape[1]
leverage_threshold = 2 * k / n
influence_df["High_Leverage"] = influence_df["Leverage"] > leverage_threshold

#flag high Cook's D: rule of thumb: Cook's D > 4/n
cooks_d_threshold = 4 / n
influence_df["High_CooksD"] = influence_df["Cooks_D"] > cooks_d_threshold

#display top suspicious points
print(influence_df.sort_values("Cooks_D", ascending=False).head(10))

#visualize Cook’s D and leverage
plt.figure(figsize=(10, 5))
#plt.stem(np.arange(n), cooks_d, markerfmt=",", basefmt=" ", use_line_collection=True)
plt.stem(np.arange(n), cooks_d, markerfmt=",", basefmt=" ")
plt.axhline(y=cooks_d_threshold, color='r', linestyle='--', label="Cook's D Threshold")
plt.xlabel("Observation Index")
plt.ylabel("Cook's Distance")
plt.title("Cook's Distance by Observation")
plt.legend()
plt.show()

plt.figure(figsize=(10, 5))
plt.scatter(leverage, cooks_d, alpha=0.6)
plt.axhline(y=cooks_d_threshold, color='red', linestyle='--', label="Cook's D Threshold")
plt.axvline(x=leverage_threshold, color='orange', linestyle='--', label="Leverage Threshold")
plt.xlabel("Leverage")
plt.ylabel("Cook's Distance")
plt.title("Influence Plot (Leverage vs Cook's D)")
plt.legend()
plt.show()

6b.	Discuss whether to drop or retain influential cases.
  
All 10 points have Cook’s D values that exceed the 4/n threshold, so they are considered influential.
	•	Only 4 of the 10 also have high leverage:
	•	Indexes: 208, 39, 186, 129
	•	Some points like 106 or 200 have low leverage but still show moderate influence (via Cook’s D).



In [None]:
#reload our dataset and prepare X and y
new_cyber_df = cyber_df_original.copy()

y = new_cyber_df["security_incidents"]
X = new_cyber_df[["log_vuln_count", "mean_time_to_patch", "endpoint_coverage", 
        "vuln_x_gap", "log_org_size", "it_budget_per_emp"]]
X_const = sm.add_constant(X)

model_raw = sm.OLS(y_refit, X_refit_const).fit()
print(model_raw.summary())

#fit the original model
model_full = sm.OLS(y, X_const).fit()

#get influence measures (Cook’s D and leverage)
influence = model_full.get_influence()
cooks_d = influence.cooks_distance[0]
leverage = influence.hat_matrix_diag

#create a DataFrame of influence measures
influence_df = pd.DataFrame({
    'leverage': leverage,
    'cooks_d': cooks_d
})

#define thresholds
n = X_const.shape[0]
p = X_const.shape[1]
leverage_thresh = 2 * p / n
cooksd_thresh = 4 / n

#identify points to drop
to_drop = influence_df[(influence_df['leverage'] > leverage_thresh) &
                       (influence_df['cooks_d'] > cooksd_thresh)].index

#drop those rows
df_refit = new_cyber_df.drop(index=to_drop)

#refit the model
y_refit = df_refit["security_incidents"]
X_refit = df_refit[["log_vuln_count", "mean_time_to_patch", "endpoint_coverage_gap", 
                    "vuln_x_gap", "log_org_size", "it_budget_per_emp"]]
X_refit_const = sm.add_constant(X_refit)

model_refit = sm.OLS(y_refit, X_refit_const).fit()

#print refitted model summary
print(model_refit.summary())

7.	Generalization Check
    a.	Train/test split; calculate RMSE on test data.
    b.	Reflect on in-sample vs. out-of-sample fit.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

#define features and target again
features = ["log_vuln_count", "mean_time_to_patch", "endpoint_coverage_gap", 
            "vuln_x_gap", "log_org_size", "it_budget_per_emp"]
target = "security_incidents"

X = cyber_df[features]
y = cyber_df[target]

#add constant for statsmodels OLS
X_const = sm.add_constant(X)

#split the data: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X_const, y, test_size=0.2, random_state=42)

In [None]:
#fit model on training data
model_train = sm.OLS(y_train, X_train).fit()

#predict on test data
y_pred_test = model_train.predict(X_test)

In [None]:
#calculate RMSE for test data
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))

#also calculate RMSE on training data for comparison
y_pred_train = model_train.predict(X_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))

print(f"RMSE (Train): {rmse_train:.2f}")
print(f"RMSE (Test):  {rmse_test:.2f}")

RMSE results indicate:
	•	RMSE (Train): 1.64
	•	RMSE (Test): 1.45

My Training RMSE is 1.64 and my Test RMSE is 1.45. It is somewhat for the Test RMSE to be lower than the Training RMSE, but not necessarily a problem. This suggests my model is not overfitting and seems to generalize reasonably well to unseen data.  We may have too small of a sample size to conduct the RMSE analysis with any confidene.



In-Sample Fit (Training RMSE = 1.64):
	•	This represents how well the model performs on data it was trained on.
	•	Our training RMSE of 1.64 indicates that the model can predict reasonably well within the sample it learned from.
	•	However, relying solely on this metric can be misleading if the model is overfitting (i.e., too closely matching the training data).

Out-of-Sample Fit (Test RMSE = 1.45):
	•	This measures performance on unseen data, which better reflects how the model might perform in the real world.
	•	Our test RMSE of 1.45 is slightly better than the training RMSE, suggesting the model generalizes well.
	•	This is a positive sign — the model is not just memorizing the training data but capturing real patterns that apply to new data.


This tells me:
	•	No overfitting is apparent. If the test RMSE were much higher than the training RMSE, that would indicate poor generalization.
	•	The model may even be slightly underfitting, or the test set might be less noisy than the training set (easier to predict).
	•	Model performance is stable across datasets — a very good sign.
    


Our model’s performance is consistent and generalizes well from training to test data. You could now:
	•	Report this as a strong result.
	•	Consider further improvements (e.g., feature engineering or regularization).
	•	Or stop here if the goal was to build a reasonably accurate and interpretable model.


8.	Think About:
    a.	Which variables appear most predictive of incidents?
    b.	How do confounders shift interpretations?
    c.	What model would you present to leadership—and what caveats remain?

8.	Think About:
    a.	Which variables appear most predictive of incidents?

endpoint_coverage (well, the gap) is one of the strongest predictors. A positive and significant coefficient indicates that departments with larger endpoint security gaps tend to have more incidents. Makes sense — more unprotected endpoints = more risk.

log_vuln_count is statistically significant with a negative coefficient. This might seem counterintuitive, but in log-linear models, this can reflect diminishing returns — or that logged values are smoothing large differences. Still, it shows vulnerability volume relates to incidents.

vuln_x_gap (interaction term)
Strong, significant, and positive. Indicates that vulnerabilities are especially risky when endpoint coverage is also low. This interaction captures a compounding effect and is crucial for understanding risk.

log_org_size is marginally significant. Also mekes sense that larger organizations (even log-transformed) appear more likely to report more incidents. Could be due to scale: more devices, more staff, more chances for attack.

mean_time_to_patch and it_budget_per_emp
These were not statistically significant in our model — suggesting they may not be reliable standalone predictors in this dataset.


8.	Think About:
    b.	How do confounders shift interpretations?

With the addition of confounder control the relationships became more accurate and controllable.  This gave us better insights, which leads to more effective and efficient prioritization and utliization of our valuable resources.



Think About:
    c.	What model would you present to leadership—and what caveats remain?

What Model Would I Present to Leadership?

I would present the Improved OLS Regression Model that includes:
	•	Log-transformed predictors (e.g., log_vuln_count, log_org_size)
	•	Interaction term (vuln_x_gap) to capture compounding risk
	•	Key confounders like org_size and it_budget_per_emp
	•	The endpoint_coverage_gap variable, which had one of the strongest coefficients (which seems to be obvious)

This model:
	•	Explains more variance than the naïve model (higher R2 and adjusted R2)
	•	Has lower AIC/BIC values, indicating a better model fit
	•	Uses interpretable variables that align with operational and strategic priorities
	•	Provides actionable insight into the areas most predictive of incident risk (like gaps in endpoint coverage and high vulnerability load)


What Caveats Remain?

Despite the model improvements, several limitations should be transparently communicated:

1. Causality Is Not Guaranteed
	•	The model identifies associations, not causal relationships.
	•	Leadership should not infer direct cause-effect (e.g., “fixing endpoint gaps will guarantee fewer incidents”).  This point should be clearly driven home. This is a "risk tolerance" conversation, not an absolute security conversation.

2. Potential Data Quality Issues
	•	Some predictors required imputation or transformation (e.g., missing values, skewed distributions).
	•	The data may still contain biases, inconsistencies, or outliers.  Although, most of the data is derived from raw logs. This reduces the likelihood of error in the original data captured.

3. Outliers & Influential Cases
	•	Some influential data points (high leverage or Cook’s D) exist.
	•	While these were identified and could be excluded for robustness checks, real-world decisions must consider whether those departments are edge cases or high-risk units. Let's not lose sight of the fact we are balancing investment against risk. There may not be a clear "formula" to apply to a final decision. Our job is to get the best information we can "in the language of risk" so decision makers can make INFORMED decisions and not just "hope" for bad things to not happen.

4. Multicollinearity Risk
	•	Some variables show moderate multicollinearity (e.g., log_org_size may correlate with budget or vuln_count).
	•	While VIF scores were below critical thresholds, they should be monitored in future models.

5. Need for Continuous Validation
	•	The threat landscape evolves rapidly.
    •	This is NOT a one-time evaluation.  It should be performed with a recurring frequency - leading to different decisions based on the updated outcomes.
	•	Recommend setting up a model retraining schedule (e.g., quarterly) as new incident and risk data is collected.

My Final Recommendation to Leadership:

“This model helps us quantify and prioritize cybersecurity risk across departments. The biggest drivers of incidents appear to be gaps in endpoint coverage and volume of vulnerabilities, especially when both are present. However, we should treat this model as a decision-support tool, not an absolute truth. Regular updates, ongoing data hygiene, and domain expertise should continue to inform our cyber risk strategy.”
