Section 1: Multiple Regression Walk Through
Cybersecurity incidents are an increasing threat to safe and responsible business operations, but even the critical job of protecting sensitive data doesn’t come with an unlimited budget. Where should you invest funds on cybersecurity enforcement across different departments?
•	Why are some departments getting hit harder than others?
•	Which risk factors matter the most?
•	Can we predict which departments are likely to see more incidents in the near future?

Break your work up in your lab notebook the following stages using the outline below.

1.	Data Familiarization & Hygiene
    a.	Inspect schema, ranges, and datatypes.
    b.	Detect and fix issues: missing values, impossible percentages, and text in numeric fields.
    c.	Apply simple imputation and clipping strategies.
2.	Exploratory Analysis (EDA)
    a.	Plot distributions (incidents histogram).
    b.	Explore relationships (incidents vs. vulnerabilities, patch time).
    c.	Spot skewness and potential transformations.
3.	Feature Engineering & Confounders
    a.	Create transformations (log_org_size, log_vulns).
    b.	Engineer interaction terms (e.g., vulnerabilities × endpoint coverage gaps).
    c.	Discuss how org size and budget confound relationships.
4.	Modeling – Naïve vs. Improved
    a.	Fit a baseline OLS model without confounders.
    b.	Fit an improved OLS model with transformations, confounders, and interactions.
    c.	Compare coefficients, R2, AIC/BIC.
5.	Assumptions Diagnostics
    a.	Check linearity, normality, and homoscedasticity (residual plots, QQ plot, Breusch–Pagan).
    b.	Evaluate multicollinearity with VIF.
6.	Outliers & Influence
    a.	Identify leverage points and high Cook’s D values.
    b.	Discuss whether to drop or retain influential cases.
    c.	Refit model and compare metrics.
7.	Generalization Check
    a.	Train/test split; calculate RMSE on test data.
    b.	Reflect on in-sample vs. out-of-sample fit.
8.	Think About:
    a.	Which variables appear most predictive of incidents?
    b.	How do confounders shift interpretations?
    c.	What model would you present to leadership—and what caveats remain?


1.	Data Familiarization & Hygiene
    a.	Inspect schema, ranges, and datatypes.
    b.	Detect and fix issues: missing values, impossible percentages, and text in numeric fields.
    c.	Apply simple imputation and clipping strategies.



In [None]:
import pandas as pd

# Load CSV
df = pd.read_csv("cybersec.csv")

# 1a. Inspect schema and types
print("Schema and Data Types:\n", df.dtypes, "\n")

# Summary statistics
print("Summary Statistics:\n", df.describe(include='all'), "\n")

# Missing values
print("Missing Values:\n", df.isnull().sum(), "\n")

# Sample unique values for object columns
object_columns = df.select_dtypes(include=['object']).columns
for col in object_columns:
    print(f"{col} unique values:", df[col].unique()[:10])
print()

# 1b. Detect & fix issues
# Clip impossible % values
df['phishing_sim_click_rate'] = df['phishing_sim_click_rate'].clip(upper=1.0)

# Fix text 'null' if present (not needed here, but included as good hygiene)
df.replace("null", pd.NA, inplace=True)

# 1c. Impute missing values (mean)
df['vuln_count'] = df['vuln_count'].fillna(df['vuln_count'].mean())
df['training_completion_rate'] = df['training_completion_rate'].fillna(df['training_completion_rate'].mean())

# Check if all missing values handled
print("Post-cleaning Missing Values:\n", df.isnull().sum())

# Optional preview
df.head()

# actual column names
print(df.columns.tolist())

# Check the min and max values of the endpoint_coverage column
min_coverage = df['endpoint_coverage'].min()
max_coverage = df['endpoint_coverage'].max()

print(f"Endpoint Coverage ranges from {min_coverage}% to {max_coverage}%.")

2.	Exploratory Analysis (EDA)
    a.	Plot distributions (incidents histogram).
    b.	Explore relationships (incidents vs. vulnerabilities, patch time).
    c.	Spot skewness and potential transformations.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import skew

# Load the data
df = pd.read_csv('cybersec.csv')  # Adjust if loading differently in Snowflake UI

# --- 2a: Histogram of Security Incidents ---
plt.figure(figsize=(8, 5))
sns.histplot(df['security_incidents'], bins=30, kde=True)
plt.title("Histogram of Security Incidents (Original)")
plt.xlabel("Number of Security Incidents")
plt.ylabel("Frequency")
plt.show()

# --- 2b: Explore relationships via scatter plots ---

# Incidents vs. vuln_count
plt.figure(figsize=(8, 5))
sns.scatterplot(x='vuln_count', y='security_incidents', data=df)
plt.title("Security Incidents vs. Vulnerability Count")
plt.xlabel("Vulnerability Count")
plt.ylabel("Security Incidents")
plt.show()

# Incidents vs. mean_time_to_patch
plt.figure(figsize=(8, 5))
sns.scatterplot(x='mean_time_to_patch', y='security_incidents', data=df)
plt.title("Security Incidents vs. Mean Time to Patch")
plt.xlabel("Mean Time to Patch (days)")
plt.ylabel("Security Incidents")
plt.show()

# --- 2c: Skewness and Log Transformation ---
# Original skewness
orig_skew = skew(df['security_incidents'].dropna())
print(f"Skewness of security_incidents (original): {orig_skew:.2f}")

# Log transform
df['log_security_incidents'] = np.log1p(df['security_incidents'])

# Skewness after transformation
log_skew = skew(df['log_security_incidents'].dropna())
print(f"Skewness after log transformation: {log_skew:.2f}")

# Plot log-transformed histogram
plt.figure(figsize=(8, 5))
sns.histplot(df['log_security_incidents'], bins=30, kde=True)
plt.title("Histogram of Log-Transformed Security Incidents")
plt.xlabel("log(1 + Security Incidents)")
plt.ylabel("Frequency")
plt.show()


3.	Feature Engineering & Confounders
    a.	Create transformations (log_org_size, log_vulns).
    b.	Engineer interaction terms (e.g., vulnerabilities × endpoint coverage gaps).
    c.	Discuss how org size and budget confound relationships.


In [None]:
# Create log-transformed columns
df['log_org_size'] = np.log1p(df['org_size'])
df['log_vulns'] = np.log1p(df['vuln_count'])

# Visualize histograms
import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(1, 2, figsize=(12, 5))
sns.histplot(df['log_org_size'], bins=30, kde=True, ax=axs[0])
axs[0].set_title("Log-transformed Org Size")

sns.histplot(df['log_vulns'], bins=30, kde=True, ax=axs[1])
axs[1].set_title("Log-transformed Vulnerability Count")

plt.tight_layout()
plt.show()

In [None]:
# Create interaction term
df['vuln_x_endpoint'] = df['vuln_count'] * (1 - df['endpoint_coverage'])
print(df['endpoint_coverage'])

# Optionally visualize
sns.scatterplot(x='vuln_x_endpoint', y='security_incidents', data=df)
plt.title("Security Incidents vs. Vulnerability × Endpoint Gap Interaction")
plt.xlabel("Vulnerability × Endpoint Gap")
plt.ylabel("Security Incidents")
plt.show()

4.	Modeling – Naïve vs. Improved
    a.	Fit a baseline OLS model without confounders.
    b.	Fit an improved OLS model with transformations, confounders, and interactions.
    c.	Compare coefficients, R2, AIC/BIC.


5.	Assumptions Diagnostics
    a.	Check linearity, normality, and homoscedasticity (residual plots, QQ plot, Breusch–Pagan).
    b.	Evaluate multicollinearity with VIF.


6.	Outliers & Influence
    a.	Identify leverage points and high Cook’s D values.
    b.	Discuss whether to drop or retain influential cases.
    c.	Refit model and compare metrics.


7.	Generalization Check
    a.	Train/test split; calculate RMSE on test data.
    b.	Reflect on in-sample vs. out-of-sample fit.


8.	Think About:
    a.	Which variables appear most predictive of incidents?
    b.	How do confounders shift interpretations?
    c.	What model would you present to leadership—and what caveats remain?