Paul, this has been a great exercise for me and I'm eager for feedback.  I selected the cyber data set as this aligns perfectly with my day-to-day work and is a topic I am very passionate about.  One of the thoughts I've been having as we've worked through the semester: How do we know we've selected the best variables for building our models?  PCA seems to be the best answer I've seen so far.  In fact, it seems so incredible I wonder what could be better?  I'm sure there are more and better ways yet to learn, but this one sure seems to do the trick quite well.

Also, I tried to point out where I took different approaches with the cyber data the second time around.  It seemed the more I experimented with this PCA approach, the better the model helped me understand where to recommend we align our resources to be the most secure.  Admittedly, I ramble on toward the end of the notebook with multiple recommendation attempts (ignore anything after Conclusion if you like), but it seems there is always an answer just better than the last one, no matter how many times we adjust the approach.

# Exercise 08: Principal Component Analysis (PCA)

Pick from the data in one of our previous assignments or the midterm (agri_productivity.csv, **cybersec.csv**, midterm_construction_projects.csv). Conduct the same kind of PCA we did in the lab, focusing only on continuous variables.
1.	Explore the data
2.	Standardize the data so we can do PCA
3.	Fit PCA and look at the explained variance 
4.	Project the original data onto the necessary principal components
5.	Explain in detail what the PCA has told you about the data, given the principal components some intuitive meaning, and explain how you would use your insights from the PCA

I have selected the cyber data set (Lab 05) on which to apply a Principal Component Analysis.  To start, I will repeat the steps in took in Lab 05 to load, analyze, clean, transform and check the data BEFORE I begin the PCA.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import skew

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.decomposition import PCA as PCAModel

1.	Data Familiarization & Hygiene
    a.	Inspect schema, ranges, and datatypes.
    b.	Detect and fix issues: missing values, impossible percentages, and text in numeric fields.
    c.	Apply simple imputation and clipping strategies.


In [None]:
#load the csv into a data frame
#perform some familiarization steps with the data set
#inspect and display some meta/summary data about the data set

#load CSV to pandas data frame
cyber_df = pd.read_csv("cybersec.csv")

#inspect schema and types
print("Schema and Data Types:\n", cyber_df.dtypes, "\n")

#show summary statistics
print("Summary Statistics:\n", cyber_df.describe(include='all'), "\n")

#show missing values
print("Missing Values:\n", cyber_df.isnull().sum(), "\n")

#show some unique values for object columns
object_columns = cyber_df.select_dtypes(include=['object']).columns
for col in object_columns:
    print(f"{col} unique values:", cyber_df[col].unique()[:10])
print()



In [None]:
#detect & fix issues
#clip impossible % values, just in case
cyber_df['phishing_sim_click_rate'] = cyber_df['phishing_sim_click_rate'].clip(upper=1.0)

#fix text 'null' if present (not needed here, but included as good hygiene)
cyber_df.replace("null", pd.NA, inplace=True)

#setup the missing values (mean)
cyber_df['vuln_count'] = cyber_df['vuln_count'].fillna(cyber_df['vuln_count'].mean())
cyber_df['training_completion_rate'] = cyber_df['training_completion_rate'].fillna(cyber_df['training_completion_rate'].mean())

#check if all missing values handled
print("Post-cleaning Missing Values:\n", cyber_df.isnull().sum())

#quick preview
cyber_df.head()

#actual column names
print(cyber_df.columns.tolist())

#check the min and max values of the endpoint_coverage column
min_coverage = cyber_df['endpoint_coverage'].min()
max_coverage = cyber_df['endpoint_coverage'].max()

print(f"Endpoint Coverage ranges from {min_coverage}% to {max_coverage}%.")


2.	Exploratory Analysis (EDA)
    a.	Plot distributions (incidents histogram).
    b.	Explore relationships (incidents vs. vulnerabilities, patch time).
    c.	Spot skewness and potential transformations.

In [None]:
#Histogram of Security Incidents

plt.figure(figsize=(8, 5))
sns.histplot(cyber_df['security_incidents'], bins=30, kde=True)
plt.title("Histogram of Security Incidents (Original)")
plt.xlabel("Number of Security Incidents")
plt.ylabel("Frequency")
plt.show()

#Explore relationships via scatter plots

#incidents vs. vuln_count
plt.figure(figsize=(8, 5))
sns.scatterplot(x='vuln_count', y='security_incidents', data=cyber_df)
plt.title("Security Incidents vs. Vulnerability Count")
plt.xlabel("Vulnerability Count")
plt.ylabel("Security Incidents")
plt.show()

#incidents vs. mean_time_to_patch
plt.figure(figsize=(8, 5))
sns.scatterplot(x='mean_time_to_patch', y='security_incidents', data=cyber_df)
plt.title("Security Incidents vs. Mean Time to Patch")
plt.xlabel("Mean Time to Patch (days)")
plt.ylabel("Security Incidents")
plt.show()

#Skewness and Log Transformation

#original skewness
orig_skew = skew(cyber_df['security_incidents'].dropna())
print(f"Skewness of security_incidents (original): {orig_skew:.2f}")

#log transform
cyber_df['log_security_incidents'] = np.log1p(cyber_df['security_incidents'])

#skewness after transformation
log_skew = skew(cyber_df['log_security_incidents'].dropna())
print(f"Skewness after log transformation: {log_skew:.2f}")

#plot log-transformed histogram
plt.figure(figsize=(8, 5))
sns.histplot(cyber_df['log_security_incidents'], bins=30, kde=True)
plt.title("Histogram of Log-Transformed Security Incidents")
plt.xlabel("log(1 + Security Incidents)")
plt.ylabel("Frequency")
plt.show()



In [None]:
#check skewness of all continuous variables first
print("Skewness Analysis of Original Variables:")
print("=" * 70)

original_continuous = [
    'org_size', 'it_budget_per_emp', 'mfa_coverage', 'endpoint_coverage',
    'vuln_count', 'mean_time_to_patch', 'phishing_sim_click_rate',
    'training_completion_rate', 'login_failures_per_user',
    'cloud_misconfig_count', 'security_incidents'
]

for var in original_continuous:
    sk = skew(cyber_df[var].dropna())
    interpretation = "Highly skewed" if abs(sk) > 1 else "Moderately skewed" if abs(sk) > 0.5 else "Roughly symmetric"
    print(f"{var:30s}: {sk:6.2f} - {interpretation}")

print("\n" + "=" * 70)
print("Applying Log Transformations to Highly Skewed Variables:")
print("=" * 70)

#log transform highly skewed variables
cyber_df['log_org_size'] = np.log1p(cyber_df['org_size'])
cyber_df['log_vuln_count'] = np.log1p(cyber_df['vuln_count'])
cyber_df['log_security_incidents'] = np.log1p(cyber_df['security_incidents'])
cyber_df['log_mean_time_to_patch'] = np.log1p(cyber_df['mean_time_to_patch'])
cyber_df['log_login_failures'] = np.log1p(cyber_df['login_failures_per_user'])
cyber_df['log_cloud_misconfig'] = np.log1p(cyber_df['cloud_misconfig_count'])

#verify transformations reduced skewness
print("\nSkewness After Transformation:")
transformed_vars = {
    'log_org_size': cyber_df['log_org_size'],
    'log_vuln_count': cyber_df['log_vuln_count'],
    'log_security_incidents': cyber_df['log_security_incidents'],
    'log_mean_time_to_patch': cyber_df['log_mean_time_to_patch'],
    'log_login_failures': cyber_df['log_login_failures'],
    'log_cloud_misconfig': cyber_df['log_cloud_misconfig']
}

for name, series in transformed_vars.items():
    sk = skew(series.dropna())
    print(f"{name:30s}: {sk:6.2f}")

cyber_df.head()

In [None]:
#select continuous variables for PCA - using TRANSFORMED versions where applicable
continuous_vars = [
    'log_org_size',                    # transformed from org_size
    'it_budget_per_emp',               # not skewed, use original
    'mfa_coverage',                    # not skewed, use original
    'endpoint_coverage',               # not skewed, use original
    'log_vuln_count',                  # transformed from vuln_count
    'log_mean_time_to_patch',          # transformed from mean_time_to_patch
    'phishing_sim_click_rate',         # not skewed, use original
    'training_completion_rate',        # not skewed, use original
    'log_login_failures',              # transformed from login_failures_per_user
    'log_cloud_misconfig',             # transformed from cloud_misconfig_count
    'log_security_incidents'           # transformed from security_incidents
]

print("Variables selected for PCA (using transformed versions):")
print("=" * 70)
for i, var in enumerate(continuous_vars, 1):
    print(f"{i:2d}. {var}")

#create a clean dataset with no missing values
pca_data = cyber_df[continuous_vars].dropna()

print(f"\nPCA Dataset shape: {pca_data.shape}")
print(f"Number of observations: {pca_data.shape[0]}")
print(f"Number of variables: {pca_data.shape[1]}")

#standardize the data (mean=0, std=1)
scaler = StandardScaler()
data_standardized = scaler.fit_transform(pca_data)

print("\n✓ Data has been standardized for PCA")
print(f"✓ Standardized data shape: {data_standardized.shape}")
print(f"✓ Mean of standardized data: {data_standardized.mean():.6f} (should be ~0)")
print(f"✓ Std of standardized data: {data_standardized.std():.6f} (should be ~1)")

In [None]:
#Fit PCA and analyze explained variance

#fit PCA with all components
pca = PCA()
pca_components = pca.fit_transform(data_standardized)

#get explained variance
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

#display variance explained by each component
variance_df = pd.DataFrame({
    'PC': [f'PC{i+1}' for i in range(len(explained_variance))],
    'Explained Variance': explained_variance,
    'Cumulative Variance': cumulative_variance
})

print("Explained Variance by Principal Component:\n")
print(variance_df)

#visualize explained variance
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

#scree plot
axes[0].bar(range(1, len(explained_variance) + 1), explained_variance)
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Scree Plot: Variance Explained by Each PC')
axes[0].set_xticks(range(1, len(explained_variance) + 1))

#cumulative variance plot
axes[1].plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o')
axes[1].axhline(y=0.8, color='r', linestyle='--', label='80% threshold')
axes[1].axhline(y=0.9, color='g', linestyle='--', label='90% threshold')
axes[1].set_xlabel('Number of Principal Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Variance Explained')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

#determine number of components needed for 80% and 90% variance
n_components_80 = np.argmax(cumulative_variance >= 0.80) + 1
n_components_90 = np.argmax(cumulative_variance >= 0.90) + 1

print(f"\nComponents needed for 80% variance: {n_components_80}")
print(f"Components needed for 90% variance: {n_components_90}")

In [None]:
#Analyze component loadings

#get the loadings (correlation between original variables and PCs)
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)

#create a dataframe for easier interpretation
#focus on the first few most important components
n_components_to_show = min(5, len(explained_variance))

loadings_df = pd.DataFrame(
    loadings[:, :n_components_to_show],
    columns=[f'PC{i+1}' for i in range(n_components_to_show)],
    index=continuous_vars
)

print("Component Loadings (correlations with original variables):\n")
print(loadings_df.round(3))

#visualize loadings as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(loadings_df, annot=True, cmap='RdBu_r', center=0, 
            fmt='.2f', cbar_kws={'label': 'Loading'})
plt.title('PCA Loadings Heatmap: Variable Contributions to Principal Components')
plt.xlabel('Principal Component')
plt.ylabel('Original Variables')
plt.tight_layout()
plt.show()

#show strongest contributors to each PC
print("\nStrongest contributors to each principal component:")
for i in range(n_components_to_show):
    pc_name = f'PC{i+1}'
    loadings_sorted = loadings_df[pc_name].abs().sort_values(ascending=False)
    print(f"\n{pc_name} (explains {explained_variance[i]:.1%} of variance):")
    for var, loading in loadings_sorted.head(3).items():
        actual_loading = loadings_df.loc[var, pc_name]
        print(f"  {var}: {actual_loading:.3f}")

In [None]:
#project data onto principal components

#define optimal_n before using it
optimal_n = n_components_90

#debug: Check what PCA is
print(f"Type of PCA: {type(PCA)}")
print(f"Optimal n_components: {optimal_n}")

#refit PCA with optimal number of components (let's use components for 90% variance)
pca_optimal = PCAModel(n_components=optimal_n)
pca_transformed = pca_optimal.fit_transform(data_standardized)

#create dataframe with projected data
pca_df = pd.DataFrame(
    pca_transformed,
    columns=[f'PC{i+1}' for i in range(optimal_n)],
    index=pca_data.index
)

#add back some categorical variables for context
pca_df['department'] = cyber_df.loc[pca_data.index, 'department'].values
pca_df['quarter'] = cyber_df.loc[pca_data.index, 'quarter'].values
pca_df['security_incidents'] = cyber_df.loc[pca_data.index, 'security_incidents'].values

print(f"Projected data shape: {pca_df.shape}")
print("\nFirst few rows of projected data:")
print(pca_df.head())

#visualize first two principal components
plt.figure(figsize=(12, 5))

#colored by department
plt.subplot(1, 2, 1)
for dept in pca_df['department'].unique():
    mask = pca_df['department'] == dept
    plt.scatter(pca_df.loc[mask, 'PC1'], pca_df.loc[mask, 'PC2'], 
                label=dept, alpha=0.6)
plt.xlabel(f'PC1 ({explained_variance[0]:.1%} variance)')
plt.ylabel(f'PC2 ({explained_variance[1]:.1%} variance)')
plt.title('PCA Projection by Department')
plt.legend()
plt.grid(True, alpha=0.3)

#colored by security incidents (continuous)
plt.subplot(1, 2, 2)
scatter = plt.scatter(pca_df['PC1'], pca_df['PC2'], 
                     c=pca_df['security_incidents'], 
                     cmap='YlOrRd', alpha=0.6)
plt.colorbar(scatter, label='Security Incidents')
plt.xlabel(f'PC1 ({explained_variance[0]:.1%} variance)')
plt.ylabel(f'PC2 ({explained_variance[1]:.1%} variance)')
plt.title('PCA Projection by Security Incidents')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

I built some interpretation code to generate recommendations and provide insights based on the PCA (and data) as analyzed.

In [None]:
#interpretation and insights

print("=" * 80)
print("PCA INTERPRETATION AND INSIGHTS")
print("=" * 80)

print("\n1. DIMENSIONALITY REDUCTION:")
print(f"   - Original dimensions: {len(continuous_vars)} variables")
print(f"   - Reduced to {optimal_n} principal components")
print(f"   - Retained variance: {cumulative_variance[optimal_n-1]:.1%}")

print("\n2. PRINCIPAL COMPONENT MEANINGS:")

#interpret based on loadings
for i in range(min(3, optimal_n)):  # Interpret first 3 PCs
    pc_name = f'PC{i+1}'
    print(f"\n   {pc_name} ({explained_variance[i]:.1%} of variance):")
    
    #get top positive and negative loadings
    pc_loadings = loadings_df.iloc[:, i].sort_values()
    
    print(f"   Strong negative associations: {pc_loadings.head(2).to_dict()}")
    print(f"   Strong positive associations: {pc_loadings.tail(2).to_dict()}")
    
    #provide intuitive interpretation
    if i == 0:
        print("   Interpretation: This likely represents 'organizational scale and cybersecurity posture'")
    elif i == 1:
        print("   Interpretation: This likely represents 'security hygiene and response effectiveness'")
    elif i == 2:
        print("   Interpretation: This likely represents 'human factor vulnerabilities'")

print("\n3. WHAT I RECOMMEND WE DO:")
print("   - Create a cybersecurity maturity score based on PC1 (most important - for creating a baseline against tracking future progress")
print("   - Use these Principal Components for clustering to identify similar security profiles")
print("   - Build predictive models with reduced features (less overfitting)")
print("   - Identify departments that are outliers in Principal Component space")
print("   - Track cybersecurity posture changes over time using Principal Component coordinates")

#calculate correlation between PC1 and security incidents
pc1_incident_corr = np.corrcoef(pca_df['PC1'], pca_df['security_incidents'])[0, 1]

print("\n4. KEY INSIGHTS:")
print(f"   - PC1 correlation with security incidents: {pc1_incident_corr:.3f}")
print(f"   - Most variance comes from organizational and cybersecurity control differences")
print(f"   - {optimal_n} dimensions capture the essential patterns in our cybersecurity posture")

Next, I added some code to build textual recommendations.  I learned how to add these to a file that includes text I can then add into a markdown cell (which I did). Probably overkill, but as the model improved I kept coming up with more, potential conclusions for a more comprehensive recommendation list.  I made a couple of passes at how to build the recommendations.  They are next as ActionableInsightsCode, followed by ActionableInsightsCommentary (markdown).  There are two versions as I wanted to try a couple of different approaches.

In [None]:
# COMPREHENSIVE ANALYSIS WITH AUTO-GENERATED MARKDOWN CONCLUSION

print("=" * 80)
print("DETAILED ANALYSIS FOR ACTIONABLE RECOMMENDATIONS")
print("=" * 80)

# ============================================================================
# PART 1: IDENTIFY KEY DRIVERS OF SECURITY INCIDENTS
# ============================================================================

print("\n" + "=" * 80)
print("PART 1: WHAT DRIVES SECURITY INCIDENTS?")
print("=" * 80)

#calculate correlation of each PC with security incidents
pc_incident_correlations = {}
for i in range(optimal_n):
    pc_col = f'PC{i+1}'
    corr = np.corrcoef(pca_df[pc_col], pca_df['security_incidents'])[0, 1]
    pc_incident_correlations[pc_col] = corr
    print(f"\n{pc_col} correlation with security incidents: {corr:.3f}")
    print(f"  Variance explained: {explained_variance[i]:.1%}")

#find most important PC (highest absolute correlation with incidents)
most_correlated_pc_name = max(pc_incident_correlations, key=lambda k: abs(pc_incident_correlations[k]))
most_correlated_pc_idx = int(most_correlated_pc_name.replace('PC', '')) - 1
most_correlated_pc_corr = pc_incident_correlations[most_correlated_pc_name]

print(f"\n>>> MOST IMPORTANT: {most_correlated_pc_name}")
print(f"    Correlation with incidents: {most_correlated_pc_corr:.3f}")
print(f"    Variance explained: {explained_variance[most_correlated_pc_idx]:.1%}")

# ============================================================================
# PART 2: IDENTIFY TOP VARIABLES TO ADDRESS
# ============================================================================

print("\n" + "=" * 80)
print("PART 2: TOP VARIABLES THAT NEED ATTENTION")
print("=" * 80)

#calculate leverage scores (variance explained * correlation with incidents)
pc_incident_corrs_list = []
for i in range(optimal_n):
    pc_col = f'PC{i+1}'
    corr = abs(np.corrcoef(pca_df[pc_col], pca_df['security_incidents'])[0, 1])
    pc_incident_corrs_list.append(corr)

#weight by both variance explained and correlation with incidents
leverage_scores = explained_variance[:optimal_n] * np.array(pc_incident_corrs_list)
most_important_pc_idx = np.argmax(leverage_scores)
most_important_pc_name = f'PC{most_important_pc_idx + 1}'

print(f"\nMost important PC for reducing incidents: {most_important_pc_name}")
print(f"  - Explains {explained_variance[most_important_pc_idx]:.1%} of variance")
print(f"  - Correlation with incidents: {pc_incident_corrs_list[most_important_pc_idx]:.3f}")
print(f"  - Leverage score: {leverage_scores[most_important_pc_idx]:.3f}")

#get top 5 variables by absolute loading on most important PC
top_5_vars = loadings_df[most_important_pc_name].abs().sort_values(ascending=False).head(5)

print(f"\nTop 5 variables loading on {most_important_pc_name}:")
top_vars_list = []

for rank, (var, abs_loading) in enumerate(top_5_vars.items(), 1):
    actual_loading = loadings_df.loc[var, most_important_pc_name]
    
    #determine if higher values lead to more or fewer incidents
    #positive loading + positive PC-incident correlation = higher values cause more incidents
    #negative loading + positive PC-incident correlation = lower values cause more incidents
    pc_incident_corr = pc_incident_correlations[most_important_pc_name]
    
    if (actual_loading > 0 and pc_incident_corr > 0) or (actual_loading < 0 and pc_incident_corr < 0):
        direction = "Higher"
        effect = "MORE incidents"
        action = "REDUCE"
    else:
        direction = "Lower"  
        effect = "FEWER incidents"
        action = "INCREASE"
    
    print(f"\n{rank}. {var}")
    print(f"   Loading: {actual_loading:.3f}")
    print(f"   Effect: {direction} values → {effect}")
    print(f"   → ACTION: {action} this metric")
    
    top_vars_list.append({
        'rank': rank,
        'variable': var,
        'loading': actual_loading,
        'abs_loading': abs_loading,
        'direction': direction,
        'effect': effect,
        'action': action
    })

# ============================================================================
# PART 3: SPECIFIC ACTIONABLE RECOMMENDATIONS
# ============================================================================

print("\n" + "=" * 80)
print("PART 3: SPECIFIC, ACTIONABLE RECOMMENDATIONS")
print("=" * 80)

#detailed recommendations for each variable
recommendations_detailed = {
    'log_mean_time_to_patch': {
        'problem': 'Slow patching increases vulnerability window',
        'quick_win': 'Prioritize critical patches, automate patch deployment for 30% of systems',
        'medium': 'Hire 1-2 dedicated patch management specialists ($120k-150k/year)',
        'long': 'Implement enterprise patch management system ($50k tool + 2 FTE)',
        'metric': 'Reduce mean time to patch from current to <7 days for critical patches'
    },
    'log_vuln_count': {
        'problem': 'High vulnerability counts indicate scanning or remediation gaps',
        'quick_win': 'Increase vulnerability scan frequency to weekly',
        'medium': 'Create vulnerability management SLA: Critical (7 days), High (30 days)',
        'long': 'Hire security engineer focused on vulnerability management ($130k-160k)',
        'metric': 'Reduce total vulnerability count by 40% in 12 months'
    },
    'training_completion_rate': {
        'problem': 'Incomplete training leaves users vulnerable to social engineering',
        'quick_win': 'Make security training mandatory with executive sponsorship',
        'medium': 'Implement quarterly micro-learning modules (10 min each)',
        'long': 'Gamify training with rewards, track completion in performance reviews',
        'metric': 'Achieve 95%+ completion rate within 6 months'
    },
    'phishing_sim_click_rate': {
        'problem': 'High click rates indicate users are vulnerable to phishing',
        'quick_win': 'Run monthly phishing simulations with immediate educational pop-ups',
        'medium': 'Implement targeted training for repeat clickers',
        'long': 'Deploy email security solution with URL rewriting and sandboxing ($15-20/user/year)',
        'metric': 'Reduce click rate from current to <5% organization-wide'
    },
    'mfa_coverage': {
        'problem': 'Accounts without MFA are vulnerable to credential compromise',
        'quick_win': 'Mandate MFA for all admin and privileged accounts (immediate)',
        'medium': 'Require MFA for all users accessing corporate resources (90 days)',
        'long': 'Implement passwordless authentication (FIDO2) for 50% of users',
        'metric': 'Achieve 100% MFA coverage for all users within 3 months'
    },
    'endpoint_coverage': {
        'problem': 'Unprotected endpoints are entry points for malware',
        'quick_win': 'Audit all endpoints, identify gaps, create deployment plan',
        'medium': 'Deploy EDR to remaining endpoints ($40-60/endpoint/year)',
        'long': 'Implement zero-trust architecture with continuous endpoint verification',
        'metric': 'Achieve 100% endpoint coverage within 6 months'
    },
    'log_login_failures': {
        'problem': 'High login failures may indicate brute force attacks or poor UX',
        'quick_win': 'Implement account lockout after 5 failed attempts',
        'medium': 'Deploy password manager to all users ($5-10/user/year)',
        'long': 'Implement SSO with adaptive authentication',
        'metric': 'Reduce login failures by 60% through better UX and security'
    },
    'log_cloud_misconfig': {
        'problem': 'Cloud misconfigurations expose data and services',
        'quick_win': 'Run immediate cloud security audit using free tools',
        'medium': 'Implement Cloud Security Posture Management (CSPM) tool ($20k-50k/year)',
        'long': 'Hire cloud security engineer or consultant ($140k-180k or $200/hr)',
        'metric': 'Reduce misconfigurations by 80%, maintain <5 open issues'
    },
    'it_budget_per_emp': {
        'problem': 'Insufficient budget limits security program effectiveness',
        'quick_win': 'Document current budget gaps with business risk analysis',
        'medium': 'Request 20-30% budget increase with ROI justification',
        'long': 'Benchmark against industry standards, align budget to risk appetite',
        'metric': 'Increase security budget to industry median per employee'
    },
    'log_org_size': {
        'problem': 'Larger organizations face scale challenges',
        'quick_win': 'Implement automation for repetitive security tasks',
        'medium': 'Scale security team proportionally (1 security FTE per 100 employees)',
        'long': 'Build security centers of excellence for enterprise-wide consistency',
        'metric': 'Maintain consistent security posture regardless of org size'
    },
    'log_security_incidents': {
        'problem': 'Lack of visibility into incident trends',
        'quick_win': 'Implement standardized incident reporting and tracking',
        'medium': 'Deploy SIEM for centralized security monitoring ($30k-100k/year)',
        'long': 'Build SOC capability with 24/7 monitoring',
        'metric': 'Reduce mean time to detect (MTTD) to <1 hour'
    }
}

print("\nDETAILED RECOMMENDATIONS FOR TOP 5 DRIVERS:\n")
for item in top_vars_list:
    var = item['variable']
    rank = item['rank']
    loading = item['loading']
    action = item['action']
    
    if var in recommendations_detailed:
        rec = recommendations_detailed[var]
        print(f"\n{'='*70}")
        print(f"#{rank}: {var} (loading: {loading:.3f})")
        print(f"ACTION NEEDED: {action} this metric")
        print(f"{'='*70}")
        print(f"PROBLEM: {rec['problem']}")
        print(f"\n→ QUICK WIN (0-3 months, low cost):")
        print(f"  {rec['quick_win']}")
        print(f"\n→ MEDIUM-TERM (3-6 months, moderate investment):")
        print(f"  {rec['medium']}")
        print(f"\n→ LONG-TERM (6-12 months, significant investment):")
        print(f"  {rec['long']}")
        print(f"\n→ SUCCESS METRIC:")
        print(f"  {rec['metric']}")

# ============================================================================
# PART 4: DEPARTMENT-SPECIFIC RECOMMENDATIONS
# ============================================================================

print("\n" + "=" * 80)
print("PART 4: DEPARTMENT-SPECIFIC RECOMMENDATIONS")
print("=" * 80)

dept_pc_means = pca_df.groupby('department')[[f'PC{i+1}' for i in range(min(3, optimal_n))]].mean()
dept_incidents = pca_df.groupby('department')['security_incidents'].mean()

print("\nDepartment Profiles:")
print("-" * 70)
dept_analysis = pd.concat([dept_pc_means, dept_incidents], axis=1)
print(dept_analysis.round(2))

#identify best and worst performers
best_dept = dept_incidents.idxmin()
worst_dept = dept_incidents.idxmax()

print(f"\n>>> BEST PERFORMER: {best_dept}")
print(f"    Average incidents: {dept_incidents[best_dept]:.1f}")
print(f"    PC1 score: {dept_pc_means.loc[best_dept, 'PC1']:.2f}")

print(f"\n>>> NEEDS MOST ATTENTION: {worst_dept}")
print(f"    Average incidents: {dept_incidents[worst_dept]:.1f}")
print(f"    PC1 score: {dept_pc_means.loc[worst_dept, 'PC1']:.2f}")

print(f"\nRECOMMENDATION: Have {best_dept} share best practices with {worst_dept}")

#specific recommendations by department
print("\n" + "-" * 70)
print("DEPARTMENT-SPECIFIC ACTION PLANS:")
print("-" * 70)

for dept in dept_pc_means.index:
    print(f"\n{dept.upper()}:")
    pc1_score = dept_pc_means.loc[dept, 'PC1']
    pc2_score = dept_pc_means.loc[dept, 'PC2'] if 'PC2' in dept_pc_means.columns else 0
    avg_incidents = dept_incidents[dept]
    
    print(f"  Current incidents/quarter: {avg_incidents:.1f}")
    print(f"  PC1 score: {pc1_score:.2f} | PC2 score: {pc2_score:.2f}")
    
    #interpret PC1
    if pc1_score > 0.5:
        print(f"  • Large/complex organization - prioritize automation and additional staff")
    elif pc1_score < -0.5:
        print(f"  • Smaller organization - focus on foundational controls first")
    else:
        print(f"  • Medium-sized organization - balanced approach needed")
    
    #interpret PC2 if available
    if 'PC2' in dept_pc_means.columns:
        if pc2_score < -0.5:
            print(f"  • WEAK controls - PRIORITY: improve coverage metrics (MFA, endpoints)")
        elif pc2_score > 0.5:
            print(f"  • STRONG controls - maintain and focus on advanced threats")
    
    #compare to org average
    if avg_incidents > dept_incidents.mean():
        print(f"  ⚠️  ABOVE AVERAGE INCIDENTS - needs immediate attention")
    else:
        print(f"  ✓  Below average incidents - share practices with other teams")

# ============================================================================
# PART 5: ROI AND IMPACT ESTIMATION
# ============================================================================

print("\n" + "=" * 80)
print("PART 5: ESTIMATED IMPACT AND ROI")
print("=" * 80)

current_mean_incidents = pca_df['security_incidents'].mean()
current_total_incidents = pca_df['security_incidents'].sum()
n_dept_quarters = len(pca_df)

print(f"\nCURRENT STATE:")
print(f"  Average incidents per dept/quarter: {current_mean_incidents:.1f}")
print(f"  Total incidents in dataset: {current_total_incidents:.0f}")
print(f"  Number of observations: {n_dept_quarters}")

#estimate impact based on correlation
correlation_strength = abs(most_correlated_pc_corr)
estimated_reduction_pct = min(correlation_strength * 100, 60)  # Cap at 60%

print(f"\nESTIMATED IMPACT:")
print(f"  {most_correlated_pc_name} correlation with incidents: {most_correlated_pc_corr:.3f}")
print(f"  By improving top 3-5 factors by 1 standard deviation:")
print(f"    • Expected incident reduction: ~{estimated_reduction_pct:.0f}%")
print(f"    • From {current_mean_incidents:.1f} to ~{current_mean_incidents * (1 - estimated_reduction_pct/100):.1f} incidents/quarter")
print(f"    • Annual reduction: ~{current_mean_incidents * 4 * estimated_reduction_pct/100:.0f} fewer incidents/year")

#cost-benefit
print(f"\nESTIMATED COSTS:")
print(f"  Quick wins: $0-10k (reallocate existing resources)")
print(f"  Medium-term: $100k-300k (tools + 1-2 FTE)")
print(f"  Long-term: $300k-600k/year (comprehensive program)")

print(f"\nESTIMATED BENEFITS (assuming $50k cost per incident):")
incidents_prevented = current_mean_incidents * 4 * estimated_reduction_pct / 100
cost_per_incident = 50000  # Adjust this for your organization
annual_savings = incidents_prevented * cost_per_incident

print(f"  Incidents prevented per year: ~{incidents_prevented:.0f}")
print(f"  Annual cost savings: ${annual_savings:,.0f}")
print(f"  ROI: {(annual_savings / 300000 - 1) * 100:.0f}% (against medium-term investment)")

# ============================================================================
# PART 6: AUTO-GENERATE MARKDOWN CONCLUSION
# ============================================================================

#print("\n" + "=" * 80)
#print("AUTO-GENERATED MARKDOWN CONCLUSION (COPY-PASTE BELOW)")
#print("=" * 80)

#interpret what PC1 represents based on loadings
pc1_top_var = loadings_df.iloc[:, 0].abs().idxmax()
pc1_interpretation = "organizational scale and security complexity"
if 'training' in pc1_top_var or 'phishing' in pc1_top_var:
    pc1_interpretation = "human factor vulnerabilities and training effectiveness"
elif 'coverage' in pc1_top_var or 'mfa' in pc1_top_var:
    pc1_interpretation = "security control coverage and maturity"

markdown_output = f"""
# Conclusion: Actionable Insights from PCA

## What We Learned

This PCA analysis revealed that **{optimal_n} principal components** capture **{cumulative_variance[optimal_n-1]:.1%}** of the variance in our cybersecurity data, with **PC1 primarily representing {pc1_interpretation}**, explaining **{explained_variance[0]:.1%}** of the variance.

The analysis successfully reduced our monitoring from 11 separate metrics to {optimal_n} composite dimensions that capture the essential patterns in our security posture.

## Key Finding: What Drives Security Incidents

**{most_correlated_pc_name} shows the strongest relationship with security incidents** (correlation: **{most_correlated_pc_corr:.3f}**), explaining **{explained_variance[most_correlated_pc_idx]:.1%}** of variance. This tells us that focusing on the variables that load heavily on {most_correlated_pc_name} will have the greatest impact on reducing incidents.

## Specific Recommendations (Prioritized by Impact)

Based on the PCA loadings and their correlation with security incidents, we identified the following **prioritized actions**:

### 1. Immediate Actions (Highest ROI - Implement in 0-3 months)
"""

for item in top_vars_list[:3]:
    var = item['variable']
    if var in recommendations_detailed:
        rec = recommendations_detailed[var]
        markdown_output += f"""
**{item['rank']}. {var}** (loading: {item['loading']:.3f}, ACTION: {item['action']})
- **Problem**: {rec['problem']}
- **Action**: {rec['quick_win']}
- **Target**: {rec['metric']}
"""

markdown_output += f"""
### 2. Medium-Term Investments (3-6 months, ~$100k-300k)
"""

for item in top_vars_list[:3]:
    var = item['variable']
    if var in recommendations_detailed:
        rec = recommendations_detailed[var]
        markdown_output += f"""
**{var}**: {rec['medium']}
"""

markdown_output += f"""
### 3. Long-Term Strategic Initiatives (6-12 months, ~$300k-600k/year)
"""

for item in top_vars_list[:3]:
    var = item['variable']
    if var in recommendations_detailed:
        rec = recommendations_detailed[var]
        markdown_output += f"""
**{var}**: {rec['long']}
"""

markdown_output += f"""
## Why These Matter

- **{most_correlated_pc_name}** explains {explained_variance[most_correlated_pc_idx]:.1%} of variance and correlates **{most_correlated_pc_corr:.3f}** with security incidents
- The top {len(top_vars_list)} variables loading on {most_correlated_pc_name} are directly actionable with clear interventions
- Improving these by 1 standard deviation could reduce incidents by approximately **{estimated_reduction_pct:.0f}%**

## Department-Specific Focus

**Best Performer**: **{best_dept}** (average {dept_incidents[best_dept]:.1f} incidents/quarter)
- Share their practices with other departments
- Document their success factors for replication

**Needs Most Attention**: **{worst_dept}** (average {dept_incidents[worst_dept]:.1f} incidents/quarter)
- Immediate intervention required
- Assign dedicated security resource for 90 days
- Implement quick wins from list above

### All Departments:
"""

for dept in dept_pc_means.index:
    avg_incidents = dept_incidents[dept]
    status = "⚠️ Above average" if avg_incidents > dept_incidents.mean() else "✓ Below average"
    
    markdown_output += f"""
**{dept}**: {avg_incidents:.1f} incidents/quarter ({status})
"""

markdown_output += f"""
## Resource Requirements

To implement these recommendations:

| Timeline | Investment | Key Activities |
|----------|-----------|----------------|
| **Quick wins (0-3 months)** | $0-10k | Mandatory training, MFA for admins, phishing sims |
| **Medium-term (3-6 months)** | $100k-300k | 1-2 FTE hires, EDR deployment, CSPM tool |
| **Long-term (6-12 months)** | $300k-600k/year | Additional staff, enterprise tools, automation |

## Expected Outcomes

### Quantitative Targets (12-month goals):
- **Reduce security incidents by {estimated_reduction_pct:.0f}%** (from {current_mean_incidents:.1f} to ~{current_mean_incidents * (1 - estimated_reduction_pct/100):.1f} per dept/quarter)
- Achieve **100% MFA coverage** across all users
- Reduce **mean time to patch to <7 days** for critical vulnerabilities
- Increase **training completion to 95%+** organization-wide
- Reduce **phishing click rate to <5%**

### Return on Investment:
- **Annual incidents prevented**: ~{incidents_prevented:.0f}
- **Estimated annual savings**: ${annual_savings:,.0f} (at ${cost_per_incident:,} per incident)
- **ROI**: ~{(annual_savings / 300000 - 1) * 100:.0f}% on medium-term investment

## Measuring Success

Track these PC scores quarterly to monitor improvement:

- **Target**: Move all departments' {most_correlated_pc_name} scores closer to {best_dept}'s profile
- **Dashboard**: Create quarterly PC score tracking with incident trends
- **Review**: Executive review of security posture using PC metrics (simpler than 11 individual metrics)

## Why PCA Was Valuable

Rather than tracking 11 separate metrics independently, PCA showed us that:

1. **Most variation comes from just {optimal_n} underlying factors** - we can focus our attention
2. **We can prioritize resources** on variables that load heavily on incident-correlated PCs ({most_correlated_pc_name})
3. **We have a quantitative basis** for budget requests: "Investing $300k in these {len(top_vars_list[:3])} areas will reduce incidents by {estimated_reduction_pct:.0f}%"
4. **Departments can be compared** fairly using PC scores rather than raw metrics that don't account for size/complexity

This transforms cybersecurity from reactive firefighting to **proactive, data-driven risk management** with clear priorities and measurable outcomes.

## Next Steps

1. **Week 1**: Present findings to leadership, secure budget approval for quick wins
2. **Month 1**: Implement all quick wins, begin hiring for medium-term roles
3. **Month 3**: Deploy tools (EDR, CSPM), complete first round of enhanced training
4. **Month 6**: Review PC scores, measure incident reduction, adjust strategy
5. **Month 12**: Full program assessment, demonstrate ROI, plan next phase

---

*Analysis based on {n_dept_quarters} department-quarter observations across {len(pca_df['department'].unique())} departments over {len(pca_df['quarter'].unique())} quarters.*
"""

#print(markdown_output)
#commented out as I copied to a separate markdown cell after generating

#save to file for easy copying
try:
    with open('PCA_Conclusion_AutoGenerated.md', 'w') as f:
        f.write(markdown_output)
    print("\n" + "=" * 80)
    print("✓ Markdown conclusion saved to: PCA_Conclusion_AutoGenerated.md and pasted to markdown cell just below:")
    print("=" * 80)
except:
    print("\n" + "=" * 80)
    print("Note: Could not save to file, but markdown is printed above")
    print("=" * 80)

# Conclusion: Actionable Insights from PCA  
## What We Learned  
This PCA analysis revealed that **8 principal components** capture **91.7%** of the variance in our cybersecurity data, with **PC1 primarily representing organizational scale and security complexity**, explaining **20.5%** of the variance.  
The analysis successfully reduced our monitoring from 11 separate metrics to 8 composite dimensions that capture the essential patterns in our security posture.  
## Key Finding: What Drives Security Incidents  
**PC1 shows the strongest relationship with security incidents** (correlation: **0.688**), explaining **20.5%** of variance. This tells us that focusing on the variables that load heavily on PC1 will have the greatest impact on reducing incidents.  
## Specific Recommendations (Prioritized by Impact)  
Based on the PCA loadings and their correlation with security incidents, we identified the following **prioritized actions**:  
### 1. Immediate Actions (Highest ROI - Implement in 0-3 months)  
**1. log_org_size** (loading: 0.857)  
- **Problem**: Larger organizations face scale challenges  
- **Action**: Implement automation for repetitive security tasks  
- **Target**: Maintain consistent security posture regardless of org size  
**2. log_vuln_count** (loading: 0.853)  
- **Problem**: High vulnerability counts indicate scanning or remediation gaps  
- **Action**: Increase vulnerability scan frequency to weekly  
- **Target**: Reduce total vulnerability count by 40% in 12 months  
**3. log_security_incidents** (loading: 0.700)  
- **Problem**: Lack of visibility into incident trends  
- **Action**: Implement standardized incident reporting and tracking  
- **Target**: Reduce mean time to detect (MTTD) to <1 hour  
### 2. Medium-Term Investments (3-6 months, ~$100k-300k)  
**log_org_size**: Scale security team proportionally (1 security FTE per 100 employees)  
**log_vuln_count**: Create vulnerability management SLA: Critical (7 days), High (30 days)  
**log_security_incidents**: Deploy SIEM for centralized security monitoring ($30k-100k/year)  
### 3. Long-Term Strategic Initiatives (6-12 months, ~$300k-600k/year)  
**log_org_size**: Build security centers of excellence for enterprise-wide consistency  
**log_vuln_count**: Hire security engineer focused on vulnerability management ($130k-160k)  
**log_security_incidents**: Build SOC capability with 24/7 monitoring  
## Why These Matter  
- **PC1** explains 20.5% of variance and correlates **0.688** with security incidents  
- The top 5 variables loading on PC1 are directly actionable with clear interventions  
- Improving these by 1 standard deviation could reduce incidents by approximately **60%**  
## Department-Specific Focus  
**Best Performer**: **Accounting** (average 0.0 incidents/quarter)  
- Share their practices with other departments  
- Document their success factors for replication  
**Needs Most Attention**: **Sales** (average 3.8 incidents/quarter)  
- Immediate intervention required  
- Assign dedicated security resource for 90 days  
- Implement quick wins from list above  
### All Departments:  
**Accounting**: 0.0 incidents/quarter (✓ Below average)  
**Business Development**: 0.0 incidents/quarter (✓ Below average)  
**Clinical Operations**: 1.2 incidents/quarter (⚠️ Above average)  
**Compliance**: 0.0 incidents/quarter (✓ Below average)  
**Customer Support**: 2.9 incidents/quarter (⚠️ Above average)  
**Data Science**: 0.0 incidents/quarter (✓ Below average)  
**DevOps**: 0.0 incidents/quarter (✓ Below average)  
**E-commerce**: 0.5 incidents/quarter (✓ Below average)  
**Engineering**: 0.0 incidents/quarter (✓ Below average)  
**Facilities**: 1.1 incidents/quarter (⚠️ Above average)  
**Field Services**: 0.0 incidents/quarter (✓ Below average)  
**Finance**: 0.0 incidents/quarter (✓ Below average)  
**HR**: 0.2 incidents/quarter (✓ Below average)  
**IT**: 0.4 incidents/quarter (✓ Below average)  
**Internal Audit**: 1.6 incidents/quarter (⚠️ Above average)  
**Investor Relations**: 0.0 incidents/quarter (✓ Below average)  
**Legal**: 0.5 incidents/quarter (✓ Below average)  
**Logistics**: 1.9 incidents/quarter (⚠️ Above average)  
**Marketing**: 0.0 incidents/quarter (✓ Below average)  
**Operations**: 0.0 incidents/quarter (✓ Below average)  
**Procurement**: 0.0 incidents/quarter (✓ Below average)  
**Product**: 0.0 incidents/quarter (✓ Below average)  
**Public Relations**: 0.5 incidents/quarter (✓ Below average)  
**Quality Assurance**: 0.0 incidents/quarter (✓ Below average)  
**R&D**: 0.4 incidents/quarter (✓ Below average)  
**Risk Management**: 1.4 incidents/quarter (⚠️ Above average)  
**Sales**: 3.8 incidents/quarter (⚠️ Above average)  
**Security**: 0.0 incidents/quarter (✓ Below average)  
**Supply Chain**: 0.0 incidents/quarter (✓ Below average)  
**Training**: 1.5 incidents/quarter (⚠️ Above average)  
## Resource Requirements  
To implement these recommendations:  
| Timeline | Investment | Key Activities |  
|----------|-----------|----------------|  
| **Quick wins (0-3 months)** | $0-10k | Mandatory training, MFA for admins, phishing sims |  
| **Medium-term (3-6 months)** | $100k-300k | 1-2 FTE hires, EDR deployment, CSPM tool |  
| **Long-term (6-12 months)** | $300k-600k/year | Additional staff, enterprise tools, automation |  
## Expected Outcomes  
### Quantitative Targets (12-month goals):  
- **Reduce security incidents by 60%** (from 0.6 to ~0.2 per dept/quarter)  
- Achieve **100% MFA coverage** across all users  
- Reduce **mean time to patch to <7 days** for critical vulnerabilities  
- Increase **training completion to 95%+** organization-wide  
- Reduce **phishing click rate to <5%**  
### Return on Investment:  
- **Annual incidents prevented**: ~1  
- **Estimated annual savings**: $71,500 (at $50,000 per incident)  
- **ROI**: ~-76% on medium-term investment  
## Measuring Success  
Track these PC scores quarterly to monitor improvement:  
- **Target**: Move all departments' PC1 scores closer to Accounting's profile  
- **Dashboard**: Create quarterly PC score tracking with incident trends  
- **Review**: Executive review of security posture using PC metrics (simpler than 11 individual metrics)  
## Why PCA Was Valuable  
Rather than tracking 11 separate metrics independently, PCA showed us that:  
1. **Most variation comes from just 8 underlying factors** - we can focus our attention  
2. **We can prioritize resources** on variables that load heavily on incident-correlated PCs (PC1)  
3. **We have a quantitative basis** for budget requests: "Investing $300,000 in these 3 areas will reduce incidents by 60%"  
4. **Departments can be compared** fairly using PC scores rather than raw metrics that don't account for size/complexity  
This transforms cybersecurity from reactive firefighting to **proactive, data-driven risk management** with clear priorities and measurable outcomes.  
## Next Steps  
1. **Week 1**: Present findings to leadership, secure budget approval for quick wins  
2. **Month 1**: Implement all quick wins, begin hiring for medium-term roles  
3. **Month 3**: Deploy tools (EDR, CSPM), complete first round of enhanced training  
4. **Month 6**: Review PC scores, measure incident reduction, adjust strategy  
5. **Month 12**: Full program assessment, demonstrate ROI, plan next phase  
---  
*Analysis based on 240 department-quarter observations across 30 departments over 8 quarters.*  

PCA successfully identified the underlying structure in our cybersecurity dataset. By no surprise, this reveals that organizational scale, security control effectiveness, and human factors are the primary dimensions of variation. This analysis provides a foundation for more sophisticated cybersecurity analytics, better resource allocation, and improved cybersecurity outcomes across the organization.

The reduction from eleven variables to eight components demonstrates that security metrics are interrelated in meaningful ways. Therefore, we can monitor organizational cybersecurity posture more efficiently using these composite dimensions rather than tracking all metrics independently.

I really like what PCA does and how it results in useful and practical conclusions that are easily understandable by humans - allowing, in this case, for straightforward advice on how to improve our cybersecurity maturity to reduce organizational risk.

Some notes:
    o I transformed all highly skewed variables (more than in my original work on this data);
    o I used only log transformations this time
    o I left non-skewed variables in their original form and explained which variables were (only continuous variables were used)
    o I verified the transformations worked

My goal was to assure the relationshipships between variables would be more linear and that no single variable can dominate due to extreme skewness.

Also, for specific product recommendations I suggest KnowBe4 for the phishing simulations, click rates and training tracking.

For e-mail security, I like a product such as Checkpoint Harmony.

Following, is a second approach to drawing my final conclusions and recommendations.  I left this in as it seemed the more I analyzed the better I was able to improve the model and, consequently, the specific recommendations (actionable insights).

In [None]:
# COMPREHENSIVE ANALYSIS WITH AUTO-GENERATED MARKDOWN CONCLUSION

print("=" * 80)
print("DETAILED ANALYSIS FOR ACTIONABLE RECOMMENDATIONS")
print("=" * 80)

# ============================================================================
# PART 1: IDENTIFY KEY DRIVERS OF SECURITY INCIDENTS
# ============================================================================

print("\n" + "=" * 80)
print("PART 1: WHAT DRIVES SECURITY INCIDENTS?")
print("=" * 80)

#calculate correlation of each PC with security incidents
pc_incident_correlations = {}
for i in range(optimal_n):
    pc_col = f'PC{i+1}'
    corr = np.corrcoef(pca_df[pc_col], pca_df['security_incidents'])[0, 1]
    pc_incident_correlations[pc_col] = corr
    print(f"\n{pc_col} correlation with security incidents: {corr:.3f}")
    print(f"  Variance explained: {explained_variance[i]:.1%}")

#find most important PC (highest correlation with incidents)
most_correlated_pc_name = max(pc_incident_correlations, key=lambda k: abs(pc_incident_correlations[k]))
most_correlated_pc_idx = int(most_correlated_pc_name.replace('PC', '')) - 1
most_correlated_pc_corr = pc_incident_correlations[most_correlated_pc_name]

print(f"\n>>> MOST IMPORTANT: {most_correlated_pc_name}")
print(f"    Correlation with incidents: {most_correlated_pc_corr:.3f}")
print(f"    Variance explained: {explained_variance[most_correlated_pc_idx]:.1%}")

# ============================================================================
# PART 2: IDENTIFY TOP VARIABLES TO ADDRESS
# ============================================================================

print("\n" + "=" * 80)
print("PART 2: TOP VARIABLES THAT NEED ATTENTION")
print("=" * 80)

#get top 5 variables by absolute loading on most important PC
top_5_vars = loadings_df.iloc[:, most_correlated_pc_idx].abs().sort_values(ascending=False).head(5)

print(f"\nTop 5 variables loading on {most_correlated_pc_name}:")
top_vars_list = []
for rank, (var, abs_loading) in enumerate(top_5_vars.items(), 1):
    actual_loading = loadings_df.loc[var, most_correlated_pc_name]
    direction = "Higher" if actual_loading * most_correlated_pc_corr > 0 else "Lower"
    effect = "MORE incidents" if actual_loading * most_correlated_pc_corr > 0 else "FEWER incidents"
    
    print(f"\n{rank}. {var}")
    print(f"   Loading: {actual_loading:.3f}")
    print(f"   Effect: {direction} values → {effect}")
    
    top_vars_list.append({
        'rank': rank,
        'variable': var,
        'loading': actual_loading,
        'abs_loading': abs_loading,
        'direction': direction,
        'effect': effect
    })

# ============================================================================
# PART 3: SPECIFIC ACTIONABLE RECOMMENDATIONS
# ============================================================================

print("\n" + "=" * 80)
print("PART 3: SPECIFIC, ACTIONABLE RECOMMENDATIONS")
print("=" * 80)

#detailed recommendations for each variable
recommendations_detailed = {
    'log_mean_time_to_patch': {
        'problem': 'Slow patching increases vulnerability window',
        'quick_win': 'Prioritize critical patches, automate patch deployment for 30% of systems',
        'medium': 'Hire 1-2 dedicated patch management specialists ($120k-150k/year)',
        'long': 'Implement enterprise patch management system ($50k tool + 2 FTE)',
        'metric': 'Reduce mean time to patch from current to <7 days for critical patches'
    },
    'log_vuln_count': {
        'problem': 'High vulnerability counts indicate scanning or remediation gaps',
        'quick_win': 'Increase vulnerability scan frequency to weekly',
        'medium': 'Create vulnerability management SLA: Critical (7 days), High (30 days)',
        'long': 'Hire security engineer focused on vulnerability management ($130k-160k)',
        'metric': 'Reduce total vulnerability count by 40% in 12 months'
    },
    'training_completion_rate': {
        'problem': 'Incomplete training leaves users vulnerable to social engineering',
        'quick_win': 'Make security training mandatory with executive sponsorship',
        'medium': 'Implement quarterly micro-learning modules (10 min each)',
        'long': 'Gamify training with rewards, track completion in performance reviews',
        'metric': 'Achieve 95%+ completion rate within 6 months'
    },
    'phishing_sim_click_rate': {
        'problem': 'High click rates indicate users are vulnerable to phishing',
        'quick_win': 'Run monthly phishing simulations with immediate educational pop-ups',
        'medium': 'Implement targeted training for repeat clickers',
        'long': 'Deploy email security solution with URL rewriting and sandboxing ($15-20/user/year)',
        'metric': 'Reduce click rate from current to <5% organization-wide'
    },
    'mfa_coverage': {
        'problem': 'Accounts without MFA are vulnerable to credential compromise',
        'quick_win': 'Mandate MFA for all admin and privileged accounts (immediate)',
        'medium': 'Require MFA for all users accessing corporate resources (90 days)',
        'long': 'Implement passwordless authentication (FIDO2) for 50% of users',
        'metric': 'Achieve 100% MFA coverage for all users within 3 months'
    },
    'endpoint_coverage': {
        'problem': 'Unprotected endpoints are entry points for malware',
        'quick_win': 'Audit all endpoints, identify gaps, create deployment plan',
        'medium': 'Deploy EDR to remaining endpoints ($40-60/endpoint/year)',
        'long': 'Implement zero-trust architecture with continuous endpoint verification',
        'metric': 'Achieve 100% endpoint coverage within 6 months'
    },
    'log_login_failures': {
        'problem': 'High login failures may indicate brute force attacks or poor UX',
        'quick_win': 'Implement account lockout after 5 failed attempts',
        'medium': 'Deploy password manager to all users ($5-10/user/year)',
        'long': 'Implement SSO with adaptive authentication',
        'metric': 'Reduce login failures by 60% through better UX and security'
    },
    'log_cloud_misconfig': {
        'problem': 'Cloud misconfigurations expose data and services',
        'quick_win': 'Run immediate cloud security audit using free tools',
        'medium': 'Implement Cloud Security Posture Management (CSPM) tool ($20k-50k/year)',
        'long': 'Hire cloud security engineer or consultant ($140k-180k or $200/hr)',
        'metric': 'Reduce misconfigurations by 80%, maintain <5 open issues'
    },
    'it_budget_per_emp': {
        'problem': 'Insufficient budget limits security program effectiveness',
        'quick_win': 'Document current budget gaps with business risk analysis',
        'medium': 'Request 20-30% budget increase with ROI justification',
        'long': 'Benchmark against industry standards, align budget to risk appetite',
        'metric': 'Increase security budget to $X per employee (industry median)'
    },
    'log_org_size': {
        'problem': 'Larger organizations face scale challenges',
        'quick_win': 'Implement automation for repetitive security tasks',
        'medium': 'Scale security team proportionally (1 security FTE per 100 employees)',
        'long': 'Build security centers of excellence for enterprise-wide consistency',
        'metric': 'Maintain consistent security posture regardless of org size'
    },
    'log_security_incidents': {
        'problem': 'Lack of visibility into incident trends',
        'quick_win': 'Implement standardized incident reporting and tracking',
        'medium': 'Deploy SIEM for centralized security monitoring ($30k-100k/year)',
        'long': 'Build SOC capability with 24/7 monitoring',
        'metric': 'Reduce mean time to detect (MTTD) to <1 hour'
    }
}

print("\nDETAILED RECOMMENDATIONS FOR TOP 5 DRIVERS:\n")
for item in top_vars_list:
    var = item['variable']
    rank = item['rank']
    loading = item['loading']
    direction = item['direction']
    
    if var in recommendations_detailed:
        rec = recommendations_detailed[var]
        print(f"\n{'='*70}")
        print(f"#{rank}: {var} (loading: {loading:.3f})")
        print(f"{'='*70}")
        print(f"PROBLEM: {rec['problem']}")
        print(f"\n→ QUICK WIN (0-3 months, low cost):")
        print(f"  {rec['quick_win']}")
        print(f"\n→ MEDIUM-TERM (3-6 months, moderate investment):")
        print(f"  {rec['medium']}")
        print(f"\n→ LONG-TERM (6-12 months, significant investment):")
        print(f"  {rec['long']}")
        print(f"\n→ SUCCESS METRIC:")
        print(f"  {rec['metric']}")

# ============================================================================
# PART 4: DEPARTMENT-SPECIFIC RECOMMENDATIONS
# ============================================================================

print("\n" + "=" * 80)
print("PART 4: DEPARTMENT-SPECIFIC RECOMMENDATIONS")
print("=" * 80)

dept_pc_means = pca_df.groupby('department')[[f'PC{i+1}' for i in range(min(3, optimal_n))]].mean()
dept_incidents = pca_df.groupby('department')['security_incidents'].mean()

print("\nDepartment Profiles:")
print("-" * 70)
dept_analysis = pd.concat([dept_pc_means, dept_incidents], axis=1)
print(dept_analysis.round(2))

#identify best and worst performers
best_dept = dept_incidents.idxmin()
worst_dept = dept_incidents.idxmax()

print(f"\n>>> BEST PERFORMER: {best_dept}")
print(f"    Average incidents: {dept_incidents[best_dept]:.1f}")
print(f"    PC scores: {dept_pc_means.loc[best_dept].to_dict()}")

print(f"\n>>> NEEDS MOST ATTENTION: {worst_dept}")
print(f"    Average incidents: {dept_incidents[worst_dept]:.1f}")
print(f"    PC scores: {dept_pc_means.loc[worst_dept].to_dict()}")

print(f"\nRECOMMENDATION: Have {best_dept} share best practices with {worst_dept}")

#specific recommendations by department
print("\n" + "-" * 70)
print("DEPARTMENT-SPECIFIC ACTION PLANS:")
print("-" * 70)

for dept in dept_pc_means.index:
    print(f"\n{dept.upper()}:")
    pc1_score = dept_pc_means.loc[dept, 'PC1']
    pc2_score = dept_pc_means.loc[dept, 'PC2'] if 'PC2' in dept_pc_means.columns else 0
    avg_incidents = dept_incidents[dept]
    
    print(f"  Current incidents/quarter: {avg_incidents:.1f}")
    print(f"  PC1 score: {pc1_score:.2f} | PC2 score: {pc2_score:.2f}")
    
    #interpret PC1
    if pc1_score > 0.5:
        print(f"  • Large/complex organization - prioritize automation and additional staff")
    elif pc1_score < -0.5:
        print(f"  • Smaller organization - focus on foundational controls first")
    
    #interpret PC2 if available
    if 'PC2' in dept_pc_means.columns:
        if pc2_score < -0.5:
            print(f"  • WEAK controls - PRIORITY: improve coverage metrics (MFA, endpoints)")
        elif pc2_score > 0.5:
            print(f"  • STRONG controls - maintain and focus on advanced threats")
    
    #compare to org average
    if avg_incidents > dept_incidents.mean():
        print(f"  ⚠️  ABOVE AVERAGE INCIDENTS - needs immediate attention")
    else:
        print(f"  ✓  Below average incidents - share practices with other teams")

# ============================================================================
# PART 5: ROI AND IMPACT ESTIMATION
# ============================================================================

print("\n" + "=" * 80)
print("PART 5: ESTIMATED IMPACT AND ROI")
print("=" * 80)

current_mean_incidents = pca_df['security_incidents'].mean()
current_total_incidents = pca_df['security_incidents'].sum()
n_dept_quarters = len(pca_df)

print(f"\nCURRENT STATE:")
print(f"  Average incidents per dept/quarter: {current_mean_incidents:.1f}")
print(f"  Total incidents in dataset: {current_total_incidents:.0f}")
print(f"  Number of observations: {n_dept_quarters}")

#estimate impact based on correlation
correlation_strength = abs(most_correlated_pc_corr)
estimated_reduction_pct = min(correlation_strength * 100, 60)  # Cap at 60%

print(f"\nESTIMATED IMPACT:")
print(f"  {most_correlated_pc_name} correlation with incidents: {most_correlated_pc_corr:.3f}")
print(f"  By improving top 3-5 factors by 1 standard deviation:")
print(f"    • Expected incident reduction: ~{estimated_reduction_pct:.0f}%")
print(f"    • From {current_mean_incidents:.1f} to ~{current_mean_incidents * (1 - estimated_reduction_pct/100):.1f} incidents/quarter")
print(f"    • Annual reduction: ~{current_mean_incidents * 4 * estimated_reduction_pct/100:.0f} fewer incidents/year")

#cost-benefit
print(f"\nESTIMATED COSTS:")
print(f"  Quick wins: $0-10k (reallocate existing resources)")
print(f"  Medium-term: $100k-300k (tools + 1-2 FTE)")
print(f"  Long-term: $300k-600k/year (comprehensive program)")

print(f"\nESTIMATED BENEFITS (assuming $50k cost per incident):")
incidents_prevented = current_mean_incidents * 4 * estimated_reduction_pct / 100
cost_per_incident = 50000  # Adjust this for your organization
annual_savings = incidents_prevented * cost_per_incident

print(f"  Incidents prevented per year: ~{incidents_prevented:.0f}")
print(f"  Annual cost savings: ${annual_savings:,.0f}")
print(f"  ROI: {(annual_savings / 300000 - 1) * 100:.0f}% (against medium-term investment)")

# ============================================================================
# PART 6: AUTO-GENERATE MARKDOWN CONCLUSION
# ============================================================================

print("\n" + "=" * 80)
print("AUTO-GENERATED MARKDOWN CONCLUSION (COPY-PASTE BELOW)")
print("=" * 80)

#interpret what PC1 and PC2 represent based on loadings
pc1_top_var = loadings_df.iloc[:, 0].abs().idxmax()
pc1_interpretation = "organizational scale and security complexity"
if 'training' in pc1_top_var or 'phishing' in pc1_top_var:
    pc1_interpretation = "human factor vulnerabilities and training effectiveness"
elif 'coverage' in pc1_top_var or 'mfa' in pc1_top_var:
    pc1_interpretation = "security control coverage and maturity"

pc2_interpretation = "security hygiene and response effectiveness"
if optimal_n >= 2:
    pc2_top_var = loadings_df.iloc[:, 1].abs().idxmax()
    if 'patch' in pc2_top_var or 'vuln' in pc2_top_var:
        pc2_interpretation = "vulnerability management and patching speed"

markdown_output = f"""
# Conclusion: Actionable Insights from PCA

## What We Learned

This PCA analysis revealed that **{optimal_n} principal components** capture **{cumulative_variance[optimal_n-1]:.1%}** of the variance in our cybersecurity data, with **PC1 primarily representing {pc1_interpretation}**, explaining **{explained_variance[0]:.1%}** of the variance.

The analysis successfully reduced our monitoring from 11 separate metrics to {optimal_n} composite dimensions that capture the essential patterns in our security posture.

## Key Finding: What Drives Security Incidents

**{most_correlated_pc_name} shows the strongest relationship with security incidents** (correlation: **{most_correlated_pc_corr:.3f}**), explaining **{explained_variance[most_correlated_pc_idx]:.1%}** of variance. This tells us that focusing on the variables that load heavily on {most_correlated_pc_name} will have the greatest impact on reducing incidents.

## Specific Recommendations (Prioritized by Impact)

Based on the PCA loadings and their correlation with security incidents, we identified the following **prioritized actions**:

### 1. Immediate Actions (Highest ROI - Implement in 0-3 months)
"""

for item in top_vars_list[:3]:
    var = item['variable']
    if var in recommendations_detailed:
        rec = recommendations_detailed[var]
        markdown_output += f"""
**{item['rank']}. {var}** (loading: {item['loading']:.3f})
- **Problem**: {rec['problem']}
- **Action**: {rec['quick_win']}
- **Target**: {rec['metric']}
"""

markdown_output += f"""
### 2. Medium-Term Investments (3-6 months, ~$100k-300k)
"""

for item in top_vars_list[:3]:
    var = item['variable']
    if var in recommendations_detailed:
        rec = recommendations_detailed[var]
        markdown_output += f"""
**{var}**: {rec['medium']}
"""

markdown_output += f"""
### 3. Long-Term Strategic Initiatives (6-12 months, ~$300k-600k/year)
"""

for item in top_vars_list[:3]:
    var = item['variable']
    if var in recommendations_detailed:
        rec = recommendations_detailed[var]
        markdown_output += f"""
**{var}**: {rec['long']}
"""

markdown_output += f"""
## Why These Matter

- **{most_correlated_pc_name}** explains {explained_variance[most_correlated_pc_idx]:.1%} of variance and correlates **{most_correlated_pc_corr:.3f}** with security incidents
- The top {len(top_vars_list)} variables loading on {most_correlated_pc_name} are directly actionable with clear interventions
- Improving these by 1 standard deviation could reduce incidents by approximately **{estimated_reduction_pct:.0f}%**

## Department-Specific Focus

**Best Performer**: **{best_dept}** (average {dept_incidents[best_dept]:.1f} incidents/quarter)
- Share their practices with other departments
- Document their success factors for replication

**Needs Most Attention**: **{worst_dept}** (average {dept_incidents[worst_dept]:.1f} incidents/quarter)
- Immediate intervention required
- Assign dedicated security resource for 90 days
- Implement quick wins from list above

### All Departments:
"""

for dept in dept_pc_means.index:
    avg_incidents = dept_incidents[dept]
    pc1_score = dept_pc_means.loc[dept, 'PC1']
    status = "⚠️ Above average" if avg_incidents > dept_incidents.mean() else "✓ Below average"
    
    markdown_output += f"""
**{dept}**: {avg_incidents:.1f} incidents/quarter ({status})
"""

markdown_output += f"""
## Resource Requirements

To implement these recommendations:

| Timeline | Investment | Key Activities |
|----------|-----------|----------------|
| **Quick wins (0-3 months)** | $0-10k | Mandatory training, MFA for admins, phishing sims |
| **Medium-term (3-6 months)** | $100k-300k | 1-2 FTE hires, EDR deployment, CSPM tool |
| **Long-term (6-12 months)** | $300k-600k/year | Additional staff, enterprise tools, automation |

## Expected Outcomes

### Quantitative Targets (12-month goals):
- **Reduce security incidents by {estimated_reduction_pct:.0f}%** (from {current_mean_incidents:.1f} to ~{current_mean_incidents * (1 - estimated_reduction_pct/100):.1f} per dept/quarter)
- Achieve **100% MFA coverage** across all users
- Reduce **mean time to patch to <7 days** for critical vulnerabilities
- Increase **training completion to 95%+** organization-wide
- Reduce **phishing click rate to <5%**

### Return on Investment:
- **Annual incidents prevented**: ~{incidents_prevented:.0f}
- **Estimated annual savings**: ${annual_savings:,.0f} (at ${cost_per_incident:,} per incident)
- **ROI**: ~{(annual_savings / 300000 - 1) * 100:.0f}% on medium-term investment

## Measuring Success

Track these PC scores quarterly to monitor improvement:

- **Target**: Move all departments' {most_correlated_pc_name} scores closer to {best_dept}'s profile
- **Dashboard**: Create quarterly PC score tracking with incident trends
- **Review**: Executive review of security posture using PC metrics (simpler than 11 individual metrics)

## Why PCA Was Valuable

Rather than tracking 11 separate metrics independently, PCA showed us that:

1. **Most variation comes from just {optimal_n} underlying factors** - we can focus our attention
2. **We can prioritize resources** on variables that load heavily on incident-correlated PCs ({most_correlated_pc_name})
3. **We have a quantitative basis** for budget requests: "Investing ${300000:,} in these {len(top_vars_list[:3])} areas will reduce incidents by {estimated_reduction_pct:.0f}%"
4. **Departments can be compared** fairly using PC scores rather than raw metrics that don't account for size/complexity

This transforms cybersecurity from reactive firefighting to **proactive, data-driven risk management** with clear priorities and measurable outcomes.

## Next Steps

1. **Week 1**: Present findings to leadership, secure budget approval for quick wins
2. **Month 1**: Implement all quick wins, begin hiring for medium-term roles
3. **Month 3**: Deploy tools (EDR, CSPM), complete first round of enhanced training
4. **Month 6**: Review PC scores, measure incident reduction, adjust strategy
5. **Month 12**: Full program assessment, demonstrate ROI, plan next phase

---

*Analysis based on {n_dept_quarters} department-quarter observations across {len(pca_df['department'].unique())} departments over {len(pca_df['quarter'].unique())} quarters.*
"""

#print(markdown_output) - commented out once I grabbed the markdown and placed into a separate cell below

# Save to file for easy copying
with open('PCA_Conclusion_AutoGenerated.md', 'w') as f:
    f.write(markdown_output)

print("\n" + "=" * 80)
print("✓ Markdown conclusion saved to: PCA_Conclusion_AutoGenerated.md and pasted to markdown cell below:")
print("=" * 80)

# Conclusion: Actionable Insights from PCA  
## What We Learned  
This PCA analysis revealed that **8 principal components** capture **91.7%** of the variance in our cybersecurity data, with **PC1 primarily representing organizational scale and security complexity**, explaining **20.5%** of the variance.  
The analysis successfully reduced our monitoring from 11 separate metrics to 8 composite dimensions that capture the essential patterns in our security posture.  
## Key Finding: What Drives Security Incidents  
**PC1 shows the strongest relationship with security incidents** (correlation: **0.688**), explaining **20.5%** of variance. This tells us that focusing on the variables that load heavily on PC1 will have the greatest impact on reducing incidents.  
## Specific Recommendations (Prioritized by Impact)  
Based on the PCA loadings and their correlation with security incidents, we identified the following **prioritized actions**:  
### 1. Immediate Actions (Highest ROI - Implement in 0-3 months)  
**1. log_org_size** (loading: 0.857)  
- **Problem**: Larger organizations face scale challenges  
- **Action**: Implement automation for repetitive security tasks  
- **Target**: Maintain consistent security posture regardless of org size  
**2. log_vuln_count** (loading: 0.853)  
- **Problem**: High vulnerability counts indicate scanning or remediation gaps  
- **Action**: Increase vulnerability scan frequency to weekly  
- **Target**: Reduce total vulnerability count by 40% in 12 months  
**3. log_security_incidents** (loading: 0.700)  
- **Problem**: Lack of visibility into incident trends  
- **Action**: Implement standardized incident reporting and tracking  
- **Target**: Reduce mean time to detect (MTTD) to <1 hour  
### 2. Medium-Term Investments (3-6 months, ~$100k-300k)  
**log_org_size**: Scale security team proportionally (1 security FTE per 100 employees)  
**log_vuln_count**: Create vulnerability management SLA: Critical (7 days), High (30 days)  
**log_security_incidents**: Deploy SIEM for centralized security monitoring ($30k-100k/year)  
### 3. Long-Term Strategic Initiatives (6-12 months, ~$300k-600k/year)  
**log_org_size**: Build security centers of excellence for enterprise-wide consistency  
**log_vuln_count**: Hire security engineer focused on vulnerability management ($130k-160k)  
**log_security_incidents**: Build SOC capability with 24/7 monitoring  
## Why These Matter  
- **PC1** explains 20.5% of variance and correlates **0.688** with security incidents  
- The top 5 variables loading on PC1 are directly actionable with clear interventions  
- Improving these by 1 standard deviation could reduce incidents by approximately **60%**  
## Department-Specific Focus  
**Best Performer**: **Accounting** (average 0.0 incidents/quarter)  
- Share their practices with other departments  
- Document their success factors for replication  
**Needs Most Attention**: **Sales** (average 3.8 incidents/quarter)  
- Immediate intervention required  
- Assign dedicated security resource for 90 days  
- Implement quick wins from list above  
### All Departments:  
**Accounting**: 0.0 incidents/quarter (✓ Below average)  
**Business Development**: 0.0 incidents/quarter (✓ Below average)  
**Clinical Operations**: 1.2 incidents/quarter (⚠️ Above average)  
**Compliance**: 0.0 incidents/quarter (✓ Below average)  
**Customer Support**: 2.9 incidents/quarter (⚠️ Above average)  
**Data Science**: 0.0 incidents/quarter (✓ Below average)  
**DevOps**: 0.0 incidents/quarter (✓ Below average)  
**E-commerce**: 0.5 incidents/quarter (✓ Below average)  
**Engineering**: 0.0 incidents/quarter (✓ Below average)  
**Facilities**: 1.1 incidents/quarter (⚠️ Above average)  
**Field Services**: 0.0 incidents/quarter (✓ Below average)  
**Finance**: 0.0 incidents/quarter (✓ Below average)  
**HR**: 0.2 incidents/quarter (✓ Below average)  
**IT**: 0.4 incidents/quarter (✓ Below average)  
**Internal Audit**: 1.6 incidents/quarter (⚠️ Above average)  
**Investor Relations**: 0.0 incidents/quarter (✓ Below average)  
**Legal**: 0.5 incidents/quarter (✓ Below average)  
**Logistics**: 1.9 incidents/quarter (⚠️ Above average)  
**Marketing**: 0.0 incidents/quarter (✓ Below average)  
**Operations**: 0.0 incidents/quarter (✓ Below average)  
**Procurement**: 0.0 incidents/quarter (✓ Below average)  
**Product**: 0.0 incidents/quarter (✓ Below average)  
**Public Relations**: 0.5 incidents/quarter (✓ Below average)  
**Quality Assurance**: 0.0 incidents/quarter (✓ Below average)  
**R&D**: 0.4 incidents/quarter (✓ Below average)  
**Risk Management**: 1.4 incidents/quarter (⚠️ Above average)  
**Sales**: 3.8 incidents/quarter (⚠️ Above average)  
**Security**: 0.0 incidents/quarter (✓ Below average)  
**Supply Chain**: 0.0 incidents/quarter (✓ Below average)  
**Training**: 1.5 incidents/quarter (⚠️ Above average)  
## Resource Requirements  
To implement these recommendations:  
| Timeline | Investment | Key Activities |  
|----------|-----------|----------------|  
| **Quick wins (0-3 months)** | $0-10k | Mandatory training, MFA for admins, phishing sims |  
| **Medium-term (3-6 months)** | $100k-300k | 1-2 FTE hires, EDR deployment, CSPM tool |  
| **Long-term (6-12 months)** | $300k-600k/year | Additional staff, enterprise tools, automation |  
## Expected Outcomes  
### Quantitative Targets (12-month goals):  
- **Reduce security incidents by 60%** (from 0.6 to ~0.2 per dept/quarter)  
- Achieve **100% MFA coverage** across all users  
- Reduce **mean time to patch to <7 days** for critical vulnerabilities  
- Increase **training completion to 95%+** organization-wide  
- Reduce **phishing click rate to <5%**  
### Return on Investment:  
- **Annual incidents prevented**: ~1  
- **Estimated annual savings**: $71,500 (at $50,000 per incident)  
- **ROI**: ~-76% on medium-term investment  
## Measuring Success  
Track these PC scores quarterly to monitor improvement:  
- **Target**: Move all departments' PC1 scores closer to Accounting's profile  
- **Dashboard**: Create quarterly PC score tracking with incident trends  
- **Review**: Executive review of security posture using PC metrics (simpler than 11 individual metrics)  
## Why PCA Was Valuable  
Rather than tracking 11 separate metrics independently, PCA showed us that:  
1. **Most variation comes from just 8 underlying factors** - we can focus our attention  
2. **We can prioritize resources** on variables that load heavily on incident-correlated PCs (PC1)  
3. **We have a quantitative basis** for budget requests: "Investing $300,000 in these 3 areas will reduce incidents by 60%"  
4. **Departments can be compared** fairly using PC scores rather than raw metrics that don't account for size/complexity  
This transforms cybersecurity from reactive firefighting to **proactive, data-driven risk management** with clear priorities and measurable outcomes.  
## Next Steps  
1. **Week 1**: Present findings to leadership, secure budget approval for quick wins  
2. **Month 1**: Implement all quick wins, begin hiring for medium-term roles  
3. **Month 3**: Deploy tools (EDR, CSPM), complete first round of enhanced training  
4. **Month 6**: Review PC scores, measure incident reduction, adjust strategy  
5. **Month 12**: Full program assessment, demonstrate ROI, plan next phase  
---  
*Analysis based on 240 department-quarter observations across 30 departments over 8 quarters.*  