# Tier 1: Correlation Analysis

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** 2497f608-0610-4ae4-bc4e-27d77a0f0af0

---

## Citation
Brandon Deloatch, "Tier 1: Correlation Analysis," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** 2497f608-0610-4ae4-bc4e-27d77a0f0af0
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [None]:
# Import Essential Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
from scipy import stats
from scipy.stats import pearsonr, spearmanr, kendalltau
import warnings
warnings.filterwarnings('ignore')

print(" Tier 1: Correlation Analysis - Libraries Loaded Successfully!")
print("=" * 60)
print("Available Correlation Methods:")
print("• Pearson Correlation - Linear relationships (parametric)")
print("• Spearman Correlation - Monotonic relationships (non-parametric)")
print("• Kendall's Tau - Rank-based relationships (robust)")
print("• Partial Correlation - Controlling for confounders")
print("• Point-Biserial - Continuous vs Binary variables")

In [None]:
# Generate Comprehensive Sample Dataset
np.random.seed(42)

# Create realistic economic/business dataset
n_samples = 1000

# Base economic indicators
gdp_growth = np.random.normal(2.5, 1.2, n_samples)
unemployment = 8 - 0.5 * gdp_growth + np.random.normal(0, 0.8, n_samples)
inflation = 2 + 0.3 * gdp_growth + np.random.normal(0, 0.5, n_samples)

# Consumer and business metrics
consumer_confidence = 50 + 5 * gdp_growth - 2 * unemployment + np.random.normal(0, 5, n_samples)
stock_returns = 0.08 + 0.4 * gdp_growth - 0.2 * unemployment + np.random.normal(0, 0.15, n_samples)
interest_rates = 3 + 0.6 * inflation + np.random.normal(0, 0.3, n_samples)

# Retail and housing
retail_sales = 100 + 10 * consumer_confidence/10 + 5 * gdp_growth + np.random.normal(0, 8, n_samples)
housing_prices = 250000 + 15000 * gdp_growth - 8000 * interest_rates + np.random.normal(0, 20000, n_samples)

# Technology and innovation metrics
tech_investment = 50 + 8 * gdp_growth + 0.1 * stock_returns * 1000 + np.random.normal(0, 10, n_samples)
productivity = 100 + 2 * tech_investment/10 + np.random.normal(0, 5, n_samples)

# Create DataFrame
df = pd.DataFrame({
 'GDP_Growth': gdp_growth,
 'Unemployment_Rate': unemployment,
 'Inflation_Rate': inflation,
 'Consumer_Confidence': consumer_confidence,
 'Stock_Returns': stock_returns * 100, # Convert to percentage
 'Interest_Rates': interest_rates,
 'Retail_Sales_Index': retail_sales,
 'Housing_Price_Index': housing_prices / 1000, # In thousands
 'Tech_Investment': tech_investment,
 'Productivity_Index': productivity
})

# Ensure realistic bounds
df['Unemployment_Rate'] = np.clip(df['Unemployment_Rate'], 1, 15)
df['Inflation_Rate'] = np.clip(df['Inflation_Rate'], -2, 8)
df['Consumer_Confidence'] = np.clip(df['Consumer_Confidence'], 0, 100)
df['Interest_Rates'] = np.clip(df['Interest_Rates'], 0, 10)

print(" Economic Dataset Generated Successfully!")
print(f"Dataset Shape: {df.shape}")
print("\nDataset Overview:")
print(df.head())
print("\nBasic Statistics:")
print(df.describe().round(2))

In [None]:
# 1. PEARSON CORRELATION ANALYSIS
print(" 1. PEARSON CORRELATION ANALYSIS")
print("=" * 50)

# Calculate Pearson correlation matrix
pearson_corr = df.corr(method='pearson')

print("Pearson Correlation Matrix:")
print(pearson_corr.round(3))

# Create enhanced correlation heatmap
fig = make_subplots(
 rows=1, cols=2,
 subplot_titles=("Pearson Correlation Heatmap", "Correlation Strength Distribution"),
 specs=[[{"type": "heatmap"}, {"type": "histogram"}]]
)

# Correlation heatmap
fig.add_trace(
 go.Heatmap(
 z=pearson_corr.values,
 x=pearson_corr.columns,
 y=pearson_corr.columns,
 colorscale='RdBu_r',
 zmid=0,
 text=pearson_corr.round(3).values,
 texttemplate="%{text}",
 textfont={"size": 10},
 hoverongaps=False
 ),
 row=1, col=1
)

# Correlation distribution
corr_values = pearson_corr.values[np.triu_indices_from(pearson_corr.values, k=1)]
fig.add_trace(
 go.Histogram(
 x=corr_values,
 nbinsx=20,
 name="Correlation Distribution",
 marker_color="steelblue",
 opacity=0.7
 ),
 row=1, col=2
)

fig.update_layout(
 title="Pearson Correlation Analysis",
 height=500,
 showlegend=False
)

fig.show()

# Identify strongest correlations
def get_correlation_pairs(corr_matrix, threshold=0.5):
 """Extract correlation pairs above threshold"""
 pairs = []
 for i in range(len(corr_matrix.columns)):
 for j in range(i+1, len(corr_matrix.columns)):
 corr_val = corr_matrix.iloc[i, j]
 if abs(corr_val) >= threshold:
 pairs.append({
 'Variable_1': corr_matrix.columns[i],
 'Variable_2': corr_matrix.columns[j],
 'Correlation': corr_val,
 'Strength': 'Strong' if abs(corr_val) >= 0.7 else 'Moderate'
 })
 return pd.DataFrame(pairs).sort_values('Correlation', key=abs, ascending=False)

strong_correlations = get_correlation_pairs(pearson_corr, threshold=0.5)
print(f"\n Strong Correlations (|r| >= 0.5):")
print(strong_correlations)

In [None]:
# 2. SPEARMAN RANK CORRELATION
print("\n 2. SPEARMAN RANK CORRELATION ANALYSIS")
print("=" * 50)

# Calculate Spearman correlation
spearman_corr = df.corr(method='spearman')

# Compare Pearson vs Spearman
comparison_data = []
for i in range(len(df.columns)):
 for j in range(i+1, len(df.columns)):
 var1, var2 = df.columns[i], df.columns[j]
 pearson_val = pearson_corr.iloc[i, j]
 spearman_val = spearman_corr.iloc[i, j]

 comparison_data.append({
 'Variable_Pair': f"{var1} vs {var2}",
 'Pearson': pearson_val,
 'Spearman': spearman_val,
 'Difference': abs(pearson_val - spearman_val)
 })

comparison_df = pd.DataFrame(comparison_data)
comparison_df = comparison_df.sort_values('Difference', ascending=False)

# Visualize Pearson vs Spearman comparison
fig = make_subplots(
 rows=2, cols=2,
 subplot_titles=("Spearman Correlation Matrix", "Pearson vs Spearman Scatter",
 "Correlation Method Differences", "Non-linear Relationship Example"),
 specs=[[{"type": "heatmap"}, {"type": "scatter"}],
 [{"type": "bar"}, {"type": "scatter"}]]
)

# Spearman heatmap
fig.add_trace(
 go.Heatmap(
 z=spearman_corr.values,
 x=spearman_corr.columns,
 y=spearman_corr.columns,
 colorscale='Viridis',
 text=spearman_corr.round(3).values,
 texttemplate="%{text}",
 textfont={"size": 8}
 ),
 row=1, col=1
)

# Pearson vs Spearman scatter
fig.add_trace(
 go.Scatter(
 x=comparison_df['Pearson'],
 y=comparison_df['Spearman'],
 mode='markers+text',
 text=comparison_df['Variable_Pair'].str[:15] + '...',
 textposition="top center",
 marker=dict(size=8, color=comparison_df['Difference'],
 colorscale='Reds', showscale=True),
 name="Correlations"
 ),
 row=1, col=2
)

# Add diagonal line
fig.add_trace(
 go.Scatter(
 x=[-1, 1], y=[-1, 1],
 mode='lines',
 line=dict(dash='dash', color='gray'),
 name="Perfect Agreement"
 ),
 row=1, col=2
)

# Differences bar chart
top_diffs = comparison_df.head(8)
fig.add_trace(
 go.Bar(
 x=top_diffs['Difference'],
 y=top_diffs['Variable_Pair'],
 orientation='h',
 marker_color='coral',
 name="Correlation Differences"
 ),
 row=2, col=1
)

# Example of non-linear relationship
# Create quadratic relationship for demonstration
x_nonlinear = np.linspace(-3, 3, 100)
y_nonlinear = x_nonlinear**2 + np.random.normal(0, 0.5, 100)

fig.add_trace(
 go.Scatter(
 x=x_nonlinear,
 y=y_nonlinear,
 mode='markers',
 marker=dict(color='purple', size=6),
 name="Quadratic Relationship"
 ),
 row=2, col=2
)

fig.update_layout(
 title="Spearman vs Pearson Correlation Comparison",
 height=800,
 showlegend=True
)

fig.show()

print("Top differences between Pearson and Spearman correlations:")
print(comparison_df.head(5))

# Statistical significance testing
print(f"\n Statistical Significance Testing:")
sample_vars = ['GDP_Growth', 'Unemployment_Rate', 'Consumer_Confidence']
for i in range(len(sample_vars)):
 for j in range(i+1, len(sample_vars)):
 var1, var2 = sample_vars[i], sample_vars[j]

 # Pearson correlation with p-value
 pearson_r, pearson_p = pearsonr(df[var1], df[var2])

 # Spearman correlation with p-value
 spearman_r, spearman_p = spearmanr(df[var1], df[var2])

 print(f"{var1} vs {var2}:")
 print(f" Pearson: r={pearson_r:.3f}, p={pearson_p:.4f}")
 print(f" Spearman: ρ={spearman_r:.3f}, p={spearman_p:.4f}")

In [None]:
# 3. KENDALL'S TAU CORRELATION
print("\n 3. KENDALL'S TAU CORRELATION")
print("=" * 50)

# Calculate Kendall's tau
kendall_corr = df.corr(method='kendall')

# Compare all three methods
method_comparison = pd.DataFrame({
 'Variable_Pair': comparison_df['Variable_Pair'],
 'Pearson': comparison_df['Pearson'],
 'Spearman': comparison_df['Spearman'],
 'Kendall': [kendalltau(df[pair.split(' vs ')[0]], df[pair.split(' vs ')[1]])[0]
 for pair in comparison_df['Variable_Pair']]
})

# Create comprehensive comparison plot
fig = make_subplots(
 rows=2, cols=2,
 subplot_titles=("Three Correlation Methods Comparison", "Method Correlations",
 "Kendall's Tau Matrix", "Correlation Method Stability"),
 specs=[[{"type": "scatter"}, {"type": "bar"}],
 [{"type": "heatmap"}, {"type": "box"}]]
)

# 3D-like scatter showing all three methods
fig.add_trace(
 go.Scatter(
 x=method_comparison['Pearson'],
 y=method_comparison['Spearman'],
 text=method_comparison['Variable_Pair'],
 mode='markers+text',
 marker=dict(
 size=abs(method_comparison['Kendall']) * 20,
 color=method_comparison['Kendall'],
 colorscale='RdYlBu_r',
 showscale=True,
 colorbar=dict(title="Kendall's Tau")
 ),
 name="All Methods"
 ),
 row=1, col=1
)

# Method correlations bar chart
method_corrs = [
 stats.pearsonr(method_comparison['Pearson'], method_comparison['Spearman'])[0],
 stats.pearsonr(method_comparison['Pearson'], method_comparison['Kendall'])[0],
 stats.pearsonr(method_comparison['Spearman'], method_comparison['Kendall'])[0]
]

fig.add_trace(
 go.Bar(
 x=['Pearson-Spearman', 'Pearson-Kendall', 'Spearman-Kendall'],
 y=method_corrs,
 marker_color=['skyblue', 'lightcoral', 'lightgreen'],
 name="Method Correlations"
 ),
 row=1, col=2
)

# Kendall's tau heatmap
fig.add_trace(
 go.Heatmap(
 z=kendall_corr.values,
 x=kendall_corr.columns,
 y=kendall_corr.columns,
 colorscale='Plasma',
 text=kendall_corr.round(3).values,
 texttemplate="%{text}",
 textfont={"size": 8}
 ),
 row=2, col=1
)

# Box plot of correlation values by method
correlation_values = []
methods = []
for method in ['Pearson', 'Spearman', 'Kendall']:
 values = method_comparison[method].values
 correlation_values.extend(values)
 methods.extend([method] * len(values))

for i, method in enumerate(['Pearson', 'Spearman', 'Kendall']):
 fig.add_trace(
 go.Box(
 y=method_comparison[method],
 name=method,
 boxpoints='outliers',
 marker_color=['blue', 'red', 'green'][i]
 ),
 row=2, col=2
 )

fig.update_layout(
 title="Comprehensive Correlation Method Analysis",
 height=800,
 showlegend=True
)

fig.show()

print("Correlation between different methods:")
print(f"Pearson-Spearman correlation: {method_corrs[0]:.3f}")
print(f"Pearson-Kendall correlation: {method_corrs[1]:.3f}")
print(f"Spearman-Kendall correlation: {method_corrs[2]:.3f}")

In [None]:
# 4. INTERACTIVE CORRELATION EXPLORER
print("\n 4. INTERACTIVE CORRELATION EXPLORER")
print("=" * 50)

# Create interactive correlation matrix with drill-down capability
def create_interactive_correlation_matrix(dataframe, method='pearson'):
 """Create an interactive correlation matrix with detailed hover information"""

 # Calculate correlation matrix
 corr_matrix = dataframe.corr(method=method)

 # Create mask for upper triangle (optional)
 mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

 # Create hover text with detailed statistics
 hover_text = []
 for i in range(len(corr_matrix.columns)):
 hover_text.append([])
 for j in range(len(corr_matrix.columns)):
 if i != j:
 var1, var2 = corr_matrix.columns[i], corr_matrix.columns[j]
 corr_val = corr_matrix.iloc[i, j]

 # Calculate additional statistics
 x_data, y_data = dataframe[var1], dataframe[var2]
 r_squared = corr_val ** 2

 hover_info = (
 f"Variables: {var1} vs {var2}<br>"
 f"Correlation: {corr_val:.4f}<br>"
 f"R-squared: {r_squared:.4f}<br>"
 f"Sample size: {len(x_data)}<br>"
 f"Strength: {'Strong' if abs(corr_val) >= 0.7 else 'Moderate' if abs(corr_val) >= 0.3 else 'Weak'}"
 )
 else:
 hover_info = f"Variable: {corr_matrix.columns[i]}<br>Perfect correlation: 1.000"

 hover_text[i].append(hover_info)

 # Create the heatmap
 fig = go.Figure(data=go.Heatmap(
 z=corr_matrix.values,
 x=corr_matrix.columns,
 y=corr_matrix.columns,
 colorscale='RdBu_r',
 zmid=0,
 text=corr_matrix.round(3).values,
 texttemplate="%{text}",
 textfont={"size": 10},
 hovertemplate='%{hovertext}<extra></extra>',
 hovertext=hover_text
 ))

 fig.update_layout(
 title=f"Interactive {method.title()} Correlation Matrix<br><sub>Hover for detailed statistics</sub>",
 xaxis_title="Variables",
 yaxis_title="Variables",
 width=800,
 height=600
 )

 return fig

# Create interactive matrices for all methods
for method in ['pearson', 'spearman', 'kendall']:
 fig = create_interactive_correlation_matrix(df, method)
 fig.show()

In [None]:
# 5. CORRELATION SIGNIFICANCE AND CONFIDENCE INTERVALS
print("\n 5. CORRELATION SIGNIFICANCE & CONFIDENCE INTERVALS")
print("=" * 50)

def correlation_with_confidence(x, y, method='pearson', confidence=0.95):
 """Calculate correlation with confidence intervals"""
 n = len(x)

 if method == 'pearson':
 r, p_value = pearsonr(x, y)
 elif method == 'spearman':
 r, p_value = spearmanr(x, y)
 elif method == 'kendall':
 r, p_value = kendalltau(x, y)

 # Fisher's z-transformation for confidence intervals (works best for Pearson)
 if method == 'pearson' and abs(r) < 0.999:
 z = np.arctanh(r)
 se = 1 / np.sqrt(n - 3)
 alpha = 1 - confidence
 z_critical = stats.norm.ppf(1 - alpha/2)

 z_lower = z - z_critical * se
 z_upper = z + z_critical * se

 ci_lower = np.tanh(z_lower)
 ci_upper = np.tanh(z_upper)
 else:
 # Approximate CI for non-parametric methods
 se = 1 / np.sqrt(n - 3)
 alpha = 1 - confidence
 z_critical = stats.norm.ppf(1 - alpha/2)

 ci_lower = r - z_critical * se
 ci_upper = r + z_critical * se

 return {
 'correlation': r,
 'p_value': p_value,
 'ci_lower': ci_lower,
 'ci_upper': ci_upper,
 'significant': p_value < 0.05
 }

# Analyze key relationships with confidence intervals
key_relationships = [
 ('GDP_Growth', 'Unemployment_Rate'),
 ('GDP_Growth', 'Consumer_Confidence'),
 ('Interest_Rates', 'Housing_Price_Index'),
 ('Tech_Investment', 'Productivity_Index'),
 ('Inflation_Rate', 'Interest_Rates')
]

significance_results = []

for var1, var2 in key_relationships:
 x, y = df[var1], df[var2]

 # Calculate for all three methods
 for method in ['pearson', 'spearman', 'kendall']:
 result = correlation_with_confidence(x, y, method)
 significance_results.append({
 'Variable_1': var1,
 'Variable_2': var2,
 'Method': method.title(),
 'Correlation': result['correlation'],
 'P_Value': result['p_value'],
 'CI_Lower': result['ci_lower'],
 'CI_Upper': result['ci_upper'],
 'Significant': result['significant'],
 'CI_Width': result['ci_upper'] - result['ci_lower']
 })

significance_df = pd.DataFrame(significance_results)

# Visualize confidence intervals
fig = make_subplots(
 rows=2, cols=2,
 subplot_titles=("Correlation Confidence Intervals", "P-Value Distribution",
 "Significance by Method", "Confidence Interval Widths"),
 specs=[[{"type": "scatter"}, {"type": "histogram"}],
 [{"type": "bar"}, {"type": "box"}]]
)

# Confidence interval plot
for i, method in enumerate(['Pearson', 'Spearman', 'Kendall']):
 method_data = significance_df[significance_df['Method'] == method]

 fig.add_trace(
 go.Scatter(
 x=method_data['Correlation'],
 y=range(len(method_data)),
 error_x=dict(
 type='data',
 symmetric=False,
 array=method_data['CI_Upper'] - method_data['Correlation'],
 arrayminus=method_data['Correlation'] - method_data['CI_Lower']
 ),
 mode='markers',
 name=f"{method} CI",
 marker=dict(size=8),
 text=method_data['Variable_1'] + ' vs ' + method_data['Variable_2'],
 hovertemplate="<b>%{text}</b><br>Correlation: %{x:.3f}<br>CI: [%{error_x.arrayminus:.3f}, %{error_x.array:.3f}]<extra></extra>"
 ),
 row=1, col=1
 )

# P-value distribution
fig.add_trace(
 go.Histogram(
 x=significance_df['P_Value'],
 nbinsx=20,
 name="P-Values",
 marker_color="lightblue",
 opacity=0.7
 ),
 row=1, col=2
)

# Add significance threshold line
fig.add_vline(x=0.05, line_dash="dash", line_color="red", row=1, col=2)

# Significance by method
sig_by_method = significance_df.groupby('Method')['Significant'].sum()
fig.add_trace(
 go.Bar(
 x=sig_by_method.index,
 y=sig_by_method.values,
 marker_color=['blue', 'red', 'green'],
 name="Significant Correlations"
 ),
 row=2, col=1
)

# CI width by method
for method in ['Pearson', 'Spearman', 'Kendall']:
 method_data = significance_df[significance_df['Method'] == method]
 fig.add_trace(
 go.Box(
 y=method_data['CI_Width'],
 name=method,
 boxpoints='outliers'
 ),
 row=2, col=2
 )

fig.update_layout(
 title="Correlation Statistical Significance Analysis",
 height=800,
 showlegend=True
)

fig.show()

# Print significance summary
print(" Significance Summary:")
print(significance_df.groupby(['Method', 'Significant']).size().unstack(fill_value=0))

print(f"\n Key Findings:")
significant_strong = significance_df[
 (significance_df['Significant']) &
 (abs(significance_df['Correlation']) > 0.5)
]
print(f"• {len(significant_strong)} relationships show strong significant correlations")
print(f"• Average p-value: {significance_df['P_Value'].mean():.4f}")
print(f"• Most precise method (narrowest CI): {significance_df.groupby('Method')['CI_Width'].mean().idxmin()}")

In [None]:
# 6. PARTIAL CORRELATION ANALYSIS
print("\n 6. PARTIAL CORRELATION ANALYSIS")
print("=" * 50)

from sklearn.linear_model import LinearRegression

def partial_correlation(data, x_var, y_var, control_vars):
 """
 Calculate partial correlation between x_var and y_var, controlling for control_vars
 """
 # Fit linear regression to remove effect of control variables
 X_control = data[control_vars].values

 # Residuals after removing control variable effects
 reg_x = LinearRegression().fit(X_control, data[x_var])
 residuals_x = data[x_var] - reg_x.predict(X_control)

 reg_y = LinearRegression().fit(X_control, data[y_var])
 residuals_y = data[y_var] - reg_y.predict(X_control)

 # Correlation of residuals is the partial correlation
 partial_corr = np.corrcoef(residuals_x, residuals_y)[0, 1]

 return partial_corr, residuals_x, residuals_y

# Example: GDP Growth vs Consumer Confidence, controlling for Unemployment
target_relationships = [
 ('GDP_Growth', 'Consumer_Confidence', ['Unemployment_Rate']),
 ('Stock_Returns', 'Consumer_Confidence', ['GDP_Growth', 'Inflation_Rate']),
 ('Housing_Price_Index', 'Interest_Rates', ['GDP_Growth', 'Inflation_Rate']),
 ('Tech_Investment', 'Productivity_Index', ['GDP_Growth'])
]

partial_results = []

fig = make_subplots(
 rows=2, cols=2,
 subplot_titles=[f"{x} vs {y}<br>Controlling for {', '.join(controls)}"
 for x, y, controls in target_relationships],
 specs=[[{"type": "scatter"}, {"type": "scatter"}],
 [{"type": "scatter"}, {"type": "scatter"}]]
)

for idx, (x_var, y_var, control_vars) in enumerate(target_relationships):
 # Calculate simple correlation
 simple_corr = df[x_var].corr(df[y_var])

 # Calculate partial correlation
 partial_corr, residuals_x, residuals_y = partial_correlation(df, x_var, y_var, control_vars)

 partial_results.append({
 'X_Variable': x_var,
 'Y_Variable': y_var,
 'Control_Variables': ', '.join(control_vars),
 'Simple_Correlation': simple_corr,
 'Partial_Correlation': partial_corr,
 'Difference': simple_corr - partial_corr,
 'Control_Effect': abs(simple_corr - partial_corr)
 })

 # Plot partial correlation (residuals)
 row = (idx // 2) + 1
 col = (idx % 2) + 1

 fig.add_trace(
 go.Scatter(
 x=residuals_x,
 y=residuals_y,
 mode='markers',
 marker=dict(size=4, opacity=0.6),
 name=f"Partial r={partial_corr:.3f}",
 text=f"Simple r={simple_corr:.3f}<br>Partial r={partial_corr:.3f}",
 hovertemplate="<b>Partial Correlation</b><br>%{text}<br>X residual: %{x:.2f}<br>Y residual: %{y:.2f}<extra></extra>"
 ),
 row=row, col=col
 )

 # Add trend line for partial correlation
 z = np.polyfit(residuals_x, residuals_y, 1)
 x_trend = np.linspace(residuals_x.min(), residuals_x.max(), 100)
 y_trend = np.poly1d(z)(x_trend)

 fig.add_trace(
 go.Scatter(
 x=x_trend,
 y=y_trend,
 mode='lines',
 line=dict(dash='dash', color='red'),
 name=f"Trend (r={partial_corr:.3f})",
 showlegend=False
 ),
 row=row, col=col
 )

fig.update_layout(
 title="Partial Correlation Analysis - Residual Plots",
 height=600,
 showlegend=True
)

fig.show()

# Display partial correlation results
partial_df = pd.DataFrame(partial_results)
print(" Partial Correlation Results:")
print(partial_df.round(3))

# Visualize the comparison
fig2 = go.Figure()

x_pos = range(len(partial_df))
fig2.add_trace(go.Bar(
 x=x_pos,
 y=partial_df['Simple_Correlation'],
 name='Simple Correlation',
 marker_color='lightblue',
 text=partial_df['Simple_Correlation'].round(3),
 textposition='auto'
))

fig2.add_trace(go.Bar(
 x=x_pos,
 y=partial_df['Partial_Correlation'],
 name='Partial Correlation',
 marker_color='lightcoral',
 text=partial_df['Partial_Correlation'].round(3),
 textposition='auto'
))

fig2.update_layout(
 title="Simple vs Partial Correlations Comparison",
 xaxis_title="Variable Pairs",
 yaxis_title="Correlation Coefficient",
 xaxis=dict(
 tickmode='array',
 tickvals=x_pos,
 ticktext=[f"{row.X_Variable} vs<br>{row.Y_Variable}" for _, row in partial_df.iterrows()]
 ),
 barmode='group',
 height=500
)

fig2.show()

print(f"\n Partial Correlation Insights:")
print(f"• Largest control effect: {partial_df.loc[partial_df['Control_Effect'].idxmax(), 'X_Variable']} vs {partial_df.loc[partial_df['Control_Effect'].idxmax(), 'Y_Variable']}")
print(f"• Average control effect: {partial_df['Control_Effect'].mean():.3f}")
print(f"• Cases where partial > simple: {sum(partial_df['Partial_Correlation'] > partial_df['Simple_Correlation'])}")

In [None]:
# 7. SPURIOUS CORRELATION DETECTION
print("\n 7. SPURIOUS CORRELATION DETECTION")
print("=" * 50)

# Create examples of potentially spurious correlations
np.random.seed(123)

# Time series that might show spurious correlation
time = np.arange(100)
trend1 = 2 * time + np.random.normal(0, 10, 100) # Linear trend
trend2 = 1.5 * time + np.random.normal(0, 8, 100) # Similar trend
random1 = np.cumsum(np.random.normal(0, 1, 100)) # Random walk
random2 = np.cumsum(np.random.normal(0, 1, 100)) # Another random walk

# Add some seasonal patterns
seasonal1 = 20 * np.sin(2 * np.pi * time / 12) + trend1
seasonal2 = 15 * np.cos(2 * np.pi * time / 12) + trend2

spurious_data = pd.DataFrame({
 'Time': time,
 'Trend_Series_1': trend1,
 'Trend_Series_2': trend2,
 'Random_Walk_1': random1,
 'Random_Walk_2': random2,
 'Seasonal_1': seasonal1,
 'Seasonal_2': seasonal2
})

# Calculate correlations
spurious_corr = spurious_data.drop('Time', axis=1).corr()

fig = make_subplots(
 rows=2, cols=3,
 subplot_titles=("Trending Series", "Random Walks", "Seasonal Series",
 "Detrended Series", "Differenced Random Walks", "Correlation Matrix"),
 specs=[[{"type": "scatter"}, {"type": "scatter"}, {"type": "scatter"}],
 [{"type": "scatter"}, {"type": "scatter"}, {"type": "heatmap"}]]
)

# Plot original series
series_pairs = [
 ('Trend_Series_1', 'Trend_Series_2', 1, 1),
 ('Random_Walk_1', 'Random_Walk_2', 1, 2),
 ('Seasonal_1', 'Seasonal_2', 1, 3)
]

for var1, var2, row, col in series_pairs:
 # Original series
 fig.add_trace(
 go.Scatter(
 x=spurious_data[var1],
 y=spurious_data[var2],
 mode='markers',
 marker=dict(size=4, opacity=0.7),
 name=f"{var1} vs {var2}",
 text=f"r = {spurious_data[var1].corr(spurious_data[var2]):.3f}",
 hovertemplate="<b>%{text}</b><br>X: %{x:.2f}<br>Y: %{y:.2f}<extra></extra>"
 ),
 row=row, col=col
 )

# Detrending and differencing to remove spurious correlations
from scipy import signal

# Detrend the trending series
detrended_1 = signal.detrend(spurious_data['Trend_Series_1'])
detrended_2 = signal.detrend(spurious_data['Trend_Series_2'])

fig.add_trace(
 go.Scatter(
 x=detrended_1,
 y=detrended_2,
 mode='markers',
 marker=dict(size=4, color='red', opacity=0.7),
 name=f"Detrended r={np.corrcoef(detrended_1, detrended_2)[0,1]:.3f}",
 text=f"Detrended r = {np.corrcoef(detrended_1, detrended_2)[0,1]:.3f}",
 hovertemplate="<b>%{text}</b><br>X: %{x:.2f}<br>Y: %{y:.2f}<extra></extra>"
 ),
 row=2, col=1
)

# Difference the random walks
diff_random1 = np.diff(spurious_data['Random_Walk_1'])
diff_random2 = np.diff(spurious_data['Random_Walk_2'])

fig.add_trace(
 go.Scatter(
 x=diff_random1,
 y=diff_random2,
 mode='markers',
 marker=dict(size=4, color='green', opacity=0.7),
 name=f"Differenced r={np.corrcoef(diff_random1, diff_random2)[0,1]:.3f}",
 text=f"Differenced r = {np.corrcoef(diff_random1, diff_random2)[0,1]:.3f}",
 hovertemplate="<b>%{text}</b><br>X: %{x:.2f}<br>Y: %{y:.2f}<extra></extra>"
 ),
 row=2, col=2
)

# Correlation matrix
fig.add_trace(
 go.Heatmap(
 z=spurious_corr.values,
 x=spurious_corr.columns,
 y=spurious_corr.columns,
 colorscale='RdBu_r',
 zmid=0,
 text=spurious_corr.round(3).values,
 texttemplate="%{text}",
 textfont={"size": 8}
 ),
 row=2, col=3
)

fig.update_layout(
 title="Spurious Correlation Detection and Correction",
 height=600,
 showlegend=True
)

fig.show()

# Spurious correlation tests
print(" Spurious Correlation Analysis:")
print("Original correlations (potentially spurious):")
original_corrs = {
 'Trending Series': spurious_data['Trend_Series_1'].corr(spurious_data['Trend_Series_2']),
 'Random Walks': spurious_data['Random_Walk_1'].corr(spurious_data['Random_Walk_2']),
 'Seasonal Series': spurious_data['Seasonal_1'].corr(spurious_data['Seasonal_2'])
}

print("After correction:")
corrected_corrs = {
 'Detrended Series': np.corrcoef(detrended_1, detrended_2)[0,1],
 'Differenced Random Walks': np.corrcoef(diff_random1, diff_random2)[0,1],
 'Deseasonalized': np.corrcoef(
 signal.detrend(spurious_data['Seasonal_1']),
 signal.detrend(spurious_data['Seasonal_2'])
 )[0,1]
}

for name, corr in original_corrs.items():
 print(f"• {name}: {corr:.3f}")

print("\nAfter removing trends/patterns:")
for name, corr in corrected_corrs.items():
 print(f"• {name}: {corr:.3f}")

In [None]:
# 8. BUSINESS INSIGHTS AND INTERPRETATION
print("\n 8. BUSINESS INSIGHTS & INTERPRETATION")
print("=" * 50)

# Create business-focused correlation insights
business_insights = []

# Define business-relevant thresholds and interpretations
def interpret_correlation(r, var1, var2):
 """Provide business interpretation of correlation"""
 abs_r = abs(r)

 strength = "Very Strong" if abs_r >= 0.8 else "Strong" if abs_r >= 0.6 else "Moderate" if abs_r >= 0.4 else "Weak" if abs_r >= 0.2 else "Very Weak"
 direction = "Positive" if r > 0 else "Negative"

 # Business implications
 if abs_r >= 0.6:
 reliability = "High - suitable for strategic decisions"
 action = "Primary focus for policy/strategy"
 elif abs_r >= 0.4:
 reliability = "Moderate - consider in planning"
 action = "Secondary consideration"
 else:
 reliability = "Low - monitor but don't rely on"
 action = "Investigate other factors"

 return {
 'strength': strength,
 'direction': direction,
 'reliability': reliability,
 'action': action,
 'business_impact': get_business_meaning(var1, var2, r)
 }

def get_business_meaning(var1, var2, correlation):
 """Generate business-specific interpretations"""
 relationships = {
 ('GDP_Growth', 'Unemployment_Rate'): "Economic health indicator - inverse relationship expected",
 ('GDP_Growth', 'Consumer_Confidence'): "Economic sentiment alignment - positive relationship",
 ('Interest_Rates', 'Housing_Price_Index'): "Monetary policy impact on real estate",
 ('Tech_Investment', 'Productivity_Index'): "Innovation-productivity nexus",
 ('Inflation_Rate', 'Interest_Rates'): "Central bank policy response mechanism"
 }

 key = (var1, var2) if (var1, var2) in relationships else (var2, var1)
 base_meaning = relationships.get(key, "Economic relationship")

 if abs(correlation) >= 0.6:
 strength_comment = "Strong predictive relationship"
 elif abs(correlation) >= 0.4:
 strength_comment = "Moderate predictive power"
 else:
 strength_comment = "Limited predictive value"

 return f"{base_meaning}. {strength_comment}."

# Analyze key business relationships
key_business_pairs = [
 ('GDP_Growth', 'Unemployment_Rate'),
 ('GDP_Growth', 'Consumer_Confidence'),
 ('Interest_Rates', 'Housing_Price_Index'),
 ('Tech_Investment', 'Productivity_Index'),
 ('Inflation_Rate', 'Interest_Rates'),
 ('Consumer_Confidence', 'Retail_Sales_Index'),
 ('Stock_Returns', 'Consumer_Confidence')
]

for var1, var2 in key_business_pairs:
 correlation = df[var1].corr(df[var2])
 interpretation = interpret_correlation(correlation, var1, var2)

 business_insights.append({
 'Relationship': f"{var1} ↔ {var2}",
 'Correlation': correlation,
 'Strength': interpretation['strength'],
 'Direction': interpretation['direction'],
 'Business_Reliability': interpretation['reliability'],
 'Recommended_Action': interpretation['action'],
 'Business_Meaning': interpretation['business_impact']
 })

insights_df = pd.DataFrame(business_insights)

# Create business dashboard
fig = make_subplots(
 rows=2, cols=2,
 subplot_titles=("Correlation Strength Distribution", "Business Action Matrix",
 "Reliability vs Correlation", "Key Economic Relationships"),
 specs=[[{"type": "histogram"}, {"type": "scatter"}],
 [{"type": "scatter"}, {"type": "bar"}]]
)

# Correlation strength distribution
fig.add_trace(
 go.Histogram(
 x=abs(insights_df['Correlation']),
 nbinsx=15,
 name="Correlation Strengths",
 marker_color="lightblue",
 opacity=0.7
 ),
 row=1, col=1
)

# Business action matrix
action_colors = {
 'Primary focus for policy/strategy': 'red',
 'Secondary consideration': 'orange',
 'Investigate other factors': 'gray'
}

fig.add_trace(
 go.Scatter(
 x=insights_df['Correlation'],
 y=range(len(insights_df)),
 mode='markers+text',
 marker=dict(
 size=abs(insights_df['Correlation']) * 30,
 color=[action_colors[action] for action in insights_df['Recommended_Action']],
 opacity=0.8
 ),
 text=insights_df['Relationship'],
 textposition="middle right",
 name="Action Priority",
 hovertemplate="<b>%{text}</b><br>Correlation: %{x:.3f}<br>Action: %{customdata}<extra></extra>",
 customdata=insights_df['Recommended_Action']
 ),
 row=1, col=2
)

# Reliability analysis (correlation vs strength)
strength_numeric = {
 'Very Strong': 5, 'Strong': 4, 'Moderate': 3, 'Weak': 2, 'Very Weak': 1
}

fig.add_trace(
 go.Scatter(
 x=abs(insights_df['Correlation']),
 y=[strength_numeric[s] for s in insights_df['Strength']],
 mode='markers',
 marker=dict(size=10, opacity=0.7, color='purple'),
 text=insights_df['Relationship'],
 name="Strength Assessment",
 hovertemplate="<b>%{text}</b><br>|Correlation|: %{x:.3f}<br>Strength: %{customdata}<extra></extra>",
 customdata=insights_df['Strength']
 ),
 row=2, col=1
)

# Key relationships bar chart
top_relationships = insights_df.nlargest(7, 'Correlation', keep='all')
fig.add_trace(
 go.Bar(
 x=top_relationships['Correlation'],
 y=range(len(top_relationships)),
 orientation='h',
 marker_color='steelblue',
 text=top_relationships['Relationship'],
 textposition='auto',
 name="Top Correlations"
 ),
 row=2, col=2
)

fig.update_layout(
 title="Business Correlation Analysis Dashboard",
 height=700,
 showlegend=True
)

fig.show()

print(" Business Insights Summary:")
print(insights_df[['Relationship', 'Correlation', 'Strength', 'Recommended_Action']].round(3))

print(f"\n Strategic Recommendations:")
high_priority = insights_df[insights_df['Recommended_Action'] == 'Primary focus for policy/strategy']
print(f"• {len(high_priority)} relationships require primary strategic focus")
print(f"• Strongest relationship: {insights_df.loc[insights_df['Correlation'].abs().idxmax(), 'Relationship']} (r={insights_df['Correlation'].abs().max():.3f})")

moderate_priority = insights_df[insights_df['Recommended_Action'] == 'Secondary consideration']
print(f"• {len(moderate_priority)} relationships are secondary considerations")

low_priority = insights_df[insights_df['Recommended_Action'] == 'Investigate other factors']
print(f"• {len(low_priority)} relationships need further investigation")

print(f"\n Business Intelligence:")
for _, row in high_priority.iterrows():
 print(f"• {row['Relationship']}: {row['Business_Meaning']}")

# LEARNING SUMMARY: Correlation Analysis

## Key Concepts Mastered

### 1. **Correlation Methods**
- **Pearson Correlation**: Measures linear relationships, assumes normal distribution
- **Spearman Correlation**: Rank-based, captures monotonic relationships, non-parametric
- **Kendall's Tau**: Also rank-based, more robust to outliers than Spearman

### 2. **Statistical Significance**
- P-values indicate whether correlations are statistically significant
- Confidence intervals provide range of plausible correlation values
- Sample size affects both significance and precision

### 3. **Advanced Techniques**
- **Partial Correlation**: Controls for confounding variables
- **Spurious Correlation**: Relationships that appear meaningful but aren't causal
- **Detrending**: Removes artificial correlations due to common trends

## Business Applications

### Strategic Decision Making
- **Strong correlations (|r| > 0.6)**: Primary focus for strategic planning
- **Moderate correlations (0.3 < |r| < 0.6)**: Secondary considerations
- **Weak correlations (|r| < 0.3)**: Monitor but investigate other factors

### Risk Management
- Understanding correlation helps in:
 - Portfolio diversification
 - Risk factor identification
 - Scenario planning
 - Economic forecasting

## Next Steps

1. **Tier 2: Predictive Models** - Use correlation insights for feature selection
2. **Causality Analysis** - Move beyond correlation to understand causation
3. **Multivariate Analysis** - Explore relationships among multiple variables simultaneously
4. **Time Series Analysis** - Understand how correlations change over time

## Pro Tips

- Always check for outliers that might inflate correlations
- Consider non-linear relationships that correlation might miss
- Use domain knowledge to interpret correlation strength
- Be cautious of spurious correlations in time series data
- Combine multiple correlation methods for robust analysis

**Remember**: *Correlation does not imply causation, but it's the first step in understanding relationships in your data!*