# Tier 1: Scatter Plot Analysis

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** f17a642e-2b48-426d-849d-9e76f40f1d53

---

## Citation
Brandon Deloatch, "Tier 1: Scatter Plot Analysis," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** f17a642e-2b48-426d-849d-9e76f40f1d53
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [None]:
# Import Essential Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import scipy.stats as stats
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore')

print(" Tier 1: Scatter Plot Analysis - Libraries Loaded Successfully!")
print("=" * 65)
print("Available Scatter Plot Techniques:")
print("• Basic Scatter Plots - Simple X vs Y relationships")
print("• Color-Coded Scatter - Third dimension via color encoding")
print("• Size-Coded Scatter - Fourth dimension via marker size")
print("• Interactive Scatter - Zoom, pan, hover capabilities")
print("• Regression Lines - Linear, polynomial, and confidence bands")
print("• Outlier Detection - Statistical identification of anomalies")
print("• Animated Scatter - Time-based relationship evolution")

In [None]:
# Generate Comprehensive Business Dataset
np.random.seed(42)

def generate_business_relationships_dataset(n_samples=1000):
 """Generate realistic business dataset with various relationship types"""

 # Marketing and Sales relationships
 marketing_spend = np.random.exponential(scale=5000, size=n_samples) + 1000
 sales_base = marketing_spend * 1.8 + np.random.normal(0, 2000, n_samples)
 sales_revenue = np.maximum(sales_base, 5000) # Ensure positive sales

 # Customer metrics
 customer_acquisition_cost = marketing_spend / (sales_revenue / 10000) + np.random.normal(0, 50, n_samples)
 customer_satisfaction = 8.5 - (customer_acquisition_cost - 200) / 50 + np.random.normal(0, 0.8, n_samples)
 customer_satisfaction = np.clip(customer_satisfaction, 1, 10)

 # Product and pricing
 product_price = np.random.normal(100, 25, n_samples)
 demand = 1000 - 5 * product_price + np.random.normal(0, 100, n_samples)
 demand = np.maximum(demand, 50) # Minimum demand

 # Employee and productivity metrics
 employee_count = np.random.poisson(lam=25, size=n_samples) + 5
 revenue_per_employee = sales_revenue / employee_count + np.random.normal(0, 5000, n_samples)

 # Time-based seasonal effect
 time_period = np.arange(n_samples)
 seasonal_factor = 1 + 0.3 * np.sin(2 * np.pi * time_period / 250) # Quarterly cycles
 sales_revenue = sales_revenue * seasonal_factor

 # Geographic and categorical variables
 regions = np.random.choice(['North', 'South', 'East', 'West'], n_samples, p=[0.3, 0.25, 0.25, 0.2])
 business_types = np.random.choice(['B2B', 'B2C', 'B2B2C'], n_samples, p=[0.4, 0.5, 0.1])
 company_size = np.random.choice(['Startup', 'Small', 'Medium', 'Large'], n_samples, p=[0.2, 0.3, 0.3, 0.2])

 # Create quadratic relationship example
 advertising_budget = np.random.uniform(1000, 20000, n_samples)
 brand_awareness = (
 0.001 * advertising_budget +
 0.000001 * advertising_budget**2 -
 0.00000001 * advertising_budget**3 + # Diminishing returns
 np.random.normal(0, 5, n_samples)
 )
 brand_awareness = np.clip(brand_awareness, 0, 100)

 return pd.DataFrame({
 'marketing_spend': marketing_spend,
 'sales_revenue': sales_revenue,
 'customer_acq_cost': customer_acquisition_cost,
 'customer_satisfaction': customer_satisfaction,
 'product_price': product_price,
 'demand': demand,
 'employee_count': employee_count,
 'revenue_per_employee': revenue_per_employee,
 'advertising_budget': advertising_budget,
 'brand_awareness': brand_awareness,
 'region': regions,
 'business_type': business_types,
 'company_size': company_size,
 'time_period': time_period
 })

# Generate dataset
print(" Generating business relationships dataset...")
df = generate_business_relationships_dataset(1000)
print(f"Dataset Shape: {df.shape}")
print("\nDataset Overview:")
print(df.head())
print("\nBasic Statistics:")
print(df.describe().round(2))

In [None]:
# 1. BASIC SCATTER PLOT ANALYSIS
print(" 1. BASIC SCATTER PLOT ANALYSIS")
print("=" * 35)

# Create basic scatter plots for key relationships
fig = make_subplots(
 rows=2, cols=2,
 subplot_titles=(
 "Marketing Spend vs Sales Revenue",
 "Product Price vs Demand",
 "Customer Acquisition Cost vs Satisfaction",
 "Advertising Budget vs Brand Awareness"
 ),
 specs=[[{"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"secondary_y": False}]]
)

# Relationship 1: Marketing Spend vs Sales Revenue (Linear)
fig.add_trace(
 go.Scatter(
 x=df['marketing_spend'],
 y=df['sales_revenue'],
 mode='markers',
 marker=dict(size=6, opacity=0.7, color='blue'),
 name='Marketing-Sales',
 hovertemplate="Marketing: $%{x:,.0f}<br>Sales: $%{y:,.0f}<extra></extra>"
 ),
 row=1, col=1
)

# Add trend line
z = np.polyfit(df['marketing_spend'], df['sales_revenue'], 1)
p = np.poly1d(z)
x_trend = np.linspace(df['marketing_spend'].min(), df['marketing_spend'].max(), 100)
fig.add_trace(
 go.Scatter(
 x=x_trend, y=p(x_trend),
 mode='lines',
 line=dict(color='red', dash='dash'),
 name='Trend Line',
 showlegend=False
 ),
 row=1, col=1
)

# Relationship 2: Product Price vs Demand (Negative correlation)
fig.add_trace(
 go.Scatter(
 x=df['product_price'],
 y=df['demand'],
 mode='markers',
 marker=dict(size=6, opacity=0.7, color='green'),
 name='Price-Demand',
 hovertemplate="Price: $%{x:.2f}<br>Demand: %{y:.0f}<extra></extra>"
 ),
 row=1, col=2
)

# Relationship 3: Customer Acquisition Cost vs Satisfaction
fig.add_trace(
 go.Scatter(
 x=df['customer_acq_cost'],
 y=df['customer_satisfaction'],
 mode='markers',
 marker=dict(size=6, opacity=0.7, color='orange'),
 name='CAC-Satisfaction',
 hovertemplate="Acq Cost: $%{x:.2f}<br>Satisfaction: %{y:.1f}/10<extra></extra>"
 ),
 row=2, col=1
)

# Relationship 4: Advertising Budget vs Brand Awareness (Quadratic)
fig.add_trace(
 go.Scatter(
 x=df['advertising_budget'],
 y=df['brand_awareness'],
 mode='markers',
 marker=dict(size=6, opacity=0.7, color='purple'),
 name='Ad-Awareness',
 hovertemplate="Ad Budget: $%{x:,.0f}<br>Awareness: %{y:.1f}%<extra></extra>"
 ),
 row=2, col=2
)

fig.update_layout(
 title="Basic Scatter Plot Analysis: Key Business Relationships",
 height=600,
 showlegend=True
)

fig.show()

# Calculate and display correlation coefficients
print("\n Correlation Analysis:")
correlations = {
 'Marketing Spend vs Sales Revenue': df['marketing_spend'].corr(df['sales_revenue']),
 'Product Price vs Demand': df['product_price'].corr(df['demand']),
 'Customer Acq Cost vs Satisfaction': df['customer_acq_cost'].corr(df['customer_satisfaction']),
 'Advertising Budget vs Brand Awareness': df['advertising_budget'].corr(df['brand_awareness'])
}

for relationship, correlation in correlations.items():
 strength = "Strong" if abs(correlation) > 0.7 else "Moderate" if abs(correlation) > 0.4 else "Weak"
 direction = "Positive" if correlation > 0 else "Negative"
 print(f"• {relationship}: r={correlation:.3f} ({strength} {direction})")

In [None]:
# 2. MULTI-DIMENSIONAL SCATTER PLOTS
print("\n 2. MULTI-DIMENSIONAL SCATTER PLOTS")
print("=" * 40)

# Color-coded scatter plot (3rd dimension)
print("2.1 Color-Coded Scatter Plot:")
fig1 = px.scatter(
 df,
 x='marketing_spend',
 y='sales_revenue',
 color='region',
 title="Sales Revenue vs Marketing Spend by Region",
 labels={
 'marketing_spend': 'Marketing Spend ($)',
 'sales_revenue': 'Sales Revenue ($)',
 'region': 'Region'
 },
 hover_data=['customer_satisfaction', 'employee_count']
)
fig1.show()

# Size-coded scatter plot (4th dimension)
print("\n2.2 Size-Coded Scatter Plot:")
fig2 = px.scatter(
 df,
 x='product_price',
 y='demand',
 size='sales_revenue',
 color='business_type',
 title="Product Price vs Demand (Size = Sales Revenue, Color = Business Type)",
 labels={
 'product_price': 'Product Price ($)',
 'demand': 'Demand (units)',
 'sales_revenue': 'Sales Revenue ($)',
 'business_type': 'Business Type'
 },
 size_max=20
)
fig2.show()

# Color and size combined (5 dimensions)
print("\n2.3 Five-Dimensional Scatter Plot:")
fig3 = px.scatter(
 df,
 x='employee_count',
 y='revenue_per_employee',
 size='marketing_spend',
 color='customer_satisfaction',
 symbol='company_size',
 title="Employee Productivity Analysis (5 Dimensions)",
 labels={
 'employee_count': 'Employee Count',
 'revenue_per_employee': 'Revenue per Employee ($)',
 'marketing_spend': 'Marketing Spend ($)',
 'customer_satisfaction': 'Customer Satisfaction (1-10)',
 'company_size': 'Company Size'
 },
 color_continuous_scale='viridis',
 size_max=25
)
fig3.show()

print(" Multi-dimensional insights:")
print("• Color encoding reveals regional patterns in business performance")
print("• Size encoding shows how sales volume relates to pricing strategies")
print("• Symbol shapes distinguish company size categories")
print("• Combined encoding reveals complex multi-factor relationships")

In [None]:
# 3. ADVANCED TREND ANALYSIS
print("\n 3. ADVANCED TREND ANALYSIS")
print("=" * 30)

# Polynomial regression analysis
def fit_polynomial_trends(x, y, degrees=[1, 2, 3]):
 """Fit polynomial trends and return R-squared values"""
 results = {}

 for degree in degrees:
 # Fit polynomial
 poly_features = PolynomialFeatures(degree=degree)
 x_poly = poly_features.fit_transform(x.reshape(-1, 1))

 model = LinearRegression()
 model.fit(x_poly, y)

 # Generate smooth curve
 x_smooth = np.linspace(x.min(), x.max(), 100)
 x_smooth_poly = poly_features.transform(x_smooth.reshape(-1, 1))
 y_smooth = model.predict(x_smooth_poly)

 # Calculate R-squared
 y_pred = model.predict(x_poly)
 r2 = r2_score(y, y_pred)

 results[degree] = {
 'x_smooth': x_smooth,
 'y_smooth': y_smooth,
 'r2': r2,
 'model': model
 }

 return results

# Analyze advertising budget vs brand awareness (non-linear relationship)
x_data = df['advertising_budget'].values
y_data = df['brand_awareness'].values

trend_results = fit_polynomial_trends(x_data, y_data, degrees=[1, 2, 3])

# Create visualization
fig = go.Figure()

# Original data points
fig.add_trace(
 go.Scatter(
 x=df['advertising_budget'],
 y=df['brand_awareness'],
 mode='markers',
 marker=dict(size=8, opacity=0.6, color='lightblue'),
 name='Data Points',
 hovertemplate="Budget: $%{x:,.0f}<br>Awareness: %{y:.1f}%<extra></extra>"
 )
)

# Add polynomial trend lines
colors = ['red', 'green', 'purple']
for i, (degree, result) in enumerate(trend_results.items()):
 fig.add_trace(
 go.Scatter(
 x=result['x_smooth'],
 y=result['y_smooth'],
 mode='lines',
 line=dict(color=colors[i], width=3),
 name=f'Degree {degree} (R²={result["r2"]:.3f})',
 hovertemplate=f"Polynomial Degree {degree}<br>R-squared: {result['r2']:.3f}<extra></extra>"
 )
 )

fig.update_layout(
 title="Polynomial Trend Analysis: Advertising Budget vs Brand Awareness",
 xaxis_title="Advertising Budget ($)",
 yaxis_title="Brand Awareness (%)",
 height=500,
 showlegend=True
)
fig.show()

# Statistical analysis of trends
print(" Polynomial Trend Analysis Results:")
for degree, result in trend_results.items():
 print(f"• Degree {degree}: R² = {result['r2']:.4f}")

best_degree = max(trend_results.keys(), key=lambda k: trend_results[k]['r2'])
print(f"• Best fit: Degree {best_degree} polynomial (R² = {trend_results[best_degree]['r2']:.4f})")

# Confidence intervals for linear regression
def calculate_confidence_intervals(x, y, confidence=0.95):
 """Calculate confidence intervals for linear regression"""
 from scipy.stats import t

 n = len(x)
 x_mean = np.mean(x)

 # Fit linear regression
 model = LinearRegression()
 model.fit(x.reshape(-1, 1), y)
 y_pred = model.predict(x.reshape(-1, 1))

 # Calculate residuals and standard error
 residuals = y - y_pred
 mse = np.sum(residuals**2) / (n - 2)
 se = np.sqrt(mse)

 # T-value for confidence interval
 alpha = 1 - confidence
 t_val = t.ppf(1 - alpha/2, n - 2)

 # Generate smooth prediction line
 x_smooth = np.linspace(x.min(), x.max(), 100)
 y_smooth = model.predict(x_smooth.reshape(-1, 1))

 # Calculate confidence intervals
 sxx = np.sum((x - x_mean)**2)
 se_pred = se * np.sqrt(1/n + (x_smooth - x_mean)**2 / sxx)

 ci_lower = y_smooth - t_val * se_pred
 ci_upper = y_smooth + t_val * se_pred

 return x_smooth, y_smooth, ci_lower, ci_upper

# Create confidence interval plot
x_ci, y_ci, ci_lower, ci_upper = calculate_confidence_intervals(
 df['marketing_spend'].values, df['sales_revenue'].values
)

fig_ci = go.Figure()

# Data points
fig_ci.add_trace(
 go.Scatter(
 x=df['marketing_spend'],
 y=df['sales_revenue'],
 mode='markers',
 marker=dict(size=6, opacity=0.6, color='blue'),
 name='Data Points'
 )
)

# Regression line
fig_ci.add_trace(
 go.Scatter(
 x=x_ci, y=y_ci,
 mode='lines',
 line=dict(color='red', width=2),
 name='Regression Line'
 )
)

# Confidence interval
fig_ci.add_trace(
 go.Scatter(
 x=np.concatenate([x_ci, x_ci[::-1]]),
 y=np.concatenate([ci_upper, ci_lower[::-1]]),
 fill='toself',
 fillcolor='rgba(255,0,0,0.2)',
 line=dict(color='rgba(255,255,255,0)'),
 name='95% Confidence Interval',
 showlegend=True
 )
)

fig_ci.update_layout(
 title="Linear Regression with 95% Confidence Intervals",
 xaxis_title="Marketing Spend ($)",
 yaxis_title="Sales Revenue ($)",
 height=500
)
fig_ci.show()

In [None]:
# 4. OUTLIER DETECTION AND ANALYSIS
print("\n 4. OUTLIER DETECTION AND ANALYSIS")
print("=" * 38)

def detect_outliers_multiple_methods(x, y):
 """Detect outliers using multiple statistical methods"""
 outliers = {}

 # Method 1: Z-score (univariate)
 z_scores_x = np.abs(stats.zscore(x))
 z_scores_y = np.abs(stats.zscore(y))
 z_outliers = (z_scores_x > 3) | (z_scores_y > 3)
 outliers['Z-Score'] = z_outliers

 # Method 2: IQR method (univariate)
 def iqr_outliers(data):
 Q1 = np.percentile(data, 25)
 Q3 = np.percentile(data, 75)
 IQR = Q3 - Q1
 lower_bound = Q1 - 1.5 * IQR
 upper_bound = Q3 + 1.5 * IQR
 return (data < lower_bound) | (data > upper_bound)

 iqr_outliers_x = iqr_outliers(x)
 iqr_outliers_y = iqr_outliers(y)
 outliers['IQR'] = iqr_outliers_x | iqr_outliers_y

 # Method 3: Mahalanobis distance (bivariate)
 from scipy.spatial.distance import mahalanobis
 data = np.column_stack([x, y])
 mean = np.mean(data, axis=0)
 cov = np.cov(data.T)

 try:
 inv_cov = np.linalg.inv(cov)
 mahal_distances = [mahalanobis(point, mean, inv_cov) for point in data]
 mahal_threshold = np.percentile(mahal_distances, 95) # Top 5%
 outliers['Mahalanobis'] = np.array(mahal_distances) > mahal_threshold
 except:
 outliers['Mahalanobis'] = np.zeros(len(x), dtype=bool)

 # Method 4: Isolation Forest
 from sklearn.ensemble import IsolationForest
 isolation_forest = IsolationForest(contamination=0.05, random_state=42)
 outlier_labels = isolation_forest.fit_predict(data)
 outliers['Isolation Forest'] = outlier_labels == -1

 return outliers

# Detect outliers in marketing spend vs sales revenue relationship
outliers = detect_outliers_multiple_methods(
 df['marketing_spend'].values,
 df['sales_revenue'].values
)

# Create comprehensive outlier visualization
fig = make_subplots(
 rows=2, cols=2,
 subplot_titles=list(outliers.keys()),
 specs=[[{"type": "scatter"}, {"type": "scatter"}],
 [{"type": "scatter"}, {"type": "scatter"}]]
)

positions = [(1,1), (1,2), (2,1), (2,2)]
colors = ['red', 'green', 'blue', 'purple']

for i, (method, outlier_mask) in enumerate(outliers.items()):
 row, col = positions[i]

 # Normal points
 normal_mask = ~outlier_mask
 fig.add_trace(
 go.Scatter(
 x=df['marketing_spend'][normal_mask],
 y=df['sales_revenue'][normal_mask],
 mode='markers',
 marker=dict(size=5, opacity=0.6, color='lightblue'),
 name='Normal',
 showlegend=(i==0)
 ),
 row=row, col=col
 )

 # Outlier points
 if np.any(outlier_mask):
 fig.add_trace(
 go.Scatter(
 x=df['marketing_spend'][outlier_mask],
 y=df['sales_revenue'][outlier_mask],
 mode='markers',
 marker=dict(size=8, color=colors[i], symbol='x'),
 name='Outliers',
 showlegend=(i==0)
 ),
 row=row, col=col
 )

fig.update_layout(
 title="Outlier Detection: Multiple Methods Comparison",
 height=600,
 showlegend=True
)
fig.show()

# Summary statistics for outlier detection
print(" Outlier Detection Summary:")
for method, outlier_mask in outliers.items():
 n_outliers = np.sum(outlier_mask)
 percentage = (n_outliers / len(df)) * 100
 print(f"• {method}: {n_outliers} outliers ({percentage:.1f}%)")

# Analyze outlier characteristics
print("\n Outlier Characteristics Analysis:")
combined_outliers = np.any(list(outliers.values()), axis=0)
if np.any(combined_outliers):
 outlier_data = df[combined_outliers]
 normal_data = df[~combined_outliers]

 print(f"• Total unique outliers: {np.sum(combined_outliers)} ({np.sum(combined_outliers)/len(df)*100:.1f}%)")
 print(f"• Average marketing spend - Outliers: ${outlier_data['marketing_spend'].mean():,.0f}")
 print(f"• Average marketing spend - Normal: ${normal_data['marketing_spend'].mean():,.0f}")
 print(f"• Average sales revenue - Outliers: ${outlier_data['sales_revenue'].mean():,.0f}")
 print(f"• Average sales revenue - Normal: ${normal_data['sales_revenue'].mean():,.0f}")

In [None]:
# 5. INTERACTIVE SCATTER PLOT DASHBOARD
print("\n 5. INTERACTIVE SCATTER PLOT DASHBOARD")
print("=" * 42)

# Create comprehensive interactive dashboard
def create_interactive_scatter_dashboard(dataframe):
 """Create an interactive scatter plot dashboard with multiple features"""

 # Main scatter plot with dropdown selectors
 fig = go.Figure()

 # Default plot: Marketing vs Sales
 fig.add_trace(
 go.Scatter(
 x=dataframe['marketing_spend'],
 y=dataframe['sales_revenue'],
 mode='markers',
 marker=dict(
 size=8,
 opacity=0.7,
 color=dataframe['customer_satisfaction'],
 colorscale='viridis',
 showscale=True,
 colorbar=dict(title="Customer Satisfaction")
 ),
 text=dataframe['region'],
 hovertemplate=(
 "<b>Marketing:</b> $%{x:,.0f}<br>"
 "<b>Sales:</b> $%{y:,.0f}<br>"
 "<b>Region:</b> %{text}<br>"
 "<b>Satisfaction:</b> %{marker.color:.1f}/10"
 "<extra></extra>"
 ),
 name='Business Data'
 )
 )

 # Add dropdown menus for X and Y axes
 fig.update_layout(
 updatemenus=[
 dict(
 buttons=list([
 dict(label="Marketing vs Sales",
 method="restyle",
 args=[{"x": [dataframe['marketing_spend']],
 "y": [dataframe['sales_revenue']]}]),
 dict(label="Price vs Demand",
 method="restyle",
 args=[{"x": [dataframe['product_price']],
 "y": [dataframe['demand']]}]),
 dict(label="Employees vs Revenue/Employee",
 method="restyle",
 args=[{"x": [dataframe['employee_count']],
 "y": [dataframe['revenue_per_employee']]}]),
 dict(label="Ad Budget vs Brand Awareness",
 method="restyle",
 args=[{"x": [dataframe['advertising_budget']],
 "y": [dataframe['brand_awareness']]}])
 ]),
 direction="down",
 showactive=True,
 x=0.01,
 xanchor="left",
 y=1.02,
 yanchor="top"
 ),
 ]
 )

 fig.update_layout(
 title="Interactive Scatter Plot Dashboard<br><sub>Use dropdown to explore different relationships</sub>",
 xaxis_title="X Variable",
 yaxis_title="Y Variable",
 height=600,
 width=900
 )

 return fig

# Create and display interactive dashboard
dashboard_fig = create_interactive_scatter_dashboard(df)
dashboard_fig.show()

# Animated scatter plot showing evolution over time
print("\n 5.1 Animated Time Series Scatter Plot:")

# Create time-based animation
fig_anim = px.scatter(
 df.sort_values('time_period'),
 x='marketing_spend',
 y='sales_revenue',
 animation_frame='time_period',
 size='employee_count',
 color='region',
 hover_name='business_type',
 title="Business Performance Evolution Over Time",
 labels={
 'marketing_spend': 'Marketing Spend ($)',
 'sales_revenue': 'Sales Revenue ($)',
 'time_period': 'Time Period'
 },
 range_x=[df['marketing_spend'].min()*0.9, df['marketing_spend'].max()*1.1],
 range_y=[df['sales_revenue'].min()*0.9, df['sales_revenue'].max()*1.1],
 size_max=20
)

# Enhance animation settings
fig_anim.update_layout(
 updatemenus=[dict(
 type="buttons",
 direction="left",
 buttons=list([
 dict(label="Play",
 method="animate",
 args=[None, {"frame": {"duration": 100, "redraw": False},
 "fromcurrent": True}]),
 dict(label="Pause",
 method="animate",
 args=[[None], {"frame": {"duration": 0, "redraw": False},
 "mode": "immediate",
 "transition": {"duration": 0}}])
 ]),
 pad={"r": 10, "t": 87},
 showactive=False,
 x=0.011,
 xanchor="right",
 y=0,
 yanchor="top"
 )]
)

fig_anim.show()

print(" Interactive Features Available:")
print("• Dropdown menus for exploring different variable relationships")
print("• Color encoding for third-dimension insights")
print("• Hover information with detailed business metrics")
print("• Animation showing temporal evolution of relationships")
print("• Zoom and pan capabilities for detailed exploration")

In [None]:
# 6. CORRELATION MATRIX AND PAIR PLOTS
print("\n 6. CORRELATION MATRIX AND PAIR PLOTS")
print("=" * 40)

# Select numerical variables for correlation analysis
numerical_vars = ['marketing_spend', 'sales_revenue', 'customer_acq_cost',
 'customer_satisfaction', 'product_price', 'demand',
 'employee_count', 'revenue_per_employee', 'advertising_budget',
 'brand_awareness']

correlation_matrix = df[numerical_vars].corr()

# Create interactive correlation heatmap
fig_corr = go.Figure(data=go.Heatmap(
 z=correlation_matrix.values,
 x=correlation_matrix.columns,
 y=correlation_matrix.columns,
 colorscale='RdBu_r',
 zmid=0,
 text=correlation_matrix.round(3).values,
 texttemplate="%{text}",
 textfont={"size": 10},
 hoverongaps=False,
 hovertemplate="<b>%{y} vs %{x}</b><br>Correlation: %{z:.3f}<extra></extra>"
))

fig_corr.update_layout(
 title="Business Metrics Correlation Matrix",
 width=800,
 height=600,
 xaxis_title="Variables",
 yaxis_title="Variables"
)
fig_corr.show()

# Create enhanced pair plot
print("\n6.1 Enhanced Pair Plot Analysis:")

# Select key variables for pair plot
key_vars = ['marketing_spend', 'sales_revenue', 'customer_satisfaction', 'product_price']
pair_data = df[key_vars + ['region']].copy()

# Create pair plot matrix
n_vars = len(key_vars)
fig_pair = make_subplots(
 rows=n_vars, cols=n_vars,
 subplot_titles=[f"{row} vs {col}" if row != col else f"{row} Distribution"
 for row in key_vars for col in key_vars]
)

for i, var1 in enumerate(key_vars):
 for j, var2 in enumerate(key_vars):
 if i == j:
 # Diagonal: Distribution plot
 fig_pair.add_trace(
 go.Histogram(
 x=df[var1],
 nbinsx=20,
 name=f'{var1} Dist',
 showlegend=False,
 marker_color='lightblue',
 opacity=0.7
 ),
 row=i+1, col=j+1
 )
 else:
 # Off-diagonal: Scatter plot colored by region
 for region in df['region'].unique():
 region_data = df[df['region'] == region]
 fig_pair.add_trace(
 go.Scatter(
 x=region_data[var2],
 y=region_data[var1],
 mode='markers',
 marker=dict(size=4, opacity=0.6),
 name=region if i == 0 and j == 1 else None,
 showlegend=(i == 0 and j == 1),
 legendgroup=region
 ),
 row=i+1, col=j+1
 )

fig_pair.update_layout(
 title="Enhanced Pair Plot: Key Business Variables by Region",
 height=800,
 showlegend=True
)
fig_pair.show()

# Correlation strength analysis
print("\n Correlation Strength Analysis:")
correlations_list = []

for i in range(len(numerical_vars)):
 for j in range(i+1, len(numerical_vars)):
 var1, var2 = numerical_vars[i], numerical_vars[j]
 corr_val = correlation_matrix.loc[var1, var2]

 correlations_list.append({
 'Variable_1': var1,
 'Variable_2': var2,
 'Correlation': corr_val,
 'Abs_Correlation': abs(corr_val),
 'Strength': 'Very Strong' if abs(corr_val) >= 0.8 else
 'Strong' if abs(corr_val) >= 0.6 else
 'Moderate' if abs(corr_val) >= 0.4 else
 'Weak' if abs(corr_val) >= 0.2 else 'Very Weak',
 'Direction': 'Positive' if corr_val > 0 else 'Negative'
 })

corr_df = pd.DataFrame(correlations_list)
corr_df = corr_df.sort_values('Abs_Correlation', ascending=False)

print("Top 10 Strongest Correlations:")
print(corr_df.head(10)[['Variable_1', 'Variable_2', 'Correlation', 'Strength', 'Direction']].to_string(index=False))

# Create correlation strength distribution
fig_strength = px.histogram(
 corr_df,
 x='Abs_Correlation',
 color='Strength',
 title="Distribution of Correlation Strengths in Business Data",
 labels={'Abs_Correlation': 'Absolute Correlation Coefficient'},
 nbins=20
)
fig_strength.show()

print(f"\n Key Correlation Insights:")
strong_corrs = corr_df[corr_df['Abs_Correlation'] >= 0.6]
print(f"• {len(strong_corrs)} strong correlations (|r| ≥ 0.6) identified")
print(f"• Strongest positive correlation: {corr_df.iloc[0]['Variable_1']} vs {corr_df.iloc[0]['Variable_2']} (r={corr_df.iloc[0]['Correlation']:.3f})")

negative_corrs = corr_df[corr_df['Correlation'] < 0].sort_values('Correlation')
if not negative_corrs.empty:
 print(f"• Strongest negative correlation: {negative_corrs.iloc[0]['Variable_1']} vs {negative_corrs.iloc[0]['Variable_2']} (r={negative_corrs.iloc[0]['Correlation']:.3f})")

In [None]:
# 7. BUSINESS INSIGHTS AND INTERPRETATION
print("\n 7. BUSINESS INSIGHTS AND INTERPRETATION")
print("=" * 46)

def generate_business_insights(dataframe, correlation_data):
 """Generate actionable business insights from scatter plot analysis"""

 insights = {
 'high_impact_relationships': [],
 'optimization_opportunities': [],
 'risk_factors': [],
 'strategic_recommendations': []
 }

 # Analyze high-impact relationships
 marketing_sales_corr = dataframe['marketing_spend'].corr(dataframe['sales_revenue'])
 price_demand_corr = dataframe['product_price'].corr(dataframe['demand'])

 if marketing_sales_corr > 0.7:
 insights['high_impact_relationships'].append(
 f"Strong marketing ROI: Every $1 in marketing generates ~${marketing_sales_corr * 2:.2f} in sales"
 )

 if price_demand_corr < -0.5:
 insights['optimization_opportunities'].append(
 "Price elasticity analysis shows demand sensitivity - consider dynamic pricing strategies"
 )

 # Identify outliers as opportunities or risks
 outliers_combined = np.any(list(outliers.values()), axis=0)
 outlier_performance = dataframe[outliers_combined]

 if not outlier_performance.empty:
 high_performers = outlier_performance[
 outlier_performance['sales_revenue'] > dataframe['sales_revenue'].quantile(0.9)
 ]

 if not high_performers.empty:
 insights['optimization_opportunities'].append(
 f"Found {len(high_performers)} high-performing outliers with unique patterns worth replicating"
 )

 # Customer satisfaction impact
 satisfaction_impact = dataframe.groupby(
 pd.cut(dataframe['customer_satisfaction'], bins=3, labels=['Low', 'Medium', 'High'])
 )['sales_revenue'].mean()

 if satisfaction_impact['High'] > satisfaction_impact['Low'] * 1.2:
 insights['strategic_recommendations'].append(
 "Customer satisfaction shows strong correlation with revenue - prioritize customer experience investments"
 )

 # Regional performance analysis
 regional_performance = dataframe.groupby('region').agg({
 'sales_revenue': 'mean',
 'marketing_spend': 'mean',
 'customer_satisfaction': 'mean'
 })

 best_region = regional_performance['sales_revenue'].idxmax()
 worst_region = regional_performance['sales_revenue'].idxmin()

 insights['strategic_recommendations'].append(
 f"Regional optimization: {best_region} region outperforms {worst_region} by "
 f"{((regional_performance.loc[best_region, 'sales_revenue'] / regional_performance.loc[worst_region, 'sales_revenue']) - 1) * 100:.1f}%"
 )

 return insights

# Generate comprehensive business insights
business_insights = generate_business_insights(df, corr_df)

# Create business insight visualization
fig_insights = make_subplots(
 rows=2, cols=2,
 subplot_titles=(
 "Marketing ROI by Region",
 "Customer Satisfaction Impact",
 "Price vs Demand Sensitivity",
 "Regional Performance Comparison"
 ),
 specs=[[{"type": "bar"}, {"type": "scatter"}],
 [{"type": "scatter"}, {"type": "bar"}]]
)

# Marketing ROI by region
regional_roi = df.groupby('region').apply(
 lambda x: x['sales_revenue'].sum() / x['marketing_spend'].sum()
).reset_index()
regional_roi.columns = ['region', 'roi']

fig_insights.add_trace(
 go.Bar(
 x=regional_roi['region'],
 y=regional_roi['roi'],
 marker_color=['red', 'green', 'blue', 'orange'],
 name='Marketing ROI',
 text=regional_roi['roi'].round(2),
 textposition='auto'
 ),
 row=1, col=1
)

# Customer satisfaction impact
satisfaction_bins = pd.cut(df['customer_satisfaction'], bins=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
satisfaction_impact = df.groupby(satisfaction_bins)['sales_revenue'].mean().reset_index()

fig_insights.add_trace(
 go.Scatter(
 x=satisfaction_impact['customer_satisfaction'],
 y=satisfaction_impact['sales_revenue'],
 mode='markers+lines',
 marker=dict(size=10, color='green'),
 line=dict(color='green', width=3),
 name='Satisfaction Impact'
 ),
 row=1, col=2
)

# Price sensitivity analysis
fig_insights.add_trace(
 go.Scatter(
 x=df['product_price'],
 y=df['demand'],
 mode='markers',
 marker=dict(size=5, opacity=0.6, color='purple'),
 name='Price Sensitivity',
 trendline='ols'
 ),
 row=2, col=1
)

# Regional performance comparison
regional_performance = df.groupby('region').agg({
 'sales_revenue': 'mean',
 'customer_satisfaction': 'mean'
}).reset_index()

fig_insights.add_trace(
 go.Bar(
 x=regional_performance['region'],
 y=regional_performance['sales_revenue'],
 marker_color=['lightcoral', 'lightgreen', 'lightblue', 'lightyellow'],
 name='Avg Sales Revenue',
 text=regional_performance['sales_revenue'].round(0),
 textposition='auto'
 ),
 row=2, col=2
)

fig_insights.update_layout(
 title="Business Intelligence Dashboard: Key Performance Insights",
 height=700,
 showlegend=False
)
fig_insights.show()

# Display business insights
print(" Strategic Business Insights:")
print("=" * 30)

for category, insights_list in business_insights.items():
 if insights_list:
 print(f"\n{category.replace('_', ' ').title()}:")
 for i, insight in enumerate(insights_list, 1):
 print(f" {i}. {insight}")

# Key performance metrics
print(f"\n Key Performance Metrics:")
print(f"• Overall Marketing ROI: {df['sales_revenue'].sum() / df['marketing_spend'].sum():.2f}x")
print(f"• Average Customer Satisfaction: {df['customer_satisfaction'].mean():.1f}/10")
print(f"• Price Elasticity: {df['product_price'].corr(df['demand']):.3f}")
print(f"• Employee Productivity: ${df['revenue_per_employee'].mean():,.0f} per employee")

# Recommendations summary
print(f"\n Actionable Recommendations:")
print("1. Focus marketing investments in regions with highest ROI")
print("2. Implement customer satisfaction monitoring as leading indicator")
print("3. Consider premium pricing strategy where demand is less elastic")
print("4. Investigate outlier success patterns for replication opportunities")
print("5. Balance employee count with productivity metrics for optimal scaling")

# LEARNING SUMMARY: Scatter Plot Analysis

## Key Concepts Mastered

### 1. **Scatter Plot Fundamentals**
- **Basic Relationships**: Understanding positive, negative, and no correlation patterns
- **Trend Analysis**: Linear, polynomial, and non-linear relationship identification
- **Outlier Detection**: Multiple statistical methods for anomaly identification
- **Confidence Intervals**: Understanding uncertainty in relationship estimates

### 2. **Multi-Dimensional Visualization**
- **Color Encoding**: Adding third dimension through color mapping
- **Size Encoding**: Representing fourth dimension through marker size
- **Symbol Encoding**: Using shapes for categorical distinctions
- **Interactive Features**: Hover information, zooming, and dynamic exploration

### 3. **Advanced Techniques**
- **Polynomial Regression**: Capturing non-linear relationships
- **Statistical Testing**: Correlation significance and confidence intervals
- **Animation**: Temporal evolution of relationships
- **Dashboard Creation**: Interactive exploration tools

## Business Applications

### Strategic Analysis
- **Marketing ROI**: Quantifying marketing spend effectiveness
- **Price Elasticity**: Understanding demand sensitivity to pricing
- **Customer Insights**: Satisfaction impact on business outcomes
- **Regional Performance**: Geographic optimization opportunities

### Decision Support
- Scatter plots provide:
 - Clear visual evidence for business decisions
 - Outlier identification for best practice replication
 - Trend analysis for forecasting and planning
 - Multi-factor relationship understanding

## Next Steps

1. **Tier 2: Predictive Models** - Use scatter plot insights for regression modeling
2. **Correlation to Causation** - Move from observational to causal analysis
3. **Time Series Analysis** - Understand how relationships evolve over time
4. **Advanced Visualization** - 3D plots and network analysis

## Pro Tips

- Always check for outliers - they often contain valuable insights
- Use multiple encoding dimensions (color, size, shape) for richer analysis
- Consider non-linear relationships, not just linear correlations
- Interactive plots are powerful for stakeholder presentations
- Combine statistical testing with visual analysis for robust conclusions

## Common Pitfalls

- **Correlation ≠ Causation**: Strong relationships don't prove cause-and-effect
- **Outlier Sensitivity**: Extreme values can distort trend analysis
- **Scale Effects**: Different variable scales can mislead visual interpretation
- **Overplotting**: Too many points can obscure patterns in dense datasets

**Remember**: *Scatter plots are your window into bivariate relationships - use them to guide deeper statistical analysis and business strategy!*