# Tier 4: Hierarchical Clustering

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** 626f342d-2fd4-460e-aafa-3fc7c960c4b2

---

## Citation
Brandon Deloatch, "Tier 4: Hierarchical Clustering," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** 626f342d-2fd4-460e-aafa-3fc7c960c4b2
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist
import warnings
warnings.filterwarnings('ignore')

print(" Tier 4: Hierarchical Clustering - Libraries Loaded!")
print("=" * 50)
print("Hierarchical Techniques:")
print("• Agglomerative (bottom-up) clustering")
print("• Multiple linkage criteria (Ward, complete, average)")
print("• Dendrogram visualization and interpretation")
print("• Optimal cluster number determination")
print("• Distance matrix analysis")

 Tier 4: Hierarchical Clustering - Libraries Loaded!
Hierarchical Techniques:
• Agglomerative (bottom-up) clustering
• Multiple linkage criteria (Ward, complete, average)
• Dendrogram visualization and interpretation
• Optimal cluster number determination
• Distance matrix analysis


In [None]:
# Generate hierarchical clustering datasets
np.random.seed(42)

# Product similarity data
products = [f"Product_{i:02d}" for i in range(1, 21)]
n_products = len(products)

# Create feature matrix (price, quality, brand_strength, market_share)
product_features = pd.DataFrame({
    'price': np.random.lognormal(3, 0.5, n_products),
    'quality_score': np.random.beta(2, 1, n_products) * 10,
    'brand_strength': np.random.gamma(2, 2, n_products),
    'market_share': np.random.exponential(2, n_products)
}, index=products)

# Normalize features
scaler = StandardScaler()
product_features_scaled = scaler.fit_transform(product_features)

print(" Hierarchical Clustering Dataset:")
print(f"Products analyzed: {n_products}")
print(f"Features: {list(product_features.columns)}")
print(f"Price range: ${product_features['price'].min():.0f} - ${product_features['price'].max():.0f}")
print(f"Quality range: {product_features['quality_score'].min():.1f} - {product_features['quality_score'].max():.1f}")

 Hierarchical Clustering Dataset:
Products analyzed: 20
Features: ['price', 'quality_score', 'brand_strength', 'market_share']
Price range: $8 - $44
Quality range: 1.6 - 9.8


In [4]:
# 1. HIERARCHICAL CLUSTERING WITH DIFFERENT LINKAGES
print(" 1. HIERARCHICAL CLUSTERING ANALYSIS")
print("=" * 37)

# Test different linkage methods
linkage_methods = ['ward', 'complete', 'average', 'single']
linkage_results = {}

for method in linkage_methods:
    # Compute linkage matrix
    if method == 'ward':
        linkage_matrix = linkage(product_features_scaled, method='ward')
    else:
        distances = pdist(product_features_scaled)
        linkage_matrix = linkage(distances, method=method)

    # Get clusters for different numbers
    n_clusters = 4
    clusters = fcluster(linkage_matrix, n_clusters, criterion='maxclust')

    # Calculate silhouette score
    silhouette = silhouette_score(product_features_scaled, clusters)

    linkage_results[method] = {
        'linkage_matrix': linkage_matrix,
        'clusters': clusters,
        'silhouette': silhouette
    }

    print(f"{method.capitalize()} linkage: {n_clusters} clusters, silhouette = {silhouette:.3f}")

# Select best linkage method
best_method = max(linkage_results.keys(), key=lambda x: linkage_results[x]['silhouette'])
best_linkage = linkage_results[best_method]['linkage_matrix']
best_clusters = linkage_results[best_method]['clusters']

print(f"\nBest linkage method: {best_method} (silhouette = {linkage_results[best_method]['silhouette']:.3f})")

# Analyze cluster composition
product_features['cluster'] = best_clusters
cluster_analysis = product_features.groupby('cluster').agg({
    'price': ['mean', 'std', 'count'],
    'quality_score': ['mean', 'std'],
    'brand_strength': ['mean', 'std'],
    'market_share': ['mean', 'std']
}).round(2)

print(f"\nCluster Analysis:")
for cluster in range(1, n_clusters + 1):
    cluster_products = product_features[product_features['cluster'] == cluster]
    cluster_size = len(cluster_products)
    avg_price = cluster_products['price'].mean()
    avg_quality = cluster_products['quality_score'].mean()

    print(f"Cluster {cluster}: {cluster_size} products, avg price ${avg_price:.0f}, avg quality {avg_quality:.1f}")

 1. HIERARCHICAL CLUSTERING ANALYSIS
Ward linkage: 4 clusters, silhouette = 0.247
Complete linkage: 4 clusters, silhouette = 0.192
Average linkage: 4 clusters, silhouette = 0.248
Single linkage: 4 clusters, silhouette = 0.194

Best linkage method: average (silhouette = 0.248)

Cluster Analysis:
Cluster 1: 2 products, avg price $44, avg quality 6.9
Cluster 2: 3 products, avg price $19, avg quality 8.3
Cluster 3: 14 products, avg price $17, avg quality 7.6
Cluster 4: 1 products, avg price $28, avg quality 1.6


In [5]:
# 2. INTERACTIVE DENDROGRAM AND CLUSTER VISUALIZATION
print(" 2. INTERACTIVE HIERARCHICAL VISUALIZATIONS")
print("=" * 44)

# Create comprehensive visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[
        'Dendrogram (Ward Linkage)',
        'Product Clusters (Price vs Quality)',
        'Linkage Method Comparison',
        'Cluster Feature Profiles'
    ],
    specs=[[{"colspan": 2}, None],
           [{"secondary_y": False}, {"secondary_y": False}]]
)

# Create dendrogram using plotly
def create_plotly_dendrogram(linkage_matrix, labels):
    """Create interactive dendrogram with plotly"""
    dendro = dendrogram(linkage_matrix, labels=labels, no_plot=True)

    # Extract dendrogram data
    icoord = np.array(dendro['icoord'])
    dcoord = np.array(dendro['dcoord'])

    # Create line traces for dendrogram
    dendro_traces = []
    for i in range(len(icoord)):
        x = icoord[i]
        y = dcoord[i]
        dendro_traces.append(
            go.Scatter(x=x, y=y, mode='lines',
                      line=dict(color='blue', width=1),
                      showlegend=False, hoverinfo='skip')
        )

    return dendro_traces

# Add dendrogram traces
dendro_traces = create_plotly_dendrogram(best_linkage, products)
for trace in dendro_traces:
    fig.add_trace(trace, row=1, col=1)

# Product clusters scatter plot
colors = ['red', 'blue', 'green', 'purple', 'orange']
for cluster in range(1, n_clusters + 1):
    cluster_data = product_features[product_features['cluster'] == cluster]
    fig.add_trace(
        go.Scatter(x=cluster_data['price'],
                  y=cluster_data['quality_score'],
                  mode='markers+text',
                  text=cluster_data.index,
                  textposition='top center',
                  name=f'Cluster {cluster}',
                  marker=dict(color=colors[cluster-1], size=10, opacity=0.7)),
        row=2, col=1
    )

# Linkage method comparison
methods = list(linkage_results.keys())
silhouette_values = [linkage_results[method]['silhouette'] for method in methods]

fig.add_trace(
    go.Bar(x=methods, y=silhouette_values,
           marker=dict(color=['red' if m == best_method else 'lightblue' for m in methods]),
           name='Silhouette Scores'),
    row=2, col=2
)

fig.update_layout(height=800, title="Hierarchical Clustering Analysis Dashboard")
fig.update_xaxes(title_text="Products", row=1, col=1)
fig.update_xaxes(title_text="Price ($)", row=2, col=1)
fig.update_xaxes(title_text="Linkage Method", row=2, col=2)
fig.update_yaxes(title_text="Distance", row=1, col=1)
fig.update_yaxes(title_text="Quality Score", row=2, col=1)
fig.update_yaxes(title_text="Silhouette Score", row=2, col=2)
fig.show()

# Distance matrix heatmap
distance_matrix = pdist(product_features_scaled)
from scipy.spatial.distance import squareform
distance_square = squareform(distance_matrix)

fig_heatmap = go.Figure(data=go.Heatmap(
    z=distance_square,
    x=products,
    y=products,
    colorscale='Viridis',
    text=np.round(distance_square, 2),
    texttemplate="%{text}",
    textfont={"size": 8}
))

fig_heatmap.update_layout(
    title="Product Distance Matrix",
    xaxis_title="Products",
    yaxis_title="Products",
    height=600
)
fig_heatmap.show()

# Business insights
print(f"\n BUSINESS INSIGHTS FROM HIERARCHICAL CLUSTERING:")

for cluster in range(1, n_clusters + 1):
    cluster_data = product_features[product_features['cluster'] == cluster]
    cluster_products = list(cluster_data.index)

    avg_price = cluster_data['price'].mean()
    avg_quality = cluster_data['quality_score'].mean()
    avg_brand = cluster_data['brand_strength'].mean()
    avg_market = cluster_data['market_share'].mean()

    # Determine segment characteristics
    if avg_price > product_features['price'].median() and avg_quality > product_features['quality_score'].median():
        segment = "Premium Products"
    elif avg_price < product_features['price'].median() and avg_quality < product_features['quality_score'].median():
        segment = "Budget Products"
    elif avg_quality > product_features['quality_score'].median():
        segment = "Value Products"
    else:
        segment = "Standard Products"

    print(f"\nCluster {cluster}: {segment}")
    print(f"• Products: {', '.join(cluster_products)}")
    print(f"• Average price: ${avg_price:.0f}")
    print(f"• Average quality: {avg_quality:.1f}/10")
    print(f"• Brand strength: {avg_brand:.1f}")
    print(f"• Market share: {avg_market:.1f}%")

# ROI calculation for product portfolio optimization
total_revenue = product_features['price'].sum() * product_features['market_share'].sum()
portfolio_optimization = 0.15 # 15% improvement in portfolio efficiency
hierarchy_roi = total_revenue * portfolio_optimization

print(f"\n HIERARCHICAL CLUSTERING ROI:")
print(f"• Total product portfolio value: ${total_revenue:,.0f}")
print(f"• Portfolio optimization improvement: {portfolio_optimization*100:.0f}%")
print(f"• Estimated annual ROI: ${hierarchy_roi:,.0f}")
print(f"• Use cases: Product categorization, pricing strategy, market positioning")

 2. INTERACTIVE HIERARCHICAL VISUALIZATIONS



 BUSINESS INSIGHTS FROM HIERARCHICAL CLUSTERING:

Cluster 1: Standard Products
• Products: Product_04, Product_07
• Average price: $44
• Average quality: 6.9/10
• Brand strength: 4.4
• Market share: 4.5%

Cluster 2: Premium Products
• Products: Product_01, Product_06, Product_17
• Average price: $19
• Average quality: 8.3/10
• Brand strength: 3.7
• Market share: 5.9%

Cluster 3: Budget Products
• Products: Product_02, Product_05, Product_08, Product_09, Product_10, Product_11, Product_12, Product_13, Product_14, Product_15, Product_16, Product_18, Product_19, Product_20
• Average price: $17
• Average quality: 7.6/10
• Brand strength: 3.8
• Market share: 0.8%

Cluster 4: Standard Products
• Products: Product_03
• Average price: $28
• Average quality: 1.6/10
• Brand strength: 0.1
• Market share: 1.5%

 HIERARCHICAL CLUSTERING ROI:
• Total product portfolio value: $16,213
• Portfolio optimization improvement: 15%
• Estimated annual ROI: $2,432
• Use cases: Product categorization, pricing