# Tier 1: Descriptive Statistics

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT

---

> **Citation:**
> Brandon Deloatch, "Tier 1: Descriptive Statistics," Quipu Research Labs, LLC, v1.3, 2025-10-02.

---

*This notebook is provided "as-is" for educational and research purposes. Users assume full responsibility for any results or applications derived from it.*

---

## Comprehensive Descriptive Statistics and Exploratory Analytics

**Learning Objectives:**
- Master fundamental descriptive statistics calculations
- Understand measures of central tendency and variability
- Learn distribution characteristics and data summarization
- Apply statistical measures to real-world datasets

**Cross-References:**
- **Foundation For:** `Tier1_Distribution.ipynb` (distribution analysis)
- **Complements:** `Tier1_Correlation.ipynb` (relationship analysis)
- **Advanced:** `Tier3_Statistics.ipynb` (advanced statistical methods)

**Key Applications:**
- Data quality assessment and validation
- Business intelligence and reporting
- Research data summarization
- Exploratory data analysis foundations

# Tier 1: Descriptive Statistics

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT

---

> **Citation:**
> Brandon Deloatch, "Tier 1: Descriptive Statistics," Quipu Research Labs, LLC, v1.3, 2025-10-02.

---

*This notebook is provided "as-is" for educational and research purposes. Users assume full responsibility for any results or applications derived from it.*

---

## Comprehensive Descriptive Statistics and Exploratory Analytics

**Learning Objectives:**
- Master fundamental descriptive statistics calculations
- Understand measures of central tendency and variability
- Learn distribution characteristics and data summarization
- Apply statistical measures to real-world datasets

**Cross-References:**
- **Foundation For:** `Tier1_Distribution.ipynb` (distribution analysis)
- **Complements:** `Tier1_Correlation.ipynb` (relationship analysis)
- **Advanced:** `Tier3_Statistics.ipynb` (advanced statistical methods)

**Key Applications:**
- Data quality assessment and validation
- Business intelligence and reporting
- Research data summarization
- Exploratory data analysis foundations

In [4]:
# Tier 1: Descriptive & Exploratory Analytics
# ============================================
# Professional implementation with comprehensive visualizations and synthetic data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Tier 1: Descriptive & Exploratory Analytics")
print("=" * 50)
print("CROSS-REFERENCES:")
print("• Prerequisites: None (Entry point to analytics suite)")
print("• Next Steps: Tier1_Pivot.ipynb (data aggregation)")
print("• Next Steps: Tier1_Scatter.ipynb (relationship analysis)")
print("• Feeds Into: ALL subsequent notebooks (foundational concepts)")
print("• Compare With: Advanced statistical analysis in Tier3_Statistics.ipynb")
print("• Full Guide: See CROSS_REFERENCE_GUIDE.md for complete learning paths")
print("=" * 50)
print("Purpose: Understand data structure, distributions, and relationships")
print("Techniques: Summary stats, distributions, correlations, pivot analysis")
print("Output: Comprehensive data profiling with professional visualizations")
print()

def generate_synthetic_dataset(n_samples=1000, seed=42):
 """Generate realistic synthetic business dataset for analysis."""
 np.random.seed(seed)

 # Generate correlated business metrics
 base_sales = np.random.normal(50000, 15000, n_samples)
 base_sales = np.maximum(base_sales, 10000) # Minimum sales

 # Create related variables with realistic relationships
 data = {
 'sales_amount': base_sales,
 'marketing_spend': base_sales * 0.05 + np.random.normal(0, 500, n_samples),
 'customer_satisfaction': 7 + (base_sales - 50000) / 20000 + np.random.normal(0, 0.8, n_samples),
 'product_price': 100 + (base_sales - 50000) / 1000 + np.random.normal(0, 10, n_samples),
 'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples, p=[0.3, 0.25, 0.25, 0.2]),
 'product_category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books'], n_samples, p=[0.4, 0.3, 0.2, 0.1]),
 'quarter': np.random.choice(['Q1', 'Q2', 'Q3', 'Q4'], n_samples),
 'employee_count': np.random.poisson(15, n_samples) + 5,
 'years_in_business': np.random.exponential(5, n_samples),
 }

 # Ensure realistic ranges
 data['customer_satisfaction'] = np.clip(data['customer_satisfaction'], 1, 10)
 data['marketing_spend'] = np.maximum(data['marketing_spend'], 500)
 data['product_price'] = np.maximum(data['product_price'], 20)
 data['years_in_business'] = np.clip(data['years_in_business'], 0.5, 20)

 # Add some categorical derived variables
 data['business_size'] = pd.cut(data['employee_count'],
 bins=[0, 10, 25, 50, np.inf],
 labels=['Small', 'Medium', 'Large', 'Enterprise'])

 data['performance_tier'] = pd.cut(data['sales_amount'],
 bins=3,
 labels=['Low', 'Medium', 'High'])

 return pd.DataFrame(data)

# Generate and load data
print("Generating synthetic business dataset...")
df = generate_synthetic_dataset(1000)
print(f"Generated dataset with {len(df)} records and {len(df.columns)} variables")
print()

# Display basic information
print("Dataset Overview:")
print("-" * 20)
print(f"Shape: {df.shape}")
print(f"Memory usage: {df.memory_usage().sum() / 1024:.1f} KB")
print()
print("Data Types:")
print(df.dtypes)
print()
print("First 5 rows:")
print(df.head())

Tier 1: Descriptive & Exploratory Analytics
CROSS-REFERENCES:
• Prerequisites: None (Entry point to analytics suite)
• Next Steps: Tier1_Pivot.ipynb (data aggregation)
• Next Steps: Tier1_Scatter.ipynb (relationship analysis)
• Feeds Into: ALL subsequent notebooks (foundational concepts)
• Compare With: Advanced statistical analysis in Tier3_Statistics.ipynb
• Full Guide: See CROSS_REFERENCE_GUIDE.md for complete learning paths
Purpose: Understand data structure, distributions, and relationships
Techniques: Summary stats, distributions, correlations, pivot analysis
Output: Comprehensive data profiling with professional visualizations

Generating synthetic business dataset...
Generated dataset with 1000 records and 11 variables

Dataset Overview:
--------------------
Shape: (1000, 11)
Memory usage: 72.7 KB

Data Types:
sales_amount              float64
marketing_spend           float64
customer_satisfaction     float64
product_price             float64
region                     object


In [5]:
# 1. COMPREHENSIVE SUMMARY STATISTICS
# ====================================

print("COMPREHENSIVE SUMMARY STATISTICS")
print("=" * 40)

# Basic descriptive statistics
print("\n1.1 Numerical Variables Summary:")
print("-" * 35)
numeric_cols = df.select_dtypes(include=[np.number]).columns
summary_stats = df[numeric_cols].describe()
print(summary_stats.round(2))

# Additional statistics
print("\n1.2 Additional Statistical Measures:")
print("-" * 38)
additional_stats = pd.DataFrame({
 'Skewness': df[numeric_cols].skew(),
 'Kurtosis': df[numeric_cols].kurtosis(),
 'Variance': df[numeric_cols].var(),
 'Std_Dev': df[numeric_cols].std(),
 'Range': df[numeric_cols].max() - df[numeric_cols].min(),
 'IQR': df[numeric_cols].quantile(0.75) - df[numeric_cols].quantile(0.25),
 'Missing_Count': df[numeric_cols].isnull().sum(),
 'Missing_Pct': (df[numeric_cols].isnull().sum() / len(df) * 100).round(2)
})
print(additional_stats.round(3))

# Categorical variables summary
print("\n1.3 Categorical Variables Summary:")
print("-" * 37)
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
    print(f"\n{col.upper()}:")
    value_counts = df[col].value_counts()
    percentages = (df[col].value_counts(normalize=True) * 100).round(2)
    summary_cat = pd.DataFrame({
        'Count': value_counts,
        'Percentage': percentages
    })
    print(summary_cat)

print("\n Summary statistics complete!")

COMPREHENSIVE SUMMARY STATISTICS

1.1 Numerical Variables Summary:
-----------------------------------
       sales_amount  marketing_spend  customer_satisfaction  product_price  \
count       1000.00          1000.00                1000.00        1000.00   
mean       50299.05          2551.75                   7.02         100.11   
std        14660.76           866.45                   1.09          17.78   
min        10000.00           500.00                   4.09          50.95   
25%        40286.15          2001.05                   6.25          87.21   
50%        50379.51          2531.53                   7.00          99.51   
75%        59719.16          3095.02                   7.76         112.18   
max       107790.97          6264.34                  10.00         162.30   

       employee_count  years_in_business  
count         1000.00            1000.00  
mean            19.97               4.97  
std              3.79               4.48  
min              9.00 

In [8]:
# 2. DISTRIBUTION ANALYSIS WITH PROFESSIONAL VISUALIZATIONS
# ==========================================================

print("DISTRIBUTION ANALYSIS")
print("=" * 25)

# Create subplots for multiple distribution visualizations
fig = make_subplots(
 rows=3, cols=2,
 subplot_titles=[
 'Sales Amount Distribution', 'Customer Satisfaction Distribution',
 'Marketing Spend Distribution', 'Product Price Distribution',
 'Employee Count Distribution', 'Years in Business Distribution'
 ],
 specs=[[{"secondary_y": True}, {"secondary_y": True}],
 [{"secondary_y": True}, {"secondary_y": True}],
 [{"secondary_y": True}, {"secondary_y": True}]]
)

# Variables to analyze
dist_vars = ['sales_amount', 'customer_satisfaction', 'marketing_spend',
 'product_price', 'employee_count', 'years_in_business']

positions = [(1,1), (1,2), (2,1), (2,2), (3,1), (3,2)]

for var, (row, col) in zip(dist_vars, positions):
    # Histogram
    fig.add_trace(
        go.Histogram(x=df[var], name=f'{var} Histogram',
                    opacity=0.7, nbinsx=30,
                    marker_color='lightblue'),
        row=row, col=col
    )

    # Add mean line
    mean_val = df[var].mean()
    fig.add_vline(x=mean_val, line_dash="dash", line_color="red",
                 annotation_text=f"Mean: {mean_val:.1f}",
                 row=row, col=col)

fig.update_layout(height=900, title_text="Distribution Analysis Dashboard",
 showlegend=False)
fig.show()

# Statistical normality tests
print("\n2.1 Distribution Shape Analysis:")
print("-" * 32)
from scipy import stats

normality_results = []
for var in dist_vars:
    # Shapiro-Wilk test (small samples)
    if len(df[var]) < 5000:
        stat, p_value = stats.shapiro(df[var].dropna())
        test_name = "Shapiro-Wilk"
    else:
        # Kolmogorov-Smirnov test (large samples)
        stat, p_value = stats.kstest(df[var].dropna(), 'norm')
        test_name = "Kolmogorov-Smirnov"

    is_normal = "Yes" if p_value > 0.05 else "No"
    normality_results.append({
        'Variable': var,
        'Test': test_name,
        'Statistic': round(stat, 4),
        'P-Value': round(p_value, 4),
        'Normal?': is_normal,
        'Skewness': round(df[var].skew(), 3),
        'Kurtosis': round(df[var].kurtosis(), 3)
    })

normality_df = pd.DataFrame(normality_results)
print(normality_df.to_string(index=False))

print("\n Distribution analysis complete!")

DISTRIBUTION ANALYSIS



2.1 Distribution Shape Analysis:
--------------------------------
             Variable         Test  Statistic  P-Value Normal?  Skewness  Kurtosis
         sales_amount Shapiro-Wilk     0.9981   0.3337     Yes     0.133     0.025
customer_satisfaction Shapiro-Wilk     0.9975   0.1368     Yes     0.063    -0.239
      marketing_spend Shapiro-Wilk     0.9949   0.0018      No     0.139     0.162
        product_price Shapiro-Wilk     0.9969   0.0487      No     0.150    -0.229
       employee_count Shapiro-Wilk     0.9893   0.0000      No     0.224    -0.052
    years_in_business Shapiro-Wilk     0.8541   0.0000      No     1.361     1.461

 Distribution analysis complete!


In [9]:
# 3. CORRELATION ANALYSIS AND HEATMAPS
# ====================================

print("CORRELATION ANALYSIS")
print("=" * 22)

# Calculate correlation matrices
numeric_data = df[numeric_cols]
correlation_matrix = numeric_data.corr()

# Create interactive correlation heatmap
fig = go.Figure(data=go.Heatmap(
 z=correlation_matrix.values,
 x=correlation_matrix.columns,
 y=correlation_matrix.columns,
 colorscale='RdBu',
 zmid=0,
 text=correlation_matrix.round(3).values,
 texttemplate="%{text}",
 textfont={"size": 10},
 hoverongaps=False
))

fig.update_layout(
 title="Correlation Matrix Heatmap",
 width=800, height=600,
 xaxis_title="Variables",
 yaxis_title="Variables"
)
fig.show()

# Correlation strength analysis
print("\n3.1 Strong Correlations (|r| > 0.5):")
print("-" * 35)
strong_corrs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_val = correlation_matrix.iloc[i, j]
        if abs(corr_val) > 0.5:
            strong_corrs.append({
                'Variable_1': correlation_matrix.columns[i],
                'Variable_2': correlation_matrix.columns[j],
                'Correlation': round(corr_val, 3),
                'Strength': 'Strong' if abs(corr_val) > 0.7 else 'Moderate'
            })

if strong_corrs:
 strong_corr_df = pd.DataFrame(strong_corrs)
 strong_corr_df = strong_corr_df.sort_values('Correlation', key=abs, ascending=False)
 print(strong_corr_df.to_string(index=False))
else:
 print("No strong correlations found (|r| > 0.5)")

# Pairplot for key relationships
print("\n3.2 Creating pairplot for key variables...")
key_vars = ['sales_amount', 'marketing_spend', 'customer_satisfaction', 'product_price']
pairplot_data = df[key_vars + ['region']].copy()

# Create pairplot with Plotly
from plotly.graph_objects import Scatter
fig_pair = make_subplots(rows=len(key_vars), cols=len(key_vars),
 subplot_titles=[f"{x} vs {y}" for x in key_vars for y in key_vars])

for i, var1 in enumerate(key_vars):
    for j, var2 in enumerate(key_vars):
        if i == j:
            # Diagonal: histogram
            fig_pair.add_trace(
                go.Histogram(x=df[var1], name=var1, showlegend=False),
                row=i+1, col=j+1
            )
        else:
            # Off-diagonal: scatter plot
            fig_pair.add_trace(
                go.Scatter(x=df[var2], y=df[var1], mode='markers',
                          name=f'{var1} vs {var2}', showlegend=False,
                          marker=dict(size=4, opacity=0.6)),
                row=i+1, col=j+1
            )

fig_pair.update_layout(height=800, title_text="Pairplot of Key Variables")
fig_pair.show()

print("\n Correlation analysis complete!")

CORRELATION ANALYSIS



3.1 Strong Correlations (|r| > 0.5):
-----------------------------------
           Variable_1            Variable_2  Correlation Strength
         sales_amount       marketing_spend        0.820   Strong
         sales_amount         product_price        0.816   Strong
         sales_amount customer_satisfaction        0.691 Moderate
      marketing_spend         product_price        0.651 Moderate
customer_satisfaction         product_price        0.573 Moderate
      marketing_spend customer_satisfaction        0.563 Moderate

3.2 Creating pairplot for key variables...



 Correlation analysis complete!


In [10]:
# 4. PIVOT TABLES AND CROSS-TABULATION ANALYSIS
# ==============================================

print("PIVOT TABLES & CROSS-TABULATION")
print("=" * 35)

# 4.1 Sales analysis by region and category
print("\n4.1 Sales Analysis by Region and Product Category:")
print("-" * 50)
pivot_sales = df.pivot_table(
 values=['sales_amount', 'marketing_spend', 'customer_satisfaction'],
 index='region',
 columns='product_category',
 aggfunc={'sales_amount': 'mean', 'marketing_spend': 'sum', 'customer_satisfaction': 'mean'}
)

print("Average Sales by Region and Category:")
print(pivot_sales['sales_amount'].round(0))
print("\nTotal Marketing Spend by Region and Category:")
print(pivot_sales['marketing_spend'].round(0))

# Interactive pivot visualization
fig = px.sunburst(
 df,
 path=['region', 'product_category'],
 values='sales_amount',
 title="Sales Distribution: Region → Product Category"
)
fig.show()

# 4.2 Cross-tabulation analysis
print("\n4.2 Cross-Tabulation Analysis:")
print("-" * 31)

# Business size vs Performance tier
crosstab = pd.crosstab(df['business_size'], df['performance_tier'],
 margins=True, normalize='index')
print("Business Size vs Performance Tier (Row Percentages):")
print(crosstab.round(3))

# Visualization of cross-tab
fig = px.density_heatmap(
 df, x='business_size', y='performance_tier',
 title="Business Size vs Performance Tier Distribution"
)
fig.show()

# 4.3 Aggregated metrics by categories
print("\n4.3 Key Metrics by Categories:")
print("-" * 32)

# Group by region
region_summary = df.groupby('region').agg({
 'sales_amount': ['count', 'mean', 'median', 'std'],
 'customer_satisfaction': 'mean',
 'marketing_spend': 'sum',
 'employee_count': 'mean'
}).round(2)

region_summary.columns = ['_'.join(col).strip() for col in region_summary.columns]
print("Regional Summary:")
print(region_summary)

# Group by product category
category_summary = df.groupby('product_category').agg({
 'sales_amount': ['count', 'mean', 'sum'],
 'product_price': 'mean',
 'customer_satisfaction': 'mean'
}).round(2)

category_summary.columns = ['_'.join(col).strip() for col in category_summary.columns]
print("\nProduct Category Summary:")
print(category_summary)

print("\n Pivot analysis complete!")

PIVOT TABLES & CROSS-TABULATION

4.1 Sales Analysis by Region and Product Category:
--------------------------------------------------
Average Sales by Region and Category:
product_category    Books  Clothing  Electronics     Home
region                                                   
East              48042.0   53472.0      48758.0  48510.0
North             49705.0   51515.0      49309.0  51230.0
South             52930.0   52301.0      51800.0  52986.0
West              44333.0   48549.0      47542.0  50793.0

Total Marketing Spend by Region and Category:
product_category    Books  Clothing  Electronics      Home
region                                                    
East              60875.0  192115.0     233719.0  116439.0
North             70437.0  228237.0     318527.0  147857.0
South             79435.0  236144.0     282411.0   79763.0
West              53799.0  121286.0     210715.0  119993.0



4.2 Cross-Tabulation Analysis:
-------------------------------
Business Size vs Performance Tier (Row Percentages):
performance_tier    Low  Medium   High
business_size                         
Small             0.400   0.400  0.200
Medium            0.312   0.638  0.050
Large             0.321   0.643  0.036
All               0.313   0.637  0.050



4.3 Key Metrics by Categories:
--------------------------------
Regional Summary:
        sales_amount_count  sales_amount_mean  sales_amount_median  \
region                                                               
East                   239           49993.26             50528.95   
North                  304           50371.97             50408.07   
South                  252           52247.31             51627.34   
West                   205           48152.51             48278.95   

        sales_amount_std  customer_satisfaction_mean  marketing_spend_sum  \
region                                                                      
East            14478.41                        6.96            603147.97   
North           15067.46                        7.03            765058.48   
South           14301.18                        7.09            677752.31   
West            14481.60                        6.98            505792.59   

        employee_count_mean  
reg