# Week 12: Financial Analysis and Reporting
## Python Session 1: Descriptive Statistics

### Learning Objectives
- Perform descriptive statistical analysis on financial data
- Understand central tendency and spread measures
- Analyze distribution patterns and identify outliers
- Apply statistical analysis to Nigerian e-commerce data

### Business Context
Today we analyze the financial performance of our Nigerian e-commerce marketplace using statistical methods to understand customer behavior, payment patterns, and regional performance differences.

## 1. Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set styling for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# For displaying all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Libraries imported successfully!")

### Loading Financial Data from Database

We'll connect to our Supabase database to fetch live financial data for analysis.

In [None]:
# Note: In actual implementation, connect to Supabase using the appropriate method
# For this notebook, we'll use the data structure from our live queries

# Sample data structure based on our database queries
data = {
    'customer_state': ['SP', 'SP', 'SP', 'RJ', 'RJ', 'MG', 'MG', 'RS', 'BA'],
    'payment_value': [136.39, 145.50, 88.73, 158.08, 105.58, 154.12, 104.37, 155.45, 169.76],
    'payment_type': ['credit_card', 'boleto', 'credit_card', 'credit_card', 'credit_card', 
                    'boleto', 'credit_card', 'credit_card', 'boleto'],
    'order_purchase_timestamp': pd.date_range('2018-01-01', periods=9, freq='D'),
    'customer_id': [f'customer_{i:04d}' for i in range(1, 10)]
}

df_financial = pd.DataFrame(data)
print("Sample financial data:")
print(df_financial.head())
print(f"\nDataset shape: {df_financial.shape}")

## 2. Central Tendency Measures

Central tendency measures help us understand the "typical" or "average" values in our financial data.

### Mean (Average)

The mean is the sum of all values divided by the count of values. It's sensitive to outliers.

In [None]:
# Calculate mean payment values
mean_payment = df_financial['payment_value'].mean()
print(f"Mean payment value: ₦{mean_payment:.2f}")

# Calculate mean by payment type
mean_by_payment_type = df_financial.groupby('payment_type')['payment_value'].mean()
print("\nMean payment by payment type:")
for payment_type, mean_val in mean_by_payment_type.items():
    print(f"{payment_type}: ₦{mean_val:.2f}")

# Calculate mean by state
mean_by_state = df_financial.groupby('customer_state')['payment_value'].mean().sort_values(ascending=False)
print("\nMean payment by state (ranked):")
for state, mean_val in mean_by_state.items():
    print(f"{state}: ₦{mean_val:.2f}")

### Median

The median is the middle value when data is sorted. It's less sensitive to outliers than the mean.

In [None]:
# Calculate median payment values
median_payment = df_financial['payment_value'].median()
print(f"Median payment value: ₦{median_payment:.2f}")

# Calculate median by payment type
median_by_payment_type = df_financial.groupby('payment_type')['payment_value'].median()
print("\nMedian payment by payment type:")
for payment_type, median_val in median_by_payment_type.items():
    print(f"{payment_type}: ₦{median_val:.2f}")

# Compare mean vs median
print("\nMean vs Median Comparison:")
print(f"Overall Mean: ₦{mean_payment:.2f}")
print(f"Overall Median: ₦{median_payment:.2f}")
print(f"Difference: ₦{abs(mean_payment - median_payment):.2f}")

if mean_payment > median_payment:
    print("→ Mean > Median: Suggests right-skewed distribution (some high-value orders)")
elif mean_payment < median_payment:
    print("→ Mean < Median: Suggests left-skewed distribution (some low-value orders)")
else:
    print("→ Mean = Median: Suggests symmetric distribution")

### Mode

The mode is the most frequently occurring value in the dataset.

In [None]:
# Calculate mode for categorical variables
mode_state = df_financial['customer_state'].mode()[0]
mode_payment_type = df_financial['payment_type'].mode()[0]

print(f"Most common customer state: {mode_state}")
print(f"Most common payment type: {mode_payment_type}")

# Calculate frequency counts
print("\nState frequency:")
print(df_financial['customer_state'].value_counts())

print("\nPayment type frequency:")
print(df_financial['payment_type'].value_counts())

## 3. Measures of Spread

Measures of spread tell us how much variability exists in our financial data.

### Standard Deviation and Variance

These measures tell us how much the data points deviate from the mean.

In [None]:
# Calculate spread measures
std_payment = df_financial['payment_value'].std()
var_payment = df_financial['payment_value'].var()
range_payment = df_financial['payment_value'].max() - df_financial['payment_value'].min()

print(f"Standard deviation of payments: ₦{std_payment:.2f}")
print(f"Variance of payments: ₦{var_payment:.2f}")
print(f"Range of payments: ₦{range_payment:.2f}")
print(f"Coefficient of variation: {(std_payment/mean_payment)*100:.1f}%")

# Calculate spread by payment type
print("\nSpread measures by payment type:")
spread_by_type = df_financial.groupby('payment_type')['payment_value'].agg([
    'mean', 'median', 'std', 'min', 'max', 'count'
]).round(2)
print(spread_by_type)

# Business interpretation
print("\nBusiness Interpretation:")
for payment_type in spread_by_type.index:
    cv = (spread_by_type.loc[payment_type, 'std'] / spread_by_type.loc[payment_type, 'mean']) * 100
    if cv > 50:
        variability = "high variability"
    elif cv > 25:
        variability = "moderate variability"
    else:
        variability = "low variability"
    print(f"{payment_type}: {variability} (CV = {cv:.1f}%)")

### Percentiles and Quartiles

Percentiles help us understand the distribution of our financial data.

In [None]:
# Calculate percentiles
percentiles = [25, 50, 75, 90, 95, 99]
payment_percentiles = np.percentile(df_financial['payment_value'], percentiles)

print("Payment Value Percentiles:")
for pct, val in zip(percentiles, payment_percentiles):
    print(f"{pct}th percentile: ₦{val:.2f}")

# Calculate Interquartile Range (IQR)
Q1 = np.percentile(df_financial['payment_value'], 25)
Q3 = np.percentile(df_financial['payment_value'], 75)
IQR = Q3 - Q1

print(f"\nInterquartile Range (IQR): ₦{IQR:.2f}")
print(f"First Quartile (Q1): ₦{Q1:.2f}")
print(f"Third Quartile (Q3): ₦{Q3:.2f}")

# Calculate percentiles by payment type
print("\nPayment percentiles by payment type:")
for payment_type in df_financial['payment_type'].unique():
    subset = df_financial[df_financial['payment_type'] == payment_type]['payment_value']
    p25, p50, p75 = np.percentile(subset, [25, 50, 75])
    print(f"{payment_type}: 25th=₦{p25:.2f}, 50th=₦{p50:.2f}, 75th=₦{p75:.2f}")

## 4. Distribution Analysis

Understanding the shape of our data distribution helps identify patterns and anomalies.

In [None]:
# Create visualizations for distribution analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Financial Data Distribution Analysis', fontsize=16, fontweight='bold')

# 1. Histogram of payment values
axes[0, 0].hist(df_financial['payment_value'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
axes[0, 0].axvline(mean_payment, color='red', linestyle='--', linewidth=2, label=f'Mean: ₦{mean_payment:.2f}')
axes[0, 0].axvline(median_payment, color='green', linestyle='--', linewidth=2, label=f'Median: ₦{median_payment:.2f}')
axes[0, 0].set_title('Distribution of Payment Values')
axes[0, 0].set_xlabel('Payment Value (₦)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Box plot by payment type
payment_types = df_financial['payment_type'].unique()
payment_data = [df_financial[df_financial['payment_type'] == pt]['payment_value'] for pt in payment_types]
box_plot = axes[0, 1].boxplot(payment_data, labels=payment_types, patch_artist=True)
colors = ['lightblue', 'lightgreen', 'lightcoral']
for patch, color in zip(box_plot['boxes'], colors[:len(payment_types)]):
    patch.set_facecolor(color)
axes[0, 1].set_title('Payment Distribution by Type')
axes[0, 1].set_ylabel('Payment Value (₦)')
axes[0, 1].grid(True, alpha=0.3)

# 3. Bar chart of average payments by state
state_avg = df_financial.groupby('customer_state')['payment_value'].mean().sort_values(ascending=False)
bars = axes[1, 0].bar(state_avg.index, state_avg.values, color='orange', alpha=0.7)
axes[1, 0].set_title('Average Payment by State')
axes[1, 0].set_xlabel('State')
axes[1, 0].set_ylabel('Average Payment (₦)')
axes[1, 0].tick_params(axis='x', rotation=45)
# Add value labels on bars
for bar, value in zip(bars, state_avg.values):
    axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                    f'₦{value:.0f}', ha='center', va='bottom')
axes[1, 0].grid(True, alpha=0.3)

# 4. Pie chart of payment types
payment_counts = df_financial['payment_type'].value_counts()
colors_pie = ['gold', 'lightcoral', 'lightskyblue']
axes[1, 1].pie(payment_counts.values, labels=payment_counts.index, autopct='%1.1f%%', 
               colors=colors_pie[:len(payment_counts)], startangle=90)
axes[1, 1].set_title('Payment Type Distribution')

plt.tight_layout()
plt.show()

# Print distribution summary
print("\n=== DISTRIBUTION SUMMARY ===")
print(f"Skewness: {df_financial['payment_value'].skew():.2f}")
print(f"Kurtosis: {df_financial['payment_value'].kurtosis():.2f}")

if df_financial['payment_value'].skew() > 0.5:
    print("→ Distribution is right-skewed (long tail to the right)")
elif df_financial['payment_value'].skew() < -0.5:
    print("→ Distribution is left-skewed (long tail to the left)")
else:
    print("→ Distribution is approximately symmetric")

## 5. Outlier Detection

Identifying outliers is crucial for financial analysis as they can indicate fraud, data entry errors, or exceptional transactions.

In [None]:
# Outlier detection using IQR method
def detect_outliers_iqr(data):
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    return outliers, lower_bound, upper_bound

# Detect outliers in payment values
outliers, lower_bound, upper_bound = detect_outliers_iqr(df_financial['payment_value'])

print(f"Lower bound for outliers: ₦{lower_bound:.2f}")
print(f"Upper bound for outliers: ₦{upper_bound:.2f}")
print(f"Number of outliers detected: {len(outliers)}")

if len(outliers) > 0:
    print(f"\nOutlier details:")
    print(outliers.describe())
    
    # Show outliers with context
    outlier_mask = df_financial['payment_value'].isin(outliers)
    print(f"\nOutlier transactions:")
    print(df_financial[outlier_mask][['customer_state', 'payment_value', 'payment_type']])
else:
    print("No outliers detected using IQR method.")

# Outlier detection using Z-score method
def detect_outliers_zscore(data, threshold=3):
    z_scores = np.abs(stats.zscore(data))
    outliers = data[z_scores > threshold]
    return outliers

zscore_outliers = detect_outliers_zscore(df_financial['payment_value'])
print(f"\nZ-score outliers (threshold=3): {len(zscore_outliers)}")

# Business impact assessment
print("\n=== BUSINESS IMPACT ASSESSMENT ===")
total_revenue = df_financial['payment_value'].sum()
outlier_revenue = outliers.sum() if len(outliers) > 0 else 0
outlier_percentage = (outlier_revenue / total_revenue) * 100

print(f"Total revenue: ₦{total_revenue:,.2f}")
print(f"Outlier revenue: ₦{outlier_revenue:,.2f}")
print(f"Outlier revenue percentage: {outlier_percentage:.2f}%")

if outlier_percentage > 5:
    print("→ High impact: Review outliers for potential fraud or special handling")
elif outlier_percentage > 1:
    print("→ Moderate impact: Monitor outliers and investigate patterns")
else:
    print("→ Low impact: Outliers have minimal effect on overall revenue")

## 6. Business Insights and Recommendations

Let's translate our statistical analysis into actionable business insights.

In [None]:
# Comprehensive statistical summary
print("=== FINANCIAL PERFORMANCE STATISTICAL SUMMARY ===")
print()

# Overall statistics
print("OVERALL PAYMENT STATISTICS:")
print(f"  • Total Revenue: ₦{df_financial['payment_value'].sum():,.2f}")
print(f"  • Average Order Value: ₦{mean_payment:.2f}")
print(f"  • Median Order Value: ₦{median_payment:.2f}")
print(f"  • Standard Deviation: ₦{std_payment:.2f}")
print(f"  • Price Range: ₦{df_financial['payment_value'].min():.2f} - ₦{df_financial['payment_value'].max():.2f}")
print(f"  • Coefficient of Variation: {(std_payment/mean_payment)*100:.1f}%")
print()

# Payment type analysis
print("PAYMENT TYPE ANALYSIS:")
payment_analysis = df_financial.groupby('payment_type').agg({
    'payment_value': ['count', 'sum', 'mean', 'std'],
    'customer_id': 'nunique'
}).round(2)
payment_analysis.columns = ['Transactions', 'Revenue', 'Avg_Order', 'Std_Dev', 'Unique_Customers']
print(payment_analysis)
print()

# Regional analysis
print("REGIONAL PERFORMANCE:")
regional_analysis = df_financial.groupby('customer_state').agg({
    'payment_value': ['count', 'sum', 'mean', 'std'],
    'customer_id': 'nunique'
}).round(2)
regional_analysis.columns = ['Transactions', 'Revenue', 'Avg_Order', 'Std_Dev', 'Unique_Customers']
regional_analysis = regional_analysis.sort_values('Revenue', ascending=False)
print(regional_analysis)
print()

# Business recommendations
print("=== BUSINESS RECOMMENDATIONS ===")
print()

# Payment optimization recommendations
print("1. PAYMENT OPTIMIZATION:")
best_payment_type = payment_analysis['Revenue'].idxmax()
best_avg_order = payment_analysis['Avg_Order'].idxmax()
print(f"   • Focus on {best_payment_type} for maximum revenue generation")
print(f"   • {best_avg_order} has highest average order value - explore upselling opportunities")

# Regional strategy recommendations
print("\n2. REGIONAL STRATEGY:")
top_state = regional_analysis.index[0]
highest_avg_state = regional_analysis['Avg_Order'].idxmax()
print(f"   • {top_state} generates highest revenue - maintain strong presence")
print(f"   • {highest_avg_state} has highest average order value - target premium customers")

# Risk assessment
print("\n3. RISK ASSESSMENT:")
high_cv_states = regional_analysis[regional_analysis['Std_Dev']/regional_analysis['Avg_Order'] > 0.5]
if len(high_cv_states) > 0:
    print(f"   • High variability in: {list(high_cv_states.index)} - implement pricing strategies")
else:
    print("   • Stable payment patterns across all regions")

# Performance metrics
print("\n4. PERFORMANCE MONITORING:")
print("   • Monitor order value distribution monthly")
print("   • Track payment type adoption rates")
print("   • Set alerts for outlier transactions")
print("   • Benchmark regional performance against national averages")

## 7. Exercise for Practice

### Exercise: Nigerian State Performance Analysis

Using the statistical techniques learned today, analyze the performance of different Nigerian states and answer the following business questions:

1. Which state shows the most consistent order values (lowest coefficient of variation)?
2. Are there any states with payment patterns that suggest premium customer segments?
3. What payment strategy would you recommend for each state based on their statistical profiles?

**Deliverable:** Create a brief statistical report with actionable recommendations for regional marketing strategies.

In [None]:
# Exercise workspace
# Add your code here to complete the exercise

# Hint: Use the techniques learned in this notebook
# 1. Calculate coefficient of variation for each state
# 2. Identify states with high average order values and low variance
# 3. Create recommendations based on statistical profiles

print("Exercise workspace - Start your analysis here!")

---

## Summary

In this session, we've covered:

✅ **Central Tendency Measures**: Mean, median, and mode for understanding typical financial values
✅ **Spread Measures**: Standard deviation, variance, and percentiles for understanding variability
✅ **Distribution Analysis**: Visual and statistical analysis of payment value distributions
✅ **Outlier Detection**: Methods for identifying unusual transactions that may require investigation
✅ **Business Applications**: Translating statistical insights into actionable business strategies

### Key Takeaways for Nigerian E-commerce:

1. **Regional Performance**: Different Nigerian states show distinct payment patterns
2. **Payment Type Preferences**: Understanding payment method distributions is crucial for optimization
3. **Variability Analysis**: High variability in some regions may indicate market opportunities or risks
4. **Outlier Monitoring**: Unusual transactions should be investigated for fraud prevention or customer service

### Next Session Preview:

In our next session, we'll explore **Correlation Analysis** to understand relationships between different business metrics and identify key drivers of financial performance.