# üõí India Household Structure & E-commerce Analysis
## Product Discovery for Quick-Commerce Expansion

**Objective:** Analyze whether household composition correlates with online purchasing behavior across Indian states/UTs, and translate findings into actionable product decisions.

**Analysis Type:** Product Discovery (Not Academic Research)

**Target Audience:** Product Leadership & Strategy Team

---

### üìã Key Research Questions

1. Do regions with smaller household sizes show higher online purchase adoption?
2. How does category-wise purchasing differ between single-heavy and family-heavy regions?
3. Does internet availability mediate the household structure effect?

### üß™ Hypotheses

- **H1:** Smaller household sizes ‚Üí Higher online purchase likelihood
- **H2:** Family-heavy regions over-index on essentials/bulk; Single-heavy on convenience
- **H3:** Household structure matters primarily when internet access is present

### ‚ö†Ô∏è Key Assumption

**Bachelor vs Family** is approximated using **household size proxies**, not directly measured demographic data.

---

**Date:** January 2026  
**Data Source:** MoSPI HCES 2022-23 (Simulated for demonstration)  
**Analyst:** Product Team

## 1. Setup and Data Loading

In [None]:
# Import necessary libraries
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Add src to path
sys.path.append('../src')

# Data processing
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Statistical analysis
from scipy import stats
from scipy.stats import chi2_contingency, pearsonr
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ Libraries imported successfully")
print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")

In [None]:
# Generate sample dataset (mimicking HCES 2022-23 structure)
from data_collection import create_sample_dataset

# Create sample data
df_raw = create_sample_dataset()

# Save for future reference
df_raw.to_csv('../data/sample_hces_data.csv', index=False)

print(f"\nüìä Dataset Overview:")
print(f"   Total households: {len(df_raw):,}")
print(f"   States covered: {df_raw['State'].nunique()}")
print(f"   Date range: HCES 2022-23 (simulated)")
print(f"\n   Columns: {list(df_raw.columns)}")

df_raw.head(10)

In [None]:
# Data quality check
print("üîç Data Quality Check\n")
print("Missing Values:")
print(df_raw.isnull().sum())
print(f"\nDuplicates: {df_raw.duplicated().sum()}")

# Create household size buckets
def create_hh_bucket(size):
    if size == 1:
        return '1 (Single-person)'
    elif size in [2, 3]:
        return '2-3 (Small)'
    elif size in [4, 5]:
        return '4-5 (Medium)'
    else:
        return '6+ (Large)'

df = df_raw.copy()
df['HH_Size_Bucket'] = df['Household_Size'].apply(create_hh_bucket)
df['HH_Type'] = df['Household_Size'].apply(lambda x: 'Single/Small' if x <= 2 else 'Family')
df['Sector'] = df['Urban'].apply(lambda x: 'Urban' if x == 1 else 'Rural')

print(f"\n‚úÖ Created derived features:")
print(f"   - HH_Size_Bucket: {df['HH_Size_Bucket'].nunique()} categories")
print(f"   - HH_Type: {df['HH_Type'].nunique()} categories")
print(f"   - Sector: {df['Sector'].nunique()} categories")

# Summary statistics
print(f"\nüìà Summary Statistics:")
print(df[['Household_Size', 'Internet_Access', 'Online_Purchase']].describe())

In [None]:
# Household size distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Overall household size distribution
axes[0, 0].hist(df['Household_Size'], bins=range(1, 11), edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Household Size')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Household Sizes')
axes[0, 0].axvline(df['Household_Size'].mean(), color='red', linestyle='--', label=f'Mean: {df["Household_Size"].mean():.1f}')
axes[0, 0].legend()

# 2. Household size bucket distribution
hh_bucket_counts = df['HH_Size_Bucket'].value_counts()
axes[0, 1].bar(range(len(hh_bucket_counts)), hh_bucket_counts.values, color='steelblue')
axes[0, 1].set_xticks(range(len(hh_bucket_counts)))
axes[0, 1].set_xticklabels(hh_bucket_counts.index, rotation=45, ha='right')
axes[0, 1].set_ylabel('Number of Households')
axes[0, 1].set_title('Household Size Buckets')

# 3. Urban vs Rural household size
urban_rural = df.groupby(['Sector', 'HH_Size_Bucket']).size().unstack()
urban_rural.plot(kind='bar', ax=axes[1, 0], stacked=False)
axes[1, 0].set_title('Household Size Distribution: Urban vs Rural')
axes[1, 0].set_xlabel('Sector')
axes[1, 0].set_ylabel('Count')
axes[1, 0].legend(title='HH Size', bbox_to_anchor=(1.05, 1))

# 4. State-wise average household size (top 15)
state_avg_hh = df.groupby('State')['Household_Size'].mean().sort_values(ascending=False).head(15)
axes[1, 1].barh(range(len(state_avg_hh)), state_avg_hh.values, color='coral')
axes[1, 1].set_yticks(range(len(state_avg_hh)))
axes[1, 1].set_yticklabels(state_avg_hh.index)
axes[1, 1].set_xlabel('Average Household Size')
axes[1, 1].set_title('Top 15 States by Avg Household Size')
axes[1, 1].invert_yaxis()

plt.tight_layout()
plt.show()

print(f"\nüìä Household Structure Summary:")
print(f"   National average HH size: {df['Household_Size'].mean():.2f}")
print(f"   Single-person HH: {(df['Household_Size'] == 1).sum():,} ({(df['Household_Size'] == 1).mean():.1%})")
print(f"   Small HH (2-3): {df['HH_Size_Bucket'].str.contains('Small').sum():,}")
print(f"   Large HH (6+): {df['HH_Size_Bucket'].str.contains('Large').sum():,}")

In [None]:
# H1 Results
print("üß™ HYPOTHESIS 1: Household Size vs Online Adoption\n")
print(f"Correlation: {analysis_results['h1']['correlation']:.4f}")
print(f"P-value: {analysis_results['h1']['correlation_p_value']:.4f}")
print(f"Chi-square: {analysis_results['h1']['chi_square']:.2f} (p={analysis_results['h1']['chi_square_p_value']:.4f})")
print(f"\n{analysis_results['h1']['conclusion']}")

print("\n" + "="*80)

# H2 Results
print("\nüß™ HYPOTHESIS 2: Category Preferences by Household Type\n")
if 'error' not in analysis_results['h2']:
    print("Category Skew Indices (>1.0 = over-indexing):")
    for cat, skew in analysis_results['h2']['category_skew_index'].items():
        print(f"\n{cat}:")
        for hh_type, value in skew.items():
            indicator = "üìà" if value > 1.1 else "üìâ" if value < 0.9 else "‚û°Ô∏è"
            print(f"  {indicator} {hh_type}: {value:.2f}x")
    print(f"\n{analysis_results['h2']['conclusion']}")
else:
    print(analysis_results['h2']['error'])

print("\n" + "="*80)

# H3 Results
print("\nüß™ HYPOTHESIS 3: Internet Access Mediation\n")
if 'error' not in analysis_results['h3']:
    print(f"Correlation WITH internet: {analysis_results['h3']['correlation_with_internet']:.4f}")
    print(f"Correlation WITHOUT internet: {analysis_results['h3']['correlation_without_internet']:.4f}")
    print(f"\n{analysis_results['h3']['conclusion']}")
else:
    print(analysis_results['h3']['error'])

In [None]:
# Generate product memo
print("üìù Generating Product Memo...")

memo_writer = ProductMemoWriter(insights, expansion_strategy, merchandising_matrix, features)
memo = memo_writer.write_memo('../outputs/product_memo.md')

print("\n‚úÖ Deliverables generated:")
print("   1. Product Memo: outputs/product_memo.md")
print("   2. Product Decision Slide: outputs/product_decision_slide.md")
print("   3. Visualizations: visualizations/*.html")

print("\n\n" + "="*80)
print(" üéâ ANALYSIS COMPLETE!")
print("="*80)
print("\nNext Steps:")
print("   1. Review product memo with stakeholders")
print("   2. Present decision slide in leadership meeting")
print("   3. Prioritize top 3 experiments")
print("   4. Validate insights with actual customer data")

## 7. Generate Deliverables

In [None]:
# Feature prioritization
print("\n\nüöÄ PRODUCT FEATURE PRIORITIZATION:")
print("="*80)

features = insight_gen.generate_feature_prioritization()

for i, feature in enumerate(features, 1):
    print(f"\n{i}. {feature['feature']} [Priority: {feature['priority_score']}/10]")
    print(f"   Description: {feature['description']}")
    print(f"   Target: {feature['target_segment']}")
    print(f"   Impact: {feature['expected_impact']}")
    print(f"   Effort: {feature['effort']}")

In [None]:
# Merchandising matrix
print("\n\nüì¶ MERCHANDISING STRATEGY:")
print("="*80)

merchandising_matrix = insight_gen.generate_merchandising_matrix()

if not merchandising_matrix.empty:
    print("\nCategory Stocking Recommendations by Neighborhood Type:\n")
    print(merchandising_matrix.to_string(index=False))
else:
    print("Category data not available for merchandising recommendations")

In [None]:
# Expansion strategy
print("\n\nüó∫Ô∏è MARKET EXPANSION STRATEGY:")
print("="*80)

expansion_strategy = insight_gen.generate_expansion_strategy()

print("\nü•á TIER 1 - Expand Aggressively:")
print(f"   States: {', '.join(expansion_strategy['tier_1_states'][:8])}")
print(f"   {expansion_strategy['rationale']['tier_1']}")

print("\nü•à TIER 2 - Selective Pilots:")
print(f"   States: {', '.join(expansion_strategy['tier_2_states'][:8])}")
print(f"   {expansion_strategy['rationale']['tier_2']}")

print("\nü•â TIER 3 - Monitor Only:")
print(f"   States: {', '.join(expansion_strategy['tier_3_states'][:5])}")
print(f"   {expansion_strategy['rationale']['tier_3']}")

In [None]:
# Generate product insights
from product_insights import ProductInsightsGenerator, ProductMemoWriter

print("üí° Generating Product Insights...")
print("="*80)

insight_gen = ProductInsightsGenerator(analysis_results, df)

# Generate all insights
insights = insight_gen.generate_all_insights()

print(f"\nüéØ TOP {len(insights)} ACTIONABLE INSIGHTS:\n")
for i, insight in enumerate(insights, 1):
    print(f"\n{'='*80}")
    print(f"INSIGHT #{i} [{insight['priority']} Priority]")
    print(f"{'='*80}")
    print(f"\nüìä Finding:")
    print(f"   {insight['insight']}")
    print(f"\nüéØ Product Implication:")
    print(f"   {insight['implication']}")
    print(f"\nüìà Metric Impact:")
    print(f"   {insight['metric_impact']}")
    print(f"\n‚úÖ Product Action:")
    print(f"   {insight['product_action']}")

## 6. Product Insights & Recommendations

In [None]:
# 4. Category visualizations (if H2 data available)
if 'category_skew_index' in analysis_results['h2']:
    print("üéØ Creating category skew visualizations...")
    cat_viz = CategorySkewVisualizer()
    
    # Heatmap
    category_heatmap = cat_viz.create_category_heatmap(analysis_results['h2']['category_skew_index'])
    category_heatmap.show()
    
    # Comparison bars
    if 'category_penetration' in analysis_results['h2']:
        category_bars = cat_viz.create_category_comparison_bars(analysis_results['h2']['category_penetration'])
        category_bars.show()

# 5. Executive Summary Dashboard
print("\nüìã Creating executive summary dashboard...")
exec_summary = create_executive_summary_viz(analysis_results)
exec_summary.show()

In [None]:
# Create comprehensive visualizations
from visualization import (
    IndiaMapVisualizer, 
    HouseholdAdoptionVisualizer,
    CategorySkewVisualizer,
    create_executive_summary_viz
)

# 1. India Penetration Map
print("üó∫Ô∏è Creating India penetration map...")
map_viz = IndiaMapVisualizer()
india_map = map_viz.create_penetration_map(analysis_results['state_penetration'])
india_map.show()

# 2. Household Size vs Adoption
print("\nüìä Creating household size adoption chart...")
hh_viz = HouseholdAdoptionVisualizer()
hh_adoption = hh_viz.create_adoption_by_size_chart(analysis_results['household_size_penetration'])
hh_adoption.show()

# 3. Scatter with trendline
print("\nüìà Creating scatter plot with trendline...")
scatter_plot = hh_viz.create_scatter_with_trendline(df)
scatter_plot.show()

## 5. Visualizations & Dashboard

### 4.2 Hypothesis Testing Results

**H1:** Smaller household sizes correlate with higher online purchase likelihood  
**H2:** Category preferences differ by household type  
**H3:** Internet access mediates household structure effect

In [None]:
# Display key penetration metrics
print("üìä ONLINE PURCHASE PENETRATION METRICS\n")

print("1. Overall Penetration:")
print(analysis_results['overall_penetration'])

print("\n2. Top 10 States by Penetration:")
print(analysis_results['state_penetration'].nlargest(10, 'Penetration_%'))

print("\n3. Penetration by Household Size:")
print(analysis_results['household_size_penetration'])

print("\n4. Urban vs Rural:")
print(analysis_results['urban_rural_penetration'])

print("\n5. Internet Access Impact:")
print(analysis_results['internet_penetration'])

### 4.1 Online Purchase Penetration Results

In [None]:
# Run comprehensive analysis using our analysis module
from analysis import run_full_analysis

print("üî¨ Running Full Analysis Pipeline...")
print("="*80)

analysis_results = run_full_analysis(df)

print("\n‚úÖ Analysis complete! Results stored in 'analysis_results' dictionary")

## 4. Core Analysis: Run All Hypothesis Tests

## 3. Exploratory Data Analysis - Household Structure

## 2. Data Cleaning and Preprocessing