# üìä Ph√¢n t√≠ch D·ªØ li·ªáu Mua s·∫Øm Tr·ª±c tuy·∫øn - HUIT Big Data

**T√°c gi·∫£:** Sinh vi√™n HUIT  
**M√¥n h·ªçc:** Big Data Analytics  
**M·ª•c ti√™u:** Ph√¢n t√≠ch h√†nh vi kh√°ch h√†ng v√† x√¢y d·ª±ng h·ªá th·ªëng ƒë·ªÅ xu·∫•t s·∫£n ph·∫©m

## üéØ M·ª•c ti√™u Ph√¢n t√≠ch

1. **Exploratory Data Analysis (EDA)** - Kh√°m ph√° d·ªØ li·ªáu
2. **Customer Segmentation** - Ph√¢n kh√∫c kh√°ch h√†ng
3. **Product Analysis** - Ph√¢n t√≠ch s·∫£n ph·∫©m
4. **Recommendation System** - H·ªá th·ªëng ƒë·ªÅ xu·∫•t
5. **Business Insights** - Th√¥ng tin kinh doanh

## üì¶ Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Configure plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Vietnamese font support (if available)
try:
    plt.rcParams['font.family'] = ['Arial Unicode MS', 'Arial', 'sans-serif']
except:
    print("Vietnamese font not available, using default")

print("üìö Libraries imported successfully!")

## üìÅ Load Data

In [None]:
# Load datasets
try:
    customers_df = pd.read_csv('../data/sample/customers.csv')
    products_df = pd.read_csv('../data/sample/products.csv')
    transactions_df = pd.read_csv('../data/sample/transactions.csv')
    
    print("‚úÖ Data loaded successfully!")
    print(f"üìä Customers: {len(customers_df):,} records")
    print(f"üì¶ Products: {len(products_df):,} records")
    print(f"üí≥ Transactions: {len(transactions_df):,} records")
    
except FileNotFoundError:
    print("‚ùå Data files not found!")
    print("Please run the data generation script first:")
    print("python data/sample/generate_data.py")
    
    # Generate sample data if not exists
    import sys
    sys.path.append('../')
    
    from data.sample.generate_data import DataGenerator
    
    print("üîÑ Generating sample data...")
    generator = DataGenerator()
    
    customers_df = generator.generate_customers(1000)
    products_df = generator.generate_products(500)
    transactions_df = generator.generate_transactions(customers_df, products_df, 10000)
    
    # Save data
    generator.save_data(customers_df, products_df, transactions_df, '../data/sample')
    print("‚úÖ Sample data generated and saved!")

## üîç Exploratory Data Analysis (EDA)

### Dataset Overview

In [None]:
# Dataset info
print("üîç DATASET OVERVIEW")
print("=" * 50)

print("\nüìä CUSTOMERS")
print(customers_df.info())
print("\nFirst 5 records:")
display(customers_df.head())

print("\nüì¶ PRODUCTS")
print(products_df.info())
print("\nFirst 5 records:")
display(products_df.head())

print("\nüí≥ TRANSACTIONS")
print(transactions_df.info())
print("\nFirst 5 records:")
display(transactions_df.head())

In [None]:
# Data preprocessing
transactions_df['timestamp'] = pd.to_datetime(transactions_df['timestamp'])
transactions_df['date'] = transactions_df['timestamp'].dt.date
transactions_df['hour'] = transactions_df['timestamp'].dt.hour
transactions_df['day_of_week'] = transactions_df['timestamp'].dt.day_name()
transactions_df['month'] = transactions_df['timestamp'].dt.month_name()

print("‚úÖ Data preprocessing completed!")

### Basic Statistics

In [None]:
# Basic statistics
print("üìà BASIC STATISTICS")
print("=" * 50)

total_revenue = transactions_df['total_amount'].sum()
avg_order_value = transactions_df['total_amount'].mean()
total_customers = customers_df['customer_id'].nunique()
total_products = products_df['product_id'].nunique()
active_customers = transactions_df['customer_id'].nunique()

stats_data = {
    'Metric': [
        'Total Revenue (VND)',
        'Average Order Value (VND)',
        'Total Customers',
        'Active Customers', 
        'Total Products',
        'Total Transactions',
        'Customer Activation Rate (%)'
    ],
    'Value': [
        f"{total_revenue:,.0f}",
        f"{avg_order_value:,.0f}",
        f"{total_customers:,}",
        f"{active_customers:,}",
        f"{total_products:,}",
        f"{len(transactions_df):,}",
        f"{(active_customers/total_customers*100):.1f}%"
    ]
}

stats_df = pd.DataFrame(stats_data)
display(stats_df)

## üìä Data Visualization

### Sales Analysis

In [None]:
# Sales by category
category_sales = transactions_df.groupby('category').agg({
    'total_amount': 'sum',
    'quantity': 'sum',
    'transaction_id': 'count'
}).round(2)

category_sales.columns = ['Revenue', 'Quantity_Sold', 'Transactions']
category_sales = category_sales.sort_values('Revenue', ascending=False)

# Create subplots
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Revenue by Category', 'Quantity Sold by Category', 
                   'Transactions by Category', 'Category Performance'),
    specs=[[{"type": "bar"}, {"type": "bar"}],
           [{"type": "bar"}, {"type": "scatter"}]]
)

# Revenue by category
fig.add_trace(
    go.Bar(x=category_sales.index, y=category_sales['Revenue'], 
           name='Revenue', marker_color='lightblue'),
    row=1, col=1
)

# Quantity by category
fig.add_trace(
    go.Bar(x=category_sales.index, y=category_sales['Quantity_Sold'], 
           name='Quantity', marker_color='lightgreen'),
    row=1, col=2
)

# Transactions by category
fig.add_trace(
    go.Bar(x=category_sales.index, y=category_sales['Transactions'], 
           name='Transactions', marker_color='lightcoral'),
    row=2, col=1
)

# Scatter plot: Revenue vs Transactions
fig.add_trace(
    go.Scatter(x=category_sales['Transactions'], y=category_sales['Revenue'],
               mode='markers+text', text=category_sales.index,
               textposition='top center', name='Performance',
               marker=dict(size=10, color='purple')),
    row=2, col=2
)

fig.update_layout(height=800, title_text="üìä Sales Analysis by Category")
fig.show()

print("\nüìä Top 5 Categories by Revenue:")
display(category_sales.head())

In [None]:
# Time series analysis
daily_sales = transactions_df.groupby('date')['total_amount'].sum().reset_index()
daily_sales['date'] = pd.to_datetime(daily_sales['date'])

# Sales trend
fig = px.line(daily_sales, x='date', y='total_amount',
              title='üìà Daily Sales Trend',
              labels={'total_amount': 'Revenue (VND)', 'date': 'Date'})

fig.update_layout(height=400)
fig.show()

# Sales by hour
hourly_sales = transactions_df.groupby('hour')['total_amount'].sum()

fig = px.bar(x=hourly_sales.index, y=hourly_sales.values,
             title='‚è∞ Sales by Hour of Day',
             labels={'x': 'Hour', 'y': 'Revenue (VND)'})

fig.update_layout(height=400)
fig.show()

### Customer Analysis

In [None]:
# Customer demographics
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Age Distribution', 'Gender Distribution', 
                   'City Distribution', 'Registration Trend'),
    specs=[[{"type": "histogram"}, {"type": "pie"}],
           [{"type": "bar"}, {"type": "bar"}]]
)

# Age distribution
fig.add_trace(
    go.Histogram(x=customers_df['age'], nbinsx=20, name='Age'),
    row=1, col=1
)

# Gender distribution
gender_counts = customers_df['gender'].value_counts()
fig.add_trace(
    go.Pie(labels=gender_counts.index, values=gender_counts.values, name='Gender'),
    row=1, col=2
)

# City distribution (top 10)
city_counts = customers_df['city'].value_counts().head(10)
fig.add_trace(
    go.Bar(x=city_counts.index, y=city_counts.values, name='City'),
    row=2, col=1
)

# Registration trend by month
customers_df['registration_date'] = pd.to_datetime(customers_df['registration_date'])
reg_trend = customers_df['registration_date'].dt.to_period('M').value_counts().sort_index()

fig.add_trace(
    go.Bar(x=[str(x) for x in reg_trend.index], y=reg_trend.values, name='Registrations'),
    row=2, col=2
)

fig.update_layout(height=800, title_text="üë• Customer Demographics Analysis")
fig.show()

In [None]:
# Customer behavior analysis
customer_behavior = transactions_df.groupby('customer_id').agg({
    'total_amount': ['sum', 'mean', 'count'],
    'product_id': 'nunique',
    'timestamp': ['min', 'max']
}).round(2)

# Flatten column names
customer_behavior.columns = ['total_spent', 'avg_order_value', 'num_orders', 
                           'unique_products', 'first_order', 'last_order']

# Calculate customer lifetime (days)
customer_behavior['lifetime_days'] = (
    customer_behavior['last_order'] - customer_behavior['first_order']
).dt.days

# Customer value segments
customer_behavior['value_segment'] = pd.cut(
    customer_behavior['total_spent'],
    bins=[0, 1000000, 5000000, 10000000, float('inf')],
    labels=['Low Value', 'Medium Value', 'High Value', 'VIP']
)

# Display statistics
print("üí≥ CUSTOMER BEHAVIOR STATISTICS")
print("=" * 50)
display(customer_behavior.describe())

# Value segment distribution
segment_dist = customer_behavior['value_segment'].value_counts()

fig = px.pie(values=segment_dist.values, names=segment_dist.index,
             title='üéØ Customer Value Segments')
fig.show()

print("\nüéØ Customer Segments:")
print(segment_dist)

## üéØ Customer Segmentation (RFM Analysis)

RFM Analysis gi√∫p ph√¢n kh√∫c kh√°ch h√†ng d·ª±a tr√™n:
- **Recency (R)**: Th·ªùi gian t·ª´ l·∫ßn mua cu·ªëi
- **Frequency (F)**: T·∫ßn su·∫•t mua h√†ng
- **Monetary (M)**: Gi√° tr·ªã ti·ªÅn t·ªá

In [None]:
# RFM Analysis
from datetime import datetime

# Calculate RFM metrics
current_date = transactions_df['timestamp'].max() + pd.Timedelta(days=1)

rfm = transactions_df.groupby('customer_id').agg({
    'timestamp': lambda x: (current_date - x.max()).days,  # Recency
    'transaction_id': 'count',  # Frequency
    'total_amount': 'sum'  # Monetary
}).round(2)

rfm.columns = ['Recency', 'Frequency', 'Monetary']

# Create RFM scores (1-5 scale)
rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels=[5,4,3,2,1])  # Lower recency = higher score
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels=[1,2,3,4,5])

# Combine RFM scores
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)
rfm['RFM_Total'] = rfm['R_Score'].astype(int) + rfm['F_Score'].astype(int) + rfm['M_Score'].astype(int)

# Define customer segments based on RFM scores
def rfm_segment(row):
    if row['RFM_Total'] >= 13:
        return 'Champions'
    elif row['RFM_Total'] >= 11:
        return 'Loyal Customers'
    elif row['RFM_Total'] >= 9:
        return 'Potential Loyalists'
    elif row['RFM_Total'] >= 7:
        return 'At Risk'
    elif row['RFM_Total'] >= 5:
        return 'Cannot Lose Them'
    else:
        return 'Lost Customers'

rfm['Segment'] = rfm.apply(rfm_segment, axis=1)

print("üéØ RFM ANALYSIS RESULTS")
print("=" * 50)
display(rfm.head(10))

# Segment distribution
segment_summary = rfm.groupby('Segment').agg({
    'Recency': 'mean',
    'Frequency': 'mean', 
    'Monetary': ['mean', 'sum'],
    'customer_id': 'count'
}).round(2)

segment_summary.columns = ['Avg_Recency', 'Avg_Frequency', 'Avg_Monetary', 'Total_Revenue', 'Customer_Count']

print("\nüìä Segment Summary:")
display(segment_summary)

In [None]:
# Visualize RFM segments
segment_counts = rfm['Segment'].value_counts()

# Create subplots for RFM visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Customer Segments', 'RFM Score Distribution',
                   'Recency vs Frequency', 'Frequency vs Monetary'),
    specs=[[{"type": "pie"}, {"type": "histogram"}],
           [{"type": "scatter"}, {"type": "scatter"}]]
)

# Customer segments pie chart
fig.add_trace(
    go.Pie(labels=segment_counts.index, values=segment_counts.values, name='Segments'),
    row=1, col=1
)

# RFM total score distribution
fig.add_trace(
    go.Histogram(x=rfm['RFM_Total'], nbinsx=15, name='RFM Scores'),
    row=1, col=2
)

# Recency vs Frequency scatter
colors = {'Champions': 'gold', 'Loyal Customers': 'green', 'Potential Loyalists': 'blue',
          'At Risk': 'orange', 'Cannot Lose Them': 'red', 'Lost Customers': 'gray'}

for segment in rfm['Segment'].unique():
    segment_data = rfm[rfm['Segment'] == segment]
    fig.add_trace(
        go.Scatter(x=segment_data['Recency'], y=segment_data['Frequency'],
                  mode='markers', name=segment,
                  marker=dict(color=colors.get(segment, 'black'))),
        row=2, col=1
    )

# Frequency vs Monetary scatter
for segment in rfm['Segment'].unique():
    segment_data = rfm[rfm['Segment'] == segment]
    fig.add_trace(
        go.Scatter(x=segment_data['Frequency'], y=segment_data['Monetary'],
                  mode='markers', name=segment, showlegend=False,
                  marker=dict(color=colors.get(segment, 'black'))),
        row=2, col=2
    )

fig.update_layout(height=800, title_text="üéØ RFM Customer Segmentation Analysis")
fig.show()

print(f"\nüìà Total customers segmented: {len(rfm):,}")
print("\nüèÜ Segment Recommendations:")
print("‚Ä¢ Champions: Reward and retain them")
print("‚Ä¢ Loyal Customers: Upsell premium products")
print("‚Ä¢ Potential Loyalists: Develop loyalty programs")
print("‚Ä¢ At Risk: Send personalized offers")
print("‚Ä¢ Cannot Lose Them: Win-back campaigns")
print("‚Ä¢ Lost Customers: Reactivation campaigns")

## üì¶ Product Analysis

Ph√¢n t√≠ch hi·ªáu su·∫•t s·∫£n ph·∫©m v√† xu h∆∞·ªõng

In [None]:
# Product performance analysis
product_performance = transactions_df.groupby('product_id').agg({
    'quantity': 'sum',
    'total_amount': 'sum',
    'transaction_id': 'count',
    'customer_id': 'nunique'
}).round(2)

product_performance.columns = ['Total_Quantity', 'Total_Revenue', 'Total_Orders', 'Unique_Customers']

# Merge with product information
product_analysis = product_performance.merge(products_df, on='product_id', how='left')

# Calculate additional metrics
product_analysis['Revenue_per_Unit'] = product_analysis['Total_Revenue'] / product_analysis['Total_Quantity']
product_analysis['Customer_Penetration'] = (product_analysis['Unique_Customers'] / 
                                          transactions_df['customer_id'].nunique()) * 100

# Top products by different metrics
print("üèÜ TOP PERFORMING PRODUCTS")
print("=" * 50)

print("\nüí∞ Top 10 by Revenue:")
top_revenue = product_analysis.nlargest(10, 'Total_Revenue')[['product_name', 'category', 'Total_Revenue', 'Total_Quantity']]
display(top_revenue)

print("\nüìà Top 10 by Quantity Sold:")
top_quantity = product_analysis.nlargest(10, 'Total_Quantity')[['product_name', 'category', 'Total_Quantity', 'Total_Revenue']]
display(top_quantity)

print("\nüë• Top 10 by Customer Reach:")
top_customers = product_analysis.nlargest(10, 'Unique_Customers')[['product_name', 'category', 'Unique_Customers', 'Customer_Penetration']]
display(top_customers)

In [None]:
# Product category analysis
category_analysis = product_analysis.groupby('category').agg({
    'Total_Revenue': 'sum',
    'Total_Quantity': 'sum',
    'Total_Orders': 'sum',
    'Unique_Customers': 'sum',
    'product_id': 'count',
    'price': 'mean'
}).round(2)

category_analysis.columns = ['Total_Revenue', 'Total_Quantity', 'Total_Orders', 
                           'Total_Customers', 'Product_Count', 'Avg_Price']

category_analysis['Revenue_per_Product'] = category_analysis['Total_Revenue'] / category_analysis['Product_Count']
category_analysis = category_analysis.sort_values('Total_Revenue', ascending=False)

print("üìä CATEGORY PERFORMANCE ANALYSIS")
print("=" * 50)
display(category_analysis)

# Visualize category performance
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Revenue by Category', 'Average Price by Category',
                   'Products Count by Category', 'Revenue per Product'),
    specs=[[{"type": "bar"}, {"type": "bar"}],
           [{"type": "bar"}, {"type": "bar"}]]
)

# Revenue by category
fig.add_trace(
    go.Bar(x=category_analysis.index, y=category_analysis['Total_Revenue'],
           name='Revenue', marker_color='skyblue'),
    row=1, col=1
)

# Average price by category
fig.add_trace(
    go.Bar(x=category_analysis.index, y=category_analysis['Avg_Price'],
           name='Avg Price', marker_color='lightgreen'),
    row=1, col=2
)

# Product count by category
fig.add_trace(
    go.Bar(x=category_analysis.index, y=category_analysis['Product_Count'],
           name='Product Count', marker_color='salmon'),
    row=2, col=1
)

# Revenue per product
fig.add_trace(
    go.Bar(x=category_analysis.index, y=category_analysis['Revenue_per_Product'],
           name='Revenue per Product', marker_color='gold'),
    row=2, col=2
)

fig.update_layout(height=800, title_text="üì¶ Product Category Analysis")
fig.update_xaxes(tickangle=45)
fig.show()

## üîç Market Basket Analysis

Ph√¢n t√≠ch nh·ªØng s·∫£n ph·∫©m th∆∞·ªùng ƒë∆∞·ª£c mua c√πng nhau

In [None]:
# Market basket analysis
from itertools import combinations
from collections import Counter

# Get transactions with multiple items
basket_transactions = transactions_df.groupby('transaction_id')['product_id'].apply(list).reset_index()
basket_transactions = basket_transactions[basket_transactions['product_id'].apply(len) > 1]

print(f"üìä Transactions with multiple items: {len(basket_transactions):,}")
print(f"üìä Total transactions: {len(transactions_df['transaction_id'].unique()):,}")
print(f"üìä Multi-item rate: {len(basket_transactions)/len(transactions_df['transaction_id'].unique())*100:.1f}%")

# Find frequent item pairs
item_pairs = []
for items in basket_transactions['product_id']:
    pairs = list(combinations(items, 2))
    item_pairs.extend(pairs)

# Count pair frequencies
pair_counts = Counter(item_pairs)
top_pairs = pair_counts.most_common(20)

# Create DataFrame for visualization
pair_df = pd.DataFrame(top_pairs, columns=['Product_Pair', 'Frequency'])
pair_df['Product_1'] = pair_df['Product_Pair'].apply(lambda x: x[0])
pair_df['Product_2'] = pair_df['Product_Pair'].apply(lambda x: x[1])

# Get product names
product_names = dict(zip(products_df['product_id'], products_df['product_name']))
pair_df['Product_1_Name'] = pair_df['Product_1'].map(product_names)
pair_df['Product_2_Name'] = pair_df['Product_2'].map(product_names)
pair_df['Pair_Name'] = pair_df['Product_1_Name'] + ' + ' + pair_df['Product_2_Name']

print("\nüõí TOP 20 PRODUCT PAIRS (Frequently Bought Together)")
print("=" * 70)
display(pair_df[['Pair_Name', 'Frequency']].head(20))

# Visualize top pairs
fig = px.bar(pair_df.head(10), x='Frequency', y='Pair_Name',
             orientation='h', title='üõí Top 10 Product Pairs (Market Basket)',
             labels={'Frequency': 'Co-occurrence Count', 'Pair_Name': 'Product Pairs'})
fig.update_layout(height=500, yaxis={'categoryorder': 'total ascending'})
fig.show()

## üéØ Recommendation System Analysis

Ph√¢n t√≠ch ƒë·ªÉ x√¢y d·ª±ng h·ªá th·ªëng ƒë·ªÅ xu·∫•t s·∫£n ph·∫©m

In [None]:
# Create user-item interaction matrix
user_item_matrix = transactions_df.pivot_table(
    index='customer_id', 
    columns='product_id', 
    values='quantity', 
    aggfunc='sum', 
    fill_value=0
)

print(f"üìä User-Item Matrix Shape: {user_item_matrix.shape}")
print(f"üìä Sparsity: {(user_item_matrix == 0).sum().sum() / (user_item_matrix.shape[0] * user_item_matrix.shape[1]) * 100:.2f}%")

# Calculate basic recommendation metrics
# Most popular products (by number of unique customers)
product_popularity = transactions_df.groupby('product_id')['customer_id'].nunique().sort_values(ascending=False)
popular_products = product_popularity.head(20)

# Get product names for popular products
popular_products_df = popular_products.reset_index()
popular_products_df['product_name'] = popular_products_df['product_id'].map(product_names)
popular_products_df.columns = ['product_id', 'unique_customers', 'product_name']

print("\n‚≠ê TOP 20 MOST POPULAR PRODUCTS (by unique customers)")
print("=" * 60)
display(popular_products_df[['product_name', 'unique_customers']].head(20))

# Visualize popularity
fig = px.bar(popular_products_df.head(15), x='unique_customers', y='product_name',
             orientation='h', title='‚≠ê Top 15 Most Popular Products',
             labels={'unique_customers': 'Number of Unique Customers', 'product_name': 'Product'})
fig.update_layout(height=600, yaxis={'categoryorder': 'total ascending'})
fig.show()

In [None]:
# Simple content-based recommendation example
def get_similar_products_by_category(product_id, top_n=5):
    """
    Get similar products based on category and price range
    """
    # Get target product info
    target_product = products_df[products_df['product_id'] == product_id]
    
    if target_product.empty:
        return []
    
    target_category = target_product['category'].iloc[0]
    target_price = target_product['price'].iloc[0]
    
    # Find similar products
    similar_products = products_df[
        (products_df['category'] == target_category) &
        (products_df['product_id'] != product_id) &
        (products_df['price'] >= target_price * 0.5) &
        (products_df['price'] <= target_price * 2)
    ]
    
    # Sort by price similarity
    similar_products['price_diff'] = abs(similar_products['price'] - target_price)
    similar_products = similar_products.sort_values('price_diff').head(top_n)
    
    return similar_products[['product_id', 'product_name', 'price']]

# Example: Get recommendations for a popular product
example_product_id = popular_products_df['product_id'].iloc[0]
example_product_name = popular_products_df['product_name'].iloc[0]

print(f"\nüéØ CONTENT-BASED RECOMMENDATIONS")
print(f"Target Product: {example_product_name} ({example_product_id})")
print("=" * 60)

recommendations = get_similar_products_by_category(example_product_id, 10)
display(recommendations)

# Customer purchase pattern analysis
customer_categories = transactions_df.groupby('customer_id')['category'].apply(list).reset_index()
customer_categories['unique_categories'] = customer_categories['category'].apply(lambda x: len(set(x)))
customer_categories['total_purchases'] = customer_categories['category'].apply(len)

# Category diversity analysis
diversity_stats = customer_categories['unique_categories'].describe()
print("\nüìä CUSTOMER CATEGORY DIVERSITY")
print("=" * 40)
print(f"Average categories per customer: {diversity_stats['mean']:.2f}")
print(f"Max categories per customer: {diversity_stats['max']:.0f}")
print(f"Customers buying from 1 category only: {(customer_categories['unique_categories'] == 1).sum():,} ({(customer_categories['unique_categories'] == 1).mean()*100:.1f}%)")

# Visualize category diversity
fig = px.histogram(customer_categories, x='unique_categories', nbins=20,
                   title='üìä Customer Category Diversity Distribution',
                   labels={'unique_categories': 'Number of Unique Categories per Customer',
                          'count': 'Number of Customers'})
fig.show()

## üìà Business Intelligence & Insights

T·ªïng h·ª£p c√°c insights quan tr·ªçng cho doanh nghi·ªáp

In [None]:
# Key Business Metrics Dashboard
print("üìä KEY BUSINESS METRICS & INSIGHTS")
print("=" * 60)

# 1. Revenue Insights
total_revenue = transactions_df['total_amount'].sum()
avg_order_value = transactions_df['total_amount'].mean()
median_order_value = transactions_df['total_amount'].median()

print("\nüí∞ REVENUE INSIGHTS")
print(f"Total Revenue: {total_revenue:,.0f} VND")
print(f"Average Order Value: {avg_order_value:,.0f} VND")
print(f"Median Order Value: {median_order_value:,.0f} VND")

# 2. Customer Insights
total_customers = customers_df['customer_id'].nunique()
active_customers = transactions_df['customer_id'].nunique()
customer_activation_rate = active_customers / total_customers * 100

avg_orders_per_customer = transactions_df.groupby('customer_id').size().mean()
avg_revenue_per_customer = customer_behavior['total_spent'].mean()

print("\nüë• CUSTOMER INSIGHTS")
print(f"Total Registered Customers: {total_customers:,}")
print(f"Active Customers: {active_customers:,}")
print(f"Customer Activation Rate: {customer_activation_rate:.1f}%")
print(f"Average Orders per Customer: {avg_orders_per_customer:.1f}")
print(f"Average Revenue per Customer: {avg_revenue_per_customer:,.0f} VND")

# 3. Product Insights
total_products = products_df['product_id'].nunique()
products_sold = transactions_df['product_id'].nunique()
product_sell_through_rate = products_sold / total_products * 100

avg_price = products_df['price'].mean()
best_category = category_analysis.index[0]
best_category_revenue = category_analysis['Total_Revenue'].iloc[0]

print("\nüì¶ PRODUCT INSIGHTS")
print(f"Total Products in Catalog: {total_products:,}")
print(f"Products with Sales: {products_sold:,}")
print(f"Product Sell-through Rate: {product_sell_through_rate:.1f}%")
print(f"Average Product Price: {avg_price:,.0f} VND")
print(f"Best Performing Category: {best_category} ({best_category_revenue:,.0f} VND)")

# 4. Operational Insights
date_range = (transactions_df['timestamp'].max() - transactions_df['timestamp'].min()).days
daily_avg_transactions = len(transactions_df) / date_range
daily_avg_revenue = total_revenue / date_range

peak_hour = transactions_df.groupby('hour').size().idxmax()
peak_day = transactions_df['day_of_week'].value_counts().index[0]

print("\n‚ö° OPERATIONAL INSIGHTS")
print(f"Data Period: {date_range} days")
print(f"Daily Average Transactions: {daily_avg_transactions:.1f}")
print(f"Daily Average Revenue: {daily_avg_revenue:,.0f} VND")
print(f"Peak Sales Hour: {peak_hour}:00")
print(f"Peak Sales Day: {peak_day}")

# 5. Recommendation System Potential
repeat_customers = customer_behavior[customer_behavior['num_orders'] > 1].shape[0]
repeat_rate = repeat_customers / len(customer_behavior) * 100

cross_category_customers = customer_categories[customer_categories['unique_categories'] > 1].shape[0]
cross_sell_potential = cross_category_customers / len(customer_categories) * 100

print("\nüéØ RECOMMENDATION SYSTEM POTENTIAL")
print(f"Repeat Customers: {repeat_customers:,} ({repeat_rate:.1f}%)")
print(f"Cross-category Customers: {cross_category_customers:,} ({cross_sell_potential:.1f}%)")
print(f"Multi-item Purchase Rate: {len(basket_transactions)/len(transactions_df['transaction_id'].unique())*100:.1f}%")

In [None]:
# Final Summary Dashboard
print("\n" + "="*80)
print("üöÄ HUIT BIG DATA PROJECT - EXECUTIVE SUMMARY")
print("="*80)

print("\nüìã PROJECT OVERVIEW:")
print("‚Ä¢ E-commerce Data Analysis & Recommendation System")
print("‚Ä¢ Big Data processing with Apache Spark")
print("‚Ä¢ Machine Learning for product recommendations")
print("‚Ä¢ Interactive web demo with analytics dashboard")

print("\nüìä DATA PROCESSED:")
print(f"‚Ä¢ {len(customers_df):,} Customer records")
print(f"‚Ä¢ {len(products_df):,} Product records")
print(f"‚Ä¢ {len(transactions_df):,} Transaction records")
print(f"‚Ä¢ {total_revenue:,.0f} VND total revenue analyzed")

print("\nüéØ KEY FINDINGS:")
print(f"‚Ä¢ Customer activation rate: {customer_activation_rate:.1f}%")
print(f"‚Ä¢ Average order value: {avg_order_value:,.0f} VND")
print(f"‚Ä¢ Top category: {best_category}")
print(f"‚Ä¢ Repeat purchase rate: {repeat_rate:.1f}%")
print(f"‚Ä¢ Cross-sell potential: {cross_sell_potential:.1f}%")

print("\nüîÆ RECOMMENDATION STRATEGIES:")
print("‚Ä¢ Collaborative Filtering for similar customers")
print("‚Ä¢ Content-based filtering for similar products")
print("‚Ä¢ Market basket analysis for cross-selling")
print("‚Ä¢ Customer segmentation for targeted marketing")
print("‚Ä¢ Real-time personalization engine")

print("\nüí° BUSINESS RECOMMENDATIONS:")
print("‚Ä¢ Focus on Champions and Loyal Customer segments")
print("‚Ä¢ Implement win-back campaigns for At Risk customers")
print("‚Ä¢ Optimize inventory for top-performing categories")
print("‚Ä¢ Develop cross-selling strategies using basket analysis")
print("‚Ä¢ Deploy recommendation system to increase AOV")

print("\nüåü NEXT STEPS:")
print("‚Ä¢ Deploy ML models to production")
print("‚Ä¢ Set up real-time data streaming")
print("‚Ä¢ A/B test recommendation algorithms")
print("‚Ä¢ Monitor recommendation system performance")
print("‚Ä¢ Scale system for larger datasets")

print("\n" + "="*80)
print("üéì HUIT Big Data Analytics Project - Successfully Completed!")
print("="*80)

## üéØ Conclusion

Notebook n√†y ƒë√£ th·ª±c hi·ªán ph√¢n t√≠ch to√†n di·ªán d·ªØ li·ªáu mua s·∫Øm tr·ª±c tuy·∫øn bao g·ªìm:

### ‚úÖ C√°c ph√¢n t√≠ch ƒë√£ th·ª±c hi·ªán:
1. **Exploratory Data Analysis (EDA)** - Kh√°m ph√° d·ªØ li·ªáu c∆° b·∫£n
2. **Customer Segmentation** - Ph√¢n kh√∫c kh√°ch h√†ng b·∫±ng RFM Analysis
3. **Product Analysis** - Ph√¢n t√≠ch hi·ªáu su·∫•t s·∫£n ph·∫©m v√† danh m·ª•c
4. **Market Basket Analysis** - T√¨m s·∫£n ph·∫©m th∆∞·ªùng mua c√πng nhau
5. **Recommendation System Foundation** - N·ªÅn t·∫£ng cho h·ªá th·ªëng ƒë·ªÅ xu·∫•t
6. **Business Intelligence** - Insights cho ra quy·∫øt ƒë·ªãnh kinh doanh

### üöÄ ·ª®ng d·ª•ng th·ª±c t·∫ø:
- T·ªëi ∆∞u h√≥a chi·∫øn l∆∞·ª£c marketing
- C·∫£i thi·ªán tr·∫£i nghi·ªám kh√°ch h√†ng
- TƒÉng doanh s·ªë b√°n h√†ng
- Qu·∫£n l√Ω t·ªìn kho hi·ªáu qu·∫£
- Ph√°t tri·ªÉn s·∫£n ph·∫©m m·ªõi

### üìö K·ªπ thu·∫≠t Big Data √°p d·ª•ng:
- **Apache Spark** cho x·ª≠ l√Ω d·ªØ li·ªáu l·ªõn
- **Machine Learning** cho recommendation systems
- **Data Visualization** cho business intelligence
- **Statistical Analysis** cho customer insights

---

**T√°c gi·∫£:** Sinh vi√™n HUIT  
**M√¥n h·ªçc:** Big Data Analytics  
**Tr∆∞·ªùng:** ƒê·∫°i h·ªçc C√¥ng nghi·ªáp TP.HCM