# User Behavior Analysis

Analyze user behavior patterns from activity logs

## User Data Source

**Description**: Contains basic user profile information

### Schema

| Column | Type | Description |
|--------|------|-------------|
| user_id | Integer | Unique user identifier (1-100) |
| age | Integer | User age (18-80 years) |
| signup_date | DateTime | Account creation date (2023-01-01 onwards) |
| country | String | User's country (USA, UK, CN, JP, DE) |
| premium | Boolean | Premium subscription status |

### Statistics

- **Total Records**: 100 users
- **Date Range**: 2023-01-01 to 2023-04-10
- **Countries**: 5 different countries
- **Premium Users**: ~50% of dataset

In [None]:
# ===== System-managed metadata (auto-generated, understand to edit) =====
# @node_type: data_source
# @node_id: user_data
# @execution_status: validated
# @name: Load User Data
# ===== End of system-managed metadata =====

import pandas as pd

## 测试

# Load user data from CSV
user_data = pd.read_csv('users.csv')
print(f"Loaded {len(user_data)} users")

In [None]:
# @node_id: user_data
# @result_format: parquet
import pandas as pd
import os

# Load result from parquet
result_path = r'../projects/test_user_behavior_analysis/parquets/user_data.parquet'
if os.path.exists(result_path):
    user_data = pd.read_parquet(result_path)
    display(user_data)
else:
    print(f"Result file not found: {result_path}")

## Activity Data Source

**Description**: User activity event log from the platform

### Schema

| Column | Type | Description |
|--------|------|-------------|
| user_id | Integer | User identifier |
| activity_type | String | Type of activity (login, click, purchase, view) |
| timestamp | DateTime | When activity occurred |
| duration_seconds | Integer | Duration of activity (1-300 seconds) |

### Statistics

- **Total Records**: 500 events
- **Date Range**: 2024-01-01 onwards
- **Activity Types**: 4 categories
- **Avg Duration**: ~150 seconds per activity
- **Users Covered**: 100 users

In [None]:
# ===== System-managed metadata (auto-generated, understand to edit) =====
# @node_type: data_source
# @node_id: activity_data
# @execution_status: validated
# @name: Load Activity Data
# ===== End of system-managed metadata =====

import pandas as pd

# Load user activity logs
activity_data = pd.read_csv('activity.csv')
print(f"Loaded {len(activity_data)} activity records")

In [None]:
# @node_id: activity_data
# @result_format: parquet
import pandas as pd
import os

# Load result from parquet
result_path = r'../projects/test_user_behavior_analysis/parquets/activity_data.parquet'
if os.path.exists(result_path):
    activity_data = pd.read_parquet(result_path)
    display(activity_data)
else:
    print(f"Result file not found: {result_path}")

## Merged Dataset

**Description**: Combines user profiles with their activity logs using left join


### Operation

- **Join Type**: Left join on `user_id`
- **Left Table**: user_data (100 rows)
- **Right Table**: activity_data (500 rows)
- **Result**: All users with their associated activities

### Output Schema

Combines all columns from both sources:
- From user_data: user_id, age, signup_date, country, premium
- From activity_data: activity_type, timestamp, duration_seconds

### Statistics

- **Total Records**: 500 activity records
- **Users Represented**: 100 users
- **Columns**: 9 total columns

In [None]:
# ===== System-managed metadata (auto-generated, understand to edit) =====
# @node_type: compute
# @node_id: merged_data
# @execution_status: validated
# @depends_on: [user_data, activity_data]
# @name: Merge Datasets
# ===== End of system-managed metadata =====

import pandas as pd

# Merge user and activity data
merged_data = user_data.merge(
    activity_data,
    on='user_id',
    how='left'
)
print(f"Merged dataset shape: {merged_data.shape}")

In [None]:
# @node_id: merged_data
# @result_format: parquet
import pandas as pd
import os

# Load result from parquet
result_path = r'../projects/test_user_behavior_analysis/parquets/merged_data.parquet'
if os.path.exists(result_path):
    merged_data = pd.read_parquet(result_path)
    display(merged_data)
else:
    print(f"Result file not found: {result_path}")

## Summary Statistics

**Description**: Key statistics computed from merged user and activity data

### Computed Metrics

| Metric | Value | Description |
|--------|-------|-------------|
| total_users | 100 | Number of unique users |
| total_activities | 500 | Total activity events recorded |
| avg_age | ~45 | Average user age |
| premium_ratio | 0.5 | Proportion of premium users |

### Insights

- Average user age is approximately 45 years
- Activity data shows 500 total interactions across 100 users
- About half of the user base has premium subscriptions

In [None]:
# ===== System-managed metadata (auto-generated, understand to edit) =====
# @node_type: compute
# @node_id: statistics
# @execution_status: validated
# @depends_on: [merged_data, user_data]
# @name: Compute Statistics
# ===== End of system-managed metadata =====

import pandas as pd

# Calculate statistics from merged data and user data
statistics = {
    'total_users': len(user_data),
    'total_activities': len(merged_data),
    'avg_age': merged_data['age'].mean(),
    'premium_ratio': (user_data['premium'].sum() / len(user_data))
}
print(statistics)

In [None]:
# @node_id: statistics
# @result_format: parquet
import pandas as pd
import os

# Load result from parquet
result_path = r'../projects/test_user_behavior_analysis/parquets/statistics.parquet'
if os.path.exists(result_path):
    statistics = pd.read_parquet(result_path)
    display(statistics)
else:
    print(f"Result file not found: {result_path}")

## Analysis Report

**Description**: Final report summarizing the user behavior analysis

### Report Contents

- **Title**: User Behavior Analysis Report
- **Generated**: 2024-11-07
- **Status**: Completed

### Key Findings

This report presents the consolidated analysis of user behavior patterns derived from:
1. Basic user demographic information (100 users)
2. Activity event logs (500 recorded events)
3. Aggregated statistics from the merged dataset

### Report Structure

- Executive summary of key metrics
- User segmentation analysis
- Activity type distribution
- Premium vs standard user comparison

In [None]:
# ===== System-managed metadata (auto-generated, understand to edit) =====
# @node_type: compute
# @node_id: report
# @execution_status: not_executed
# @depends_on: [statistics]
# @name: Generate Report
# ===== End of system-managed metadata =====

import pandas as pd
from datetime import datetime

# Generate report based on statistics
report_data = {
    'metric': [
        'Report Title',
        'Generated At',
        'Analysis Status',
        'Total Users Analyzed',
        'Total Activities Recorded',
        'Average User Age',
        'Premium User Ratio'
    ],
    'value': [
        'User Behavior Analysis Report',
        datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'Completed',
        statistics.loc[statistics['metric'] == 'total_users', 'value'].values[0],
        statistics.loc[statistics['metric'] == 'total_activities', 'value'].values[0],
        statistics.loc[statistics['metric'] == 'avg_age', 'value'].values[0],
        statistics.loc[statistics['metric'] == 'premium_ratio', 'value'].values[0]
    ]
}

## 测试

# Convert to DataFrame
report = pd.DataFrame(report_data)
print("Report Generated Successfully")
print(report)

In [None]:
# @node_id: report
# @result_format: parquet
import pandas as pd
import os

# Load result from parquet
result_path = r'../projects/test_user_behavior_analysis/parquets/report.parquet'
if os.path.exists(result_path):
    report = pd.read_parquet(result_path)
    display(report)
else:
    print(f"Result file not found: {result_path}")

In [None]:
# ===== System-managed metadata (auto-generated, understand to edit) =====
# @node_type: image
# @node_id: behavior_chart
# @execution_status: validated
# @depends_on: [statistics]
# @name: Visualize User Engagement
# ===== End of system-managed metadata =====

import matplotlib.pyplot as plt
import numpy as np
import matplotlib
matplotlib.use('Agg')

# Create engagement visualization
fig, ax = plt.subplots(figsize=(12, 6))

# Sample engagement scores based on statistics
engagement_levels = ['Low\n(0-33)', 'Medium\n(34-66)', 'High\n(67-100)']
user_counts = [20, 50, 30]
colors = ['#e74c3c', '#f39c12', '#27ae60']

bars = ax.bar(engagement_levels, user_counts, color=colors, alpha=0.8, edgecolor='black', linewidth=1.5)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height)}',
            ha='center', va='bottom', fontsize=12, fontweight='bold')

ax.set_ylabel('Number of Users', fontsize=12)
ax.set_title('User Engagement Distribution', fontsize=14, fontweight='bold')
ax.set_ylim(0, max(user_counts) * 1.1)

plt.tight_layout()

# Save image
import os
os.makedirs('parquets', exist_ok=True)
plt.savefig('parquets/behavior_chart.png', dpi=150, bbox_inches='tight')
print("✓ Engagement chart saved to parquets/behavior_chart.png")

In [None]:
# @node_id: behavior_chart
# @result_format: image
from IPython.display import Image, display
import os

# Load and display engagement chart
image_path = r'parquets/behavior_chart.png'
if os.path.exists(image_path):
    display(Image(filename=image_path))
else:
    print(f"Image file not found: {image_path}")

## User Engagement Visualization

**Description**: Visual representation of user engagement scores and activity distribution

### Visualization Details

- **Chart Type**: Bar Chart
- **X-Axis**: Users (grouped by engagement level)
- **Y-Axis**: Engagement Score (0-100)
- **Color Coding**: 
  - Green: High engagement (80-100)
  - Yellow: Medium engagement (50-79)
  - Red: Low engagement (0-49)

### Key Metrics Displayed

- Total users analyzed: 100
- Average engagement score: ~60
- High engagement users: ~30%
- Low engagement users: ~20%