# Quality Quotes AI - User Engagement Data Generation

## Overview
This notebook generates realistic dummy data to simulate user interactions with Quality Quotes AI platform.

### Data Generation Strategy
- **User Base**: 5,000 simulated users over 90 days
- **Quote Categories**: Motivation, Inspiration, Wellness, Success, Mindfulness, Gratitude, Growth
- **Metrics**: Quote generation frequency, interaction types, session duration
- **Realistic Patterns**: Time-of-day variations, weekend trends, engagement correlations

### Technical Approach
- Use statistical distributions (normal, poisson, beta) for realistic variability
- Implement correlations (e.g., longer sessions → more interactions)
- Add temporal patterns (peak hours, day-of-week effects)
- Include user segmentation (power users, casual users, new users)

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from faker import Faker
import random
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)
fake = Faker()
Faker.seed(42)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Libraries imported successfully!
Pandas version: 2.3.3
NumPy version: 2.3.4


## 1. Configuration and Constants

In [2]:
# Configuration
NUM_USERS = 5000
START_DATE = datetime(2025, 7, 1)
END_DATE = datetime(2025, 9, 30)
NUM_DAYS = (END_DATE - START_DATE).days + 1

# Quote categories with popularity weights
QUOTE_CATEGORIES = {
    'Motivation': 0.25,
    'Inspiration': 0.20,
    'Wellness': 0.15,
    'Success': 0.15,
    'Mindfulness': 0.10,
    'Gratitude': 0.08,
    'Growth': 0.07
}

# Interaction types
INTERACTION_TYPES = ['view', 'like', 'share', 'save', 'copy']

# User segments with characteristics
USER_SEGMENTS = {
    'power_user': {'ratio': 0.15, 'avg_sessions': 25, 'avg_duration': 12},
    'regular_user': {'ratio': 0.35, 'avg_sessions': 15, 'avg_duration': 8},
    'casual_user': {'ratio': 0.50, 'avg_sessions': 5, 'avg_duration': 4}
}

print(f"Configuration:")
print(f"  Period: {START_DATE.date()} to {END_DATE.date()} ({NUM_DAYS} days)")
print(f"  Users: {NUM_USERS:,}")
print(f"  Categories: {list(QUOTE_CATEGORIES.keys())}")

Configuration:
  Period: 2025-07-01 to 2025-09-30 (92 days)
  Users: 5,000
  Categories: ['Motivation', 'Inspiration', 'Wellness', 'Success', 'Mindfulness', 'Gratitude', 'Growth']


## 2. Generate User Profiles

Create diverse user profiles with demographic information and behavioral segments.

In [3]:
def generate_users(num_users):
    """
    Generate user profiles with demographics and behavioral segments.
    
    Returns:
        DataFrame with user_id, name, email, signup_date, segment, location
    """
    users = []
    
    # Calculate users per segment
    segment_counts = {
        seg: int(num_users * info['ratio']) 
        for seg, info in USER_SEGMENTS.items()
    }
    
    user_id = 1
    for segment, count in segment_counts.items():
        for _ in range(count):
            # Random signup date (weighted towards recent)
            days_ago = int(np.random.exponential(30))
            days_ago = min(days_ago, NUM_DAYS - 1)
            signup_date = END_DATE - timedelta(days=days_ago)
            
            users.append({
                'user_id': f'U{user_id:05d}',
                'name': fake.name(),
                'email': fake.email(),
                'signup_date': signup_date,
                'segment': segment,
                'location': fake.city(),
                'country': fake.country()
            })
            user_id += 1
    
    return pd.DataFrame(users)

# Generate users
users_df = generate_users(NUM_USERS)

print(f"Generated {len(users_df):,} user profiles")
print(f"\nUser segment distribution:")
print(users_df['segment'].value_counts())
print(f"\nSample users:")
users_df.head()

Generated 5,000 user profiles

User segment distribution:
segment
casual_user     2500
regular_user    1750
power_user       750
Name: count, dtype: int64

Sample users:


Unnamed: 0,user_id,name,email,signup_date,segment,location,country
0,U00001,Allison Hill,donaldgarcia@example.net,2025-09-16,power_user,New Roberttown,Bermuda
1,U00002,Jonathan Johnson,jennifermiles@example.com,2025-07-02,power_user,South Bridget,Sudan
2,U00003,Olivia Moore,jpeterson@example.org,2025-08-22,power_user,Curtisfurt,Bhutan
3,U00004,Gregory Baker,blairamanda@example.com,2025-09-03,power_user,New Kellystad,Morocco
4,U00005,Benjamin Stanley,tracie31@example.com,2025-09-25,power_user,Shawnstad,Central African Republic


## 3. Generate Quote Generation Events

Simulate quote generation with realistic temporal patterns:
- Peak hours: Morning (7-9 AM) and Evening (7-9 PM)
- Weekend activity slightly higher
- User segment influences frequency

In [4]:
def get_hour_weight(hour):
    """Return activity weight for given hour (0-23)"""
    # Peak hours: 7-9 AM and 7-9 PM
    if hour in [7, 8, 19, 20]:
        return 2.5
    elif hour in [9, 10, 11, 18, 21, 22]:
        return 1.8
    elif hour in [12, 13, 14, 15, 16, 17]:
        return 1.2
    elif hour in [6, 23]:
        return 0.8
    else:
        return 0.3

def generate_quote_events(users_df):
    """
    Generate quote generation events for all users.
    
    Returns:
        DataFrame with quote_id, user_id, timestamp, category, quote_length
    """
    events = []
    quote_id = 1
    
    categories = list(QUOTE_CATEGORIES.keys())
    category_weights = list(QUOTE_CATEGORIES.values())
    
    for _, user in users_df.iterrows():
        user_id = user['user_id']
        segment = user['segment']
        signup_date = user['signup_date']
        
        # Number of sessions based on segment
        avg_sessions = USER_SEGMENTS[segment]['avg_sessions']
        num_sessions = max(1, int(np.random.poisson(avg_sessions)))
        
        # Generate sessions only after signup
        active_days = (END_DATE - signup_date).days + 1
        if active_days <= 0:
            continue
            
        for _ in range(num_sessions):
            # Random day after signup
            day_offset = random.randint(0, active_days - 1)
            session_date = signup_date + timedelta(days=day_offset)
            
            # Weekend boost
            is_weekend = session_date.weekday() >= 5
            weekend_multiplier = 1.3 if is_weekend else 1.0
            
            # Random hour weighted by activity patterns
            hours = list(range(24))
            hour_weights = [get_hour_weight(h) * weekend_multiplier for h in hours]
            hour = random.choices(hours, weights=hour_weights)[0]
            
            minute = random.randint(0, 59)
            second = random.randint(0, 59)
            
            timestamp = session_date.replace(hour=hour, minute=minute, second=second)
            
            # Quotes per session (1-5, higher for power users)
            quotes_in_session = np.random.poisson(2 if segment == 'power_user' else 1.2)
            quotes_in_session = max(1, min(5, quotes_in_session))
            
            for q in range(quotes_in_session):
                # Select category based on weights
                category = random.choices(categories, weights=category_weights)[0]
                
                # Quote length varies by category
                if category in ['Mindfulness', 'Wellness']:
                    quote_length = int(np.random.normal(120, 30))
                else:
                    quote_length = int(np.random.normal(80, 25))
                quote_length = max(30, min(200, quote_length))
                
                events.append({
                    'quote_id': f'Q{quote_id:07d}',
                    'user_id': user_id,
                    'timestamp': timestamp + timedelta(seconds=q*30),
                    'category': category,
                    'quote_length': quote_length
                })
                quote_id += 1
    
    return pd.DataFrame(events)

# Generate quote events
print("Generating quote events... (this may take a minute)")
quotes_df = generate_quote_events(users_df)

print(f"\nGenerated {len(quotes_df):,} quote generation events")
print(f"\nQuotes by category:")
print(quotes_df['category'].value_counts())
print(f"\nDate range: {quotes_df['timestamp'].min()} to {quotes_df['timestamp'].max()}")
print(f"\nSample quotes:")
quotes_df.head(10)

Generating quote events... (this may take a minute)

Generated 97,859 quote generation events

Quotes by category:
category
Motivation     24478
Inspiration    19603
Success        14693
Wellness       14533
Mindfulness     9896
Gratitude       7822
Growth          6834
Name: count, dtype: int64

Date range: 2025-07-01 01:56:17 to 2025-10-01 00:00:48

Sample quotes:


Unnamed: 0,quote_id,user_id,timestamp,category,quote_length
0,Q0000001,U00001,2025-09-26 07:47:17,Motivation,93
1,Q0000002,U00001,2025-09-26 07:47:47,Motivation,90
2,Q0000003,U00001,2025-09-26 07:48:17,Motivation,73
3,Q0000004,U00001,2025-09-27 21:05:37,Inspiration,48
4,Q0000005,U00001,2025-09-16 07:14:32,Success,95
5,Q0000006,U00001,2025-09-16 07:15:02,Wellness,63
6,Q0000007,U00001,2025-09-27 18:34:26,Motivation,64
7,Q0000008,U00001,2025-09-27 18:34:56,Wellness,130
8,Q0000009,U00001,2025-09-27 18:35:26,Mindfulness,129
9,Q0000010,U00001,2025-09-27 18:35:56,Motivation,32


## 4. Generate User Interactions

Create interaction events (likes, shares, saves) with realistic engagement patterns:
- Not all quotes get interactions
- Engagement rate varies by category
- Power users interact more

In [5]:
def generate_interactions(quotes_df, users_df):
    """
    Generate user interactions with quotes.
    
    Returns:
        DataFrame with interaction_id, quote_id, user_id, interaction_type, timestamp
    """
    interactions = []
    interaction_id = 1
    
    # Merge to get user segment
    quotes_with_segment = quotes_df.merge(users_df[['user_id', 'segment']], on='user_id')
    
    # Engagement probability by category
    category_engagement = {
        'Motivation': 0.65,
        'Inspiration': 0.60,
        'Success': 0.55,
        'Wellness': 0.50,
        'Growth': 0.48,
        'Mindfulness': 0.45,
        'Gratitude': 0.42
    }
    
    for _, quote in quotes_with_segment.iterrows():
        quote_id = quote['quote_id']
        user_id = quote['user_id']
        category = quote['category']
        segment = quote['segment']
        timestamp = quote['timestamp']
        
        # Base engagement probability
        base_prob = category_engagement[category]
        
        # Segment multiplier
        segment_multiplier = {
            'power_user': 1.4,
            'regular_user': 1.0,
            'casual_user': 0.7
        }[segment]
        
        engagement_prob = min(0.95, base_prob * segment_multiplier)
        
        # Always add 'view' interaction
        interactions.append({
            'interaction_id': f'I{interaction_id:08d}',
            'quote_id': quote_id,
            'user_id': user_id,
            'interaction_type': 'view',
            'timestamp': timestamp
        })
        interaction_id += 1
        
        # Decide if user engages further
        if random.random() < engagement_prob:
            # Engagement funnel: view → like → (save/share/copy)
            # Like (most common)
            if random.random() < 0.8:
                interactions.append({
                    'interaction_id': f'I{interaction_id:08d}',
                    'quote_id': quote_id,
                    'user_id': user_id,
                    'interaction_type': 'like',
                    'timestamp': timestamp + timedelta(seconds=random.randint(1, 5))
                })
                interaction_id += 1
            
            # Save
            if random.random() < 0.35:
                interactions.append({
                    'interaction_id': f'I{interaction_id:08d}',
                    'quote_id': quote_id,
                    'user_id': user_id,
                    'interaction_type': 'save',
                    'timestamp': timestamp + timedelta(seconds=random.randint(2, 8))
                })
                interaction_id += 1
            
            # Share (less common)
            if random.random() < 0.25:
                interactions.append({
                    'interaction_id': f'I{interaction_id:08d}',
                    'quote_id': quote_id,
                    'user_id': user_id,
                    'interaction_type': 'share',
                    'timestamp': timestamp + timedelta(seconds=random.randint(5, 15))
                })
                interaction_id += 1
            
            # Copy
            if random.random() < 0.30:
                interactions.append({
                    'interaction_id': f'I{interaction_id:08d}',
                    'quote_id': quote_id,
                    'user_id': user_id,
                    'interaction_type': 'copy',
                    'timestamp': timestamp + timedelta(seconds=random.randint(3, 10))
                })
                interaction_id += 1
    
    return pd.DataFrame(interactions)

# Generate interactions
print("Generating user interactions...")
interactions_df = generate_interactions(quotes_df, users_df)

print(f"\nGenerated {len(interactions_df):,} interactions")
print(f"\nInteractions by type:")
print(interactions_df['interaction_type'].value_counts())
print(f"\nSample interactions:")
interactions_df.head(10)

Generating user interactions...

Generated 199,270 interactions

Interactions by type:
interaction_type
view     97859
like     47635
save     20985
copy     17964
share    14827
Name: count, dtype: int64

Sample interactions:


Unnamed: 0,interaction_id,quote_id,user_id,interaction_type,timestamp
0,I00000001,Q0000001,U00001,view,2025-09-26 07:47:17
1,I00000002,Q0000002,U00001,view,2025-09-26 07:47:47
2,I00000003,Q0000002,U00001,like,2025-09-26 07:47:51
3,I00000004,Q0000002,U00001,copy,2025-09-26 07:47:50
4,I00000005,Q0000003,U00001,view,2025-09-26 07:48:17
5,I00000006,Q0000003,U00001,like,2025-09-26 07:48:18
6,I00000007,Q0000003,U00001,share,2025-09-26 07:48:25
7,I00000008,Q0000003,U00001,copy,2025-09-26 07:48:23
8,I00000009,Q0000004,U00001,view,2025-09-27 21:05:37
9,I00000010,Q0000004,U00001,like,2025-09-27 21:05:39


## 5. Generate Session Duration Data

Calculate time spent per session based on user activity and engagement.

In [6]:
def generate_sessions(quotes_df, users_df):
    """
    Generate session data with duration based on user activity.
    
    Returns:
        DataFrame with session_id, user_id, start_time, duration_seconds, quotes_generated
    """
    # Merge to get user segment
    quotes_with_segment = quotes_df.merge(users_df[['user_id', 'segment']], on='user_id')
    
    # Group quotes by user and 30-minute windows (same session)
    quotes_with_segment['session_window'] = quotes_with_segment['timestamp'].dt.floor('30min')
    
    sessions = []
    session_id = 1
    
    for (user_id, session_window), group in quotes_with_segment.groupby(['user_id', 'session_window']):
        segment = group['segment'].iloc[0]
        quotes_in_session = len(group)
        start_time = group['timestamp'].min()
        
        # Base duration from segment
        avg_duration = USER_SEGMENTS[segment]['avg_duration'] * 60  # Convert to seconds
        
        # Duration influenced by number of quotes
        base_duration = np.random.gamma(shape=2, scale=avg_duration/2)
        
        # Add time per quote (30-90 seconds per quote)
        time_per_quote = quotes_in_session * random.randint(30, 90)
        
        duration = int(base_duration + time_per_quote)
        duration = max(60, min(3600, duration))  # Between 1 min and 1 hour
        
        sessions.append({
            'session_id': f'S{session_id:07d}',
            'user_id': user_id,
            'start_time': start_time,
            'duration_seconds': duration,
            'duration_minutes': round(duration / 60, 2),
            'quotes_generated': quotes_in_session
        })
        session_id += 1
    
    return pd.DataFrame(sessions)

# Generate sessions
sessions_df = generate_sessions(quotes_df, users_df)

print(f"Generated {len(sessions_df):,} sessions")
print(f"\nSession statistics:")
print(sessions_df[['duration_minutes', 'quotes_generated']].describe())
print(f"\nSample sessions:")
sessions_df.head(10)

Generated 56,766 sessions

Session statistics:
       duration_minutes  quotes_generated
count      56766.000000      56766.000000
mean          10.176308          1.723902
std            7.219309          1.042020
min            1.000000          1.000000
25%            5.070000          1.000000
50%            8.300000          1.000000
75%           13.230000          2.000000
max           60.000000         14.000000

Sample sessions:


Unnamed: 0,session_id,user_id,start_time,duration_seconds,duration_minutes,quotes_generated
0,S0000001,U00001,2025-09-16 07:14:32,634,10.57,2
1,S0000002,U00001,2025-09-17 20:55:06,839,13.98,4
2,S0000003,U00001,2025-09-20 20:39:56,713,11.88,2
3,S0000004,U00001,2025-09-20 23:40:44,435,7.25,2
4,S0000005,U00001,2025-09-21 08:08:32,1103,18.38,2
5,S0000006,U00001,2025-09-22 07:00:46,628,10.47,1
6,S0000007,U00001,2025-09-22 09:13:58,1330,22.17,1
7,S0000008,U00001,2025-09-23 15:35:55,458,7.63,1
8,S0000009,U00001,2025-09-24 07:59:24,1337,22.28,1
9,S0000010,U00001,2025-09-24 15:47:37,423,7.05,2


## 6. Data Quality Checks

In [7]:
print("=" * 60)
print("DATA QUALITY REPORT")
print("=" * 60)

print("\n1. USERS")
print(f"   Total users: {len(users_df):,}")
print(f"   Missing values: {users_df.isnull().sum().sum()}")
print(f"   Duplicate user_ids: {users_df['user_id'].duplicated().sum()}")

print("\n2. QUOTES")
print(f"   Total quotes: {len(quotes_df):,}")
print(f"   Missing values: {quotes_df.isnull().sum().sum()}")
print(f"   Unique users who generated quotes: {quotes_df['user_id'].nunique():,}")
print(f"   Quotes per user (avg): {len(quotes_df) / quotes_df['user_id'].nunique():.1f}")

print("\n3. INTERACTIONS")
print(f"   Total interactions: {len(interactions_df):,}")
print(f"   Missing values: {interactions_df.isnull().sum().sum()}")
print(f"   Engagement rate (non-view): {((len(interactions_df) - interactions_df[interactions_df['interaction_type']=='view'].shape[0]) / len(quotes_df) * 100):.1f}%")

print("\n4. SESSIONS")
print(f"   Total sessions: {len(sessions_df):,}")
print(f"   Missing values: {sessions_df.isnull().sum().sum()}")
print(f"   Avg session duration: {sessions_df['duration_minutes'].mean():.1f} minutes")
print(f"   Avg quotes per session: {sessions_df['quotes_generated'].mean():.1f}")

print("\n5. TEMPORAL COVERAGE")
print(f"   Quote date range: {quotes_df['timestamp'].min().date()} to {quotes_df['timestamp'].max().date()}")
print(f"   Days with activity: {quotes_df['timestamp'].dt.date.nunique()}")

print("\n" + "=" * 60)

DATA QUALITY REPORT

1. USERS
   Total users: 5,000
   Missing values: 0
   Duplicate user_ids: 0

2. QUOTES
   Total quotes: 97,859
   Missing values: 0
   Unique users who generated quotes: 5,000
   Quotes per user (avg): 19.6

3. INTERACTIONS
   Total interactions: 199,270
   Missing values: 0
   Engagement rate (non-view): 103.6%

4. SESSIONS
   Total sessions: 56,766
   Missing values: 0
   Avg session duration: 10.2 minutes
   Avg quotes per session: 1.7

5. TEMPORAL COVERAGE
   Quote date range: 2025-07-01 to 2025-10-01
   Days with activity: 93



## 7. Export Data to CSV Files

Save all datasets for use in visualization and analysis notebooks.

In [8]:
# Create data directory if it doesn't exist
import os
os.makedirs('data', exist_ok=True)

# Export to CSV
users_df.to_csv('data/users.csv', index=False)
quotes_df.to_csv('data/quotes.csv', index=False)
interactions_df.to_csv('data/interactions.csv', index=False)
sessions_df.to_csv('data/sessions.csv', index=False)

print("Data exported successfully!")
print("\nFiles created:")
for filename in ['users.csv', 'quotes.csv', 'interactions.csv', 'sessions.csv']:
    filepath = f'data/{filename}'
    size_kb = os.path.getsize(filepath) / 1024
    print(f"  {filepath:30s} - {size_kb:>8.1f} KB")

Data exported successfully!

Files created:
  data/users.csv                 -    448.8 KB
  data/quotes.csv                -   4734.4 KB
  data/interactions.csv          -   9939.1 KB
  data/sessions.csv              -   2625.9 KB


## Summary

### Data Generation Methodology

**1. Realistic Patterns Implemented:**
- **Temporal patterns**: Peak hours (morning/evening), weekend boost
- **User segmentation**: Power users (15%), Regular (35%), Casual (50%)
- **Statistical distributions**: Poisson for counts, Normal for durations, Exponential for signup timing
- **Correlations**: Session duration correlates with quotes generated and user segment

**2. Data Structure:**
- **Users**: Demographics, signup date, behavioral segment
- **Quotes**: Timestamp, category, length, user reference
- **Interactions**: Type (view/like/share/save/copy), timestamp
- **Sessions**: Duration, activity count, temporal information

**3. Key Design Decisions:**
- Quote categories weighted by expected popularity
- Engagement funnel: view → like → save/share/copy
- Time-of-day activity based on typical social media patterns
- Session grouping within 30-minute windows

**Next Steps:**
Proceed to `02_visualizations.ipynb` to create data visualizations and analyze patterns.