# Length-Adaptive Sequential Recommendation

**Hybrid SASRec + LightGCN with Adaptive Fusion on MovieLens-1M**

---

## üöÄ Quick Start

This notebook is ready to run! Just:
1. **Enable GPU** (recommended): Settings ‚Üí Accelerator ‚Üí GPU T4
2. **Click "Run All"** or run cells sequentially

**Expected Time:**
- With GPU T4: ~8-10 minutes per model
- With CPU: ~40-50 minutes per model

All data is already preprocessed and included in the repository!

## Step 1: Clone Repository

Cloning from: https://github.com/faroukq1/length-adaptive.git

In [None]:
# Clone repository with all preprocessed data
!git clone https://github.com/faroukq1/length-adaptive.git

# Change to project directory
%cd length-adaptive

# Verify structure
!echo "‚úì Source code:"
!ls -la src/

!echo "\n‚úì Preprocessed data:"
!ls -lh data/ml-1m/processed/

!echo "\n‚úì Co-occurrence graph:"
!ls -lh data/graphs/

!echo "\n‚úì Experiments scripts:"
!ls -lh experiments/

## Step 2: Install Dependencies

Installing PyTorch Geometric and other required packages

In [None]:
# Install required packages quietly
!pip install -q torch-geometric tqdm scikit-learn pandas matplotlib

print("‚úì All dependencies installed successfully!")

## Step 3: Verify GPU Setup

Check if GPU is available and will be used for training

In [None]:
# Check GPU availability
!python check_gpu.py

## üìã Experiment Priority Guide

This notebook includes experiments from the action plan to beat SASRec baseline:

**Priority 1 (Quick - Run First):**
- ‚úÖ SASRec Baseline (Step 6)
- ‚úÖ Hybrid Discrete (Step 5) - Our best model

**Priority 2 (Optimization - Run if time permits):**
- üî¨ Grid Search for Optimal Alpha (Advanced section)
- üî¨ All Hybrid Variants (Advanced section)

**Current Best Results:**
- Hybrid Fixed (Œ±=0.5): HR@10 = 9.99% (+3.7% vs baseline)
- Short-history users: +42% improvement

**Target:** Beat SASRec on overall HR@10 by ‚â•3% and short-user HR@10 by ‚â•20%

## Step 4: Quick Test (2 epochs)

Verify everything works with a quick 2-epoch test on both SASRec and Hybrid models

In [None]:
# Run quick training test
!python test_training.py

## Step 5: Train Hybrid Model (50 epochs)

Train our best model: **Hybrid with Discrete Fusion**

This model adapts based on user history length:
- Short history users (‚â§10 items): More collaborative filtering (GNN)
- Medium users (10-50 items): Balanced fusion
- Long history users (>50 items): More sequential patterns (Transformer)

**Note on Epochs:**
- **50 epochs** = ~8-10 min (GPU) with early stopping ‚Üí typically converges at ~20-30 epochs
- **600 epochs** (paper setting) = ~80 min (GPU) ‚Üí may improve by ~2-3% but uses 10x time
- Early stopping (patience=10) prevents overfitting automatically

In [None]:
# Train Hybrid model with discrete fusion (default: Œ±_short=0.3, Œ±_mid=0.5, Œ±_long=0.7)
# This uses early stopping - training stops when validation stops improving
print("="*70)
print("üöÄ Training Hybrid Discrete Model")
print("="*70)

!python experiments/run_experiment.py \
    --model hybrid_discrete \
    --epochs 50 \
    --batch_size 256 \
    --lr 0.001 \
    --d_model 64 \
    --n_heads 2 \
    --n_blocks 2 \
    --patience 10

print("\n‚úÖ Training complete! Check results/ folder for outputs.")

# To match paper settings (slower but may get slightly better results):
# !python experiments/run_experiment.py \
#     --model hybrid_discrete \
#     --epochs 600 \
#     --patience 20

## Step 6: Train SASRec Baseline (Optional - Skip if you already have it)

**‚ö†Ô∏è SKIP THIS STEP if:**
- You already have `results/sasrec_*/` folder from previous runs
- You haven't changed data preprocessing or hyperparameters
- You just want to test new hybrid variants

**Only run this if:**
- First time training
- Changed hyperparameters
- Want to verify reproducibility
- Need fresh baseline for comparison

**Alternative:** Copy your existing `results/sasrec_*/` folder to Kaggle instead of retraining.

In [None]:
# OPTION 1: Skip SASRec training (if you already have results)
print("üí° Skipping SASRec - using existing baseline results")
print("   If you need to train SASRec, uncomment the code below:\n")

# OPTION 2: Train SASRec baseline (uncomment if needed)
# print("="*70)
# print("üöÄ Training SASRec Baseline")
# print("="*70)
# 
# !python experiments/run_experiment.py \
#     --model sasrec \
#     --epochs 50 \
#     --batch_size 256 \
#     --lr 0.001 \
#     --patience 10
# 
# print("\n‚úÖ Baseline training complete!")

# OPTION 3: Upload existing SASRec results
# If you have results locally, you can upload the folder:
# 1. Zip your local results/sasrec_*/ folder
# 2. Upload to Kaggle input data
# 3. Copy to results/ directory:
# !mkdir -p results
# !cp -r /kaggle/input/your-sasrec-results/* results/

## Step 7: Train All Models (Optional - takes 3-5 hours)

Uncomment to train all 5 model variants:
- `sasrec`: Transformer baseline
- `hybrid_fixed`: Fixed fusion weight (Œ±=0.5)
- `hybrid_discrete`: Bin-based fusion (our approach)
- `hybrid_learnable`: Per-user learned weights
- `hybrid_continuous`: Neural network fusion

In [None]:
# Uncomment to run all experiments
# !bash scripts/run_all_experiments.sh

In [None]:
# Train all hybrid variants
# Uncomment to run complete ablation study (takes ~8 hours with GPU)

# models = ['hybrid_fixed', 'hybrid_discrete', 'hybrid_learnable', 'hybrid_continuous']
# 
# for model in models:
#     print(f"\n{'='*70}")
#     print(f"üöÄ Training {model}")
#     print(f"{'='*70}\n")
#     
#     !python experiments/run_experiment.py \
#         --model {model} \
#         --epochs 50 \
#         --batch_size 256 \
#         --lr 0.001 \
#         --patience 10
#     
#     print(f"\n‚úÖ {model} complete!")

# Quick version: Use the automated script
# !bash scripts/run_all_experiments.sh

print("üí° Tip: Uncomment to train all model variants")

## üî¨ Advanced: All Hybrid Variants

Train all fusion strategies for complete comparison:
- **Fixed**: Single Œ± for all users
- **Discrete**: Bin-based (short/medium/long)
- **Learnable**: Learned bin weights
- **Continuous**: Smooth function of length

In [None]:
# Grid search for optimal alpha
# Tests Œ± ‚àà {0.3, 0.4, 0.5, 0.6, 0.7}
# Uncomment to run (takes ~10-12 hours with GPU)

# alphas = [0.3, 0.4, 0.5, 0.6, 0.7]
# 
# for alpha in alphas:
#     print(f"\n{'='*70}")
#     print(f"üî¨ Testing Fixed Alpha = {alpha}")
#     print(f"{'='*70}\n")
#     
#     !python experiments/run_experiment.py \
#         --model hybrid_fixed \
#         --fixed_alpha {alpha} \
#         --epochs 50 \
#         --batch_size 256 \
#         --lr 0.001 \
#         --patience 10
#     
#     print(f"\n‚úÖ Alpha={alpha} complete!")

print("üí° Tip: Uncomment the code above to run grid search")

## üî¨ Advanced: Grid Search for Optimal Alpha (Fixed Fusion)

Test different fixed alpha values to find the optimal fusion weight.
This helps us understand the best balance between GNN and SASRec embeddings.

## Step 8: Analyze Results

Generate comparison tables and visualizations

In [None]:
# Generate analysis using the built-in script
print("="*70)
print("üìä Generating Analysis")
print("="*70)

!python experiments/analyze_results.py --save_csv

print("\n‚úÖ Analysis complete!")

## Step 9: Display Results

Show performance comparison table

In [None]:
import pandas as pd
import os
import json
import glob

# Try to load results directly from experiments
result_folders = glob.glob('results/*_*')

if len(result_folders) == 0:
    print("‚ùå No results found. Run experiments first!")
else:
    print("\n" + "="*80)
    print("üìä OVERALL PERFORMANCE")
    print("="*80 + "\n")
    
    # Collect all results
    all_results = []
    for folder in result_folders:
        results_path = os.path.join(folder, 'results.json')
        if os.path.exists(results_path):
            with open(results_path, 'r') as f:
                results = json.load(f)
            
            # Extract model name
            folder_name = os.path.basename(folder)
            model_name = '_'.join(folder_name.split('_')[:-2])
            
            all_results.append({
                'Model': model_name,
                'HR@5': results['test_metrics']['HR@5'],
                'HR@10': results['test_metrics']['HR@10'],
                'HR@20': results['test_metrics']['HR@20'],
                'NDCG@5': results['test_metrics']['NDCG@5'],
                'NDCG@10': results['test_metrics']['NDCG@10'],
                'NDCG@20': results['test_metrics']['NDCG@20'],
                'MRR@10': results['test_metrics']['MRR@10']
            })
    
    if all_results:
        df = pd.DataFrame(all_results)
        df = df.sort_values('NDCG@10', ascending=False)
        
        # Display table
        print(df.to_string(index=False, float_format='%.4f'))
        
        # Highlight best model
        best = df.iloc[0]
        print("\n" + "="*80)
        print(f"üèÜ BEST MODEL: {best['Model']}")
        print("="*80)
        print(f"  NDCG@10: {best['NDCG@10']:.4f}")
        print(f"  HR@10:   {best['HR@10']:.4f}")
        print(f"  MRR@10:  {best['MRR@10']:.4f}")
        print("="*80 + "\n")
        
        # Show improvement over baseline
        sasrec_row = df[df['Model'] == 'sasrec']
        if not sasrec_row.empty:
            sasrec_ndcg = sasrec_row.iloc[0]['NDCG@10']
            sasrec_hr = sasrec_row.iloc[0]['HR@10']
            hybrid_ndcg = best['NDCG@10']
            hybrid_hr = best['HR@10']
            ndcg_imp = ((hybrid_ndcg - sasrec_ndcg) / sasrec_ndcg) * 100
            hr_imp = ((hybrid_hr - sasrec_hr) / sasrec_hr) * 100
            print(f"üìà Improvement over SASRec baseline:")
            print(f"   NDCG@10: {ndcg_imp:+.2f}%")
            print(f"   HR@10:   {hr_imp:+.2f}%\n")
    else:
        print("‚ùå Could not parse results files")

## Step 10: Performance by User Group

Compare performance across different user history lengths

In [None]:
import glob
import json
import os
import pandas as pd

# Load grouped metrics
print("\n" + "="*80)
print("üìä PERFORMANCE BY USER GROUP")
print("="*80 + "\n")

result_folders = glob.glob('results/*_*')

if len(result_folders) == 0:
    print("‚ùå No results found.")
else:
    # Collect grouped results
    group_data = {'short': [], 'medium': [], 'long': []}
    
    for folder in result_folders:
        results_path = os.path.join(folder, 'results.json')
        if os.path.exists(results_path):
            with open(results_path, 'r') as f:
                results = json.load(f)
            
            # Extract model name
            folder_name = os.path.basename(folder)
            model_name = '_'.join(folder_name.split('_')[:-2])
            
            # Extract grouped metrics
            grouped = results.get('grouped_metrics', {})
            
            for group in ['short', 'medium', 'long']:
                if group in grouped:
                    group_data[group].append({
                        'Model': model_name,
                        'HR@10': grouped[group]['HR@10'],
                        'NDCG@10': grouped[group]['NDCG@10'],
                        'MRR@10': grouped[group]['MRR@10'],
                        'Count': grouped[group]['count']
                    })
    
    # Display each group
    for group_name in ['short', 'medium', 'long']:
        if group_data[group_name]:
            df_group = pd.DataFrame(group_data[group_name])
            df_group = df_group.sort_values('NDCG@10', ascending=False)
            
            print(f"\n{group_name.upper()} HISTORY USERS:")
            print("-" * 80)
            print(df_group.to_string(index=False, float_format='%.4f'))
            print()
        else:
            print(f"\n{group_name.upper()} HISTORY USERS:")
            print("-" * 80)
            print(f"‚ö†Ô∏è  No {group_name} user data found (possibly no users in this range)")
            print()

In [None]:
import glob
import json
import os

# Check for alpha statistics in hybrid model results
print("\n" + "="*80)
print("üîç ALPHA VALUES (Fusion Weights)")
print("="*80 + "\n")

hybrid_folders = [f for f in glob.glob('results/hybrid_*') if os.path.isdir(f)]

if not hybrid_folders:
    print("‚ö†Ô∏è  No hybrid model results found. Alpha tracking only works for hybrid models.")
else:
    for folder in hybrid_folders:
        alpha_path = os.path.join(folder, 'alpha_stats.json')
        if os.path.exists(alpha_path):
            with open(alpha_path, 'r') as f:
                alpha_stats = json.load(f)
            
            folder_name = os.path.basename(folder)
            model_name = '_'.join(folder_name.split('_')[:-2])
            
            print(f"{model_name.upper()}:")
            print("-" * 80)
            
            for group in ['short', 'medium', 'long', 'overall']:
                if group in alpha_stats:
                    stats = alpha_stats[group]
                    if group != 'overall' and 'count' in stats:
                        print(f"  {group.capitalize():8s}: mean={stats['mean']:.3f}, std={stats['std']:.3f}, count={stats['count']}")
                    elif group == 'overall':
                        print(f"  {group.capitalize():8s}: mean={stats['mean']:.3f}, std={stats['std']:.3f}")
            print()
        else:
            # Show expected alpha values based on model type
            folder_name = os.path.basename(folder)
            model_name = '_'.join(folder_name.split('_')[:-2])
            
            print(f"{model_name.upper()}:")
            print("-" * 80)
            
            if 'discrete' in model_name:
                print("  Expected: Short=0.3, Medium=0.5, Long=0.7 (discrete bins)")
            elif 'fixed' in model_name:
                print("  Expected: All users = 0.5 (fixed fusion)")
            elif 'learnable' in model_name:
                print("  Expected: Learned during training (check model params)")
            elif 'continuous' in model_name:
                print("  Expected: Smooth function of sequence length")
            
            print("  ‚ö†Ô∏è  Alpha statistics not saved (enable with track_alpha=True)")
            print()
    
    print("\nüí° Alpha interpretation:")
    print("   ‚Ä¢ Œ± close to 0: More weight on GNN (collaborative)")
    print("   ‚Ä¢ Œ± close to 1: More weight on SASRec (sequential)")
    print("   ‚Ä¢ Œ± = 0.5: Equal balance")

## Step 10b: Alpha Statistics (Hybrid Models Only)

For hybrid models, check what fusion weights (alpha values) were used for different user groups.

## Step 11: Visualize Learning Curves

Plot training loss and validation NDCG over epochs

In [None]:
import json
import matplotlib.pyplot as plt
import glob
import os

# Find all experiment results
result_folders = glob.glob('results/*_*')

if len(result_folders) > 0:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Training Loss
    for folder in result_folders:
        history_path = os.path.join(folder, 'history.json')
        if os.path.exists(history_path):
            try:
                with open(history_path, 'r') as f:
                    history = json.load(f)
                
                # Extract model name from folder
                parts = os.path.basename(folder).split('_')
                model_name = '_'.join(parts[:-2]) if len(parts) > 2 else parts[0]
                
                if 'train_loss' in history and history['train_loss']:
                    ax1.plot(history['train_loss'], label=model_name, marker='o', markersize=3, linewidth=2)
            except Exception as e:
                print(f"‚ö†Ô∏è  Could not load history from {folder}: {e}")
    
    ax1.set_xlabel('Epoch', fontsize=12)
    ax1.set_ylabel('BPR Loss', fontsize=12)
    ax1.set_title('Training Loss', fontsize=14, fontweight='bold')
    ax1.legend(fontsize=10)
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Validation NDCG@10
    for folder in result_folders:
        history_path = os.path.join(folder, 'history.json')
        if os.path.exists(history_path):
            try:
                with open(history_path, 'r') as f:
                    history = json.load(f)
                
                parts = os.path.basename(folder).split('_')
                model_name = '_'.join(parts[:-2]) if len(parts) > 2 else parts[0]
                
                if 'val_metrics' in history and history['val_metrics']:
                    ndcg_values = [m.get('NDCG@10', 0) for m in history['val_metrics']]
                    if ndcg_values:
                        ax2.plot(ndcg_values, label=model_name, marker='o', markersize=3, linewidth=2)
            except Exception as e:
                print(f"‚ö†Ô∏è  Could not load validation metrics from {folder}: {e}")
    
    ax2.set_xlabel('Epoch', fontsize=12)
    ax2.set_ylabel('NDCG@10', fontsize=12)
    ax2.set_title('Validation NDCG@10', fontsize=14, fontweight='bold')
    ax2.legend(fontsize=10)
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    
    # Save plot
    os.makedirs('results', exist_ok=True)
    plt.savefig('results/learning_curves.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("‚úì Saved to: results/learning_curves.png")
else:
    print("No results to plot. Run experiments first!")

## Step 12: Download Results

Create a zip file of all results for download

In [None]:
# Create zip of all results
import os

if os.path.exists('results') and os.listdir('results'):
    !zip -r results.zip results/
    
    print("\n‚úÖ Success!")
    print("Download 'results.zip' from the Output tab (right sidebar) ‚Üí")
    print("\nContains:")
    print("  ‚Ä¢ Model checkpoints (best_model.pt)")
    print("  ‚Ä¢ Training history (history.json)")
    print("  ‚Ä¢ Test metrics (results.json)")
    print("  ‚Ä¢ Comparison tables (CSV files, if generated)")
    print("  ‚Ä¢ Learning curves (PNG)")
    
    # Show what's in results
    result_folders = [d for d in os.listdir('results') if os.path.isdir(os.path.join('results', d))]
    print(f"\nüì¶ Packaged {len(result_folders)} experiment(s):")
    for folder in result_folders:
        print(f"  ‚Ä¢ {folder}")
else:
    print("‚ö†Ô∏è  No results folder found. Run experiments first!")

---

## ‚úÖ Summary

You've successfully:
1. ‚úÖ Cloned repository with preprocessed data
2. ‚úÖ Installed all dependencies
3. ‚úÖ Verified GPU availability
4. ‚úÖ Tested training pipeline
5. ‚úÖ Trained recommendation models
6. ‚úÖ Analyzed and compared results
7. ‚úÖ Visualized learning curves

---

## üî¨ Key Results

**Dataset:** MovieLens-1M
- 6,034 users
- 3,533 items  
- 1M+ ratings
- 151,874 co-occurrence edges

**Models Trained:**
- SASRec (Transformer baseline)
- Hybrid with Discrete Fusion (length-adaptive)

**Metrics:** Hit Rate (HR), NDCG, MRR at K={5, 10, 20}

---

## ‚è±Ô∏è Training Time & Epochs FAQ

**Q: Why 50 epochs instead of 600 like in papers?**

**A:** We use **early stopping** (patience=10):
- Training automatically stops when validation NDCG@10 stops improving
- With 50 epochs max ‚Üí usually converges at epoch 20-30 (~8-10 min GPU)
- With 600 epochs max ‚Üí usually converges at epoch 30-40 (~35-45 min GPU)
- Performance difference: ~2-3% for 10x more training time

**Default (Fast):**
```bash
--epochs 50 --patience 10  # 8-10 min GPU, 95-98% of max performance
```

**Paper Setting (Thorough):**
```bash
--epochs 600 --patience 20  # 35-45 min GPU with early stopping, 100% performance
```

**Without Early Stopping (Not Recommended):**
```bash
--epochs 600 --patience 9999  # 80+ min GPU, risk of overfitting
```

---

## üöÄ Next Steps

**1. Match paper settings (600 epochs):**
```python
!python experiments/run_experiment.py \
    --model hybrid_discrete \
    --epochs 600 \
    --patience 20 \
    --batch_size 256 \
    --lr 0.001
```

**2. Experiment with hyperparameters:**
```python
!python experiments/run_experiment.py \
    --model hybrid_discrete \
    --epochs 100 \
    --batch_size 512 \
    --lr 0.0005 \
    --d_model 128 \
    --n_heads 4
```

**3. Try different fusion strategies:**
- `--model hybrid_learnable` - Per-user learned weights
- `--model hybrid_continuous` - Neural network fusion
- `--model hybrid_fixed` - Fixed alpha=0.5

**4. Analyze specific user groups:**
Check `results/comparison_*.csv` for performance on short/medium/long history users

---

## üìö Resources

- **GitHub:** https://github.com/faroukq1/length-adaptive
- **Paper:** Length-Adaptive Hybrid Sequential Recommendation
- **Dataset:** MovieLens-1M (GroupLens)

---

**Questions or issues?** Check the README.md and EXPERIMENTS.md in the repository.