# Length-Adaptive Sequential Recommendation - Paper Experiments

**Publication-Quality Training on MovieLens-1M**

---

## üéì Paper-Level Configuration

**Training Settings:**
- Max Epochs: 200 (with early stopping)
- Patience: 20
- Expected convergence: epoch 40-60
- Batch size: 256
- Learning rate: 0.001
- Model: d_model=64, n_heads=2, n_blocks=2

**Models to Train:**
1. ‚úÖ SASRec (Transformer baseline)
2. ‚úÖ Hybrid Fixed (Œ±=0.5)
3. ‚úÖ Hybrid Discrete (bin-based fusion)
4. ‚úÖ Hybrid Learnable (learned weights)
5. ‚úÖ Hybrid Continuous (neural fusion)

**Time Estimate: ~3-4 hours total with GPU T4**

---

## üìã Quick Start

1. Enable GPU T4 accelerator
2. Enable Internet
3. Run cells 1-7 sequentially
4. Download results.zip at the end

## Step 1: Clone Repository

In [None]:
# Clone repository
!git clone https://github.com/faroukq1/length-adaptive.git

# Change to project directory
%cd length-adaptive

# Verify structure
!ls -lh experiments/

print("\n‚úÖ Repository cloned successfully!")

## Step 2: Install Dependencies

In [None]:
# Install required packages quietly
!pip install -q torch-geometric tqdm scikit-learn pandas matplotlib

print("‚úì All dependencies installed successfully!")

## Step 3: Verify GPU

In [None]:
# Check GPU availability
!python check_gpu.py

## Step 4: Prepare Data

Downloads MovieLens-1M and preprocesses if needed (2-3 minutes)

In [None]:
import os

# Check if preprocessed data exists
data_file = 'data/ml-1m/processed/sequences.pkl'
graph_file = 'data/graphs/cooccurrence_graph.pkl'

print("="*70)
print("üîç Checking Data Files")
print("="*70)

if os.path.exists(data_file):
    print(f"‚úÖ Sequential data found: {data_file}")
    print(f"   Size: {os.path.getsize(data_file) / 1024 / 1024:.2f} MB")
else:
    print(f"‚ùå Sequential data NOT found: {data_file}")

if os.path.exists(graph_file):
    print(f"‚úÖ Graph data found: {graph_file}")
    print(f"   Size: {os.path.getsize(graph_file) / 1024 / 1024:.2f} MB")
else:
    print(f"‚ùå Graph data NOT found: {graph_file}")

# Check raw data
raw_file = 'data/ml-1m/raw/ml-1m/ratings.dat'
if os.path.exists(raw_file):
    print(f"‚úÖ Raw data found: {raw_file}")
else:
    print(f"‚ùå Raw data NOT found: {raw_file}")

print("="*70)

# If data is missing, run preprocessing
if not os.path.exists(data_file) or not os.path.exists(graph_file):
    print("\nüîß Running preprocessing...")
    print("This will take 2-3 minutes.\n")
    
    # Download MovieLens-1M if needed
    if not os.path.exists(raw_file):
        print("üì• Downloading MovieLens-1M dataset...")
        !mkdir -p data/ml-1m/raw
        !wget -q http://files.grouplens.org/datasets/movielens/ml-1m.zip
        !unzip -q ml-1m.zip
        !mv ml-1m data/ml-1m/raw/
        !rm -f ml-1m.zip
        print("‚úÖ Download complete!\n")
    
    # Run preprocessing
    print("üîÑ Preprocessing sequential data...")
    !python -m src.data.preprocess
    
    # Build graph
    print("\nüîÑ Building co-occurrence graph...")
    !python -m src.data.graph_builder
    
    print("\n‚úÖ Preprocessing complete!")
    print("="*70)
else:
    print("\n‚úÖ All data files ready!")
    print("="*70)

## Step 5: Run Paper Experiments

**‚è±Ô∏è Time: ~3-4 hours total (GPU T4)**

This will train all 5 models sequentially with paper-quality settings:
- 200 max epochs with early stopping (patience=20)
- Models typically converge at epoch 40-60
- Full ablation study for publication

In [None]:
# Run all paper experiments
print("="*80)
print("üéì PAPER-LEVEL EXPERIMENTS")
print("="*80)
print("")
print("Training 5 models with 200 epochs, early stopping patience=20")
print("Expected convergence: epoch 40-60")
print("Time estimate: ~3-4 hours with GPU T4")
print("")
print("Models:")
print("  1. SASRec (baseline)")
print("  2. Hybrid Fixed (Œ±=0.5)")
print("  3. Hybrid Discrete (bin-based)")
print("  4. Hybrid Learnable (learned weights)")
print("  5. Hybrid Continuous (neural fusion)")
print("")
print("="*80)

# Run the paper experiments script
!bash scripts/run_paper_experiments.sh

print("\n‚úÖ All paper experiments complete!")

## Step 6: Analyze Results

Generate comprehensive comparison tables and statistics

In [None]:
# Generate analysis
print("="*70)
print("üìä Generating Analysis")
print("="*70)

!python experiments/analyze_results.py --save_csv

print("\n‚úÖ Analysis complete!")

## Step 7: Display Results

Show comprehensive performance comparison

In [None]:
import pandas as pd
import os
import json
import glob

# Try to load results directly from experiments
result_folders = glob.glob('results/*_*')

if len(result_folders) == 0:
    print("‚ùå No results found. Run experiments first!")
else:
    print("\n" + "="*80)
    print("üìä OVERALL PERFORMANCE")
    print("="*80 + "\n")
    
    # Collect all results
    all_results = []
    for folder in result_folders:
        results_path = os.path.join(folder, 'results.json')
        if os.path.exists(results_path):
            with open(results_path, 'r') as f:
                results = json.load(f)
            
            # Extract model name
            folder_name = os.path.basename(folder)
            model_name = '_'.join(folder_name.split('_')[:-2])
            
            all_results.append({
                'Model': model_name,
                'HR@5': results['test_metrics']['HR@5'],
                'HR@10': results['test_metrics']['HR@10'],
                'HR@20': results['test_metrics']['HR@20'],
                'NDCG@5': results['test_metrics']['NDCG@5'],
                'NDCG@10': results['test_metrics']['NDCG@10'],
                'NDCG@20': results['test_metrics']['NDCG@20'],
                'MRR@10': results['test_metrics']['MRR@10']
            })
    
    if all_results:
        df = pd.DataFrame(all_results)
        df = df.sort_values('NDCG@10', ascending=False)
        
        # Display table
        print(df.to_string(index=False, float_format='%.4f'))
        
        # Highlight best model
        best = df.iloc[0]
        print("\n" + "="*80)
        print(f"üèÜ BEST MODEL: {best['Model']}")
        print("="*80)
        print(f"  NDCG@10: {best['NDCG@10']:.4f}")
        print(f"  HR@10:   {best['HR@10']:.4f}")
        print(f"  MRR@10:  {best['MRR@10']:.4f}")
        print("="*80 + "\n")
        
        # Show improvement over baseline
        sasrec_row = df[df['Model'] == 'sasrec']
        if not sasrec_row.empty:
            sasrec_ndcg = sasrec_row.iloc[0]['NDCG@10']
            sasrec_hr = sasrec_row.iloc[0]['HR@10']
            hybrid_ndcg = best['NDCG@10']
            hybrid_hr = best['HR@10']
            ndcg_imp = ((hybrid_ndcg - sasrec_ndcg) / sasrec_ndcg) * 100
            hr_imp = ((hybrid_hr - sasrec_hr) / sasrec_hr) * 100
            print(f"üìà Improvement over SASRec baseline:")
            print(f"   NDCG@10: {ndcg_imp:+.2f}%")
            print(f"   HR@10:   {hr_imp:+.2f}%\n")
    else:
        print("‚ùå Could not parse results files")

## Step 8: Performance by User Group

In [None]:
import glob
import json
import os
import pandas as pd

# Load grouped metrics
print("\n" + "="*80)
print("üìä PERFORMANCE BY USER GROUP")
print("="*80 + "\n")

result_folders = glob.glob('results/*_*')

if len(result_folders) == 0:
    print("‚ùå No results found.")
else:
    # Collect grouped results
    group_data = {'short': [], 'medium': [], 'long': []}
    
    for folder in result_folders:
        results_path = os.path.join(folder, 'results.json')
        if os.path.exists(results_path):
            with open(results_path, 'r') as f:
                results = json.load(f)
            
            # Extract model name
            folder_name = os.path.basename(folder)
            model_name = '_'.join(folder_name.split('_')[:-2])
            
            # Extract grouped metrics
            grouped = results.get('grouped_metrics', {})
            
            for group in ['short', 'medium', 'long']:
                if group in grouped:
                    group_data[group].append({
                        'Model': model_name,
                        'HR@10': grouped[group]['HR@10'],
                        'NDCG@10': grouped[group]['NDCG@10'],
                        'MRR@10': grouped[group]['MRR@10'],
                        'Count': grouped[group]['count']
                    })
    
    # Display each group
    for group_name in ['short', 'medium', 'long']:
        if group_data[group_name]:
            df_group = pd.DataFrame(group_data[group_name])
            df_group = df_group.sort_values('NDCG@10', ascending=False)
            
            print(f"\n{group_name.upper()} HISTORY USERS:")
            print("-" * 80)
            print(df_group.to_string(index=False, float_format='%.4f'))
            print()
        else:
            print(f"\n{group_name.upper()} HISTORY USERS:")
            print("-" * 80)
            print(f"‚ö†Ô∏è  No {group_name} user data found (possibly no users in this range)")
            print()

## Step 9: Visualize Learning Curves

In [None]:
import json
import matplotlib.pyplot as plt
import glob
import os

# Find all experiment results
result_folders = glob.glob('results/*_*')

if len(result_folders) > 0:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Training Loss
    for folder in result_folders:
        history_path = os.path.join(folder, 'history.json')
        if os.path.exists(history_path):
            try:
                with open(history_path, 'r') as f:
                    history = json.load(f)
                
                # Extract model name from folder
                parts = os.path.basename(folder).split('_')
                model_name = '_'.join(parts[:-2]) if len(parts) > 2 else parts[0]
                
                if 'train_loss' in history and history['train_loss']:
                    ax1.plot(history['train_loss'], label=model_name, marker='o', markersize=3, linewidth=2)
            except Exception as e:
                print(f"‚ö†Ô∏è  Could not load history from {folder}: {e}")
    
    ax1.set_xlabel('Epoch', fontsize=12)
    ax1.set_ylabel('BPR Loss', fontsize=12)
    ax1.set_title('Training Loss', fontsize=14, fontweight='bold')
    ax1.legend(fontsize=10)
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Validation NDCG@10
    for folder in result_folders:
        history_path = os.path.join(folder, 'history.json')
        if os.path.exists(history_path):
            try:
                with open(history_path, 'r') as f:
                    history = json.load(f)
                
                parts = os.path.basename(folder).split('_')
                model_name = '_'.join(parts[:-2]) if len(parts) > 2 else parts[0]
                
                if 'val_metrics' in history and history['val_metrics']:
                    ndcg_values = [m.get('NDCG@10', 0) for m in history['val_metrics']]
                    if ndcg_values:
                        ax2.plot(ndcg_values, label=model_name, marker='o', markersize=3, linewidth=2)
            except Exception as e:
                print(f"‚ö†Ô∏è  Could not load validation metrics from {folder}: {e}")
    
    ax2.set_xlabel('Epoch', fontsize=12)
    ax2.set_ylabel('NDCG@10', fontsize=12)
    ax2.set_title('Validation NDCG@10', fontsize=14, fontweight='bold')
    ax2.legend(fontsize=10)
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    
    # Save plot
    os.makedirs('results', exist_ok=True)
    plt.savefig('results/learning_curves.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("‚úì Saved to: results/learning_curves.png")
else:
    print("No results to plot. Run experiments first!")

## Step 10: Download Results

Creates a zip file with all results for local analysis

In [None]:
# Create zip of all results
import os

if os.path.exists('results') and os.listdir('results'):
    !zip -r results_paper.zip results/
    
    print("\n‚úÖ Success!")
    print("Download 'results_paper.zip' from the Output tab (right sidebar) ‚Üí")
    print("\nContains:")
    print("  ‚Ä¢ Model checkpoints (best_model.pt)")
    print("  ‚Ä¢ Training history (history.json)")
    print("  ‚Ä¢ Test metrics (results.json)")
    print("  ‚Ä¢ Comparison tables (CSV files)")
    print("  ‚Ä¢ Learning curves (PNG)")
    
    # Show what's in results
    result_folders = [d for d in os.listdir('results') if os.path.isdir(os.path.join('results', d))]
    print(f"\nüì¶ Packaged {len(result_folders)} experiment(s):")
    for folder in result_folders:
        print(f"  ‚Ä¢ {folder}")
else:
    print("‚ö†Ô∏è  No results folder found. Run experiments first!")

---

## ‚úÖ Paper Experiments Complete!

You now have publication-quality results for:
- SASRec baseline
- 4 hybrid fusion strategies
- Complete ablation study
- Performance by user groups
- Learning curves visualization

**Next Steps:**
1. Download `results_paper.zip`
2. Use for paper tables and figures
3. Report best model and improvements over baseline

---

## üìö Citation

```
@article{yourname2026length,
  title={Length-Adaptive Hybrid Sequential Recommendation},
  author={Your Name},
  journal={arXiv preprint},
  year={2026}
}
```