# HLA Locus-Specific Embeddings Analysis

This notebook demonstrates how to use the locus-specific embeddings generated by the `analyze_locus_embeddings.py` script and how to interpret the visualizations.

## Setup

First, let's set up our environment and import the necessary libraries.

In [None]:
import os
import sys
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from IPython.display import Image, display, Markdown, HTML
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

# Add parent directory to path to import modules
notebook_dir = Path().resolve()
project_dir = notebook_dir.parent if notebook_dir.name == 'notebooks' else notebook_dir
sys.path.insert(0, str(project_dir))

# Import our modules
try:
    from src.models.protbert import ProtBERTEncoder
    from src.analysis.visualization import HLAEmbeddingVisualizer
    from src.utils.logging import setup_logging
    logger = setup_logging(level="INFO")
    dependencies_ok = True
except ImportError as e:
    print(f"Error importing required package: {e}")
    dependencies_ok = False

# Set paths
data_dir = project_dir / "data"
sequence_file = data_dir / "processed" / "hla_sequences.pkl"
embeddings_dir = data_dir / "embeddings"
analysis_dir = data_dir / "analysis" / "locus_embeddings"

print(f"Project directory: {project_dir}")
print(f"Sequence file exists: {sequence_file.exists()}")
print(f"Analysis directory exists: {analysis_dir.exists()}")

## Check Generated Visualizations

To view the visualizations generated by the analysis script, run the following helper function for each locus:

In [None]:
def display_visualizations(locus, class_type):
    """Display visualizations for a specific locus"""
    plots_dir = analysis_dir / class_type / "plots"
    
    if not plots_dir.exists():
        print(f"Directory {plots_dir} does not exist. Run the analysis script first.")
        return
    
    print(f"\n### HLA-{locus} Visualizations")
    
    for viz_type in ["umap", "tsne", "pca", "groups"]:
        plot_file = plots_dir / f"hla_{locus}_{viz_type}.png"
        if plot_file.exists():
            print(f"\n#### {viz_type.upper()} Projection")
            display(Image(filename=str(plot_file)))
        else:
            print(f"Plot {viz_type.upper()} not found for HLA-{locus}")

In [None]:
# Display visualizations for Class I loci
for locus in ["A", "B", "C"]:
    display_visualizations(locus, "class1")

In [None]:
# Display visualizations for Class II loci
for locus in ["DRB1", "DQB1", "DPB1"]:
    display_visualizations(locus, "class2")

## Load and Examine Embeddings

Let's load the embeddings for a specific locus and examine them in more detail.

In [None]:
def load_embeddings(locus, class_type):
    """Load embeddings for a specific locus"""
    embeddings_file = analysis_dir / class_type / "embeddings" / f"hla_{locus}_embeddings.pkl"
    
    if not embeddings_file.exists():
        print(f"File {embeddings_file} not found. Run the analysis script first.")
        return None
    
    try:
        with open(embeddings_file, 'rb') as f:
            embeddings = pickle.load(f)
        print(f"Loaded {len(embeddings)} embeddings for HLA-{locus}")
        return embeddings
    except Exception as e:
        print(f"Error loading embeddings: {e}")
        return None

In [None]:
# Load HLA-A embeddings
embeddings = load_embeddings("A", "class1")

if embeddings is not None and embeddings:
    # Show some basic stats
    allele = list(embeddings.keys())[0]
    embedding = embeddings[allele]
    print(f"Sample allele: {allele}")
    print(f"Embedding shape: {embedding.shape}")
    print(f"First 5 dimensions: {embedding[:5]}")

## Custom Analysis and Visualization

To perform custom analysis and visualization, initialize the encoder and visualizer:

In [None]:
# Initialize encoder
if dependencies_ok and sequence_file.exists():
    try:
        encoder = ProtBERTEncoder(
            sequence_file=sequence_file,
            cache_dir=embeddings_dir
        )
        visualizer = HLAEmbeddingVisualizer(encoder)
        print(f"Initialized encoder with {len(encoder.sequences)} sequences")
    except Exception as e:
        print(f"Error initializing encoder: {e}")

## How to Run the Analysis

If you don't see any visualizations above, you need to run the analysis script first:

```bash
# For all loci
python scripts/run_locus_analysis.py

# For only Class I loci with debug information
python scripts/run_locus_analysis.py --class1-only --debug

# For only Class II loci with debug information
python scripts/run_locus_analysis.py --class2-only --debug
```

This will generate all the necessary embeddings and visualizations.