# Enformer Embeddings Tutorial

This tutorial demonstrates how to load and work with embeddings extracted from DNA sequences using the Enformer model.

## Table of Contents
1. [Loading Embeddings](#loading-embeddings)
2. [Inspecting the Data](#inspecting-the-data)

## 1. Loading Embeddings

First, let's load the embeddings from the npz file.

In [2]:
import numpy as np
from pathlib import Path

In [3]:
# Load embeddings from npz file
embeddings_file = "test_files/embeddings.npz"  # Update this path to your embeddings file

# Load the data
data = np.load(embeddings_file)

# Extract arrays
sequence_ids = data["ids"]
embeddings = data["embeddings"]

print(f"Loaded embeddings from: {embeddings_file}")
print(f"Available keys in npz file: {list(data.keys())}")

Loaded embeddings from: test_files/embeddings.npz
Available keys in npz file: ['ids', 'embeddings']


## 2. Inspecting the Data

Let's examine the structure and properties of the loaded data.


In [4]:
# Inspect sequence IDs
print("Sequence IDs:")
print(f"  Type: {type(sequence_ids)}")
print(f"  Shape: {sequence_ids.shape}")
print(f"  Number of sequences: {len(sequence_ids)}")
print(f"  First 5 IDs: {sequence_ids[:5]}")
print()

# Inspect embeddings
print("Embeddings:")
print(f"  Type: {type(embeddings)}")
print(f"  Shape: {embeddings.shape}")
print(f"  Dtype: {embeddings.dtype}")
print(f"  Min value: {embeddings.min():.4f}")
print(f"  Max value: {embeddings.max():.4f}")
print(f"  Mean value: {embeddings.mean():.4f}")
print(f"  Std value: {embeddings.std():.4f}")


Sequence IDs:
  Type: <class 'numpy.ndarray'>
  Shape: (3,)
  Number of sequences: 3
  First 5 IDs: ['seq1' 'seq2' 'seq3']

Embeddings:
  Type: <class 'numpy.ndarray'>
  Shape: (3, 896, 3072)
  Dtype: float32
  Min value: -0.1636
  Max value: 0.6920
  Mean value: 0.0315
  Std value: 0.1442
