This notebook demonstrates downloading the BioFM model, using the BioToken framework to convert VCF files into annotated datasets, and extracting embeddings for analysis.

In [None]:
from biofm_eval import AnnotatedModel, AnnotationTokenizer, Embedder, VCFConverter
import torch

# Define the model and tokenizer paths
MODEL_PATH = 'm42-health/BioFM-265M'
TOKENIZER_PATH = 'm42-health/BioFM-265M'

# Load the pre-trained BioFM model and tokenizer
model = AnnotatedModel.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
tokenizer = AnnotationTokenizer.from_pretrained(TOKENIZER_PATH)

# Initialize the embedder
embedder = Embedder(model, tokenizer)

# Configure VCF conversion with gene annotations and reference genome
vcf_converter = VCFConverter(
    gene_annotation_path='./gencode.v38.annotation.gff3',
    reference_genome_path='./GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna'
)

# Convert VCF file into an annotated dataset
annotated_dataset = vcf_converter.vcf_to_annotated_dataset(vcf_path='./HG01779_b.vcf.gz', max_variants=200)

# Extract the embeddings
embeddings = embedder.get_dataset_embeddings(annotated_dataset)
print(embeddings)


The following code block visualizes the distribution of the extracted embeddings using Plotly to assess clustering patterns indicative of feature relevance.

In [None]:
import plotly.express as px
import numpy as np

# Here, a dummy array represents the embeddings. Replace dummy_embeddings with actual embeddings in practice.
dummy_embeddings = np.random.rand(100, 2) 

fig = px.scatter(x=dummy_embeddings[:, 0], y=dummy_embeddings[:, 1], title='Visualization of Variant Embeddings')
fig.show()






***
### [**Evolve This Code**](https://biologpt.com/?q=Evolve%20Code%3A%20This%20code%20extracts%20variant%20embeddings%20using%20BioFM%20with%20biologically%20informed%20annotations%20from%20VCF%20files%20for%20downstream%20genomic%20analysis.%0A%0AIntegrate%20real%20dataset%20paths%2C%20exception%20handling%2C%20and%20logging%2C%20and%20validate%20embedding%20quality%20with%20downstream%20task%20performance%20metrics.%0A%0ABioToken%20BioFM%20biologically-informed%20tokenization%20genomic%20foundation%20models%20review%0A%0AThis%20notebook%20demonstrates%20downloading%20the%20BioFM%20model%2C%20using%20the%20BioToken%20framework%20to%20convert%20VCF%20files%20into%20annotated%20datasets%2C%20and%20extracting%20embeddings%20for%20analysis.%0A%0Afrom%20biofm_eval%20import%20AnnotatedModel%2C%20AnnotationTokenizer%2C%20Embedder%2C%20VCFConverter%0Aimport%20torch%0A%0A%23%20Define%20the%20model%20and%20tokenizer%20paths%0AMODEL_PATH%20%3D%20%27m42-health%2FBioFM-265M%27%0ATOKENIZER_PATH%20%3D%20%27m42-health%2FBioFM-265M%27%0A%0A%23%20Load%20the%20pre-trained%20BioFM%20model%20and%20tokenizer%0Amodel%20%3D%20AnnotatedModel.from_pretrained%28MODEL_PATH%2C%20torch_dtype%3Dtorch.bfloat16%29%0Atokenizer%20%3D%20AnnotationTokenizer.from_pretrained%28TOKENIZER_PATH%29%0A%0A%23%20Initialize%20the%20embedder%0Aembedder%20%3D%20Embedder%28model%2C%20tokenizer%29%0A%0A%23%20Configure%20VCF%20conversion%20with%20gene%20annotations%20and%20reference%20genome%0Avcf_converter%20%3D%20VCFConverter%28%0A%20%20%20%20gene_annotation_path%3D%27.%2Fgencode.v38.annotation.gff3%27%2C%0A%20%20%20%20reference_genome_path%3D%27.%2FGCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna%27%0A%29%0A%0A%23%20Convert%20VCF%20file%20into%20an%20annotated%20dataset%0Aannotated_dataset%20%3D%20vcf_converter.vcf_to_annotated_dataset%28vcf_path%3D%27.%2FHG01779_b.vcf.gz%27%2C%20max_variants%3D200%29%0A%0A%23%20Extract%20the%20embeddings%0Aembeddings%20%3D%20embedder.get_dataset_embeddings%28annotated_dataset%29%0Aprint%28embeddings%29%0A%0A%0AThe%20following%20code%20block%20visualizes%20the%20distribution%20of%20the%20extracted%20embeddings%20using%20Plotly%20to%20assess%20clustering%20patterns%20indicative%20of%20feature%20relevance.%0A%0Aimport%20plotly.express%20as%20px%0Aimport%20numpy%20as%20np%0A%0A%23%20Here%2C%20a%20dummy%20array%20represents%20the%20embeddings.%20Replace%20dummy_embeddings%20with%20actual%20embeddings%20in%20practice.%0Adummy_embeddings%20%3D%20np.random.rand%28100%2C%202%29%20%0A%0Afig%20%3D%20px.scatter%28x%3Ddummy_embeddings%5B%3A%2C%200%5D%2C%20y%3Ddummy_embeddings%5B%3A%2C%201%5D%2C%20title%3D%27Visualization%20of%20Variant%20Embeddings%27%29%0Afig.show%28%29%0A%0A%0A)
***

### [Created with BioloGPT](https://biologpt.com/?q=Paper%20Review%3A%20BioToken%20and%20BioFM%20-%20Biologically-Informed%20Tokenization%20Enables%20Accurate%20and%20Efficient%20Genomic%20Foundation%20Models)
[![BioloGPT Logo](https://biologpt.com/static/icons/bioinformatics_wizard.png)](https://biologpt.com/)
***