# 02 | TFBS-VCF Intersection using PyBedTools

This notebook demonstrates how to run the TFBS × VCF intersection pipeline using `deepvregulome.intersect`.
It loads transcription factor binding site (TFBS) lists and somatic mutation VCFs, and performs bedtools-style intersection to identify variants that overlap with regulatory elements.

In [None]:
import os
import pandas as pd
from deepvregulome.intersect import run_tfbs_intersection_pipeline

## 🔧 Load config.yaml
Make sure your paths (VCF folder, TFBS folder, output directory, etc.) are correctly set in `config.yaml`. This will be auto-loaded by the script.

## 🚀 Run Intersection
Run the full TFBS–mutation overlap pipeline.

In [None]:
# Specify the TFBS model list directory (e.g., output_4.txt to output_45.txt)
tfbs_list_dir = "/home/campus.stonybrook.edu/pdutta/Github/Postdoc/DNABERT_data_processing/TFBS/tfbs_list_folder/output_files"

# Run the pipeline with multiprocessing across files in the folder
run_tfbs_intersection_pipeline(tfbs_list_dir, start=4, end=6, num_processes=3)  # Adjust for demo/testing

## 📊 Summary Output
Each TFBS folder in the output will contain:
- `VCF_statistics.tsv`: Number of variants, overlaps, and fields
- `intersected_vcf_data.pkl`: Dictionary of per-patient intersection DataFrames

In [None]:
# Load and display summary statistics for one TFBS
stats_path = os.path.join(
    "/data/projects/GDC_Cancer_Wise/New_data/Brain/Generated_files/Intersected_Data/Somatic/300bp_TFBS", 
    "CTCF",  # Example TFBS model
    "VCF_statistics.tsv"
)

if os.path.exists(stats_path):
    df_stats = pd.read_csv(stats_path, sep="\t")
    display(df_stats.head())
else:
    print("Summary file not found — run the pipeline first.")