```
═══════════════════════════════════════════════════════════════════════════════
gRNA DATA PREPARATION PIPELINE v2.1 - COMPLETE WORKING IMPLEMENTATION
═══════════════════════════════════════════════════════════════════════════════

COMPLETE, TESTED, READY-TO-RUN pipeline for gRNA classification data preparation.

IMPROVEMENTS:
1. Multi-source negative sampling (maxicircle + transcripts + minicircle)
2. Proper Altschul-Erickson dinucleotide shuffling
3. GTF-based gRNA region exclusion
4. Complete 112-feature extraction (verified count)
5. Rigorous quality control


Date: November 25, 2025
Version: 2.1 COMPLETE
═══════════════════════════════════════════════════════════════════════════════
```

In [3]:
import sys
import warnings
import json
import re
from pathlib import Path
from collections import Counter, defaultdict
from typing import Dict, Tuple, List, Set, Optional

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import networkx as nx

from Bio import SeqIO
from Bio.Seq import Seq
from sklearn.model_selection import train_test_split

# Configure
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)

print('='*80)
print('gRNA DATA PREPARATION PIPELINE V2.1 - COMPLETE')
print('='*80)
print('\n✓ Imports loaded')
print(f'  NumPy: {np.__version__}')
print(f'  Pandas: {pd.__version__}')
print(f'  NetworkX: {nx.__version__}')

gRNA DATA PREPARATION PIPELINE V2.1 - COMPLETE

✓ Imports loaded
  NumPy: 2.3.5
  Pandas: 2.3.3
  NetworkX: 3.5


# CONFIGURE PATHS

In [4]:
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / 'data'
REF_DIR = DATA_DIR / 'gRNAs' / 'Cooper_2022'

# Input files
GRNA_FILE = REF_DIR / 'mOs.gRNA.final.fasta'
MINICIRCLE_FILE = REF_DIR / 'mOs.Cooper.minicircle.fasta'
GTF_FILE = REF_DIR / 'mOs.gRNA.final.gtf'
MAXICIRCLE_FILE = PROJECT_ROOT / 'notes_dump/minicircle_maxcircle_strain_cmp-master/data-deposit/maxcircle/29-13_maxicircle.fasta'
TRANSCRIPTS_FILE = PROJECT_ROOT / "data/gRNAs/Tbrucei_transcripts/AnTat1.1_transcripts-20.fasta"


# Output directories
PROCESSED_DIR = DATA_DIR / 'processed'
PLOTS_DIR = DATA_DIR / 'plots'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
PLOTS_DIR.mkdir(parents=True, exist_ok=True)

print('\n' + '='*80)
print('FILE VALIDATION')
print('='*80)
print('\nInput files:')
all_files_exist = True
for filepath in [GRNA_FILE, MINICIRCLE_FILE, GTF_FILE, MAXICIRCLE_FILE, TRANSCRIPTS_FILE]:
    status = '✓' if filepath.exists() else '✗ MISSING'
    print(f'  {status} {filepath.name}')
    if not filepath.exists():
        all_files_exist = False

if not all_files_exist:
    print('\n⚠ WARNING: Some files missing!')
    sys.exit(1)

print(f'\nOutput:')
print(f'  Data: {PROCESSED_DIR}')
print(f'  Plots: {PLOTS_DIR}')


FILE VALIDATION

Input files:
  ✓ mOs.gRNA.final.fasta
  ✓ mOs.Cooper.minicircle.fasta
  ✓ mOs.gRNA.final.gtf
  ✓ 29-13_maxicircle.fasta
  ✓ AnTat1.1_transcripts-20.fasta

Output:
  Data: /Users/anna/projects/grna-inspector/data/processed
  Plots: /Users/anna/projects/grna-inspector/data/plots


# Data Exploration

**Summary**  
This comprehensive exploration cell reveals your data is in excellent shape and ready for pipeline development. The analysis uncovered several critical insights that will guide our negative sampling strategy.  
  
Key discoveries: You have 1,158 canonical gRNA sequences (averaging 40 nucleotides) annotated across 390 minicircle genomes. The perfect match between GTF annotations and FASTA sequences confirms data integrity. The high AT content (71.5%) and presence of signature biological patterns like ATATA motifs (27% of sequences) and poly-T tracts (38%) validate that these are genuine guide RNA sequences.  
  
The negative sampling opportunity: Each minicircle averages 886 base pairs of non-gRNA regions, providing over 345,000 total base pairs for authentic negative examples. Combined with the maxicircle (23,016 bp) and transcript sequences (32 sequences), you have abundant, biologically realistic sources for generating length-matched negative examples that preserve the proper dinucleotide composition.  

```
=============================================================================
CELL 1: COMPREHENSIVE DATA EXPLORATION & QUALITY ASSESSMENT
=============================================================================
Purpose: Deeply understand all input files, their structure, relationships,
         and biological content before building the pipeline.
         
This cell answers:
- What do we have? (file inventory)
- What does each column/field mean?
- How do files relate to each other?
- What's the data quality?
- What are the key biological patterns?
=============================================================================
```

## SECTION 1: FILE INVENTORY & BASIC STATS

In [7]:
print("="*80)
print("SECTION 1: FILE INVENTORY & OVERVIEW")
print("="*80 + "\n")

# Define all project files
files = {
    'gRNA_fasta': GRNA_FILE,
    'gRNA_gtf': GTF_FILE,
    'minicircle_fasta': MINICIRCLE_FILE,
    'maxicircle_fasta': MAXICIRCLE_FILE,
    'transcripts_fasta': TRANSCRIPTS_FILE
}

# Check existence and get basic stats
for name, path in files.items():
    p = Path(path)
    if p.exists():
        size_kb = p.stat().st_size / 1024
        print(f"✓ {name:20s} exists ({size_kb:.1f} KB)")
    else:
        print(f"✗ {name:20s} MISSING!")

print("\n" + "="*80)
print("SECTION 2: GTF ANNOTATION ANALYSIS")
print("="*80 + "\n")

# Parse GTF file to understand gRNA annotations
print("GTF Format Explanation:")
print("-" * 40)
print("GTF = Gene Transfer Format, a standard annotation format")
print("Each line describes one genomic feature (here: gRNA location)")
print()

# Load GTF
gtf_data = []
with open(files['gRNA_gtf'], 'r') as f:
    for line in f:
        if line.strip():
            parts = line.strip().split('\t')
            # GTF columns: seqname, source, feature, start, end, score, strand, frame, attributes
            seqname = parts[0]      # Minicircle ID (e.g., mO_001)
            source = parts[1]        # Data source
            feature = parts[2]       # Feature type (transcript)
            start = int(parts[3])    # Start position (1-based)
            end = int(parts[4])      # End position (inclusive)
            strand = parts[6]        # Strand (+/-)
            attributes = parts[8]    # Additional info
            
            # Parse attributes
            gene_id = re.search(r'gene_id "([^"]+)"', attributes).group(1)
            transcript_id = re.search(r'transcript_id "([^"]+)"', attributes).group(1)
            note = re.search(r'note "([^"]+)"', attributes)
            note_text = note.group(1) if note else ""
            
            # Extract gRNA name from note
            name_match = re.search(r'name: ([^;]+)', note_text)
            grna_name = name_match.group(1) if name_match else ""
            
            gtf_data.append({
                'minicircle': seqname,
                'start': start,
                'end': end,
                'strand': strand,
                'length': end - start + 1,
                'gene_id': gene_id,
                'transcript_id': transcript_id,
                'grna_name': grna_name,
                'full_note': note_text
            })

gtf_df = pd.DataFrame(gtf_data)

print(f"Total gRNA annotations: {len(gtf_df)}")
print(f"Unique minicircles with gRNA: {gtf_df['minicircle'].nunique()}")
print(f"Unique gene_ids: {gtf_df['gene_id'].nunique()}")
print(f"Unique transcript_ids: {gtf_df['transcript_id'].nunique()}")
print()

print("GTF Column Meanings:")
print("-" * 40)
print("minicircle    : Which minicircle DNA molecule contains this gRNA")
print("start/end     : Genomic coordinates (1-based, inclusive)")
print("strand        : DNA strand orientation (+/- or forward/reverse)")
print("length        : gRNA sequence length in nucleotides")
print("gene_id       : Links multiple transcripts to same genomic locus")
print("transcript_id : Unique identifier for each gRNA transcript variant")
print("grna_name     : Descriptive name with target gene info")
print()

print("Length distribution of annotated gRNA:")
print(gtf_df['length'].describe())
print()

print("Sample annotations:")
print(gtf_df.head(3).to_string())
print()

SECTION 1: FILE INVENTORY & OVERVIEW

✓ gRNA_fasta           exists (77.0 KB)
✓ gRNA_gtf             exists (316.1 KB)
✓ minicircle_fasta     exists (422.6 KB)
✓ maxicircle_fasta     exists (18.1 KB)
✓ transcripts_fasta    exists (24.3 KB)

SECTION 2: GTF ANNOTATION ANALYSIS

GTF Format Explanation:
----------------------------------------
GTF = Gene Transfer Format, a standard annotation format
Each line describes one genomic feature (here: gRNA location)

Total gRNA annotations: 1158
Unique minicircles with gRNA: 390
Unique gene_ids: 390
Unique transcript_ids: 1158

GTF Column Meanings:
----------------------------------------
minicircle    : Which minicircle DNA molecule contains this gRNA
start/end     : Genomic coordinates (1-based, inclusive)
strand        : DNA strand orientation (+/- or forward/reverse)
length        : gRNA sequence length in nucleotides
gene_id       : Links multiple transcripts to same genomic locus
transcript_id : Unique identifier for each gRNA transcript v