# 1. Preprocess Data

In this notebook we preprocess genomic data associated with R-loops originally generated by the [Arsuaga–Vázquez Lab](https://arsuaga-vazquez-lab.faculty.ucdavis.edu) at UC Davis and reported in the paper *“The R-loop grammar predicts R-loop formation under different topological constraints”* [by Ferrari *et al.*](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1013376) The raw data used here can be accessed from the [Zenodo repository](https://zenodo.org/records/15742754) as the file `fasta_bed.zip`.

The experimental data was obtained using **single-molecule R-loop footprinting and sequencing (SMRF-seq)**. This technique identifies the genomic start and end positions of R-loops formed on DNA plasmids under different initial topologies.

Each experiment is defined by a combination of the following parameters:

- **Plasmid**:  
  Two different DNA plasmids are used: **pFC53** and **pFC8**.

- **DNA topology**:  
  Plasmids are prepared in different topological states prior to transcription:
  - **Linearized**: Circular plasmids are linearized before transcription.
  - **Supercoiled**: Plasmids retain their native bacterial supercoiling, with supercoiling density approximately $\sim -0.07$.
  - **Gyrase-treated (Gyrasecr)**: Plasmids are treated with gyrase before transcription, increasing the negative supercoiling density to approximately $\sim -0.14$.

- **R-loop position**:  
  For each molecule, SMRF-seq identifies the **starting** and **ending** coordinates of the R-loop along the plasmid sequence. In this notebook, we focus on the **starting locations** of R-loops.


We use this data to construct libraries of short DNA sequences centered at the starting positions of R-loops. In subsequent notebooks, we show that these collections of words are structurally distinct: topological signatures derived from their **Insertion Chain Complexes** can differentiate between sets of words associated with R-loops formed under linear versus supercoiled DNA topologies.

### Construction of sequence libraries

We build these libraries as follows:

1. For each plasmid and each R-loop starting site, we extract a 14-mer window of DNA sequence centered at the starting position of the R-loop. That is, if $a_1$ denotes the first nucleotide of the R-loop, we consider the window $$a_{-7}a_{-6}\dots a_{-1}a_1a_2\dots a_6a_7,$$  centered at the R-loop start.

2. From this window, we extract all contiguous $k$-mers that include the start position, for $k=1, \dots, 14$:

   - **Length 1**:  $a_{-1}, \quad a_1$

   - **Length 2**:  $a_{-2}a_{-1}, \quad a_{-1}a_1, \quad a_1a_2$

   - **Length 3**:  $a_{-3}a_{-2}a_{-1},\quad a_{-2}a_{-1}a_1,\quad a_{-1}a_1a_2, \quad a_1a_2a_3$

   - $\vdots$

   - **Length 14**:  $a_{-7}a_{-6}\dots a_{-1}a_1\dots a_6a_7$
  
3. We aggregate all extracted $k$-mers across all R-loop start sites for a given experimental condition and compute their frequencies.

4. We normalize these frequencies by dividing each value by the maximum observed frequency, obtaining weights in the interval $(0,1]$.

5. We retain only the $N = 500$ words with the highest normalized frequencies to form the representative sequence library for that condition.

The resulting processed datasets are saved as `.pickle` files in the `data/preprocessed` directory for efficient use in the subsequent steps of the analysis.


In [1]:
#Make sure biopython is installed
#!pip install biopython

In [2]:
from utils.preprocess import read_subsequence, produce_table, get_words_combined
import pickle
import pandas as pd
import numpy as np
import os

In [3]:
# Folder paths in the notebook
folder_path_bed = "data/raw/fasta_bed/bed-files/"
folder_path_fasta = "data/raw/fasta_bed/fasta-files/"

max_N_words=500
max_window_side=7

# FASTA mapping
dictionary_fastas = {'pFC53':'pFC53.fa', 'pFC8':'pFC8.fa', 'pFC19FIXED':'pFC8.fa'}

In [4]:
#List all BED files
bed_files = [f for f in os.listdir(folder_path_bed) if f.endswith('.bed')]

#Create libraries for the Bed Files
Libraries=[]

for file in bed_files:

    #Read BED file
    bed_df = pd.read_csv(os.path.join(folder_path_bed, file), sep="\t")

    #Rename BED file columns
    bed_df.columns = ["chrom", "start", "end", "extra_info", "extra_bit", "direction"]
    
    #Build sequence table around R-loop starts
    Table = produce_table(bed_df, folder_path_fasta, dictionary_fastas,max_window_size=max_window_side)
    
    # Generate combined k-mers
    words, freqs = get_words_combined(Table, max_w=max_window_side)

    # Normalize frequencies
    freqs=np.array(freqs)/np.max(freqs)
    
    # Build Library DataFrame and select top N words
    Table_Words=pd.DataFrame({'Words':words, 'Freqs':freqs})
    Table_Words=Table_Words.sort_values('Freqs',ascending=False).reset_index(drop=True).iloc[0:max_N_words]

    # Get experiment name from file
    exp_name = file.replace('.bed', '')
    
    # Append to libraries list
    Libraries.append({'exp_name': exp_name, 'table': Table_Words})
     
#Save libraries
output_path = 'data/preprocessed/R_loops_14-mers_starting_site.pickle'
with open(output_path, 'wb') as handle:
    pickle.dump(Libraries, handle)
