<a href="https://colab.research.google.com/github/abh2180/te-binding-ml4fg/blob/main/notebooks/01_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML4FG
## Interim Report: Code

# 01 – Alignment and QC (hg38 + eCLIP)

This notebook:
- Mounts Google Drive and sets persistent data paths
- Downloads hg38 and raw eCLIP data into Drive (one-time)
- Provides stubs for alignment, IDR, TE intersections, and QC

Once data is in Drive, you can delete or ignore the download cell and just reuse the paths.


In [4]:
from google.colab import drive
drive.mount("/content/drive")

PROJECT_ROOT = "/content/drive/MyDrive/te-binding-ml4fg-data"
RAW_DIR = PROJECT_ROOT + "/raw"
HG38_DIR = RAW_DIR + "/hg38"
ECLIP_DIR = RAW_DIR + "/eclip"
TE_DIR = RAW_DIR + "/te_annotations"
PROC_DIR = PROJECT_ROOT + "/processed"
ALIGN_DIR = PROC_DIR + "/aligned"
PEAK_DIR = PROC_DIR + "/peaks"

import os

for d in [PROJECT_ROOT, RAW_DIR, HG38_DIR, ECLIP_DIR, TE_DIR, PROC_DIR, ALIGN_DIR, PEAK_DIR]:
    os.makedirs(d, exist_ok=True)

print("RAW_DIR:", RAW_DIR)
print("PROC_DIR:", PROC_DIR)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
RAW_DIR: /content/drive/MyDrive/te-binding-ml4fg-data/raw
PROC_DIR: /content/drive/MyDrive/te-binding-ml4fg-data/processed


In [5]:
%%bash
HG38_DIR="/content/drive/MyDrive/te-binding-ml4fg-data/raw/hg38"
mkdir -p "$HG38_DIR"
cd "$HG38_DIR"

if [ ! -f "hg38.fa.gz" ]; then
  echo "Downloading hg38..."
  wget -O hg38.fa.gz "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz"
else
  echo "hg38.fa.gz already exists in Drive, skipping download."
fi

hg38.fa.gz already exists in Drive, skipping download.


In [11]:
%%bash
TE_DIR="/content/drive/MyDrive/te-binding-ml4fg-data/raw/te_annotations"
mkdir -p "$TE_DIR"
cd "$TE_DIR"

# Safety: only download if the file does NOT exist
if [ ! -f "GRCh38_GENCODE_rmsk_TE.gtf.gz" ]; then
  echo "GRCh38_GENCODE_rmsk_TE.gtf.gz not found, downloading..."
  wget -O GRCh38_GENCODE_rmsk_TE.gtf.gz "https://www.dropbox.com/scl/fo/jdpgn6fl8ngd3th3zebap/ACdZkShDC1au-OckIipI5kM/TEtranscripts/TE_GTF?dl=1&file_subpath=%2FGRCh38_GENCODE_rmsk_TE.gtf.gz&rlkey=41oz6ppggy82uha5i3yo1rnlx"
else
  echo "GRCh38_GENCODE_rmsk_TE.gtf.gz already exists in Drive, skipping download."
fi

ls -lh GRCh38_GENCODE_rmsk_TE.gtf.gz


Removing any old/corrupt copy...
Re-downloading GRCh38_GENCODE_rmsk_TE.gtf.gz from Dropbox...
Download complete.
-rw------- 1 root root 3.4G Nov  9 02:39 GRCh38_GENCODE_rmsk_TE.gtf.gz


--2025-11-09 02:37:47--  https://www.dropbox.com/scl/fo/jdpgn6fl8ngd3th3zebap/ACdZkShDC1au-OckIipI5kM/TEtranscripts/TE_GTF?dl=1&file_subpath=%2FGRCh38_GENCODE_rmsk_TE.gtf.gz&rlkey=41oz6ppggy82uha5i3yo1rnlx
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.18, 2620:100:6020:18::a27d:4012
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ucfe86ce7ae840e09ae631115a7c.dl.dropboxusercontent.com/zip_download_get/CYPOVTyVoG0AvkZK2mhfxDOj6E5Ef4XuyA6K-u6b4ajoDxdCVwaKl1Lh5XfohRa1n7qTFdKLPCOSp275coeUmIbeGEMJ3lE4wW0_2dyp9baZWw# [following]
--2025-11-09 02:37:48--  https://ucfe86ce7ae840e09ae631115a7c.dl.dropboxusercontent.com/zip_download_get/CYPOVTyVoG0AvkZK2mhfxDOj6E5Ef4XuyA6K-u6b4ajoDxdCVwaKl1Lh5XfohRa1n7qTFdKLPCOSp275coeUmIbeGEMJ3lE4wW0_2dyp9baZWw
Resolving ucfe86ce7ae840e09ae631115a7c.dl.dropboxusercontent.com (ucfe86ce7ae840e09ae631115a7c.dl.dropboxusercontent.com)... 162.125.6.15, 2620:1

In [18]:
%%bash
TE_DIR="/content/drive/MyDrive/te-binding-ml4fg-data/raw/te_annotations"
cd "$TE_DIR"

echo "Backing up old BED..."
mv GRCh38_GENCODE_rmsk_TE.bed GRCh38_GENCODE_rmsk_TE.bed.old

echo "Rebuilding GRCh38_GENCODE_rmsk_TE.bed with correct attributes..."
zcat GRCh38_GENCODE_rmsk_TE.gtf.gz | \
  awk -F '\t' 'BEGIN{OFS="\t"} $3=="exon" {print $1, $4-1, $5, $9}' \
  > GRCh38_GENCODE_rmsk_TE.bed

echo
ls -lh GRCh38_GENCODE_rmsk_TE.bed*
echo
echo "New BED first few lines:"
head -n 5 GRCh38_GENCODE_rmsk_TE.bed


Backing up old BED...
Rebuilding GRCh38_GENCODE_rmsk_TE.bed with correct attributes...

-rw------- 1 root root 598M Nov  9 02:53 GRCh38_GENCODE_rmsk_TE.bed
-rw------- 1 root root 148M Nov  9 02:48 GRCh38_GENCODE_rmsk_TE.bed.old

New BED first few lines:
chr1	67108753	67109046	gene_id "L1P5"; transcript_id "L1P5"; family_id "L1"; class_id "LINE"; gene_name "L1P5:TE";
chr1	8388315	8388618	gene_id "AluY"; transcript_id "AluY"; family_id "Alu"; class_id "SINE"; gene_name "AluY:TE";
chr1	25165803	25166380	gene_id "L1MB5"; transcript_id "L1MB5"; family_id "L1"; class_id "LINE"; gene_name "L1MB5:TE";
chr1	33554185	33554483	gene_id "AluSc"; transcript_id "AluSc"; family_id "Alu"; class_id "SINE"; gene_name "AluSc:TE";
chr1	41942894	41943205	gene_id "AluY"; transcript_id "AluY_dup1"; family_id "Alu"; class_id "SINE"; gene_name "AluY:TE";


In [16]:
%%bash
apt-get update -y
apt-get install -y bedtools samtools

bedtools --version
samtools --version | head -n 1

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:3 https://cli.github.com/packages stable InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [2,123 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Get:11 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [9,426 kB]
Hit:12 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:13 http://archive.ubuntu.com/ubuntu jammy-updates/univers

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


In [19]:
%%bash
TE_DIR="/content/drive/MyDrive/te-binding-ml4fg-data/raw/te_annotations"
cd "$TE_DIR"

echo "Contents of TE_DIR:"
ls -lh

echo
echo "First few lines of BED:"
head -n 5 GRCh38_GENCODE_rmsk_TE.bed


Contents of TE_DIR:
total 833M
-rw------- 1 root root 598M Nov  9 02:53 GRCh38_GENCODE_rmsk_TE.bed
-rw------- 1 root root 148M Nov  9 02:48 GRCh38_GENCODE_rmsk_TE.bed.old
-rw------- 1 root root  88M Nov  9 02:45 GRCh38_GENCODE_rmsk_TE.gtf.gz

First few lines of BED:
chr1	67108753	67109046	gene_id "L1P5"; transcript_id "L1P5"; family_id "L1"; class_id "LINE"; gene_name "L1P5:TE";
chr1	8388315	8388618	gene_id "AluY"; transcript_id "AluY"; family_id "Alu"; class_id "SINE"; gene_name "AluY:TE";
chr1	25165803	25166380	gene_id "L1MB5"; transcript_id "L1MB5"; family_id "L1"; class_id "LINE"; gene_name "L1MB5:TE";
chr1	33554185	33554483	gene_id "AluSc"; transcript_id "AluSc"; family_id "Alu"; class_id "SINE"; gene_name "AluSc:TE";
chr1	41942894	41943205	gene_id "AluY"; transcript_id "AluY_dup1"; family_id "Alu"; class_id "SINE"; gene_name "AluY:TE";
