# Data Download and Verification

This notebook helps you download and verify public datasets for regulatory genomics modeling. It covers the DeepSEA training bundle, ENCODE, and Roadmap Epigenomics. Outputs are organized in `data/raw/` for downstream preprocessing.

## DeepSEA Training Bundle
- [DeepSEA download page](http://deepsea.princeton.edu/help/)
- Download the training bundle (large .tar.gz) manually or with the script below.

In [8]:
import os, requests, tarfile, shutil
from pathlib import Path
RAW = Path('data/raw')
RAW.mkdir(parents=True, exist_ok=True)
deepsea_url = 'http://deepsea.princeton.edu/media/code/deepsea_train_bundle.v0.9.tar.gz'
deepsea_tar = RAW / 'deepsea_train_bundle.v0.9.tar.gz'
if not deepsea_tar.exists():
    print('Downloading DeepSEA training bundle...')
    r = requests.get(deepsea_url, stream=True)
    with open(deepsea_tar, 'wb') as f:
        shutil.copyfileobj(r.raw, f)
    print('Download complete.')
else:
    print('DeepSEA bundle already downloaded.')
# Extract
deepsea_dir = RAW / 'deepsea_train_bundle'
if not deepsea_dir.exists():
    print('Extracting...')
    with tarfile.open(deepsea_tar, 'r:gz') as tar:
        tar.extractall(RAW)
    print('Extraction complete.')
else:
    print('DeepSEA bundle already extracted.')
# List contents
print('DeepSEA files:', list(deepsea_dir.glob('*')))

DeepSEA bundle already downloaded.
Extracting...


  tar.extractall(RAW)


Extraction complete.
DeepSEA files: []


## ENCODE Data
- [ENCODE Project](https://www.encodeproject.org/)
- Use the ENCODE REST API to search and download metadata or files. See [API docs](https://www.encodeproject.org/help/rest-api/).

In [5]:
import requests
# Example: Search for DNase-seq experiments in human (GRCh38)
search_url = 'https://www.encodeproject.org/search/'
params = {
    'type': 'experiment',
    'assay_title': 'DNase-seq',
    'assembly': 'GRCh38',
    'status': 'released',
    'limit': 5,
    'format': 'json'
}
r = requests.get(search_url, params=params, headers={'accept': 'application/json'})
results = r.json()['@graph']
for exp in results:
    print(exp['accession'], exp['assay_title'], exp['biosample_term_name'])

## Roadmap Epigenomics Data
- [Roadmap Epigenomics Portal](https://egg2.wustl.edu/roadmap/web_portal/)
- Data is available via web portal, AWS Open Data, and GEO. Download manually or script as needed.

In [6]:
# Example: List available Roadmap files (manual download recommended for large files)
print('See portal for available files: https://egg2.wustl.edu/roadmap/web_portal/')

See portal for available files: https://egg2.wustl.edu/roadmap/web_portal/


## Verification
- Check that expected files exist in `data/raw/` and print their sizes.

In [7]:
for f in RAW.glob('*'):
    print(f, f.stat().st_size // 1024, 'KB')

data/raw/deepsea_train 0 KB
data/raw/coords.tsv 0 KB
data/raw/reference.fa.fai 0 KB
data/raw/deepsea_train_bundle.v0.9.tar.gz 3732296 KB
data/raw/reference.fa 195 KB
