# Bulk histone mark ChIP-Seq

This tutorial is a brief guide for the implementation of the seven histone-mark-specific clocks and the pan-histone-mark clock developed ourselves. Link to [preprint](https://www.biorxiv.org/content/10.1101/2023.08.21.554165v3).

## Import packages

We just need two packages for this tutorial.

In [1]:
import pandas as pd
import pyaging as pya

## Download and load example data

Let's download an example of H3K4me3 ChIP-Seq bigWig file from the ENCODE project.

In [2]:
pya.data.download_example_data('histone_mark')

|-----> 🏗️ Starting download_example_data function
|-----> ⚙️ Download data started
|-----------> Data found in ./pyaging_data/ENCFF386QWG.bigWig
|-----> ✅ Download data finished [0.0007s]
|-----> 🎉 Done! [0.0020s]


To exemplify that multiple bigWigs can be turned into a df object at once, let's just repeat the file path.

In [3]:
df = pya.pp.bigwig_to_df(['pyaging_data/ENCFF386QWG.bigWig', 'pyaging_data/ENCFF386QWG.bigWig'])

|-----> 🏗️ Starting bigwig_to_df function
|-----> ⚙️ Load Ensembl genome metadata started
|-----> ⚙️ Download data started
|-----------> Data found in ./pyaging_data/Ensembl-105-EnsDb-for-Homo-sapiens-genes.csv
|-----> ✅ Download data finished [0.0006s]
|-----> ✅ Load Ensembl genome metadata finished [0.0006s]
|-----> ⚙️ Processing bigWig files started
|-----------> Processing file: pyaging_data/ENCFF386QWG.bigWig
|-----------> in progress: 100.0000%
|-----------> Processing file: pyaging_data/ENCFF386QWG.bigWig
|-----------> in progress: 100.0000%
|-----> ✅ Processing bigWig files finished [0.1994s]
|-----> 🎉 Done! [22.8922s]


In [4]:
df.head()

Unnamed: 0,ENSG00000223972,ENSG00000227232,ENSG00000278267,ENSG00000243485,ENSG00000284332,ENSG00000237613,ENSG00000268020,ENSG00000240361,ENSG00000186092,ENSG00000238009,...,ENSG00000237801,ENSG00000237040,ENSG00000124333,ENSG00000228410,ENSG00000223484,ENSG00000124334,ENSG00000270726,ENSG00000185203,ENSG00000182484,ENSG00000227159
pyaging_data/ENCFF386QWG.bigWig,0.028616,0.030415,0.027783,0.028616,0.028616,0.028616,0.044171,0.036474,0.030784,0.03181,...,0.034435,0.006822,1.413119,0.029424,0.140005,0.049786,0.069296,0.332126,0.028596,0.028616
pyaging_data/ENCFF386QWG.bigWig,0.028616,0.030415,0.027783,0.028616,0.028616,0.028616,0.044171,0.036474,0.030784,0.03181,...,0.034435,0.006822,1.413119,0.029424,0.140005,0.049786,0.069296,0.332126,0.028596,0.028616


## Convert data to AnnData object

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

In [5]:
adata = pya.preprocess.df_to_adata(df)

|-----> 🏗️ Starting df_to_adata function
|-----> ⚙️ Impute missing values started
|-----------> No missing values found. No imputation necessary
|-----> ✅ Impute missing values finished [0.0006s]
|-----> ⚙️ Log data statistics started
|-----------> There are 2 observations
|-----------> There are 62241 features
|-----------> Total missing values: 0
|-----------> Percentage of missing values: 0.00%
|-----> ✅ Log data statistics finished [0.0014s]
|-----> ⚙️ Create anndata object started
|-----> ✅ Create anndata object finished [0.0019s]
|-----> ⚙️ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0039s]
|-----> ⚙️ Add unstructured data to anndata started
|-----> ✅ Add unstructured data to anndata finished [0.0008s]
|-----> 🎉 Done! [0.0090s]


  utils.warn_names_duplicates("obs")


## Predict age

We can either predict one clock at once or all at the same time. For convenience, let's simply input a few clocks of interest at once. The function is invariant to the capitalization of the clock name. 

In [6]:
adata = pya.pred.predict_age(adata, ['h3k4me3', 'h3k9me3', 'panhistone'])

|-----> 🏗️ Starting predict_age function
|-----> ⚙️ Set PyTorch device started
|-----------> Using device: cpu
|-----> ✅ Set PyTorch device finished [0.0005s]
|-----> Processing clock: h3k4me3
|-----------> ⚙️ Load clock started
|-----------> ⚙️ Download data started
|-----------> Downloading data to ./pyaging_data/h3k4me3.pt
|-----------> in progress: 100.0000%
|-----------> ✅ Download data finished [0.0003s]
|-----------> ✅ Load clock finished [0.0003s]
|-----------> ⚙️ Check features in adata started
|-----------> All features are present in adata.var_names.
|-----------> ✅ Check features in adata finished [1.3708s]
|-----------> ⚙️ Convert adata.X to torch.tensor and filter features started
|-----------> ✅ Convert adata.X to torch.tensor and filter features finished [0.0021s]
|-----------> ⚙️ Initialize model started
|-----------> ✅ Initialize model finished [0.0028s]
|-----------> ⚙️ Predict ages with model started
|-----------> ✅ Predict ages with model finished [0.0024s]
|------

In [7]:
adata.obs.head()

Unnamed: 0,h3k4me3,h3k9me3,panhistone
pyaging_data/ENCFF386QWG.bigWig,53.998566,44.322887,54.021847
pyaging_data/ENCFF386QWG.bigWig,53.998566,44.322887,54.021847


## Get citation

The doi, citation, and some metadata are automatically added to the AnnData object under `adata.uns[CLOCKNAME_metadata]`.

In [8]:
adata.uns['h3k4me3_metadata']

{'species': 'Homo sapiens',
 'data_type': 'histone_mark',
 'year': 2023,
 'citation': 'de Lima Camillo, Lucas Paulo, et al. "Histone mark age of human tissues and cells." bioRxiv (2023): 2023-08.',
 'doi': 'https://doi.org/10.1101/2023.08.21.554165'}

In [9]:
adata.uns['h3k9me3_metadata']

{'species': 'Homo sapiens',
 'data_type': 'histone_mark',
 'year': 2023,
 'citation': 'de Lima Camillo, Lucas Paulo, et al. "Histone mark age of human tissues and cells." bioRxiv (2023): 2023-08.',
 'doi': 'https://doi.org/10.1101/2023.08.21.554165'}

In [10]:
adata.uns['panhistone_metadata']

{'species': 'Homo sapiens',
 'data_type': 'histone_mark',
 'year': 2023,
 'citation': 'de Lima Camillo, Lucas Paulo, et al. "Histone mark age of human tissues and cells." bioRxiv (2023): 2023-08.',
 'doi': 'https://doi.org/10.1101/2023.08.21.554165'}