[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rsinghlab/pyaging/blob/main/tutorials/tutorial_histonemarkchipseq.ipynb) [![Open In nbviewer](https://img.shields.io/badge/View%20in-nbviewer-orange)](https://nbviewer.jupyter.org/github/rsinghlab/pyaging/blob/main/tutorials/tutorial_histonemarkchipseq.ipynb)

# Bulk histone mark ChIP-Seq

This tutorial is a brief guide for the implementation of the seven histone-mark-specific clocks and the pan-histone-mark clock developed ourselves. Link to [preprint](https://www.biorxiv.org/content/10.1101/2023.08.21.554165v3).

We just need two packages for this tutorial.

In [1]:
import pandas as pd
import pyaging as pya

## Download and load example data

Let's download an example of H3K4me3 ChIP-Seq bigWig file from the ENCODE project.

In [None]:
pya.data.download_example_data('ENCFF386QWG')

|-----> 🏗️ Starting download_example_data function
|-----------> Downloading data to pyaging_data/ENCFF386QWG.bigWig
|-----------> in progress: 19.0045%

To exemplify that multiple bigWigs can be turned into a df object at once, let's just repeat the file path.

In [None]:
df = pya.pp.bigwig_to_df(['pyaging_data/ENCFF386QWG.bigWig', 'pyaging_data/ENCFF386QWG.bigWig'])

In [None]:
df.index = ['sample1', 'sample2'] # just to avoid an annoying anndata warning that samples have same names

In [None]:
df.head()

## Convert data to AnnData object

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

In [None]:
adata = pya.preprocess.df_to_adata(df)

Note that the original DataFrame is stored in `X_original` under layers. This is what the `adata` object looks like:

In [None]:
adata

## Predict age

We can either predict one clock at once or all at the same time. For convenience, let's simply input a few clocks of interest at once. The function is invariant to the capitalization of the clock name. 

In [None]:
pya.pred.predict_age(adata, ['CamilloH3K4me3', 'CamilloH3K9me3', 'CamilloPanHistone'])

In [None]:
adata.obs.head()

Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

In [None]:
pya.data.download_example_data('ENCFF386QWG', verbose=False)
df = pya.pp.bigwig_to_df(['pyaging_data/ENCFF386QWG.bigWig', 'pyaging_data/ENCFF386QWG.bigWig'], verbose=False)
df.index = ['sample1', 'sample2']
adata = pya.preprocess.df_to_adata(df, verbose=False)
pya.pred.predict_age(adata, ['CamilloH3K4me3', 'CamilloH3K9me3', 'CamilloPanHistone'], verbose=False)

In [None]:
adata.obs.head()

After age prediction, the clocks are added to `adata.obs`. Moreover, the percent of missing values for each clock and other metadata are included in `adata.uns`.

In [None]:
adata

## Get citation

The doi, citation, and some metadata are automatically added to the AnnData object under `adata.uns[CLOCKNAME_metadata]`.

In [None]:
adata.uns['camilloh3k4me3_metadata']