[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rsinghlab/pyaging/blob/main/tutorials/tutorial_dnam.ipynb) [![Open In nbviewer](https://img.shields.io/badge/View%20in-nbviewer-orange)](https://nbviewer.jupyter.org/github/rsinghlab/pyaging/blob/main/tutorials/tutorial_dnam.ipynb)

# Bulk DNA methylation

This tutorial is a brief guide for the implementation of an array of bulk DNA-methylation epigenetic clocks. In this notebook, we will demonstrate the breadth of epigenetic clock models available in `pyaging` by showing:

- Horvath's 2013 ElasticNet-based clock ([paper](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r115));
  
- AltumAge, a highly accurate deep-learning based clock ([paper](https://www.nature.com/articles/s41514-022-00085-y));
  
- PCGrimAge, a principal-component based version of the GrimAge clock ([paper](https://www.nature.com/articles/s43587-022-00248-2));

- DunedinPACE, a biomarker of the pace of aging ([paper](https://elifesciences.org/articles/73420)).

We just need two packages for this tutorial.

In [1]:
import pandas as pd
import pyaging as pya

## Homo sapiens

### Download and load example data

Let's download the publicly avaiable dataset GSE139307 with Illumina's 450k array. The CpG coverage of the 450k array should be good enough for most clocks.

In [2]:
pya.data.download_example_data('GSE139307')

|-----> 🏗️ Starting download_example_data function
|-----------> Data found in pyaging_data/GSE139307.pkl
|-----> 🎉 Done! [0.0008s]


In [3]:
df = pd.read_pickle('pyaging_data/GSE139307.pkl')

In [4]:
df.head()

Unnamed: 0,dataset,tissue_type,age,gender,cg00000029,cg00000108,cg00000109,cg00000165,cg00000236,cg00000289,...,ch.X.93511680F,ch.X.938089F,ch.X.94051109R,ch.X.94260649R,ch.X.967194F,ch.X.97129969R,ch.X.97133160R,ch.X.97651759F,ch.X.97737721F,ch.X.98007042R
GSM4137709,GSE139307,sperm,84.0,M,0.084811,0.920696,0.856851,0.084567,0.838699,0.247273,...,0.061751,0.045942,0.037631,0.056455,0.249872,0.049022,0.085691,0.037435,0.07782,0.106234
GSM4137710,GSE139307,sperm,69.0,M,0.099626,0.919073,0.890024,0.115541,0.852584,0.198103,...,0.075077,0.041849,0.032573,0.08979,0.250245,0.079095,0.079756,0.046229,0.091256,0.120241
GSM4137711,GSE139307,sperm,69.0,M,0.117228,0.920276,0.894317,0.117127,0.839258,0.21341,...,0.068679,0.049515,0.058097,0.079919,0.299758,0.079305,0.089815,0.065364,0.086864,0.156005
GSM4137712,GSE139307,sperm,69.0,M,0.077096,0.910204,0.9084,0.073885,0.861615,0.163276,...,0.070091,0.033289,0.038836,0.108213,0.295428,0.050731,0.099943,0.047597,0.07848,0.10748
GSM4137713,GSE139307,sperm,67.0,M,0.063524,0.911608,0.884643,0.079877,0.864654,0.176169,...,0.082368,0.038411,0.048787,0.088631,0.316694,0.041873,0.079303,0.048823,0.08901,0.117903


For PCGrimAge, both age and sex are features. Therefore, to get the full prediction, let's convert the column `gender` into a column called `female`, with 1 being female and 0 being male.

In [5]:
# needs only numerical data (doesn't work with strings)
df['female'] = (df['gender'] == 'F').astype(int)

### Convert data to AnnData object

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

In [6]:
adata = pya.pp.df_to_adata(df, metadata_cols=['gender', 'tissue_type', 'dataset'], imputer_strategy='knn')

|-----> 🏗️ Starting df_to_adata function
|-----> ⚙️ Create anndata object started
|-----------? Dropping 1 columns with only NAs: ['cg01550828'], etc.
|-----> ⚠️ Create anndata object finished [0.2956s]
|-----> ⚙️ Add metadata to anndata started
|-----------> Adding provided metadata to adata.obs
|-----> ✅ Add metadata to anndata finished [0.0010s]
|-----> ⚙️ Log data statistics started
|-----------> There are 37 observations
|-----------> There are 485513 features
|-----------> Total missing values: 489
|-----------> Percentage of missing values: 0.00%
|-----> ✅ Log data statistics finished [0.0193s]
|-----> ⚙️ Impute missing values started
|-----------> Imputing missing values using knn strategy
|-----> ✅ Impute missing values finished [4.9587s]
|-----> ⚙️ Add imputer strategy to adata.uns started
|-----> ✅ Add imputer strategy to adata.uns finished [0.0002s]
|-----> 🎉 Done! [5.3741s]


Note that the original DataFrame is stored in `X_original` under layers. is This is what the `adata` object looks like:

In [7]:
adata

AnnData object with n_obs × n_vars = 37 × 485513
    obs: 'gender', 'tissue_type', 'dataset'
    var: 'percent_na'
    uns: 'imputer_strategy'
    layers: 'X_original', 'X_imputed'

### Predict age

We can either predict one clock at once or all at the same time. For convenience, let's simply input all four clocks of interest at once. The function is invariant to the capitalization of the clock name. 

In [8]:
adata = pya.pred.predict_age(adata, ['Horvath2013', 'AltumAge', 'PCGrimAge', 'DunedinPACE'])

|-----> 🏗️ Starting predict_age function
|-----> ⚙️ Set PyTorch device started
|-----------> Using device: cpu
|-----> ✅ Set PyTorch device finished [0.0006s]
|-----> 🕒 Processing clock: horvath2013
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/horvath2013.pt
|-----------> ✅ Load clock finished [0.0031s]
|-----------> ⚙️ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------> ✅ Check features in adata finished [0.0013s]
|-----------> ⚙️ Preprocess data started
|-----------------> There is no preprocessing to be done
|-----------> ✅ Preprocess data finished [0.1128s]
|-----------> ⚙️ Initialize model started
|-----------> ✅ Initialize model finished [0.0013s]
|-----------> ⚙️ Predict ages with model started
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0338s]
|-----------> ⚙️ Convert tensor to numpy array started
|-----------> ✅ Convert tensor to n

In [9]:
adata.obs.head()

Unnamed: 0,gender,tissue_type,dataset,horvath2013,altumage,pcgrimage,dunedinpace
GSM4137709,M,sperm,GSE139307,33.624776,37.007213,95.506114,1.326356
GSM4137710,M,sperm,GSE139307,28.829344,29.426899,83.934244,1.215606
GSM4137711,M,sperm,GSE139307,28.316545,22.798928,82.709334,1.271146
GSM4137712,M,sperm,GSE139307,24.85063,18.079173,84.269462,1.276884
GSM4137713,M,sperm,GSE139307,25.942111,20.071985,84.356985,1.26194


Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

In [10]:
pya.data.download_example_data('GSE139307', verbose=False)
df = pd.read_pickle('pyaging_data/GSE139307.pkl')
df['female'] = (df['gender'] == 'F').astype(int)
adata = pya.preprocess.df_to_adata(df, metadata_cols=['gender', 'tissue_type', 'dataset'], imputer_strategy='mean', verbose=False)
adata = pya.pred.predict_age(adata, ['Horvath2013', 'AltumAge', 'PCGrimAge', 'DunedinPACE'], verbose=False)

In [11]:
adata.obs.head()

Unnamed: 0,gender,tissue_type,dataset,horvath2013,altumage,pcgrimage,dunedinpace
GSM4137709,M,sperm,GSE139307,33.624776,37.007213,95.50578,1.326341
GSM4137710,M,sperm,GSE139307,28.829344,29.426899,83.934244,1.215608
GSM4137711,M,sperm,GSE139307,28.316545,22.805551,82.709334,1.271094
GSM4137712,M,sperm,GSE139307,24.85063,18.060107,84.269462,1.276884
GSM4137713,M,sperm,GSE139307,25.942111,20.071985,84.356985,1.26194


After age prediction, the clocks are added to `adata.obs`. Moreover, the percent of missing values for each clock and other metadata are included in `adata.uns`.

In [12]:
adata

AnnData object with n_obs × n_vars = 37 × 485513
    obs: 'gender', 'tissue_type', 'dataset', 'horvath2013', 'altumage', 'pcgrimage', 'dunedinpace'
    var: 'percent_na'
    uns: 'imputer_strategy', 'horvath2013_percent_na', 'horvath2013_metadata', 'altumage_percent_na', 'altumage_metadata', 'pcgrimage_percent_na', 'pcgrimage_metadata', 'dunedinpace_percent_na', 'dunedinpace_metadata'
    layers: 'X_original', 'X_imputed'

### Get citation

The doi, citation, and some metadata are automatically added to the AnnData object under `adata.uns[CLOCKNAME_metadata]`.

In [13]:
adata.uns['horvath2013_metadata']

{'species': 'Homo sapiens',
 'data_type': 'methylation',
 'year': 2013,
 'implementation_approved_by_author(s)': '⌛',
 'preprocessing': None,
 'postprocessing': 'anti_log_linear',
 'citation': 'Horvath, Steve. "DNA methylation age of human tissues and cell types." Genome biology 14.10 (2013): 1-20.',
 'doi': 'https://doi.org/10.1186/gb-2013-14-10-r115',
 'notes': None}

In [14]:
adata.uns['altumage_metadata']

{'species': 'Homo sapiens',
 'data_type': 'methylation',
 'year': 2022,
 'implementation_approved_by_author(s)': '✅',
 'preprocessing': 'scale',
 'postprocessing': None,
 'citation': 'de Lima Camillo, Lucas Paulo, Louis R. Lapierre, and Ritambhara Singh. "A pan-tissue DNA-methylation epigenetic clock based on deep learning." npj Aging 8.1 (2022): 4.',
 'doi': 'https://doi.org/10.1038/s41514-022-00085-y',
 'notes': None}

In [15]:
adata.uns['pcgrimage_metadata']

{'species': 'Homo sapiens',
 'data_type': 'methylation',
 'year': 2022,
 'implementation_approved_by_author(s)': '⌛',
 'preprocessing': None,
 'postprocessing': None,
 'citation': 'Higgins-Chen, Albert T., et al. "A computational solution for bolstering reliability of epigenetic clocks: Implications for clinical trials and longitudinal tracking." Nature aging 2.7 (2022): 644-661.',
 'doi': 'https://doi.org/10.1038/s43587-022-00248-2',
 'notes': None}

In [16]:
adata.uns['dunedinpace_metadata']

{'species': 'Homo sapiens',
 'data_type': 'methylation',
 'year': 2022,
 'implementation_approved_by_author(s)': '⌛',
 'preprocessing': 'quantile_normalization_with_gold_standard',
 'postprocessing': None,
 'citation': 'Belsky, Daniel W., et al. "DunedinPACE, a DNA methylation biomarker of the pace of aging." Elife 11 (2022): e73420.',
 'doi': 'https://doi.org/10.7554/eLife.73420',
 'notes': "The automatic failure if fewer than 80% of the CpG probes are available is not implemented and left to the user's discretion."}

## Mus musculus

### Download and load example data

Let's download the publicly available dataset GSE130735 with RRBS samples from mouse. Given it is RRBS, there are millions of CpG sites.

In [17]:
pya.data.download_example_data('GSE130735')

|-----> 🏗️ Starting download_example_data function
|-----------> Data found in pyaging_data/GSE130735_subset.pkl
|-----> 🎉 Done! [0.0008s]


In [18]:
df = pd.read_pickle('pyaging_data/GSE130735_subset.pkl')

In [19]:
df.head()

Unnamed: 0,chr1:3020814,chr1:3020842,chr1:3020877,chr1:3020891,chr1:3020945,chr1:3020971,chr1:3020987,chr1:3021012,chr1:3037802,chr1:3037820,...,chrY:1825397,chrY:4682362,chrY:32122892,chrY:85867071,chrY:85867083,chrY:85867117,chrY:85867137,chrY:85867139,chrY:85867178,chrY:88224179
GSM3752631,0.609,0.25,0.408,0.189,0.068,0.373,0.571,0.252,0.333,0.158,...,,,,,,,,,,
GSM3752625,,,0.973,0.984,0.912,0.915,0.987,0.974,0.991,0.932,...,,,,,,,,,,
GSM3752634,,,0.526,0.131,0.0,0.038,0.469,0.769,0.772,0.146,...,,,,,,,,,,
GSM3752620,0.931,0.92,0.988,0.949,0.897,0.921,0.907,0.958,1.0,0.867,...,,,,,,,,,,
GSM3752622,,,0.205,0.382,0.091,0.132,0.174,0.227,0.108,0.053,...,,,,,,,,,,


### Convert data to AnnData object

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

In [20]:
adata = pya.pp.df_to_adata(df, imputer_strategy='mean')

|-----> 🏗️ Starting df_to_adata function
|-----> ⚙️ Create anndata object started
|-----> ✅ Create anndata object finished [1.0154s]
|-----> ⚙️ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0005s]
|-----> ⚙️ Log data statistics started
|-----------> There are 14 observations
|-----------> There are 1778324 features
|-----------> Total missing values: 6322346
|-----------> Percentage of missing values: 25.39%
|-----> ✅ Log data statistics finished [0.0152s]
|-----> ⚙️ Impute missing values started
|-----------> Imputing missing values using mean strategy
|-----> ✅ Impute missing values finished [0.3602s]
|-----> ⚙️ Add imputer strategy to adata.uns started
|-----> ✅ Add imputer strategy to adata.uns finished [0.0003s]
|-----> 🎉 Done! [1.3944s]


This is what the `adata` object looks like:

### Predict age

We can either predict one clock at once or all at the same time. For convenience, let's simply input all four available mammalian clocks at once. The function is invariant to the capitalization of the clock name.

In [21]:
adata = pya.pred.predict_age(adata, ['ThompsonMultiTissue', 'MeerMultiTissue', 'PetkovichBlood', 'StubbsMultiTissue'])

|-----> 🏗️ Starting predict_age function
|-----> ⚙️ Set PyTorch device started
|-----------> Using device: cpu
|-----> ✅ Set PyTorch device finished [0.0009s]
|-----> 🕒 Processing clock: thompsonmultitissue
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/thompsonmultitissue.pt
|-----------> ✅ Load clock finished [0.0020s]
|-----------> ⚙️ Check features in adata started
|-----------------? 1 out of 582 features (0.17%) are missing: ['chr4:91376687'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Expanded adata with 1 missing features
|-----------> ⚠️ Check features in adata finished [0.4493s]
|-----------> ⚙️ Preprocess data started
|-----------------> There is no preprocessing to be done
|-----------> ✅ Preprocess data finished [0.0512s]
|-----------> ⚙️ Initialize model started
|-----------> ✅ Initialize model finished [0.0009s]
|-----------> ⚙️ Predict ages with model started
|-----------------> in progress: 

All of the age predictions are in unit of months.

In [22]:
adata.obs.head()

Unnamed: 0,thompsonmultitissue,meermultitissue,petkovichblood,stubbsmultitissue
GSM3752631,19.634113,7.315183,6.472695,2.198807
GSM3752625,-1.410461,0.028221,2.794689,1.843469
GSM3752634,61.058783,21.322178,9.511293,2.608401
GSM3752620,-2.663815,1.611947,2.155649,1.865847
GSM3752622,20.594114,7.592145,6.978587,2.152412


Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

In [23]:
pya.data.download_example_data('GSE130735', verbose=False)
df = pd.read_pickle('pyaging_data/GSE130735_subset.pkl')
adata = pya.preprocess.df_to_adata(df, imputer_strategy='mean', verbose=False)
adata = pya.pred.predict_age(adata, ['ThompsonMultiTissue', 'MeerMultiTissue', 'PetkovichBlood', 'StubbsMultiTissue'], verbose=False)

In [24]:
adata.obs.head()

Unnamed: 0,thompsonmultitissue,meermultitissue,petkovichblood,stubbsmultitissue
GSM3752631,19.634113,7.315183,6.472695,2.198807
GSM3752625,-1.410461,0.028221,2.794689,1.843469
GSM3752634,61.058783,21.322178,9.511293,2.608401
GSM3752620,-2.663815,1.611947,2.155649,1.865847
GSM3752622,20.594114,7.592145,6.978587,2.152412


After age prediction, the clocks are added to `adata.obs`. Moreover, the percent of missing values for each clock and other metadata are included in `adata.uns`.

In [25]:
adata

AnnData object with n_obs × n_vars = 14 × 1778324
    obs: 'thompsonmultitissue', 'meermultitissue', 'petkovichblood', 'stubbsmultitissue'
    var: 'percent_na'
    uns: 'imputer_strategy', 'thompsonmultitissue_percent_na', 'thompsonmultitissue_metadata', 'meermultitissue_percent_na', 'meermultitissue_metadata', 'petkovichblood_percent_na', 'petkovichblood_metadata', 'stubbsmultitissue_percent_na', 'stubbsmultitissue_metadata'
    layers: 'X_original', 'X_imputed'

### Get citation

The doi, citation, and some metadata are automatically added to the AnnData object under `adata.uns[CLOCKNAME_metadata]`.

In [26]:
adata.uns['thompsonmultitissue_metadata']

{'species': 'Mus musculus',
 'data_type': 'methylation',
 'year': 2018,
 'implementation_approved_by_author(s)': '✅',
 'preprocessing': None,
 'postprocessing': None,
 'citation': 'Thompson, Michael J., et al. "A multi-tissue full lifespan epigenetic clock for mice." Aging (Albany NY) 10.10 (2018): 2832.',
 'doi': 'https://doi.org/10.18632/aging.101590',
 'notes': None}

In [27]:
adata.uns['meermultitissue_metadata']

{'species': 'Mus musculus',
 'data_type': 'methylation',
 'year': 2018,
 'implementation_approved_by_author(s)': '⌛',
 'preprocessing': None,
 'postprocessing': None,
 'citation': 'Meer, Margarita V., et al. "A whole lifespan mouse multi-tissue DNA methylation clock." Elife 7 (2018): e40675.',
 'doi': 'https://doi.org/10.7554/eLife.40675',
 'notes': None}

In [28]:
adata.uns['petkovichblood_metadata']

{'species': 'Mus musculus',
 'data_type': 'methylation',
 'year': 2017,
 'implementation_approved_by_author(s)': '⌛',
 'preprocessing': None,
 'postprocessing': 'petkovichblood',
 'citation': 'Petkovich, Daniel A., et al. "Using DNA methylation profiling to evaluate biological age and longevity interventions." Cell metabolism 25.4 (2017): 954-960.',
 'doi': 'https://doi.org/10.1016/j.cmet.2017.03.016',
 'notes': None}

In [29]:
adata.uns['stubbsmultitissue_metadata']

{'species': 'Mus musculus',
 'data_type': 'methylation',
 'year': 2017,
 'implementation_approved_by_author(s)': '⌛',
 'preprocessing': None,
 'postprocessing': 'stubbsmultitissue',
 'citation': 'Stubbs, Thomas M., et al. "Multi-tissue DNA methylation age predictor in mouse." Genome biology 18 (2017): 1-14.',
 'doi': 'https://doi.org/10.1186/s13059-017-1203-5',
 'notes': None}

## Mammalian

### Download and load example data

Let's download the publicly avaiable dataset GSE223748 with Illumina's Mammalian Methylation array. The CpG coverage of the this array (~37k) spans highly conserved CpG sequences. Let's download a subset of that data.

In [30]:
pya.data.download_example_data('GSE223748')

|-----> 🏗️ Starting download_example_data function
|-----------> Data found in pyaging_data/GSE223748_subset.pkl
|-----> 🎉 Done! [0.0007s]


In [31]:
df = pd.read_pickle('pyaging_data/GSE223748_subset.pkl')

In [32]:
df.head()

Unnamed: 0,cg00000165,cg00001209,cg00001364,cg00001582,cg00002920,cg00003994,cg00004555,cg00005112,cg00005271,cg00006213,...,rs7746156_II_F_C_37550,rs798149_II_F_C_37528,rs845016_II_F_C_37529,rs877309_II_F_C_37552,rs9292570_I_F_C_37499,rs9363764_II_F_C_37541,rs939290_II_F_C_37535,rs951295_I_F_C_37507,rs966367_II_F_C_37551,rs9839873_II_F_C_37532
204509080002_R01C02,0.094879,0.916154,0.890314,0.053583,0.490381,0.034852,0.159705,0.763959,0.973245,0.928975,...,0.488592,0.491361,0.480024,0.5,0.484252,0.489448,0.505585,0.505335,0.485003,0.510081
202897220142_R04C02,0.497077,0.441263,0.915314,0.047339,0.651029,0.037774,0.082634,0.4158,0.702857,0.821715,...,0.508102,0.500299,0.507261,0.490684,0.499673,0.497256,0.564106,0.482151,0.486667,0.505236
204529320092_R01C02,0.321141,0.834158,0.881194,0.056124,0.68835,0.030225,0.086776,0.777588,0.974587,0.923934,...,0.520404,0.509568,0.507549,0.501659,0.492823,0.487243,0.516018,0.471244,0.491066,0.491759
202794570004_R02C01,0.495226,0.924121,0.915812,0.050866,0.688335,0.032344,0.113318,0.872094,0.969189,0.917076,...,0.499314,0.516132,0.487009,0.487146,0.469119,0.495125,0.548238,0.512283,0.514257,0.49252
203531420070_R05C02,0.183954,0.934332,0.924153,0.055032,0.717495,0.037108,0.098632,0.859614,0.973422,0.963446,...,0.501432,0.509412,0.485055,0.497272,0.480637,0.467502,0.494246,0.500924,0.531334,0.503709


### Convert data to AnnData object

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

In [33]:
adata = pya.pp.df_to_adata(df, imputer_strategy='mean')

|-----> 🏗️ Starting df_to_adata function
|-----> ⚙️ Create anndata object started
|-----> ✅ Create anndata object finished [0.0159s]
|-----> ⚙️ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0006s]
|-----> ⚙️ Log data statistics started
|-----------> There are 100 observations
|-----------> There are 37554 features
|-----------> Total missing values: 0
|-----------> Percentage of missing values: 0.00%
|-----> ✅ Log data statistics finished [0.0040s]
|-----> ⚙️ Impute missing values started
|-----------> No missing values found. No imputation necessary
|-----> ✅ Impute missing values finished [0.0058s]
|-----> 🎉 Done! [0.0290s]


This is what the `adata` object looks like:

In [34]:
adata

AnnData object with n_obs × n_vars = 100 × 37554
    var: 'percent_na'
    layers: 'X_original'

### Predict age

We can either predict one clock at once or all at the same time. For convenience, let's simply input all four available mammalian clocks at once. The function is invariant to the capitalization of the clock name.

In [35]:
adata = pya.pred.predict_age(adata, ['Mammalian1', 'Mammalian2', 'Mammalian3', 'MammalianLifespan',])

|-----> 🏗️ Starting predict_age function
|-----> ⚙️ Set PyTorch device started
|-----------> Using device: cpu
|-----> ✅ Set PyTorch device finished [0.0006s]
|-----> 🕒 Processing clock: mammalian1
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/mammalian1.pt
|-----------> ✅ Load clock finished [0.0018s]
|-----------> ⚙️ Check features in adata started
|-----------------> All features are present in adata.var_names.
|-----------> ✅ Check features in adata finished [0.0007s]
|-----------> ⚙️ Preprocess data started
|-----------------> There is no preprocessing to be done
|-----------> ✅ Preprocess data finished [0.0032s]
|-----------> ⚙️ Initialize model started
|-----------> ✅ Initialize model finished [0.0009s]
|-----------> ⚙️ Predict ages with model started
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0032s]
|-----------> ⚙️ Convert tensor to numpy array started
|-----------> ✅ Convert tensor to num

Of note, Mammalian1 is age in units of years, Mammalian2 is the relative age (0 to 1), and Mammalian 3 is the log-linear transformed version of age (in order to get the age in years, one needs to know the "adult_age" of the species tested for the inverse transformation).

In [36]:
adata.obs.head()

Unnamed: 0,mammalian1,mammalian2,mammalian3,mammalianlifespan
204509080002_R01C02,26.372437,0.313472,0.250647,93.886067
202897220142_R04C02,1.176586,0.356768,1.38463,6.999176
204529320092_R01C02,18.776438,0.253814,-0.202204,73.335119
202794570004_R02C01,0.890973,0.320041,1.136356,5.332615
203531420070_R05C02,10.371315,0.142494,-0.55929,68.409331


Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

In [37]:
pya.data.download_example_data('GSE223748', verbose=False)
df = pd.read_pickle('pyaging_data/GSE223748_subset.pkl')
adata = pya.preprocess.df_to_adata(df, imputer_strategy='mean', verbose=False)
adata = pya.pred.predict_age(adata, ['Mammalian1', 'Mammalian2', 'Mammalian3', 'MammalianLifespan',], verbose=False)

In [38]:
adata.obs.head()

Unnamed: 0,mammalian1,mammalian2,mammalian3,mammalianlifespan
204509080002_R01C02,26.372437,0.313472,0.250647,93.886067
202897220142_R04C02,1.176586,0.356768,1.38463,6.999176
204529320092_R01C02,18.776438,0.253814,-0.202204,73.335119
202794570004_R02C01,0.890973,0.320041,1.136356,5.332615
203531420070_R05C02,10.371315,0.142494,-0.55929,68.409331


After age prediction, the clocks are added to `adata.obs`. Moreover, the percent of missing values for each clock and other metadata are included in `adata.uns`.

In [39]:
adata

AnnData object with n_obs × n_vars = 100 × 37554
    obs: 'mammalian1', 'mammalian2', 'mammalian3', 'mammalianlifespan'
    var: 'percent_na'
    uns: 'mammalian1_percent_na', 'mammalian1_metadata', 'mammalian2_percent_na', 'mammalian2_metadata', 'mammalian3_percent_na', 'mammalian3_metadata', 'mammalianlifespan_percent_na', 'mammalianlifespan_metadata'
    layers: 'X_original'

### Get citation

The doi, citation, and some metadata are automatically added to the AnnData object under `adata.uns[CLOCKNAME_metadata]`.

In [40]:
adata.uns['mammalian1_metadata']

{'species': 'multi',
 'data_type': 'methylation',
 'year': 2023,
 'preprocessing': None,
 'postprocessing': 'anti_logp2',
 'citation': 'Lu, A. T., et al. "Universal DNA methylation age across mammalian tissues." Nature aging 3.9 (2023): 1144-1166.',
 'doi': 'https://doi.org/10.1038/s43587-023-00462-6',
 'notes': 'This is the DNAm age predictor from the paper in which there is no adjustment for species',
 'implementation_approved_by_author(s)': '⌛'}

In [41]:
adata.uns['mammalianlifespan_metadata']

{'species': 'multi',
 'data_type': 'methylation',
 'year': 2023,
 'preprocessing': None,
 'postprocessing': 'anti_log',
 'citation': 'Li, Caesar Z., et al. "Epigenetic predictors of species maximum lifespan and other life history traits in mammals." bioRxiv (2023): 2023-11.',
 'doi': 'https://doi.org/10.1101/2023.11.02.565286',
 'notes': 'This is still a preprint, so the model might change',
 'implementation_approved_by_author(s)': '⌛'}