# Stubbs

## Index
1. [Define required clock parameters](#Define-required-clock-parameters)
2. [Download necessary data](#Download-necessary-data)
3. [Load data](#-Load-data)
4. [Extract features and weights](#Extract-features-and-weights)
5. [Load weights into pyaging model](#Load-weights-into-pyaging-model)
6. [Add reference values](#Add-reference-values)
7. [Add preprocessing and postprocesssing steps](#Add-preprocessing-and-postprocesssing-steps)
8. [Check all data objects](#Check-all-data-objects)
9. [Write clock dictionary](#Write-clock-dictionary)
10. [Clear directory](#Clear-directory)

Let's first import some packages:

In [1]:
import os
import marshal
import shutil
import json
import torch
import pandas as pd
import pyaging as pya

## Define required clock parameters

Let's define some required information first:

In [2]:
clock_name = 'stubbs'
data_type = 'methylation'
model_class = 'LinearModel'
species = 'Mus musculus'
year = 2017
approved_by_author = '⌛'
citation = "Stubbs, Thomas M., et al. \"Multi-tissue DNA methylation age predictor in mouse.\" Genome biology 18 (2017): 1-14."
doi = "https://doi.org/10.1186/s13059-017-1203-5"
notes = None

## Download necessary data

#### Download directly with curl

In [3]:
supplementary_url = "https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvNDA2NzUvZWxpZmUtNDA2NzUtc3VwcDMtdjIueGxzeA--/elife-40675-supp3-v2.xlsx?_hash=qzOMc4yUFACfDFG%2FlgxkFTHWt%2BSXSmP9zz1BM3oOTRM%3D"
supplementary_file_name = "coefficients.xlsx"
os.system(f"curl -o {supplementary_file_name} {supplementary_url}")

0

#### Download GitHub repository

In [4]:
github_url = "https://github.com/EpigenomeClock/MouseEpigeneticClock.git"
github_folder_name = github_url.split('/')[-1].split('.')[0]
os.system(f"git clone {github_url}")

0

## Load data

#### From Excel file

In [5]:
df = pd.read_excel('coefficients.xlsx', sheet_name='Young age multi-tissue', nrows=329)

#### From CSV file

In [6]:
reference_feature_values_df = pd.read_table('MouseEpigeneticClock/TrainingMatrix/TrainingData_Babraham_Reizel_Cannon.txt', index_col=0)

## Extract features and weights

First, let's extract the features and weights:

In [7]:
df['feature'] = df['Chromosome'].astype(str) + ':' + df['Position'].astype(int).astype(str)
df['coefficient'] = df['Weight']

df.head()

Unnamed: 0,Chromosome,Position,Feature (within 2kb),Weight,feature,coefficient
0,chr1,73924900,Tns1,-0.000998,chr1:73924900,-0.000998
1,chr1,153890242,5830403L16Rik,-0.000239,chr1:153890242,-0.000239
2,chr1,180350659,Itpkb,-0.002815,chr1:180350659,-0.002815
3,chr2,25579878,Gm996,-0.007452,chr2:25579878,-0.007452
4,chr2,131180254,Spef1,-0.002223,chr2:131180254,-0.002223


Then, let's create lists for features and weights. Be careful about the intercept, as it usually shows up as a feature name.

In [8]:
features = df['feature'].tolist()
weights = torch.tensor(df['coefficient'].tolist()).unsqueeze(0)
intercept = torch.tensor([0.0])

## Load weights into pyaging model

#### Linear model

In [9]:
model = pya.models.LinearModel(input_dim=len(features))

model.linear.weight.data = weights.float()
model.linear.bias.data = intercept.float()

model

LinearModel(
  (linear): Linear(in_features=329, out_features=1, bias=True)
)

## Add reference values

Some clocks have reference values in the case of missing features. It is also possible that these values are for preprocessing features rather than the clock features. Let's add a dictionary with the feature names as the keys.

In [10]:
reference_feature_values_df.index = ['chr' + index for index in reference_feature_values_df.index]
reference_feature_values_df = reference_feature_values_df.T

reference_feature_values = dict(zip(reference_feature_values_df.columns.tolist(), reference_feature_values_df.mean().tolist()))

## Add preprocessing and postprocesssing steps

The preprocessing and postprocessing objects are dictionaries with the following format, with all items required. It takes in x in the form of a numpy array.

In [11]:
def preprocessing_function(x):
    """
    Apply quantile normalization on x using gold standard means
    and then scale with the means and standard deviation.
    """
    gold_standard_means = preprocessing_helper_objects[0]
    gold_standard_stds = preprocessing_helper_objects[1]

    # Sort the gold standard means
    sorted_gold_standard = np.sort(gold_standard_means)

    # Iterate through each row in x
    for i in range(x.shape[0]):
        # Sort the row data and store the original indices
        sorted_indices = np.argsort(x[i, :])
        sorted_data = x[i, sorted_indices]

        # Map the sorted data to their quantile values in the gold standard
        quantile_indices = np.round(
            np.linspace(0, len(sorted_gold_standard) - 1, len(sorted_data))
        ).astype(int)
        normalized_data = sorted_gold_standard[quantile_indices]

        # Re-order the normalized data to the original order
        original_order_indices = np.argsort(sorted_indices)
        x[i, :] = normalized_data[original_order_indices]

    # Avoid division by zero in case of a column with constant value
    gold_standard_stds = np.array(gold_standard_stds)
    gold_standard_stds[np.abs(gold_standard_stds) < 10e-10] = 1

    x = (x - gold_standard_means) / gold_standard_stds
    return x
    
preprocessing_function_string = marshal.dumps(preprocessing_function.__code__)

preprocessing_helper_objects = [reference_feature_values_df.mean().tolist(), reference_feature_values_df.std().tolist()]

preprocessing = {
    'name': 'quantile_normalization_and_scale_with_gold_standard',
    'preprocessing_function': preprocessing_function_string,
    'preprocessing_helper_objects': preprocessing_helper_objects
}

Similarly is the case of postprocessing. Remember that your function must be compatible with torch and is applied to each number individually.

In [12]:
def postprocessing_function(x):
    """
    Applies a convertion from the output of an ElasticNet to mouse age in months.
    """
    age = math.exp(0.1207 * (x**2) + 1.2424 * x + 2.5440) - 3
    age = age * (7 / 30.5)  # weeks to months
    return age
    
postprocessing_function_string = marshal.dumps(postprocessing_function.__code__)

postprocessing_helper_objects = [1]

postprocessing = {
    'name': 'stubbs',
    'postprocessing_function': postprocessing_function_string,
    'postprocessing_helper_objects': postprocessing_helper_objects
}

## Check all data objects

Let's print all data objects to check if they make sense.

#### features

In [13]:
def my_print_function():
    print(f"There are {len(features)} features.")
    print(features)
pya.utils.print_to_scrollable_output(my_print_function)

#### reference_feature_values

In [14]:
def my_print_function():
    if reference_feature_values:
        print(f"There are {len(reference_feature_values)} reference feature values.")
    print(reference_feature_values)
pya.utils.print_to_scrollable_output(my_print_function)

#### preprocessing

In [15]:
def my_print_function():
    print(preprocessing)
    if preprocessing:
        print(preprocessing_helper_objects)
pya.utils.print_to_scrollable_output(my_print_function)

#### postprocessing

In [16]:
def my_print_function():
    print(postprocessing)
    if postprocessing:
        print(postprocessing)
pya.utils.print_to_scrollable_output(my_print_function)

#### weight_dict

In [17]:
def my_print_function():
    for name, param in model.named_parameters():
        print(f"Layer: {name}")
        print(f"Shape: {param.shape}")
        print(param.data)
pya.utils.print_to_scrollable_output(my_print_function)

## Write clock dictionary

Let's put everything together and save:

In [18]:
clock_dict = {
    # Metadata
    'clock_name': clock_name,
    'data_type': data_type,
    'model_class': model_class,
    'species': species,
    'year': year,
    'approved_by_author': approved_by_author,
    'citation': citation,
    'doi': doi,
    "notes": notes,

    # Data
    'reference_feature_values': reference_feature_values if reference_feature_values else None,
    'preprocessing': preprocessing if preprocessing else None, 
    'features': features,
    'weight_dict': model.state_dict(),
    'postprocessing': postprocessing if postprocessing else None,
}

torch.save(clock_dict, f'../weights/{clock_name}.pt')

## Clear directory

Delete all files that are not clock jupyter notebooks:

In [19]:
# Function to remove a folder and all its contents
def remove_folder(path):
    try:
        shutil.rmtree(path)
        print(f"Deleted folder: {path}")
    except Exception as e:
        print(f"Error deleting folder {path}: {e}")

# Get a list of all files and folders in the current directory
all_items = os.listdir('.')

# Loop through the items
for item in all_items:
    # Check if it's a file and does not end with .ipynb
    if os.path.isfile(item) and not item.endswith('.ipynb'):
        os.remove(item)
        print(f"Deleted file: {item}")
    # Check if it's a folder
    elif os.path.isdir(item):
        remove_folder(item)

Deleted file: coefficients.xlsx
Deleted folder: .ipynb_checkpoints
Deleted folder: MouseEpigeneticClock
