# EpiGePT tutorial

This is a simplest tutorial on using the pre-trained EpiGePT model to predict epigenomic signals. As of September 2023, we have expanded the training data for EpiGePT to cover 105 cell types. All the data mentioned in this tutorial can be downloaded from the [Download](https://health.tsinghua.edu.cn/epigept/download.php) page. The purpose of this tutorial is to provide an example of how to use the pre-trained EpiGePT model to predict epigenomic signals for any genomic region and cell type. It's worth noting that this model has been updated to the hg38 reference genome.

## Initialization

In [1]:
import torch
import os
from pyfasta import Fasta
import numpy as np
import pandas as pd
os.environ['CUDA_VISIBLE_DEVICES']='5'
from model_hg38 import EpiGePT
from model_hg38.config import *
from model_hg38.utils import *

## Load pretrained model

Loading parameters of the pre-trained model and the reference genome, the pretrained model can be downloaded from [here](https://health.tsinghua.edu.cn/epigept/help/model.ckpt). The reference genome can be downloaded from [here](https://health.tsinghua.edu.cn/epigept/help/hg38.fa), and the code for this tutorial can be downloaded from [here](https://health.tsinghua.edu.cn/epigept/help/code.tar.gz).

In [2]:
model = EpiGePT.EpiGePT(WORD_NUM,TF_DIM,BATCH_SIZE)
model = load_weights(model,'pretrainModel/model.ckpt')

## Predict

Users need to prepare a matrix with dimensions (1000, 711), representing the binding states of these 711 transcription factors on 1000 genomic bins. This can be achieved using the HOMER tool for scanning. Additionally, a 711-dimensional vector is required, representing the TPM values of the 711 transcription factors after quantile normalization. Users can refer to this [link](https://github.com/ZjGaothu/EpiGePT) for specific instructions on how to perform these operations.

In [3]:
SEQ_LENGTH = 128000
input_tf_feature = np.random.rand(1000, 711) # 711 TFs
input_seq_feature = np.zeros((1,4,SEQ_LENGTH))
predict = model_predict(model,input_seq_feature,input_tf_feature)
predict.shape # (BATCH_SIZE, Number of bins, Number of epigenomic profiles)

(1, 1000, 8)