## This notebook created by Jason Miller https://github.com/ShepherdCode/ShepherdML/blob/master/Localization

## GenCode
https://www.gencodegenes.org/human/release_42.html

This notebook runs before Train/Test Split.

This notebook reduces our GenCode sequence files to

    genes with RCI values in LncAtlas
    transcripts whose GenCode annotation looks good



Human sequence in FASTA:

    genomic DNA
    gene cDNA (with introns)
    transcript CDS (without introns)
    noncoding ncRNA
    Human gene annotation in GFF
    *****
    gencode.v42.annotation.gff3
    gencode.v42.lncRNA_transcripts.fa
    gencode.v42.pc_transcripts.fa
    ****

In [1]:
import os

In [2]:
from datetime import datetime
print(datetime.now())

2025-05-19 23:10:20.916684


In [8]:
DATA_DIR = "../../data/GenCode/"
ATLAS_DIR = "../../data/LncAtlas/"

In [4]:
os.listdir(DATA_DIR)

['gencode.v42.annotation.gff3',
 'gencode.v42.lncRNA_transcripts.fa',
 'gencode.v42.pc_transcripts.fa']

In [5]:
# GenCode inputs

ANNOTATION = 'gencode.v42.annotation.gff3'
NONCODING_SEQUENCE = 'gencode.v42.lncRNA_transcripts.fa'
CODING_SEQUENCE = 'gencode.v42.pc_transcripts.fa'
# GenCode outputs
CODING_CSV = 'gencode.v42.pc_transcripts.csv'
NONCODING_CSV = 'gencode.v42.lncRNA_transcripts.csv'
# Atlas inputs
ATLAS_FILE = 'lncATLAS_all_data_RCI.csv'

In [None]:
# dff = pd.read_csv(DATA_DIR+ANNOTATION)

In [6]:
class fasta_reader():
    '''
    Parser for human transcripts FASTA file from GenCode.
    '''
    def __init__(self,infile,outfile,biotype):
        '''
        Biotype should reflect the filename: either 'pc' or 'lncRNA'.
        '''
        self.infile = infile
        self.outfile = outfile
        self.biotype = biotype
        self.FASTA_DEFCHAR = '>'  # signals a defline = definition line
        self.count_in = 0
        self.count_out = 0
        self.allow_genes = None
        self.allow_transcripts = None
        self.headers='transcript_id,gene_id,biotype,length,sequence\n'
    def allow_these_genes(self,genes:set):
        self.allow_genes = genes
    def allow_these_transcripts(self,trans:set):
        self.allow_transcripts = trans
    def print_one_sequence(self,handle,tran,gene,seq):
        allow_genes = self.allow_genes
        allow_trans = self.allow_transcripts
        if seq is not None:
            # sequence is None when we encounter the first defline
            if allow_genes is None or gene in allow_genes:
                if allow_trans is None or tran in allow_trans:
                    biotype = self.biotype
                    length = str(len(seq))
                    outstr = ','.join((tran,gene,biotype,length,seq))
                    handle.write(outstr+'\n')
                    self.count_out += 1
    def fasta_to_csv(self):
        with open(self.outfile,'w') as handle:
            handle.write(self.headers)
            with open(self.infile,'r') as fasta:
                transcript_id = None
                gene_id = None
                next_seq = None
                for line in fasta:
                    if line[0]==self.FASTA_DEFCHAR:
                        self.count_in += 1
                        # The defline starts with '>'
                        # The defline has fields separated by vertical bar
                        # Wrap up the previous sequence before moving on to the next.
                        self.print_one_sequence(handle,transcript_id,gene_id,next_seq)
                        tokens = line.split('|')
                        transcript_id = tokens[0][1:] # chop off '>'
                        gene_id = tokens[1]
                        version_index=gene_id.find('.')
                        if version_index>=0:
                            # chop off version number, as in ENSG00000198888.2
                            gene_id = gene_id[:version_index]
                        next_seq = ""   # get ready for one to many sequence lines
                    else:
                        # In FASTA format, one sequence may continue to next line
                        next_seq = next_seq + line.strip()
            self.print_one_sequence(handle,transcript_id,gene_id,next_seq)
        print(" Input sequences: %d"%self.count_in)
        print("Output sequences: %d"%self.count_out)

In [9]:
def load_atlas_genes(filepath):
    genes = set()
    with open (filepath,'r') as handle:
        header = None
        for row in handle:
            if header is None:
                header = row
            else:
                fields = row.split(',')
                gene_id = fields[0]
                value = fields[3]
                if (value != 'NA'):
                    genes.add(gene_id)  # set removes dupes
    return genes
print(datetime.now())
atlas_genes = load_atlas_genes(ATLAS_DIR+ATLAS_FILE)
print('Atlas good genes:', len(atlas_genes))

2025-05-19 23:14:46.446961
Atlas good genes: 25172


In [10]:
atlas_genes

{'ENSG00000003056',
 'ENSG00000156564',
 'ENSG00000108448',
 'ENSG00000178199',
 'ENSG00000228560',
 'ENSG00000271533',
 'ENSG00000188626',
 'ENSG00000123610',
 'ENSG00000136274',
 'ENSG00000258957',
 'ENSG00000136758',
 'ENSG00000107779',
 'ENSG00000162086',
 'ENSG00000113448',
 'ENSG00000152779',
 'ENSG00000035687',
 'ENSG00000137404',
 'ENSG00000282572',
 'ENSG00000123600',
 'ENSG00000258102',
 'ENSG00000139767',
 'ENSG00000235192',
 'ENSG00000253649',
 'ENSG00000231749',
 'ENSG00000176092',
 'ENSG00000188677',
 'ENSG00000120915',
 'ENSG00000281348',
 'ENSG00000256577',
 'ENSG00000280422',
 'ENSG00000138074',
 'ENSG00000279141',
 'ENSG00000162704',
 'ENSG00000249645',
 'ENSG00000185917',
 'ENSG00000204673',
 'ENSG00000103512',
 'ENSG00000279781',
 'ENSG00000271840',
 'ENSG00000027075',
 'ENSG00000272328',
 'ENSG00000280789',
 'ENSG00000089053',
 'ENSG00000267470',
 'ENSG00000144452',
 'ENSG00000047346',
 'ENSG00000283075',
 'ENSG00000196189',
 'ENSG00000178685',
 'ENSG00000272343',




We keep transcripts with the following combinations of GenCode annotation:

    gene_type=transcript_type=lncRNA
    gene_type=transcript_type=protein_coding

The goal is to avoid these GenCode transcript types:

    transcript_type=retained_intron
    transcript_type=protein_coding_CDS_not_defined
    transcript_type=protein_coding_LoF # note the bug-inducing partial string match
    transcript_type=nonsense_mediated_decay
    transcript_type=non_stop_decay



In [11]:
def load_annotated_transcripts(filepath):
    pc_tids = set()
    nc_tids = set()
    with open (filepath,'r') as handle:
        for row in handle:
            columns = row.split('\t')
            # Avoid comment lines
            if len(columns)>=9 and columns[2] == 'transcript':
                comments = columns[8]
                pairs = comments.split(';')
                tid = None
                gtype = None
                ttype = None
                for pair in pairs:
                    if pair.startswith('ID=ENST'):
                        tid = pair[3:]
                    elif pair.startswith('gene_type='):
                        gtype = pair[10:]
                    elif pair.startswith('transcript_type='):
                        ttype = pair[16:]
                if ttype is not None:
                    if tid is None:
                        raise Exception('transcript type without ID')
                    if ttype==gtype:
                        if ttype=='protein_coding':
                            pc_tids.add(tid)
                        elif ttype=='lncRNA':
                            nc_tids.add(tid)
    return pc_tids, nc_tids
print(datetime.now())
gencode_pc_transcripts,gencode_nc_transcripts = load_annotated_transcripts(DATA_DIR+ANNOTATION)
print('Gencode good pc transcripts pc/nc:', len(gencode_pc_transcripts))
print('Gencode good nc transcripts pc/nc:', len(gencode_nc_transcripts))

2025-05-19 23:15:08.936001
Gencode good pc transcripts pc/nc: 89305
Gencode good nc transcripts pc/nc: 56049


In [None]:
# gencode_nc_transcripts

lncRNA

Non-coding genes can halso ave different transcripts called isoforms.

Typical defline of the GenCode lncRNA file:

>ENST00000456328.2|ENSG00000290825.1|-|OTTHUMT00000362751.1|DDX11L2-202|DDX11L2|1657|


In [12]:
# GenCode inputs

ANNOTATION = 'gencode.v42.annotation.gff3'
NONCODING_SEQUENCE = 'gencode.v42.lncRNA_transcripts.fa'
CODING_SEQUENCE = 'gencode.v42.pc_transcripts.fa'
# GenCode outputs
CODING_CSV = 'gencode.v42.pc_transcripts.csv'
NONCODING_CSV = 'gencode.v42.lncRNA_transcripts.csv'
# Atlas inputs
ATLAS_DIR = "/content/drive/My Drive/LncRNA/data/LncAtlas_0510/"
ATLAS_FILE = 'lncATLAS_all_data_RCI.csv'

In [13]:
print(datetime.now())
infile = DATA_DIR + NONCODING_SEQUENCE
outfile = DATA_DIR + NONCODING_CSV
converter = fasta_reader(infile,outfile,'lncRNA')
converter.allow_these_genes(atlas_genes)
converter.allow_these_transcripts(gencode_nc_transcripts)
converter.fasta_to_csv()
# Without gencode_pc_transcripts filter, Input sequences: 57936 Output sequences: 30139

2025-05-19 23:15:30.803103
 Input sequences: 57936
Output sequences: 29117


In [14]:
print(datetime.now())
infile = DATA_DIR + CODING_SEQUENCE
outfile = DATA_DIR + CODING_CSV
converter = fasta_reader(infile,outfile,'protein_coding')
converter.allow_these_genes(atlas_genes)
converter.allow_these_transcripts(gencode_pc_transcripts)
converter.fasta_to_csv()


2025-05-19 23:15:46.732902
 Input sequences: 111053
Output sequences: 85641


In [19]:
GENCODE_DIR = "../../data/GenCode/"
ATLAS_DIR= "../../data/LncAtlas/"
NONCODING_ALL = 'gencode.v42.lncRNA_transcripts.csv'
CODING_ALL = 'gencode.v42.pc_transcripts.csv'
ATLAS_DATA='lncATLAS_all_data_RCI.csv'

from random import Random
from datetime import datetime
print(datetime.now())
TEST_PORTION = 0.2
# Output gene lists
CODING_TEST_ID = 'CNRCI_coding_test_genes.gc42.csv'
CODING_TRAIN_ID = 'CNRCI_coding_train_genes.gc42.csv'
NONCODING_TEST_ID = 'CNRCI_noncoding_test_genes.gc42.csv'
NONCODING_TRAIN_ID = 'CNRCI_noncoding_train_genes.gc42.csv'
# Output sequence files
CODING_TEST_SEQ = 'CNRCI_coding_test_transcripts.gc42.csv'
CODING_TRAIN_SEQ = 'CNRCI_coding_train_transcripts.gc42.csv'
NONCODING_TEST_SEQ = 'CNRCI_noncoding_test_transcripts.gc42.csv'
NONCODING_TRAIN_SEQ = 'CNRCI_noncoding_train_transcripts.gc42.csv'

2025-05-19 23:17:38.414733


In [20]:
def load_sequence_data(filepath):
    '''
    Load transcript sequences. Also,
    Load IDs of the genes for which we have sequence.
    The long RNA strings preclude the use of csv.reader utility.
    Expect csv file with this header line:
    transcript_id,gene_id,biotype,length,sequence
    '''
    gene_set=set()
    sequence_data=[]
    with open (filepath) as handle:
        header = None
        for line in handle:
            if header is None:
                header = line
            else:
                line = line.strip()
                fields = line.split(',')
                transcript_id = fields[0]
                gene_id       = fields[1]
                biotype       = fields[2]
                length        = int(fields[3])
                sequence      = fields[4]
                if length != len(sequence):
                    print(line)
                    raise Exception('Lengths do not match')
                gene_set.add(gene_id)
                sequence_data.append(fields)
    gene_list = sorted(list(gene_set))
    return gene_list,sequence_data

In [21]:
def inplace_shuffle(rows):
    generator = Random()
    generator.seed(42)
    generator.shuffle(rows)  # in-place

def train_test_split(rows):
    length = len(rows)
    divider = int(length*TEST_PORTION)
    train_set = rows[divider:]
    test_set = rows[:divider]
    return (train_set,test_set)
def save_csv(rows,filepath):
    with open(filepath,'w') as handle:
        header = 'gene_id'
        handle.write(header)
        handle.write('\n')
        for line in rows:
            handle.write(line)
            handle.write('\n')
def assert_exclusivity(list1,list2):
    set1=set(list1)
    set2=set(list2)
    if len(set1)!=len(list1) or len(set2)!=len(list2):
        raise Exception('Lists contained duplicates')
    intersection = set1 & set2
    if len(intersection)!=0:
        raise Exception('Lists are not exclusive')
def get_atlas_genes(filepath):
    gene_set = set()
    with open (filepath, 'r') as handle:
        header = None
        for row in handle:
            if header is None:
                header = row
            else:
                row = row.strip()
                fields = row.split(',')
                gene_id = fields[0]
                cell_line = fields[1]
                data_type = fields[2]
                value = fields[3]
                if data_type == 'CNRCI' and value != 'NA':
                    gene_set.add(gene_id)
    return gene_set
def get_intersection(all_data, subset_ids):
    intersection = []
    for gene_id in all_data:
        if gene_id in subset_ids:
            intersection.append(gene_id)
    return intersection



In [22]:
print(datetime.now())
coding_genes,    coding_sequence    = load_sequence_data(GENCODE_DIR+CODING_ALL)
noncoding_genes, noncoding_sequence = load_sequence_data(GENCODE_DIR+NONCODING_ALL)

print('First few coding genes:')
print(coding_genes[:5])
print('First few noncoding genes:')
print(noncoding_genes[:5])
print(datetime.now())

2025-05-19 23:17:49.670508
First few coding genes:
['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460']
First few noncoding genes:
['ENSG00000082929', 'ENSG00000099869', 'ENSG00000105501', 'ENSG00000115934', 'ENSG00000116652']
2025-05-19 23:17:50.997833


In [23]:
import pandas as pd
dff = pd.read_csv(GENCODE_DIR+NONCODING_ALL)

In [24]:
dff

Unnamed: 0,transcript_id,gene_id,biotype,length,sequence
0,ENST00000473358.1,ENSG00000243485,lncRNA,712,GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCC...
1,ENST00000469289.1,ENSG00000243485,lncRNA,535,TCATCAGTCCAAAGTCCAGCAGTTGTCCCTCCTGGAATCCGTTGGC...
2,ENST00000466430.5,ENSG00000238009,lncRNA,2748,CTGATCCATATGAATTCCTCTTATTAAGAAAAATAAAGCATCCAGG...
3,ENST00000477740.5,ENSG00000238009,lncRNA,491,GACAAGTTCGAGCATCTTAAAATGATTCAACAGGAGGAGATAAGGA...
4,ENST00000471248.1,ENSG00000238009,lncRNA,629,GAAGCTCGAGGAAGAGAAAAAAAAACTGGAAGGAGAAATCATAGAT...
...,...,...,...,...,...
29112,ENST00000667496.1,ENSG00000229236,lncRNA,4474,GCTCTTGTTGCCCATGCTGGAGTGCAGGGGCGCGATCTGCCCGTCT...
29113,ENST00000659275.1,ENSG00000229236,lncRNA,3567,CACCCTGGCCATAGCCCATGTCAGCAGTAACCTATGTCTTTGTTTT...
29114,ENST00000666666.1,ENSG00000229236,lncRNA,1469,CTTTGCTAAAAAACGGTCTTCCAGTTTCAGAAGTTCGTGGGTCATT...
29115,ENST00000382764.1,ENSG00000183146,lncRNA,878,ACAGGAGGACAAGGACTCAGGGGTCTGCTGGTCCATCTCTGCACCT...


In [25]:
len(noncoding_genes)

6270

In [26]:
atlas_genes = get_atlas_genes(ATLAS_DIR+ATLAS_DATA)
coding_genes = get_intersection(coding_genes, atlas_genes)
noncoding_genes = get_intersection(noncoding_genes, atlas_genes)

In [None]:
len(coding_genes)

17472

In [27]:
print(datetime.now())
inplace_shuffle(coding_genes)
inplace_shuffle(noncoding_genes)

print('First few coding genes:')
print(coding_genes[:5])
print('First few noncoding genes:')
print(noncoding_genes[:5])
print(datetime.now())

2025-05-19 23:18:53.885857
First few coding genes:
['ENSG00000107679', 'ENSG00000164647', 'ENSG00000169299', 'ENSG00000156787', 'ENSG00000101464']
First few noncoding genes:
['ENSG00000185186', 'ENSG00000259005', 'ENSG00000250775', 'ENSG00000266801', 'ENSG00000236352']
2025-05-19 23:18:53.892853


In [28]:
print(datetime.now())
coding_train_set,   coding_test_set    = train_test_split(coding_genes)
noncoding_train_set,noncoding_test_set = train_test_split(noncoding_genes)

print('First few coding train genes:')
print(coding_train_set[:5])
print('First few coding test genes:')
print(coding_test_set[:5])
print('First few noncoding train genes:')
print(noncoding_train_set[:5])
print('First few noncoding test genes:')
print(noncoding_test_set[:5])
print(datetime.now())

2025-05-19 23:18:56.313563
First few coding train genes:
['ENSG00000212659', 'ENSG00000103269', 'ENSG00000156206', 'ENSG00000106608', 'ENSG00000150627']
First few coding test genes:
['ENSG00000107679', 'ENSG00000164647', 'ENSG00000169299', 'ENSG00000156787', 'ENSG00000101464']
First few noncoding train genes:
['ENSG00000213904', 'ENSG00000261766', 'ENSG00000273145', 'ENSG00000259977', 'ENSG00000277463']
First few noncoding test genes:
['ENSG00000185186', 'ENSG00000259005', 'ENSG00000250775', 'ENSG00000266801', 'ENSG00000236352']
2025-05-19 23:18:56.314562


In [29]:


print(datetime.now())
coding_train_sort    = sorted(coding_train_set)
coding_test_sort     = sorted(coding_test_set)
noncoding_train_sort = sorted(noncoding_train_set)
noncoding_test_sort  = sorted(noncoding_test_set)
coding_train_set     = None
coding_test_set      = None
noncoding_train_set  = None
noncoding_test_set   = None

print('First few coding train genes:')
print(coding_train_sort[:5])
print('First few coding test genes:')
print(coding_test_sort[:5])
print('First few noncoding train genes:')
print(noncoding_train_sort[:5])
print('First few noncoding test genes:')
print(noncoding_test_sort[:5])
print(datetime.now())



2025-05-19 23:19:01.182267
First few coding train genes:
['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460']
First few coding test genes:
['ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001460', 'ENSG00000001626']
First few noncoding train genes:
['ENSG00000099869', 'ENSG00000105501', 'ENSG00000116652', 'ENSG00000117242', 'ENSG00000120664']
First few noncoding test genes:
['ENSG00000082929', 'ENSG00000124915', 'ENSG00000130600', 'ENSG00000145063', 'ENSG00000146666']
2025-05-19 23:19:01.186776


In [30]:
filename = ATLAS_DIR + NONCODING_TEST_ID
print(filename)
save_csv( noncoding_test_sort, filename)

filename = ATLAS_DIR + NONCODING_TRAIN_ID
print(filename)
save_csv( noncoding_train_sort, filename)

../../data/LncAtlas/CNRCI_noncoding_test_genes.gc42.csv
../../data/LncAtlas/CNRCI_noncoding_train_genes.gc42.csv


In [31]:
print('Noncoding total, filtered, train, test:',
      len(noncoding_genes), len(noncoding_train_sort), len(noncoding_test_sort))

Noncoding total, filtered, train, test: 5827 4662 1165


In [32]:
def save_seq(gene_list,sequence_data,filepath):
    with open(filepath,'w') as handle:
        header = 'transcript_id,gene_id,biotype,length,sequence'
        handle.write(header)
        handle.write('\n')
        valid_ids = set(gene_list)
        for fields in sequence_data:
            transcript_id = fields[0]
            gene_id       = fields[1]
            biotype       = fields[2]
            length        = fields[3]
            sequence      = fields[4]
            if gene_id in valid_ids:
                line = ','.join(fields)
                handle.write(line)
                handle.write('\n')

In [33]:
print('Save transcript sequences')
filename = ATLAS_DIR + CODING_TEST_SEQ
print(filename)
save_seq( coding_test_sort, coding_sequence, filename)

filename = ATLAS_DIR + CODING_TRAIN_SEQ
print(filename)
save_seq( coding_train_sort, coding_sequence, filename)

filename = ATLAS_DIR + NONCODING_TEST_SEQ
print(filename)
save_seq( noncoding_test_sort, noncoding_sequence, filename)

filename = ATLAS_DIR + NONCODING_TRAIN_SEQ
print(filename)
save_seq( noncoding_train_sort, noncoding_sequence, filename)

Save transcript sequences
../../data/LncAtlas/CNRCI_coding_test_transcripts.gc42.csv
../../data/LncAtlas/CNRCI_coding_train_transcripts.gc42.csv
../../data/LncAtlas/CNRCI_noncoding_test_transcripts.gc42.csv
../../data/LncAtlas/CNRCI_noncoding_train_transcripts.gc42.csv


In [34]:
print('done')
print(datetime.now())

done
2025-05-19 23:19:43.375474


## mRNA

The cds file contains the coding sequence, that is, the processed transcript, which is the gene sequence minus the untranslated UTR and introns.

Genes can have different transcripts called isoforms.

Some protein_coding genes have degenerate, non-coding transcripts. For example, a protein_coding gene might have one transcript marked for nonsense-mediated_decay because it is ill-formed.

Here, we filter for the protein_coding transcripts only.

Typical defline of cds file:
ENST00000631435.1 cds chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]


## lncRNA

The ncrna file contains non-coding sequence, that is, transcripts from non-coding genes.

Non-coding genes can have different transcripts called isoforms.

Some non-coding genes have degenerate transcripts. For example, a lncRNA gene might have one transcript marked retained_intron because it was not processed correctly.

Here, we filter for the lncRNA transcripts only.

Typical defline of ncrna file:
>ENST00000516993.1 ncrna chromosome:GRCh38:1:26593940:26594041:-1 gene:ENSG00000252802.1 gene_biotype:misc_RNA transcript_biotype:misc_RNA gene_symbol:Y_RNA description:Y RNA [Source:RFAM;Acc:RF00019]
