# MCB 112 Pset 01: the case of the dead sand mouse
* Eric Yang
* 09//14/2020

# 1. check that the gene names match

In [1]:
# Extract gene data and names in Moriarty_SuppTable1
with open('Moriarty_SuppTable1', 'r') as infile:
    next(infile)
    moriarty = [] # names of genes plus timestamp data
    m_genes = [] # just names of genes
    for line in infile:
        gene = line.strip().split()
        moriarty.append(gene)
        m_genes.append(gene[0])
m_genes

['anise',
 'apricot',
 'artichoke',
 'arugula',
 'asparagus',
 'avocado',
 'banana',
 'basil',
 'beet',
 'blackberry',
 'blueberry',
 'broccoli',
 'butternut',
 'cabbage',
 'cantaloupe',
 'caraway',
 'carrot',
 'cauliflower',
 'cayenne',
 'celery',
 'chard',
 'cherry',
 'chestnut',
 'chickpea',
 'cilantro',
 'clementine',
 'coconut',
 'coriander',
 'cranberry',
 'cucumber',
 'currant',
 'eggplant',
 'elderberry',
 'endive',
 'fennel',
 'fig',
 'garlic',
 'ginger',
 'gooseberry',
 'grape',
 'grapefruit',
 'guava',
 'honeydew',
 'horseradish',
 'huckleberry',
 'juniper',
 'kiwi',
 'kohlrabi',
 'lavender',
 'leek',
 'lentil',
 'lettuce',
 'lime',
 'maize',
 'mango',
 'melon',
 'mulberry',
 'mushroom',
 'mustard',
 'nectarine',
 'okra',
 'olive',
 'onion',
 'orange',
 'oregano',
 'papaya',
 'parsley',
 'parsnip',
 'pea',
 'peach',
 'pear',
 'pepper',
 'persimmon',
 'pineapple',
 'plantain',
 'plum',
 'pomegranate',
 'potato',
 'pumpkin',
 'quince',
 'radish',
 'raisin',
 'raspberry',
 'rhu

After some analyses, I noticed there are 2 duplicate gene names in moriarty. So, I chose to store the moriarty data in a 2D list rather than a dictionary mapping gene name to data because dictionaries don't allow for duplicate keys. In addition, it is easier to sort lists compared to dictionaries. For adler, I will make a 2D list as well as a dictionary for ease of merging the two datasets later.

In [2]:
# Extract gene data and names in Adler_SuppTable1
with open('Adler_SuppTable2', 'r') as infile:
    next(infile)
    adler = [] # names of genes plus synth and halflife data
    a_map = {} # key is gene name, values are synth rate and halflife, used for mapping later
    for line in infile:
        gene = line.strip().split()
        adler.append(gene)
        a_map[gene[0]]=[gene[1], gene[2]]
a_genes = a_map.keys() #no duplicates in adler, so can just access map keys to get all genes
a_genes

dict_keys(['INO80B', 'AF131216.1', 'BTLA', 'MSI2', 'ZNF18', 'HMGCR', 'PIGU', 'olive', 'NATD1', 'SURF1', 'PRSS53', 'CDCA7', 'LRFN3', 'CREB3L4', 'TXNDC17', 'ING1', 'RCC1', 'HMGN1', 'IGFL3', 'RNASE9', 'CPTP', 'PLCH1', 'RAE1', 'HIST1H4K', 'ABC7-42404400C24.1', 'SEMA4G', 'TRAF5', 'SYNGR3', 'PIM1', 'ZNF76', 'UBA1', 'EPHA7', 'GCGR', 'CLEC19A', 'CT47A7', 'ANP32A', 'OR2L3', 'CR392000.1', 'GLTSCR2', 'GSKIP', 'AC087651.1', 'PFN3', 'SF3B4', 'TIAF1', 'RP11-20I23.1', 'STRN4', 'SEPT7', 'SCNN1A', 'ESCO2', 'OTOL1', 'FBXW5', 'SLN', 'NFYA', 'ZNF735', 'MBD6', 'TMEM177', 'NUDT5', 'CASP6', 'ATP11B', 'APCDD1L', 'C8A', 'HOXA11', 'TRIM4', 'KIAA0895', 'SURF2', 'FPGT-TNNI3K', 'oregano', 'PARPBP', 'HIST2H4B', 'ALS2', 'ARV1', 'USP54', 'IQCA1L', 'lime', 'HS3ST4', 'ARMC5', 'PLA2G4E', 'RP11-392E22.9', 'ITPKB', 'TRIM32', 'INPP5E', 'CXCL8', 'SPTB', 'CHN2', 'FAM117A', 'KCNA4', 'RCVRN', 'TMCO6', 'SRPRA', 'GATC', 'KCNH1', 'OR8J3', 'FAM136A', 'TUBA4B', 'DCUN1D5', 'ADAP2', 'ALDH7A1', 'IK', 'XRCC1', 'EBF4', 'NCAM2', 'TSHZ1',

In [3]:
# get gene names that appear in moriarty but not in adler
moriarty_diff = []
for gene in m_genes:
    if gene not in a_genes:
        moriarty_diff.append(gene)
moriarty_diff

['15-Sep',
 '2-Mar',
 '1-Mar',
 '10-Sep',
 '7-Mar',
 '4-Mar',
 '2-Sep',
 '11-Sep',
 '1-Mar',
 '6-Mar',
 '11-Mar',
 '3-Mar',
 '8-Sep',
 '7-Sep',
 '14-Sep',
 '6-Sep',
 '1-Dec',
 '8-Mar',
 '5-Mar',
 '9-Mar',
 '12-Sep',
 '1-Sep',
 '4-Sep',
 '10-Mar',
 '9-Sep',
 '2-Mar',
 '5-Sep',
 '3-Sep']

It looks like dates were accidentally entered into the "gene_name" column in the Moriarty file! I'm not sure if these entries are meaningless and can be deleted or they are actually genes in Adler that were mislabeled in Moriarty. I'm going to compare the lengths of these lists for a quick sanity check.

In [4]:
print('moriarty genes:', len(m_genes))
print('adler genes:', len(a_genes))
print('genes in moriarty not adler:', len(moriarty_diff))

moriarty genes: 20031
adler genes: 20031
genes in moriarty not adler: 28


Uh oh...it looks like these mislabeled genes in Moriarty actually correlate to some genes in Adler. Although I'm not sure which one corresponds to which. I'm going to create a list of genes that are in Adler and not in Moriarty to figure out which genes are mislabeled in Moriarty.

In [5]:
moriarty_missing = []
for gene in a_genes:
    if gene not in m_genes:
        moriarty_missing.append(gene)
print(moriarty_missing)
print('genes in adler not in moriarty:', len(moriarty_missing))

['SEPT7', 'MARCH10', 'SEPT12', 'SEPT4', 'MARCH7', 'SEPT11', 'MARCH1', 'MARCH9', 'SEPT10', 'MARCH11', 'SEPT6', 'MARC1', 'SEPT14', 'MARCH3', 'SEPT8', 'MARC2', 'SEPT9', 'SEPT2', 'MARCH2', 'MARCH6', 'SEPT1', 'SEPT5', 'MARCH4', 'SEP15', 'MARCH5', 'MARCH8', 'DEC1', 'SEPT3']
genes in adler not in moriarty: 28


Oh wait! These genes in Adler are also dates too! It looks like these data rows are labeled in both datasets as dates but just formatted differently. This might be an artifact in excel where users have different date format settings, causing these entries to be autocorrected during entry. I will keep them in both gene lists for now...

# 2. explore the data

In [6]:
# 5 genes with the highest mRNA synthesis rate
adler.sort(key = lambda adler: float(adler[1]))
adler[-5:]

[['MAPK10', '62.8', '6.2'],
 ['RNASE7', '78.3', '15.6'],
 ['CFAP100', '83.1', '17.9'],
 ['TMEM2', '87.5', '6.1'],
 ['SCT', '174.5', '16.1']]

In [7]:
# 5 genes with the highest mRNA half life
adler.sort(key = lambda adler: float(adler[2]))
adler[-5:]

[['EIF4A1', '1.7', '56.8'],
 ['ECE2', '32.5', '56.8'],
 ['ERF', '4.4', '59.2'],
 ['SELL', '5.9', '62.5'],
 ['GIN1', '8.4', '66.3']]

In [8]:
# 5 genes with highest ratio of expression at t = 96 vs t = 0
for obs in moriarty: # adds ratio of each gene to last column
    obs.append(round(float(obs[5])/float(obs[1]),2))
moriarty.sort(key = lambda moriarty: moriarty[6])
moriarty[-5:]

[['ERF', '265.7', '534.4', '826.6', '1999.8', '5036.7', 18.96],
 ['HNF1A', '26.1', '50.0', '74.3', '168.9', '515.5', 19.75],
 ['EIF4A1', '97.9', '192.2', '313.4', '675.5', '2215.2', 22.63],
 ['SELL', '371.1', '730.1', '1264.2', '2858.9', '9452.8', 25.47],
 ['GIN1', '585.1', '1078.6', '1739.0', '4174.9', '15473.5', 26.45]]

# 3. figure out what happened 

In [9]:
# Combine moriarty and adler files, merge by gene name, ignore genes not found in both files
# column names
combined = [['# gene_name', ' tpm[12h]/tpm[0]',  'tpm[24h]/tpm[0]', 'tpm[48h]/tpm[0]', 
             'tpm[96h]/tpm[0]', 'synth_rate', 'halflife']]
for obs in moriarty:
    if obs[0] in a_genes: 
        line = [obs[0], round(float(obs[2])/float(obs[1]),2), 
                round(float(obs[3])/float(obs[1]),2),
                round(float(obs[4])/float(obs[1]),2), 
                round(float(obs[5])/float(obs[1]),2), 
               a_map[obs[0]][0], a_map[obs[0]][1]]
        combined.append(line)
combined

[['# gene_name',
  ' tpm[12h]/tpm[0]',
  'tpm[24h]/tpm[0]',
  'tpm[48h]/tpm[0]',
  'tpm[96h]/tpm[0]',
  'synth_rate',
  'halflife'],
 ['arugula', 0.61, 0.35, 0.09, 0.0, '1.2', '9.1'],
 ['asparagus', 0.4, 0.15, 0.02, 0.0, '2.5', '7.1'],
 ['avocado', 0.59, 0.32, 0.07, 0.0, '5.3', '9.0'],
 ['blackberry', 0.56, 0.33, 0.08, 0.0, '4.5', '9.0'],
 ['blueberry', 0.77, 0.43, 0.11, 0.0, '0.9', '9.5'],
 ['cayenne', 0.57, 0.29, 0.06, 0.0, '5.6', '8.7'],
 ['celery', 0.58, 0.35, 0.09, 0.0, '7.3', '9.1'],
 ['cherry', 0.59, 0.29, 0.06, 0.0, '2.0', '8.8'],
 ['chickpea', 0.63, 0.28, 0.06, 0.0, '7.7', '8.5'],
 ['coconut', 0.41, 0.14, 0.01, 0.0, '1.1', '6.8'],
 ['elderberry', 0.46, 0.2, 0.04, 0.0, '1.2', '7.8'],
 ['fig', 0.64, 0.42, 0.12, 0.0, '0.7', '9.8'],
 ['grapefruit', 0.51, 0.25, 0.05, 0.0, '10.1', '8.2'],
 ['huckleberry', 0.6, 0.35, 0.08, 0.0, '2.4', '9.1'],
 ['kiwi', 0.07, 0.0, 0.0, 0.0, '0.3', '3.8'],
 ['lavender', 0.41, 0.13, 0.01, 0.0, '4.1', '6.7'],
 ['leek', 0.25, 0.06, 0.0, 0.0, '5.8', '5.4']

In [10]:
# Save combined dataset to file
with open('moriarty_adler.txt', 'w') as outfile:
    for line in combined:
        for word in line:
            outfile.write('{:<17}'.format(str(word)))
        outfile.write('\n')

In [11]:
# store combined as dict for ease of data exploration
c_map = {}
for obs in combined:
    c_map[obs[0]] = obs[1:] # key is gene name, values is a list of ' tpm[12h]/tpm[0]',  'tpm[24h]/tpm[0]', 'tpm[48h]/tpm[0]', 
                            #'tpm[96h]/tpm[0]', 'synth_rate', 'halflife']
del c_map['# gene_name'] # remove header from map

# 4 genes in given plots
print('tomato: ' + str(c_map['tomato']))
print('MLX: ' + str(c_map['MLX']))
print('ANAPC15: ' + str(c_map['ANAPC15']))
print('chestnut: ' + str(c_map['chestnut']))

tomato: [0.56, 0.29, 0.06, 0.0, '10.2', '8.7']
MLX: [1.2, 1.58, 1.41, 1.0, '5.5', '20.2']
ANAPC15: [1.51, 1.64, 2.26, 1.6, '0.6', '23.1']
chestnut: [1.67, 2.65, 3.7, 6.06, '3.1', '32.4']


In [12]:
# 5 genes with highest mRNA synthesis rates
print('MAPK10: ' + str(c_map['MAPK10']))
print('RNASE7: ' + str(c_map['RNASE7']))
print('CFAP100: ' + str(c_map['CFAP100']))
print('TMEM2: ' + str(c_map['TMEM2']))
print('SCT: ' + str(c_map['SCT']))

MAPK10: [0.32, 0.09, 0.01, 0.0, '62.8', '6.2']
RNASE7: [1.13, 1.1, 0.86, 0.24, '78.3', '15.6']
CFAP100: [1.22, 1.14, 1.02, 0.52, '83.1', '17.9']
TMEM2: [0.29, 0.08, 0.01, 0.0, '87.5', '6.1']
SCT: [1.12, 1.06, 0.83, 0.31, '174.5', '16.1']


In [13]:
# 5 genes with the longest mRNA halflife
print('EIF4A1: ' + str(c_map['EIF4A1']))
print('ECE2: ' + str(c_map['ECE2']))
print('ERF: ' + str(c_map['ERF']))
print('SELL: ' + str(c_map['SELL']))
print('GIN1: ' + str(c_map['GIN1']))

EIF4A1: [1.96, 3.2, 6.9, 22.63, '1.7', '56.8']
ECE2: [1.7, 2.86, 6.33, 18.51, '32.5', '56.8']
ERF: [2.01, 3.11, 7.53, 18.96, '4.4', '59.2']
SELL: [1.97, 3.41, 7.7, 25.47, '5.9', '62.5']
GIN1: [1.84, 2.97, 7.14, 26.45, '8.4', '66.3']


In [14]:
# 5 genes with the highest ratio of expression at t = 96 vs t = 0
print('ERF: ' + str(c_map['ERF']))
print('HNF1A: ' + str(c_map['HNF1A']))
print('EIF4A1: ' + str(c_map['EIF4A1']))
print('SELL: ' + str(c_map['SELL']))
print('GIN1: ' + str(c_map['GIN1']))

ERF: [2.01, 3.11, 7.53, 18.96, '4.4', '59.2']
HNF1A: [1.92, 2.85, 6.47, 19.75, '0.4', '56.8']
EIF4A1: [1.96, 3.2, 6.9, 22.63, '1.7', '56.8']
SELL: [1.97, 3.41, 7.7, 25.47, '5.9', '62.5']
GIN1: [1.84, 2.97, 7.14, 26.45, '8.4', '66.3']


After some data exploration, here is my hypothesis. The reason that Moriarty sees "increased expression" (tpm) in some genes is that these genes have longer half lives compared to others. Four of the five genes with the highest ratio of expression at t = 96 also have four of the longest mRNA half lives. Tpm captures relative transcript abundance, as a fraction of individual gene abundance (numerator) over total gene abundance (denominator) at a timepoint. When the genes with shorter halflives decay, the total gene abundance decreases. For genes with longer halflives, their individual transcript abundance does not decrease as fast as the total gene abundance, causing the the tpm fraction to increase over time. Of note is that this phenomenom is independent of mRNA synthesis rate since the mouse is dead and no longer producing new mRNA. The tpm of genes with high mRNA synthesis rates and relatively low halflives all decay fairly rapidly. This allows us to infer that the tpm calculation is driven by mRNA halflife. In summary, genes with long halflives show increase in tpm over time not because of heightened cortical gene expression and synthesis post death, it is due to the slow decay of the genes already present in the mouse reltive to others. 