# Data Preprocessing

## Overview

The raw data from Eraslan et al. is a tab separated table containing gene names, related Ensembl IDs and measured or calculated values for mRNA abundance, protein abundance and protein-to-mRNA ratio.

In the first Jupyter cells the data is roughly explored. After that the relevant values for the up coming analysis are extracted.

Some problems arose:
- Not all transcript IDs seem to be the current canonical form of transcript for a particular gene. In fact the given IDs point to transcripts that do not translate (in most cases). The gene needs to be identified in that case and the up to date transcript ID resolved.
- BUT there are still some transcript IDs left that are annotate with "nonsense mediated decay". These will be thrown out as they do not successfully translate. 
- The Genecode data set contains duplicate files for some of the given transcript IDs. They can be easily filtered using a regex. The duplicates are the exact same files with different names.

In [1]:
# library dependencies
import pandas as pd
from pathlib import Path
from bs4 import BeautifulSoup
import requests
import re

## Reading the data

In [2]:
# raw data file and path
datafile = '../data/Eraslan-EV3.tsv'

# sanity check if the file exists
if not Path(datafile).is_file():
    print('Data file not found!')

## Exploring the data

In [3]:
# reading the data into a dataframe and looking at the first entries
df = pd.read_csv(datafile, sep='\t')
df

Unnamed: 0,GeneName,EnsemblGeneID,EnsemblTranscriptID,EnsemblProteinID,Adrenal_mRNA,Appendices_mRNA,Brain_mRNA,Colon_mRNA,Duodenum_mRNA,Endometrium_mRNA,...,Rectum_PTR,Salivarygland_PTR,Smallintestine_PTR,Smoothmuscle_PTR,Spleen_PTR,Stomach_PTR,Testis_PTR,Thyroid_PTR,Tonsil_PTR,Urinarybladder_PTR
0,A1BG,ENSG00000121410,ENST00000263100,ENSP00000263100,,1.073,,,,,...,,7.718,,,7.313,,,,,
1,A1CF,ENSG00000148584,ENST00000373993,ENSP00000363105,,,,1.971,2.324,,...,5.147,,5.202,,,5.8143,,,,
2,A2M,ENSG00000175899,ENST00000318602,ENSP00000323929,3.154,3.021,2.824,3.321,3.006,3.344,...,6.081,5.726,5.699,4.997,5.136,6.5349,5.820,6.060,5.675,5.8286
3,A2ML1,ENSG00000166535,ENST00000299698,ENSP00000299698,,,1.355,,,,...,,,,,,,2.350,,5.249,
4,A4GALT,ENSG00000128274,ENST00000401850,ENSP00000384794,1.625,1.567,,,,,...,4.731,4.508,,,,4.0613,4.832,,,4.2430
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11570,ZXDB,ENSG00000198455,ENST00000374888,ENSP00000364023,,,,,,,...,,,,,,,,,4.681,
11571,ZYG11B,ENSG00000162378,ENST00000294353,ENSP00000294353,1.930,1.589,1.995,1.627,1.531,2.082,...,4.962,4.987,5.076,4.827,4.255,4.0412,5.389,4.250,4.439,4.1460
11572,ZYX,ENSG00000159840,ENST00000322764,ENSP00000324422,2.414,2.978,2.349,2.257,2.572,3.175,...,6.268,5.564,5.708,6.284,6.159,5.8846,5.582,5.598,5.968,5.3358
11573,ZZEF1,ENSG00000074755,ENST00000381638,ENSP00000371051,1.851,1.904,1.866,2.140,2.175,1.689,...,5.540,5.181,5.303,5.038,5.110,5.0834,5.047,5.038,5.130,5.0619


In [4]:
# looking at the columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11575 entries, 0 to 11574
Data columns (total 91 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   GeneName                11575 non-null  object
 1   EnsemblGeneID           11575 non-null  object
 2   EnsemblTranscriptID     11575 non-null  object
 3   EnsemblProteinID        11575 non-null  object
 4   Adrenal_mRNA            11575 non-null  object
 5   Appendices_mRNA         11575 non-null  object
 6   Brain_mRNA              11575 non-null  object
 7   Colon_mRNA              11575 non-null  object
 8   Duodenum_mRNA           11575 non-null  object
 9   Endometrium_mRNA        11575 non-null  object
 10  Esophagus_mRNA          11575 non-null  object
 11  Fallopiantube_mRNA      11575 non-null  object
 12  Fat_mRNA                11575 non-null  object
 13  Gallbladder_mRNA        11575 non-null  object
 14  Heart_mRNA              11575 non-null  object
 15  Ki

In [5]:
df.describe()

Unnamed: 0,GeneName,EnsemblGeneID,EnsemblTranscriptID,EnsemblProteinID,Adrenal_mRNA,Appendices_mRNA,Brain_mRNA,Colon_mRNA,Duodenum_mRNA,Endometrium_mRNA,...,Rectum_PTR,Salivarygland_PTR,Smallintestine_PTR,Smoothmuscle_PTR,Spleen_PTR,Stomach_PTR,Testis_PTR,Thyroid_PTR,Tonsil_PTR,Urinarybladder_PTR
count,11575,11575,11575,11575,11575.0,11575.0,11575.0,11575.0,11575.0,11575.0,...,11575.0,11575.0,11575.0,11575.0,11575.0,11575.0,11575.0,11575.0,11575.0,11575.0
unique,11575,11575,11575,11575,2132.0,2049.0,2020.0,2143.0,2183.0,2043.0,...,3175.0,3297.0,3117.0,3248.0,3109.0,7372.0,3299.0,3238.0,3102.0,6971.0
top,A1BG,ENSG00000121410,ENST00000263100,ENSP00000263100,,,,,,,...,,,,,,,,,,
freq,1,1,1,1,3162.0,4011.0,3079.0,3396.0,2971.0,3395.0,...,3471.0,3643.0,2981.0,3632.0,4022.0,3124.0,2938.0,3962.0,3603.0,3697.0


Between ~3000 and ~4000 values in each of the 11575 rows are NA

## Extracting the relevant columns

Only the _EnsemblTranscriptID_ and _PTR_ values per tissue are necessary for training the network.

In [6]:
df2 = df[['EnsemblTranscriptID'] + [ col for col in df.columns if col.endswith('_PTR') ]].copy()
df2

Unnamed: 0,EnsemblTranscriptID,Adrenal_PTR,Appendices_PTR,Brain_PTR,Colon_PTR,Duodenum_PTR,Endometrium_PTR,Esophagus_PTR,Fallopiantube_PTR,Fat_PTR,...,Rectum_PTR,Salivarygland_PTR,Smallintestine_PTR,Smoothmuscle_PTR,Spleen_PTR,Stomach_PTR,Testis_PTR,Thyroid_PTR,Tonsil_PTR,Urinarybladder_PTR
0,ENST00000263100,,8.277,,,,,,7.841,,...,,7.718,,,7.313,,,,,
1,ENST00000373993,,,,5.135,5.371,,,,,...,5.147,,5.202,,,5.8143,,,,
2,ENST00000318602,6.290,6.328,5.948,5.811,6.068,5.383,5.881,6.119,6.410,...,6.081,5.726,5.699,4.997,5.136,6.5349,5.820,6.060,5.675,5.8286
3,ENST00000299698,,,3.995,,,,4.129,,,...,,,,,,,2.350,,5.249,
4,ENST00000401850,3.843,4.601,,,,,4.013,3.683,,...,4.731,4.508,,,,4.0613,4.832,,,4.2430
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11570,ENST00000374888,,,,,,,,,,...,,,,,,,,,4.681,
11571,ENST00000294353,4.461,5.013,5.047,4.566,5.184,4.826,5.102,4.670,5.756,...,4.962,4.987,5.076,4.827,4.255,4.0412,5.389,4.250,4.439,4.1460
11572,ENST00000322764,5.664,5.524,5.478,5.915,5.811,5.817,5.943,5.509,4.931,...,6.268,5.564,5.708,6.284,6.159,5.8846,5.582,5.598,5.968,5.3358
11573,ENST00000381638,5.112,4.918,5.139,5.190,5.442,5.602,4.715,4.956,5.033,...,5.540,5.181,5.303,5.038,5.110,5.0834,5.047,5.038,5.130,5.0619


Cross referencing the transcript IDs with BED and Fasta files from the gencode data set (43).

The path set below expects the gencode repo to be relative to this notebook.

In [7]:
# raw data paths
gencode_path = '../../GENCODE43/protein_coding/'
bed = Path(gencode_path) / 'BED6__protein_coding_strict/'
fa = Path(gencode_path) / 'FA_protein_coding_strict_mRNA/'

# file names look like this
# for the BED file : ENST00000370801.8.bed
# for the Fasta file : ENST00000370801.8:0-6412.fasta
# .8 denotes the current Ensemble version
# :0-6412 is the nucleotide length

# count of processed transcript IDs
count_all = 0
# success count
count_found = 0
# multiple files found for transcript
count_multi = 0

# extend the dataframe
# number of files found
df2['bed_files'] = 0
df2['fa_files'] = 0
# file path and name
df2['bed'] = ''
df2['fa'] = ''

# checking if all the transcript Fasta and BED files per transcript exist
for tid in df2['EnsemblTranscriptID']:
    # inclrease over all count
    count_all += 1

    # list and count files
    bed_file_list = list(bed.glob(tid + '*.bed'))
    bed_file_count = len(bed_file_list)
    fa_file_list = list(fa.glob(tid + '*.fasta'))
    fa_file_count = len(fa_file_list)

    # update dataframe
    df2.loc[ df2['EnsemblTranscriptID'] == tid, 'bed_files'] = bed_file_count
    df2.loc[ df2['EnsemblTranscriptID'] == tid, 'fa_files'] = fa_file_count

    # check BED and Fasta file count
    if bed_file_count == 1 and fa_file_count == 1:
        # exctly one BED and FA file
        
        # increase hit count
        count_found += 1
        
        # update file name information
        df2.loc[ df2['EnsemblTranscriptID'] == tid, 'bed'] = str(bed_file_list[0])
        df2.loc[ df2['EnsemblTranscriptID'] == tid, 'fa'] = str(fa_file_list[0])
    elif bed_file_count == 2 and fa_file_count == 2:
        # special case where there are duplicate files
        print(tid, 'more than one BED/Fasta file present. selecting')

        # increase hit count
        count_found += 1
        count_multi += 1

        # find correct BED file and update table
        for f in bed_file_list:
            temp_bed_file = str(f)
            if re.search(r'.*ENST\d+\.\d+.bed', temp_bed_file):
                df2.loc[ df2['EnsemblTranscriptID'] == tid, 'bed'] = temp_bed_file
                print('   ', temp_bed_file)

        # find correct Fasta file and update table
        for f in fa_file_list:
            temp_fa_file = str(f)
            if re.search(r'.*ENST\d+\.\d+:\d+-\d+.fasta', temp_fa_file):
                df2.loc[ df2['EnsemblTranscriptID'] == tid, 'fa'] = temp_fa_file
                print('   ', temp_fa_file)

        # update file count in table
        df2.loc[ df2['EnsemblTranscriptID'] == tid, 'bed_files'] = 1
        df2.loc[ df2['EnsemblTranscriptID'] == tid, 'fa_files'] = 1
    else:
        # everything else ends up here
        print(tid, 'bed count:', bed_file_count, 'fa count:', fa_file_count, 'bed files:', bed_file_list, 'fa files:', fa_file_list)

print('searched for', count_all, 'and found', count_found)
print('found multiple files for', count_multi, 'transcripts')
print('missing or otherwise off:', count_all - count_found)

ENST00000435683 bed count: 0 fa count: 0 bed files: [] fa files: []
ENST00000263817 bed count: 0 fa count: 0 bed files: [] fa files: []
ENST00000370449 bed count: 0 fa count: 0 bed files: [] fa files: []
ENST00000376887 bed count: 0 fa count: 0 bed files: [] fa files: []
ENST00000260645 bed count: 0 fa count: 0 bed files: [] fa files: []
ENST00000622407 bed count: 0 fa count: 0 bed files: [] fa files: []
ENST00000331789 bed count: 0 fa count: 0 bed files: [] fa files: []
ENST00000366779 bed count: 0 fa count: 0 bed files: [] fa files: []
ENST00000355413 bed count: 0 fa count: 0 bed files: [] fa files: []
ENST00000373176 bed count: 0 fa count: 0 bed files: [] fa files: []
ENST00000313871 more than one BED/Fasta file present. selecting
    ../../GENCODE43/protein_coding/BED6__protein_coding_strict/ENST00000313871.9.bed
    ../../GENCODE43/protein_coding/FA_protein_coding_strict_mRNA/ENST00000313871.9:0-3199.fasta
ENST00000564546 bed count: 0 fa count: 0 bed files: [] fa files: []
ENST000

In [9]:
# temporaray backup of df2
# for restoration if the processing below fails (again)
df2_backup = df2.copy()

In [28]:
# ONLY RUN THIS FOR DEBUGGING
df2 = df2_backup.copy()

In [29]:
# entries with two transcript files per entry
# the gencod data set contains a couple of transcript files with multiple different names
# file countent is exaclty the same
# this has been corrected in the previous cell so this sanity check should reveal 0 rows
df2.loc[ df2['bed_files'] == 2 ]

Unnamed: 0,EnsemblTranscriptID,Adrenal_PTR,Appendices_PTR,Brain_PTR,Colon_PTR,Duodenum_PTR,Endometrium_PTR,Esophagus_PTR,Fallopiantube_PTR,Fat_PTR,...,Spleen_PTR,Stomach_PTR,Testis_PTR,Thyroid_PTR,Tonsil_PTR,Urinarybladder_PTR,bed_files,fa_files,bed,fa


In [30]:
# check again the number of missing files (with a count of 0 in the bed_file column)
df2.loc[ df2['bed_files'] == 0, 'EnsemblTranscriptID' ].count()

294

In [32]:
df2[ df2['bed'] == '' ]['EnsemblTranscriptID' ].count()

294

In [33]:
# show some of the IDs with missing files
df2.loc[ df2['bed_files'] == 0, 'EnsemblTranscriptID' ]

28       ENST00000435683
33       ENST00000263817
39       ENST00000370449
41       ENST00000376887
54       ENST00000260645
              ...       
11323    ENST00000545588
11411    ENST00000309776
11441    ENST00000534834
11513    ENST00000309495
11559    ENST00000543588
Name: EnsemblTranscriptID, Length: 294, dtype: object

Sampling the IDs for which files are missing showed that the transcript is either deprecated or not the canonical form in the gencode data set (43). For some IDs the transcripts either do not translate or a different gene translates to the associated protein.

This is a natural evolution since the Eraslan et al. research took place 2019 the underlying data in the Ensembl database got updated with current research results.

The following cells query the Ensembl web site directly for the transcript IDs in question as it's the fastest way to resolve this issue.

In [34]:
def find_new_transcript(content):
    """Method to extract a specific transcript ID from an HTML document.

    Keyword Arguments:
    req -- Python request object content

    Returns:
    A string either empty or containing the transcript ID.
    """
    # parse the HTML document
    soup = BeautifulSoup(content, 'html.parser')
    # check if a specific table exists
    if soup.find(id='transcripts_table'):
        # if so extract the transcript ID
        href = soup.find(id='transcripts_table').tbody.td.a.attrs['href']
        transcript = re.sub(r'.*(ENST0\d+)', r'\1', href)
        print('   Current canonical transcript is', transcript)
    else:
        # if not return an empty string
        transcript = ''
        print('   No current transcript found!')

    return transcript

def check_files_and_update_df(old_tid, new_tid):
    """Cross reference the transcript ID with files in the gencode data set
    (bad hack as it uses variables globally defined at the beginning of this notebook!)

    Keyword Arguments:
    transcript -- the transcript ID string

    Returns:
    True if files were found for a transcript ID and False otherwise
    """
    # search and count files with a given name
    bed_file_list = list(bed.glob(new_tid + '*.bed'))
    bed_files = len(bed_file_list)
    fa_file_list = list(fa.glob(new_tid + '*.fasta'))
    fa_files = len(fa_file_list)

    # check how many files were found
    if bed_files == 1 and fa_files == 1:
        # if it's 1 everything is perfect
        print('   FA and BED files found. Updating dataframe with current information')
        # update dataframe
        df2.loc[ df2['EnsemblTranscriptID'] == old_tid, 'bed_files' ] = bed_files
        df2.loc[ df2['EnsemblTranscriptID'] == old_tid, 'fa_files' ] = fa_files
        df2.loc[ df2['EnsemblTranscriptID'] == old_tid, 'bed'] = str(bed_file_list[0])
        df2.loc[ df2['EnsemblTranscriptID'] == old_tid, 'fa'] = str(fa_file_list[0])
        df2.loc[ df2['EnsemblTranscriptID'] == old_tid, 'EnsemblTranscriptID' ] = new_tid
        
        return True
    else:
        # if there are many manual processing is needed
        print('   FA and BED file count invalid. File lists', bed_file_list, fa_file_list)

    return False

In [36]:
# loop over all transcript IDs without a transcript file associated with it

for tid in df2.loc[ df2['bed_files'] == 0, 'EnsemblTranscriptID' ]: #.head(2):
    print('processing', tid)
    # Ensembl URL for resolving the given transcript ID
    url = 'https://www.ensembl.org/Homo_sapiens/Transcript/Idhistory?t=' + tid
    # retrieve the document
    r = requests.get(url)
    # parse the document
    soup = BeautifulSoup(r.content, 'html.parser')
    # check for specific strings in the page
    if re.search(r'This transcript is not in the current gene set', soup.get_text()):
        table_content = soup.tbody;
        # loop over all listed genes
        for i in table_content.find_all_next('td'):
            # search for columns with a gene link
            if i.find('a') and re.search(r'.*Gene.*ENSG\d+', i.a.attrs['href']):
                # extract the gene ID
                href = i.a.attrs['href']
                gene = re.sub(r'.*(ENSG0\d+)', r'\1', href)
                print('   Transcript is deprecated, resolved gene is', gene)

                # Ensembl URL to resolve a gene ID
                url = 'https://www.ensembl.org/Homo_sapiens/Gene/Idhistory?g=' + gene
                r = requests.get(url)
                new_tid = find_new_transcript(r.content)
                if check_files_and_update_df(tid, new_tid): break
    
    elif re.search(r'Show transcript table', soup.get_text()):
        # the transcript is not the current canonical version and needs updating
        new_tid = find_new_transcript(r.content)
        check_files_and_update_df(tid, new_tid)
    else:
        print('   Some other error occured for this transcript')

processing ENST00000263817
   Transcript is deprecated, resolved gene is ENSG00000073734
   Current canonical transcript is ENST00000650372
   FA and BED files found. Updating dataframe with current information
processing ENST00000370449
   Transcript is deprecated, resolved gene is ENSG00000023839
   Current canonical transcript is ENST00000647814
   FA and BED files found. Updating dataframe with current information
processing ENST00000376887
   Transcript is deprecated, resolved gene is ENSG00000125257
   Current canonical transcript is ENST00000645237
   FA and BED files found. Updating dataframe with current information
processing ENST00000260645
   Transcript is deprecated, resolved gene is ENSG00000138075
   Current canonical transcript is ENST00000405322
   FA and BED files found. Updating dataframe with current information
processing ENST00000622407
   Transcript is deprecated, resolved gene is ENSG00000119673
   Current canonical transcript is ENST00000238651
   FA and BED fi

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



   Transcript is deprecated, resolved gene is ENSG00000275896
   Current canonical transcript is ENST00000539842
   FA and BED files found. Updating dataframe with current information
processing ENST00000606149
   Transcript is deprecated, resolved gene is ENSG00000100519
   Current canonical transcript is ENST00000445930
   FA and BED files found. Updating dataframe with current information
processing ENST00000620216
   Transcript is deprecated, resolved gene is ENSG00000099341
   Current canonical transcript is ENST00000215071
   FA and BED files found. Updating dataframe with current information
processing ENST00000619580
   Transcript is deprecated, resolved gene is ENSG00000159335
   Current canonical transcript is ENST00000309083
   FA and BED files found. Updating dataframe with current information
processing ENST00000306726
   Transcript is deprecated, resolved gene is ENSG00000169410
   Current canonical transcript is ENST00000618819
   FA and BED files found. Updating datafra

In [40]:
# check if there are still entries with unresolved transcript files
# spoiler, there are 10
df2.loc[ df2['bed_files'] != 1, 'EnsemblTranscriptID' ].count()

10

In [39]:
# show the 10 IDs with missing transcript files
df2.loc[ df2['bed_files'] == 0 ]

Unnamed: 0,EnsemblTranscriptID,Adrenal_PTR,Appendices_PTR,Brain_PTR,Colon_PTR,Duodenum_PTR,Endometrium_PTR,Esophagus_PTR,Fallopiantube_PTR,Fat_PTR,...,Spleen_PTR,Stomach_PTR,Testis_PTR,Thyroid_PTR,Tonsil_PTR,Urinarybladder_PTR,bed_files,fa_files,bed,fa
220,ENST00000366779,5.703,5.19,5.879,5.753,5.569,5.699,5.663,5.875,5.788,...,5.582,5.4896,5.586,6.075,5.588,5.0744,0,0,,
1377,ENST00000382387,,,,,,,,,,...,,,,,,,0,0,,
1532,ENST00000624406,,,7.201,,,,,,,...,,,,6.864,,,0,0,,
1940,ENST00000619537,,,,,,,,,,...,,,,,,,0,0,,
1941,ENST00000620528,,,,,6.221,,,,,...,,,,,,,0,0,,
3679,ENST00000369384,4.442,,,,,,,,4.537,...,,,,,,5.2395,0,0,,
4224,ENST00000368232,4.089,3.836,3.029,,3.773,3.914,4.63,4.463,,...,3.664,4.294,3.668,4.027,3.985,2.9934,0,0,,
7810,ENST00000357304,4.154,3.726,3.551,4.201,3.729,3.665,3.392,4.057,3.011,...,3.515,3.7242,3.83,3.025,3.805,3.5673,0,0,,
9042,ENST00000534261,,5.356,,,,,,,5.152,...,5.351,,,,,5.5499,0,0,,
10776,ENST00000610664,,,,,,,,,6.141,...,7.237,,,,,,0,0,,


Even though these transcript IDs will have resolved in 2019 when the paper using this data was published the current database does not resolve these IDs any more (due to more up to date research results).

Being brave those entries will be deleted.

In [None]:
# remove the 11 missing entries
df2.drop(df2[df2['bed_files'] == 0].index, inplace=True)

In [None]:
# verify the entries are gone from the dataframe
df2.loc[ df2['bed_files'] == 0 ]

In [23]:
# show the rest again
df2

Unnamed: 0,EnsemblTranscriptID,Adrenal_PTR,Appendices_PTR,Brain_PTR,Colon_PTR,Duodenum_PTR,Endometrium_PTR,Esophagus_PTR,Fallopiantube_PTR,Fat_PTR,...,Spleen_PTR,Stomach_PTR,Testis_PTR,Thyroid_PTR,Tonsil_PTR,Urinarybladder_PTR,bed_files,fa_files,bed,fa
0,ENST00000263100,,8.277,,,,,,7.841,,...,7.313,,,,,,1,1,../../GENCODE43/protein_coding/BED6__protein_c...,../../GENCODE43/protein_coding/FA_protein_codi...
1,ENST00000373993,,,,5.135,5.371,,,,,...,,5.8143,,,,,1,1,../../GENCODE43/protein_coding/BED6__protein_c...,../../GENCODE43/protein_coding/FA_protein_codi...
2,ENST00000318602,6.290,6.328,5.948,5.811,6.068,5.383,5.881,6.119,6.410,...,5.136,6.5349,5.820,6.060,5.675,5.8286,1,1,../../GENCODE43/protein_coding/BED6__protein_c...,../../GENCODE43/protein_coding/FA_protein_codi...
3,ENST00000299698,,,3.995,,,,4.129,,,...,,,2.350,,5.249,,1,1,../../GENCODE43/protein_coding/BED6__protein_c...,../../GENCODE43/protein_coding/FA_protein_codi...
4,ENST00000401850,3.843,4.601,,,,,4.013,3.683,,...,,4.0613,4.832,,,4.2430,1,1,../../GENCODE43/protein_coding/BED6__protein_c...,../../GENCODE43/protein_coding/FA_protein_codi...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11570,ENST00000374888,,,,,,,,,,...,,,,,4.681,,1,1,../../GENCODE43/protein_coding/BED6__protein_c...,../../GENCODE43/protein_coding/FA_protein_codi...
11571,ENST00000294353,4.461,5.013,5.047,4.566,5.184,4.826,5.102,4.670,5.756,...,4.255,4.0412,5.389,4.250,4.439,4.1460,1,1,../../GENCODE43/protein_coding/BED6__protein_c...,../../GENCODE43/protein_coding/FA_protein_codi...
11572,ENST00000322764,5.664,5.524,5.478,5.915,5.811,5.817,5.943,5.509,4.931,...,6.159,5.8846,5.582,5.598,5.968,5.3358,1,1,../../GENCODE43/protein_coding/BED6__protein_c...,../../GENCODE43/protein_coding/FA_protein_codi...
11573,ENST00000381638,5.112,4.918,5.139,5.190,5.442,5.602,4.715,4.956,5.033,...,5.110,5.0834,5.047,5.038,5.130,5.0619,1,1,../../GENCODE43/protein_coding/BED6__protein_c...,../../GENCODE43/protein_coding/FA_protein_codi...


In [None]:
# write current pre processed table to file
datafile = '../data/preproc_stage1.csv'
df2.to_csv(datafile, index=False)