# GSD: Mine Stenger et al 2020 supplemental Excel file for high confidence pet genes


Reference for the supplemental data table:  
- [Systematic analysis of nuclear gene function in respiratory growth and expression of the mitochondrial genome in S. cerevisiae. Stenger M, Le DT, Klecker T, Westermann B. Microb Cell. 2020 Jun 30;7(9):234-249. doi: 10.15698/mic2020.09.729. PMID: 32904421](https://pubmed.ncbi.nlm.nih.gov/32904421/)

To fetch the Excel file, I'm actually going to use the PMC posted version from [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7453639/).

(For now the plan is just to use data from Table S1 of [Stenger et al., 2020](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7453639/bin/mic-07-234-s02.xlsx).)

The generated list of gene identifiers will then be entered in [the 'Lists' page at YeastMine](https://yeastmine.yeastgenome.org/yeastmine/bag.do) to then make gene list in my YeastMine account.

-----

## Preparation


#### Fetch the Excel file, convert, and tidy the data to useable form.


The next cell gets the supplemental data table as an Excel file.

In [1]:
!curl -OL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7453639/bin/mic-07-234-s02.xlsx

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  851k  100  851k    0     0  1483k      0 --:--:-- --:--:-- --:--:-- 1481k


Then a necessary package for the conversion from Excel file to Pandas dataframe is installed here.

In [2]:
!pip install xlrd



Convert the Excel file to a dataframe.

In [3]:
import pandas as pd
df = pd.read_excel('mic-07-234-s02.xlsx', sheet_name=0, header=0) 

Look at top to see if that worked.

In [4]:
df.head()

Unnamed: 0,ORF,standard name,description,Morgenstern et al. (2017),Dimmer et al. (2002),Luban et al. (2005),Merz & Westermann (2009),this study,pet score
0,YAL001C,TFC3,Subunit of RNA polymerase III transcription in...,0,nd,nd,nd,nd,nd
1,YAL002W,VPS8,Membrane-binding component of the CORVET compl...,0,0,0,0,0,0
2,YAL003W,EFB1,Translation elongation factor 1 beta; stimulat...,0,nd,nd,nd,nd,nd
3,YAL004W,,Dubious open reading frame; unlikely to encode...,0,0,0,0,0,0
4,YAL005C,SSA1,ATPase involved in protein folding and NLS-dir...,0,0,0,0,0,0


How many genes in total?

In [5]:
len(df)

6634

That numbers should read 6634 if all is well. This means there is data on 6634 genes in this comvereted Pandas dataframe.  
(Note that when you open the source file in Excel, the last gene is on row number 6635 and the header labels are on row 1 and so 6634 is the total number of genes listed in the Excel file.)

For performing filtering on the values in 'pet score' column, we need to convert the 'pet score' column to numeric, changing non-numeric values to `NaN`.

In [6]:
df['pet score'] = df['pet score'].apply(pd.to_numeric, errors='coerce')

Verify the 'pet score' column is now numeric values by looking at the data type of each column. (If this line had been run prior to the `pd.to_numeric` step, it would have read 'object' for that column's data type.)

In [7]:
df.dtypes

ORF                           object
standard name                 object
description                   object
Morgenstern et al. (2017)     object
Dimmer et al. (2002)          object
Luban et al. (2005)           object
Merz & Westermann (2009)      object
this study                    object
pet score                    float64
dtype: object

## Collect the systematic gene identifiers for the high confidence pet genes

>"We propose that genes with a pet score higher than 0.5 should be regarded as high confidence pet genes. This definition requires that a high confidence pet mutant has to repeatedly show a respiratory-deficient phenotype, but it does not exclude mutants that yielded one false-negative result. According to this definition there are 254 high confidence pet genes in yeast, 79% of which encode mitochondrial proteins (Tables 1 and S2)." - from Stenger et al 2020 paper, pg 236 the first fill paragraph of the top right column.

So subsetting from Excel file S1 those with a 'pet score' > 0.5 should yield the 254 that are in Table S2 that is embedded in a PDF of other tables of Supplemental data from the paper and not easily mined cleanly by placing the PDF contents in a text file.

In [8]:
high_confidence_df = df[df["pet score"] > 0.5]

If that subset worked well the `high_confidence_df` should be 254 according to Stenger et al., 2020.  
Checking that:

In [9]:
len(high_confidence_df)

254

Subsetting to the high confidence pet genes worked because see 254 genes as expected. Now we need the systematic identifiers for going to YeastMine to make a list.

We'll save that list of systematic gene names corresponding to the 'ORF' column as a file. (I'm going to rename the column to be a clearer name since it will be in header of produced file.)

In [10]:
high_confidence_df = high_confidence_df.rename(columns={'ORF':'systematic_gene_name'}) 
high_confidence_df['systematic_gene_name'].to_csv('high_confidence_pet_genes.tsv', sep='\t',index = False)

In [None]:
import time

def executeSomething():
    #code here
    print ('.')
    time.sleep(480) #60 seconds times 8 minutes

while True:
    executeSomething()

.
.
