# How to use `extract_gene_list_from_screen_copied_MitCOM_data.py` to collect standard gene names from MitCOM portal

This Jupyter notebook file describes how to use [the script `extract_gene_list_from_screen_copied_MitCOM_data.py`](https://github.com/fomightez/proteomicswork/blob/main/mitcom-utiltities/extract_gene_list_from_screen_copied_MitCOM_data.py) to extract the standard gene name identifiers for the listed proteins present in the data presented at the [MitCOM: Yeast mitochondria complexome portal](https://www.complexomics.org/datasets/mitcom) associated with [Schulte et al 2023 'Mitochondrial complexome reveals quality-control pathways of protein import'](https://www.nature.com/articles/s41586-022-05641-w#Sec29). That way you have them so you can use them analyses or to access related information at other sites, such as YeastMine.

-------

## Preparation: Getting data from the portal 

Go to the [MitCOM: Yeast mitochondria complexome portal](https://www.complexomics.org/datasets/mitcom) and choose one of the datasets to view.

Now select a protein and for this example, we'll examine what is listed that correlates with it. (This example is using 'ACEB_YEAST/ICL2'.) In the manuscript it says under the figure legened for 'Extended Data Fig. 3 Using the Profile Viewer for interactive data exploration':

>"Simultaneously, a list of 20 proteins exhibiting the highest Pearson correlation with the selected target peak segment is displayed."

If you click to the left of the square to the left of first protein listed, skipping the header of the list, and then while holding down the left mouse button swipe down and to the right. Only the first two columns will look like they are selected; however, it isn't the case, as you'll see.

You'll have something like this in your text file:

```text

GCSP_YEAST

GCV2

Glycine dehydrogenase (decarboxylating), mitochondrial (EC 1.4.4.2) (Glycine cleavage system P protein) (Glycine decarboxylase) (Glycine decarboxylase complex subunit P) (Glycine dehydrogenase (aminomethyl-transferring))	GCV2 GSD2 YMR189W YM9646.01	114451	178	0.907
	
ESP1_YEAST

ESP1

Separin (EC 3.4.22.49) (Separase)	ESP1 YGR098C	187447	1	0.863
	
MYO4_YEAST

MYO4

Myosin-4 (SWI5-dependent HO expression protein 1)	MYO4 SHE1 YAL029C FUN22	169344	1	0.781
	
KOG1_YEAST

KOG1

Target of rapamycin complex 1 subunit KOG1 (TORC1 subunit KOG1) (Kontroller of growth protein 1) (Local anesthetic-sensitive protein 24)	KOG1 LAS24 YHR186C H9998.14	177609	17	0.772
	
SLM1_YEAST

SLM1

Phosphatidylinositol 4,5-bisphosphate-binding protein SLM1 (Synthetic lethal with MSS4 protein 1) (TORC2 effector protein SLM1)	SLM1 LIT2 YIL105C	77995	15	0.740
	
ODO1_YEAST

KGD1

2-oxoglutarate dehydrogenase, mitochondrial (EC 1.2.4.2) (2-oxoglutarate dehydrogenase complex component E1) (OGDC-E1) (Alpha-ketoglutarate dehydrogenase)	KGD1 OGD1 YIL125W	114416	392	0.732
	
DPOG_YEAST

MIP1

DNA polymerase gamma (EC 2.7.7.7) (Mitochondrial DNA polymerase catalytic subunit)	MIP1 YOR330C	143502	18	0.714
	
C1TM_YEAST

MIS1

C-1-tetrahydrofolate synthase, mitochondrial (C1-THF synthase) [Includes: Methylenetetrahydrofolate dehydrogenase (EC 1.5.1.5); Methenyltetrahydrofolate cyclohydrolase (EC 3.5.4.9); Formyltetrahydrofolate synthetase (EC 6.3.4.3)]	MIS1 YBR084W YBR0751	106217	204	0.711
	
AFT2_YEAST

AFT2

Iron-regulated transcriptional activator AFT2 (Activator of iron transcription protein 2)	AFT2 YPL202C	47105	1	0.704
	
FOL1_YEAST

FOL1

Folic acid synthesis protein FOL1 [Includes: Dihydroneopterin aldolase (DHNA) (EC 4.1.2.25) (7,8-dihydroneopterin aldolase) (FASA) (FASB); 6-hydroxymethyl-7,8-dihydropterin pyrophosphokinase (HPPK) (EC 2.7.6.3) (2-amino-4-hydroxy-6-hydroxymethyldihydropteridine pyrophosphokinase) (7,8-dihydro-6-hydroxymethylpterin-pyrophosphokinase) (PPPK) (FASC); Dihydropteroate synthase (DHPS) (EC 2.5.1.15) (Dihydropteroate pyrophosphorylase) (FASD)]	FOL1 YNL256W N0848	93120	8	0.697
	
POS5_YEAST

POS5

NADH kinase POS5, mitochondrial (EC 2.7.1.86)	POS5 YPL188W	46247	28	0.693
	
GLU2A_YEAST

ROT2

Glucosidase 2 subunit alpha (EC 3.2.1.207) (Alpha-glucosidase II subunit alpha) (Glucosidase II subunit alpha) (Reversal of TOR2 lethality protein 2)	ROT2 GLS2 YBR229C YBR1526	110266	15	0.679
	
RPOM_YEAST

RPO41

DNA-directed RNA polymerase, mitochondrial (EC 2.7.7.6)	RPO41 YFL036W	153047	98	0.677
	
ATM1_YEAST

ATM1

Iron-sulfur clusters transporter ATM1, mitochondrial	ATM1 MDY YMR301C YM9952.03C	77522	65	0.674
	
ISD11_YEAST

ISD11

Protein ISD11 (Iron-sulfur protein biogenesis, desulfurase-interacting protein 11)	ISD11 YER048W-A	11266	32	0.669
	
SOV1_YEAST

SOV1

Protein SOV1, mitochondrial	SOV1 YMR066W YM9916.05	104748	27	0.665
	
ALDH5_YEAST

ALD5

Aldehyde dehydrogenase 5, mitochondrial (EC 1.2.1.5)	ALD5 ALD3 ALDH5 YER073W	56693	21	0.657
	
NFS1_YEAST

NFS1

Cysteine desulfurase, mitochondrial (EC 2.8.1.7) (tRNA-splicing protein SPL1)	NFS1 SPL1 YCL017C	54467	104	0.652
	
SYC_YEAST

CRS1

Cysteine--tRNA ligase (EC 6.1.1.16) (Cysteinyl-tRNA synthetase) (CysRS)	YNL247W N0885	87530	23	0.636
```

(Note depending on where you click on the first line, you may have an empty line first or not. The script will disregard the first empty space and so it isn't an issue.)

Save that file where you want to run the script.

For the example here, this has already been done here. (Or running the next cell will do it? I haven't decided how I'm handling this. FOR DEVELOPMENT STAGE: letting next cell handle.)


In [1]:
s='''
GCSP_YEAST

GCV2

Glycine dehydrogenase (decarboxylating), mitochondrial (EC 1.4.4.2) (Glycine cleavage system P protein) (Glycine decarboxylase) (Glycine decarboxylase complex subunit P) (Glycine dehydrogenase (aminomethyl-transferring))	GCV2 GSD2 YMR189W YM9646.01	114451	178	0.907
	
ESP1_YEAST

ESP1

Separin (EC 3.4.22.49) (Separase)	ESP1 YGR098C	187447	1	0.863
	
MYO4_YEAST

MYO4

Myosin-4 (SWI5-dependent HO expression protein 1)	MYO4 SHE1 YAL029C FUN22	169344	1	0.781
	
KOG1_YEAST

KOG1

Target of rapamycin complex 1 subunit KOG1 (TORC1 subunit KOG1) (Kontroller of growth protein 1) (Local anesthetic-sensitive protein 24)	KOG1 LAS24 YHR186C H9998.14	177609	17	0.772
	
SLM1_YEAST

SLM1

Phosphatidylinositol 4,5-bisphosphate-binding protein SLM1 (Synthetic lethal with MSS4 protein 1) (TORC2 effector protein SLM1)	SLM1 LIT2 YIL105C	77995	15	0.740
	
ODO1_YEAST

KGD1

2-oxoglutarate dehydrogenase, mitochondrial (EC 1.2.4.2) (2-oxoglutarate dehydrogenase complex component E1) (OGDC-E1) (Alpha-ketoglutarate dehydrogenase)	KGD1 OGD1 YIL125W	114416	392	0.732
	
DPOG_YEAST

MIP1

DNA polymerase gamma (EC 2.7.7.7) (Mitochondrial DNA polymerase catalytic subunit)	MIP1 YOR330C	143502	18	0.714
	
C1TM_YEAST

MIS1

C-1-tetrahydrofolate synthase, mitochondrial (C1-THF synthase) [Includes: Methylenetetrahydrofolate dehydrogenase (EC 1.5.1.5); Methenyltetrahydrofolate cyclohydrolase (EC 3.5.4.9); Formyltetrahydrofolate synthetase (EC 6.3.4.3)]	MIS1 YBR084W YBR0751	106217	204	0.711
	
AFT2_YEAST

AFT2

Iron-regulated transcriptional activator AFT2 (Activator of iron transcription protein 2)	AFT2 YPL202C	47105	1	0.704
	
FOL1_YEAST

FOL1

Folic acid synthesis protein FOL1 [Includes: Dihydroneopterin aldolase (DHNA) (EC 4.1.2.25) (7,8-dihydroneopterin aldolase) (FASA) (FASB); 6-hydroxymethyl-7,8-dihydropterin pyrophosphokinase (HPPK) (EC 2.7.6.3) (2-amino-4-hydroxy-6-hydroxymethyldihydropteridine pyrophosphokinase) (7,8-dihydro-6-hydroxymethylpterin-pyrophosphokinase) (PPPK) (FASC); Dihydropteroate synthase (DHPS) (EC 2.5.1.15) (Dihydropteroate pyrophosphorylase) (FASD)]	FOL1 YNL256W N0848	93120	8	0.697
	
POS5_YEAST

POS5

NADH kinase POS5, mitochondrial (EC 2.7.1.86)	POS5 YPL188W	46247	28	0.693
	
GLU2A_YEAST

ROT2

Glucosidase 2 subunit alpha (EC 3.2.1.207) (Alpha-glucosidase II subunit alpha) (Glucosidase II subunit alpha) (Reversal of TOR2 lethality protein 2)	ROT2 GLS2 YBR229C YBR1526	110266	15	0.679
	
RPOM_YEAST

RPO41

DNA-directed RNA polymerase, mitochondrial (EC 2.7.7.6)	RPO41 YFL036W	153047	98	0.677
	
ATM1_YEAST

ATM1

Iron-sulfur clusters transporter ATM1, mitochondrial	ATM1 MDY YMR301C YM9952.03C	77522	65	0.674
	
ISD11_YEAST

ISD11

Protein ISD11 (Iron-sulfur protein biogenesis, desulfurase-interacting protein 11)	ISD11 YER048W-A	11266	32	0.669
	
SOV1_YEAST

SOV1

Protein SOV1, mitochondrial	SOV1 YMR066W YM9916.05	104748	27	0.665
	
ALDH5_YEAST

ALD5

Aldehyde dehydrogenase 5, mitochondrial (EC 1.2.1.5)	ALD5 ALD3 ALDH5 YER073W	56693	21	0.657
	
NFS1_YEAST

NFS1

Cysteine desulfurase, mitochondrial (EC 2.8.1.7) (tRNA-splicing protein SPL1)	NFS1 SPL1 YCL017C	54467	104	0.652
	
SYC_YEAST

CRS1

Cysteine--tRNA ligase (EC 6.1.1.16) (Cysteinyl-tRNA synthetase) (CysRS)	YNL247W N0885	87530	23	0.636
'''
%store s >data_to_extract.txt

Writing 's' (str) to file 'data_to_extract.txt'.


Currently the script will process any file that has the `.md` extensions. So we'll copy this data to there. (Note that, the extension that signals what files the script acts on can be altered by editing `extension_for_processing` under 'USER ADJUSTABLE VALUES' in the script.)

In [2]:
!cp data_to_extract.txt data_to_extract.md

### Preparation get the script to do the work

Running the next cell will get the current version of the script `extract_gene_list_from_screen_copied_MitCOM_data.py`.

In [3]:
!curl -OL https://raw.githubusercontent.com/fomightez/proteomicswork/main/mitcom-utiltities/extract_gene_list_from_screen_copied_MitCOM_data.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9037  100  9037    0     0  58681      0 --:--:-- --:--:-- --:--:-- 58681


In [4]:
%run extract_gene_list_from_screen_copied_MitCOM_data.py

Processing ./data_to_extract.md...
Saved gene names in ./data_to_extract_EXTRACTEDgenes.tsv.
Processing ./README.md...
Saved gene names in ./README_EXTRACTEDgenes.tsv.


If you were using the script on the command line/termimal, the above line would look something like.

```shell
python extract_gene_list_from_screen_copied_MitCOM_data.py
```

In this demo currently, you may see it act on the file `README.md`. We'll just disregard the result of that as it isn't much overhead as that file is small and the computational task being carried out isn't demanding.

The real target was `data_to_extract.md`, and the result is `data_to_extract_EXTRACTEDgenes.tsv`.

We can examine the content of the resulting file produced:

In [5]:
cat data_to_extract_EXTRACTEDgenes.tsv

gene_name
GCV2
ESP1
MYO4
KOG1
SLM1
KGD1
MIP1
MIS1
AFT2
FOL1
POS5
ROT2
RPO41
ATM1
ISD11
SOV1
ALD5
NFS1
CRS1


Those paying attention may notice there is only 19 gene names listed there.  
Consistent with one-off errors being [one of the hard things in computer science](https://twitter.com/codinghorror/status/506010907021828096?lang=en), despite what is stated in the manuscript, it seems there's only 19 in the lists produced from the correlation window. (I've tried it many times and it is always 19 in the correlation window itself. It's not the script missing one.) Even the reviewers didn't catch it. 

Note that this script isn't limited to the list seen when in the 'Correlation' tab. You can use it on the table in the main 'Protein List' area, too, if for some reason you wanted. Or even just a segment withing that table. However, that table is available as Sheet 1 in the Excel spreadsheet file available as [Supplementary Table 1 (41586_2022_5641_MOESM4_ESM.xlsx)](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-022-05641-w/MediaObjects/41586_2022_5641_MOESM4_ESM.xlsx) among the 'Supplementary information' available with [Schulte et al 2023 'Mitochondrial complexome reveals quality-control pathways of protein import'](https://www.nature.com/articles/s41586-022-05641-w#Sec29). I could imagine if you had several selected and wanted to record it and then get the names out later, it would be useful from the portal.

Now you can collect the filtered data you want from the portal and follow the above process to collect the standard gene names identifiers in to a form you can use in analyses or elsewhere, such as at YeastMine.

--------

Enjoy!