# Pre-processing of "COVID-19" IntAct MITAB 2.5 tabular data

This notebook has the purpose to pre-process **MITAB 2.5** tabular data to make them more human-readable and usable to generate NDEx networks using our Python TSV loader.

## Network Structure

### Nodes
The __name__ will be an official gene symbol or other short human-readable name (for non-protein nodes).

The __represent__ will be an official identifier as specified in the **"ID(s) interactor A/B"** columns.

__Attribute 1 = alias__ This will be a list of strings with the content contained in the **"Alias(es) interactor A/B"** columns.

__Attribute 2 = taxid__ This will be a string with the content of the **"Taxid interactor A/B"** columns.

__Attribute 3 = type__ This will be a string with the content of the **"Type(s) interactor A/B"** columns.

### Edges

Edges will have the default predicate **interacts-with**

Edges will also have several attributes from the following columns in the original data set:

> 'Interaction detection method(s)'

> 'Publication 1st author(s)'

> 'Publication Identifier(s)'

> 'Interaction type(s)'

> 'Confidence value(s)'

> 'Biological role(s) interactor A'

> 'Biological role(s) interactor B'

> 'Experimental role(s) interactor A'

> 'Experimental role(s) interactor B'

> "Expansion method(s)"

> "Interaction Xref(s)"

> "Interaction annotation(s)"

> "Host organism(s)"

> "Interaction parameter(s)"


## Packages and data import

In [1]:
import pandas as pd

In [2]:
original_data = pd.read_csv('annot__dat_mitab2_5.txt', sep='\t')
original_data.head(3)

Unnamed: 0,#ID(s) interactor A,ID(s) interactor B,Alt. ID(s) interactor A,Alt. ID(s) interactor B,Alias(es) interactor A,Alias(es) interactor B,Interaction detection method(s),Publication 1st author(s),Publication Identifier(s),Taxid interactor A,...,Checksum(s) interactor A,Checksum(s) interactor B,Interaction Checksum(s),Negative,Feature(s) interactor A,Feature(s) interactor B,Stoichiometry(s) interactor A,Stoichiometry(s) interactor B,Identification method participant A,Identification method participant B
0,uniprotkb:P31809,uniprotkb:P11224,intact:EBI-8663470|uniprotkb:Q61353|intact:MIN...,intact:EBI-16196052|uniprotkb:O39227,psi-mi:ceam1_mouse(display_long)|uniprotkb:Cea...,psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,"psi-mi:""MI:1247""(microscale thermophoresis)",Walls et al. (2016),pubmed:26855426|doi:10.1038/nature16988|imex:I...,taxid:10090(mouse)|taxid:10090(Mus musculus),...,rogid:TVzltAt+4VNWHR+xzvc8oHRwQ1010090,rogid:6rg8ivahgyrx/wjATOTsFDqOrDw11142,rigid:Y99xRD1y8eE2/yVdAd9yAjVKJ9Y,False,sufficient binding region:35-142,sufficient binding region:15-1231,-,-,"psi-mi:""MI:0396""(predetermined participant)","psi-mi:""MI:0396""(predetermined participant)"
1,uniprotkb:P11224,uniprotkb:P11224,intact:EBI-16196052|uniprotkb:O39227,intact:EBI-16196052|uniprotkb:O39227,psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,"psi-mi:""MI:0067""(light scattering)",Walls et al. (2016),pubmed:26855426|doi:10.1038/nature16988|imex:I...,"taxid:11142(cvma5)|taxid:11142(""Murine coronav...",...,rogid:6rg8ivahgyrx/wjATOTsFDqOrDw11142,rogid:6rg8ivahgyrx/wjATOTsFDqOrDw11142,rigid:5rNkJ7rt9fT0v8b81I4UVRcB8cI,False,sufficient binding region:15-1231,sufficient binding region:15-1231,-,0,"psi-mi:""MI:0396""(predetermined participant)","psi-mi:""MI:0396""(predetermined participant)"
2,uniprotkb:P11224,uniprotkb:P11224,intact:EBI-16196052|uniprotkb:O39227,intact:EBI-16196052|uniprotkb:O39227,psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,"psi-mi:""MI:0071""(molecular sieving)",Walls et al. (2016),pubmed:26855426|doi:10.1038/nature16988|imex:I...,"taxid:11142(cvma5)|taxid:11142(""Murine coronav...",...,rogid:6rg8ivahgyrx/wjATOTsFDqOrDw11142,rogid:6rg8ivahgyrx/wjATOTsFDqOrDw11142,rigid:5rNkJ7rt9fT0v8b81I4UVRcB8cI,False,sufficient binding region:15-1231,sufficient binding region:15-1231,-,0,"psi-mi:""MI:0396""(predetermined participant)","psi-mi:""MI:0396""(predetermined participant)"


## Selection of data subset to use

In [3]:
#Get all column headers as list

original_headers = original_data.columns.values.tolist()
print(original_headers)

['#ID(s) interactor A', 'ID(s) interactor B', 'Alt. ID(s) interactor A', 'Alt. ID(s) interactor B', 'Alias(es) interactor A', 'Alias(es) interactor B', 'Interaction detection method(s)', 'Publication 1st author(s)', 'Publication Identifier(s)', 'Taxid interactor A', 'Taxid interactor B', 'Interaction type(s)', 'Source database(s)', 'Interaction identifier(s)', 'Confidence value(s)', 'Expansion method(s)', 'Biological role(s) interactor A', 'Biological role(s) interactor B', 'Experimental role(s) interactor A', 'Experimental role(s) interactor B', 'Type(s) interactor A', 'Type(s) interactor B', 'Xref(s) interactor A', 'Xref(s) interactor B', 'Interaction Xref(s)', 'Annotation(s) interactor A', 'Annotation(s) interactor B', 'Interaction annotation(s)', 'Host organism(s)', 'Interaction parameter(s)', 'Creation date', 'Update date', 'Checksum(s) interactor A', 'Checksum(s) interactor B', 'Interaction Checksum(s)', 'Negative', 'Feature(s) interactor A', 'Feature(s) interactor B', 'Stoichiom

In [4]:
#Select data columns to use and create new dataframe

cols = ['#ID(s) interactor A', 'ID(s) interactor B', 'Alias(es) interactor A', 'Alias(es) interactor B', 'Taxid interactor A', 'Taxid interactor B', 'Type(s) interactor A', 'Type(s) interactor B', 'Interaction detection method(s)', 'Publication 1st author(s)', 'Publication Identifier(s)', 'Interaction type(s)', 'Confidence value(s)', 'Biological role(s) interactor A', 'Biological role(s) interactor B', 'Experimental role(s) interactor A', 'Experimental role(s) interactor B', "Expansion method(s)", "Interaction Xref(s)", "Interaction annotation(s)", "Host organism(s)", "Interaction parameter(s)"]
selected_data = original_data[cols].reset_index(drop=True)
selected_data.head(5)

Unnamed: 0,#ID(s) interactor A,ID(s) interactor B,Alias(es) interactor A,Alias(es) interactor B,Taxid interactor A,Taxid interactor B,Type(s) interactor A,Type(s) interactor B,Interaction detection method(s),Publication 1st author(s),...,Confidence value(s),Biological role(s) interactor A,Biological role(s) interactor B,Experimental role(s) interactor A,Experimental role(s) interactor B,Expansion method(s),Interaction Xref(s),Interaction annotation(s),Host organism(s),Interaction parameter(s)
0,uniprotkb:P31809,uniprotkb:P11224,psi-mi:ceam1_mouse(display_long)|uniprotkb:Cea...,psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,taxid:10090(mouse)|taxid:10090(Mus musculus),"taxid:11142(cvma5)|taxid:11142(""Murine coronav...","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:1247""(microscale thermophoresis)",Walls et al. (2016),...,intact-miscore:0.44,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)",-,"psi-mi:""MI:0465""(dip)",dataset:Coronavirus - Interactions investigate...,taxid:-1(in vitro)|taxid:-1(In vitro),-
1,uniprotkb:P11224,uniprotkb:P11224,psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,"taxid:11142(cvma5)|taxid:11142(""Murine coronav...","taxid:11142(cvma5)|taxid:11142(""Murine coronav...","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0067""(light scattering)",Walls et al. (2016),...,intact-miscore:0.60,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)",-,"psi-mi:""MI:0465""(dip)",dataset:Coronavirus - Interactions investigate...,"taxid:7227(drome)|taxid:7227(""Drosophila melan...",-
2,uniprotkb:P11224,uniprotkb:P11224,psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,"taxid:11142(cvma5)|taxid:11142(""Murine coronav...","taxid:11142(cvma5)|taxid:11142(""Murine coronav...","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0071""(molecular sieving)",Walls et al. (2016),...,intact-miscore:0.60,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)",-,"psi-mi:""MI:0465""(dip)",dataset:Coronavirus - Interactions investigate...,"taxid:7227(drome)|taxid:7227(""Drosophila melan...",-
3,uniprotkb:P11224,uniprotkb:P11224,psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,"taxid:11142(cvma5)|taxid:11142(""Murine coronav...","taxid:11142(cvma5)|taxid:11142(""Murine coronav...","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0410""(3D electron microscopy)",Walls et al. (2016),...,intact-miscore:0.60,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)",-,"psi-mi:""MI:0465""(dip)",dataset:Coronavirus - Interactions investigate...,"taxid:7227(drome)|taxid:7227(""Drosophila melan...",-
4,uniprotkb:Q0ZME7,uniprotkb:Q0ZME7,psi-mi:spike_cvhn5(display_long)|uniprotkb:S(g...,psi-mi:spike_cvhn5(display_long)|uniprotkb:S(g...,"taxid:443241(cvhn5)|taxid:443241(""Human corona...","taxid:443241(cvhn5)|taxid:443241(""Human corona...","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0410""(3D electron microscopy)",Kirchdoerfer et al. (2016),...,intact-miscore:0.46,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)",-,"psi-mi:""MI:0465""(dip)",dataset:Coronavirus - Interactions investigate...,taxid:9606(human)|taxid:9606(Homo sapiens),-


## Extraction of 'gene name' to use as 'node name' (node label)

The **'Alias(es) interactor A/B'** columns contain a list of aliases for each interactor.
Among those, there are also official gene symbols, so it makes sense to extract the info and use it as node labels.

However, there are complications: 
1. Not all nodes have the same number of aliases; this means that the (gene_name) might be the first, second or n-position alias, if one is even present.
2. All aliases have prefixes that further complicate extracting the information.
3. Last but not least, not all nodes are proteins; some of them are chemicals/small molecules, so a different alias should be extracted for these nodes (the best candidate is the one flagged as (display_short)

Below are 2 options to attempt retrieval of the useful information.

### Option A

- Split the string at vertical bars and get the substrings as a list in the same dataframe.
- Then, iterate through every item in the list and discard all those that don't contain '(gene name)'

In [5]:
# split string into a list whithin the original df

selected_data['Alias(es) interactor A'] = selected_data['Alias(es) interactor A'].str.split('|', expand = False)
selected_data.head(20)

Unnamed: 0,#ID(s) interactor A,ID(s) interactor B,Alias(es) interactor A,Alias(es) interactor B,Taxid interactor A,Taxid interactor B,Type(s) interactor A,Type(s) interactor B,Interaction detection method(s),Publication 1st author(s),...,Confidence value(s),Biological role(s) interactor A,Biological role(s) interactor B,Experimental role(s) interactor A,Experimental role(s) interactor B,Expansion method(s),Interaction Xref(s),Interaction annotation(s),Host organism(s),Interaction parameter(s)
0,uniprotkb:P31809,uniprotkb:P11224,"[psi-mi:ceam1_mouse(display_long), uniprotkb:C...",psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,taxid:10090(mouse)|taxid:10090(Mus musculus),"taxid:11142(cvma5)|taxid:11142(""Murine coronav...","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:1247""(microscale thermophoresis)",Walls et al. (2016),...,intact-miscore:0.44,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)",-,"psi-mi:""MI:0465""(dip)",dataset:Coronavirus - Interactions investigate...,taxid:-1(in vitro)|taxid:-1(In vitro),-
1,uniprotkb:P11224,uniprotkb:P11224,"[psi-mi:spike_cvma5(display_long), uniprotkb:S...",psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,"taxid:11142(cvma5)|taxid:11142(""Murine coronav...","taxid:11142(cvma5)|taxid:11142(""Murine coronav...","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0067""(light scattering)",Walls et al. (2016),...,intact-miscore:0.60,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)",-,"psi-mi:""MI:0465""(dip)",dataset:Coronavirus - Interactions investigate...,"taxid:7227(drome)|taxid:7227(""Drosophila melan...",-
2,uniprotkb:P11224,uniprotkb:P11224,"[psi-mi:spike_cvma5(display_long), uniprotkb:S...",psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,"taxid:11142(cvma5)|taxid:11142(""Murine coronav...","taxid:11142(cvma5)|taxid:11142(""Murine coronav...","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0071""(molecular sieving)",Walls et al. (2016),...,intact-miscore:0.60,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)",-,"psi-mi:""MI:0465""(dip)",dataset:Coronavirus - Interactions investigate...,"taxid:7227(drome)|taxid:7227(""Drosophila melan...",-
3,uniprotkb:P11224,uniprotkb:P11224,"[psi-mi:spike_cvma5(display_long), uniprotkb:S...",psi-mi:spike_cvma5(display_long)|uniprotkb:S(g...,"taxid:11142(cvma5)|taxid:11142(""Murine coronav...","taxid:11142(cvma5)|taxid:11142(""Murine coronav...","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0410""(3D electron microscopy)",Walls et al. (2016),...,intact-miscore:0.60,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)",-,"psi-mi:""MI:0465""(dip)",dataset:Coronavirus - Interactions investigate...,"taxid:7227(drome)|taxid:7227(""Drosophila melan...",-
4,uniprotkb:Q0ZME7,uniprotkb:Q0ZME7,"[psi-mi:spike_cvhn5(display_long), uniprotkb:S...",psi-mi:spike_cvhn5(display_long)|uniprotkb:S(g...,"taxid:443241(cvhn5)|taxid:443241(""Human corona...","taxid:443241(cvhn5)|taxid:443241(""Human corona...","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0410""(3D electron microscopy)",Kirchdoerfer et al. (2016),...,intact-miscore:0.46,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)",-,"psi-mi:""MI:0465""(dip)",dataset:Coronavirus - Interactions investigate...,taxid:9606(human)|taxid:9606(Homo sapiens),-
5,uniprotkb:Q0ZME7,uniprotkb:Q0ZME7,"[psi-mi:spike_cvhn5(display_long), uniprotkb:S...",psi-mi:spike_cvhn5(display_long)|uniprotkb:S(g...,"taxid:443241(cvhn5)|taxid:443241(""Human corona...","taxid:443241(cvhn5)|taxid:443241(""Human corona...","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0020""(transmission electron microsc...",Kirchdoerfer et al. (2016),...,intact-miscore:0.46,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)",-,"psi-mi:""MI:0465""(dip)",dataset:Coronavirus - Interactions investigate...,taxid:9606(human)|taxid:9606(Homo sapiens),-
6,"chebi:""CHEBI:145416""",uniprotkb:Q696P8,"[psi-mi:""methyl 9-o-acetyl-5-(acetylamino)-3,5...",psi-mi:q696p8_cvhoc(display_long)|uniprotkb:E2...,-,taxid:31631(HCoV-OC43)|taxid:31631(Human coron...,"psi-mi:""MI:0328""(small molecule)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0410""(3D electron microscopy)",Tortorici et al. (2019),...,intact-miscore:0.36,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0497""(neutral component)","psi-mi:""MI:0497""(neutral component)",-,emdb:EMD-0557(see-also),"figure legend:Fig. 1, Table 1, Supplementary F...",taxid:9606(human-293f)|taxid:9606(Homo sapiens...,-
7,uniprotkb:Q696P8,uniprotkb:Q696P8,"[psi-mi:q696p8_cvhoc(display_long), uniprotkb:...",psi-mi:q696p8_cvhoc(display_long)|uniprotkb:E2...,taxid:31631(HCoV-OC43)|taxid:31631(Human coron...,taxid:31631(HCoV-OC43)|taxid:31631(Human coron...,"psi-mi:""MI:0326""(protein)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0410""(3D electron microscopy)",Tortorici et al. (2019),...,intact-miscore:0.36,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0497""(neutral component)","psi-mi:""MI:0497""(neutral component)",-,emdb:EMD-20070(see-also),"figure legend:Fig. 1, Table 1, Supplementary F...",taxid:9606(human-293f)|taxid:9606(Homo sapiens...,-
8,intact:EBI-20623174,uniprotkb:Q53F19,"[psi-mi:cvhsa-oligo(display_short), psi-mi:EBI...",psi-mi:ncbp3_human(display_long)|uniprotkb:C17...,taxid:694009(sars-cov)|taxid:694009(Human SARS...,taxid:9606(human)|taxid:9606(Homo sapiens),"psi-mi:""MI:0320""(ribonucleic acid)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0096""(pull down)",Gebhardt et al. (2015),...,intact-miscore:0.59,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0496""(bait)","psi-mi:""MI:0498""(prey)","psi-mi:""MI:1060""(spoke expansion)",-,figure legend:Fig. 2D|comment:a similar intera...,taxid:-1(in vitro)|taxid:-1(In vitro),-
9,intact:EBI-20623174,uniprotkb:O60573,"[psi-mi:cvhsa-oligo(display_short), psi-mi:EBI...",psi-mi:if4e2_human(display_long)|uniprotkb:EIF...,taxid:694009(sars-cov)|taxid:694009(Human SARS...,taxid:9606(human)|taxid:9606(Homo sapiens),"psi-mi:""MI:0320""(ribonucleic acid)","psi-mi:""MI:0326""(protein)","psi-mi:""MI:0096""(pull down)",Gebhardt et al. (2015),...,intact-miscore:0.35,"psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0499""(unspecified role)","psi-mi:""MI:0496""(bait)","psi-mi:""MI:0498""(prey)","psi-mi:""MI:1060""(spoke expansion)",-,figure legend:Fig. 2D|comment:a similar intera...,taxid:-1(in vitro)|taxid:-1(In vitro),-


In [7]:
# create a list of values for a specific dataframe column, then iterate through the
#values (that are temselves lists) and remove all that do not match the desired condition

aliases_list = selected_data['Alias(es) interactor A'].to_list()
print(aliases_list)   

[['psi-mi:ceam1_mouse(display_long)', 'uniprotkb:Ceacam1(gene name)', 'psi-mi:Ceacam1(display_short)', 'uniprotkb:Bgp(gene name synonym)', 'uniprotkb:Bgp1(gene name synonym)', 'uniprotkb:Biliary glycoprotein 1(gene name synonym)', 'uniprotkb:Murine hepatitis virus receptor(gene name synonym)', 'uniprotkb:MHVR1(gene name synonym)', 'uniprotkb:Biliary glycoprotein D(gene name synonym)'], ['psi-mi:spike_cvma5(display_long)', 'uniprotkb:S(gene name)', 'psi-mi:S(display_short)', 'uniprotkb:E2(gene name synonym)', 'uniprotkb:Peplomer protein(gene name synonym)', 'uniprotkb:3(orf name)'], ['psi-mi:spike_cvma5(display_long)', 'uniprotkb:S(gene name)', 'psi-mi:S(display_short)', 'uniprotkb:E2(gene name synonym)', 'uniprotkb:Peplomer protein(gene name synonym)', 'uniprotkb:3(orf name)'], ['psi-mi:spike_cvma5(display_long)', 'uniprotkb:S(gene name)', 'psi-mi:S(display_short)', 'uniprotkb:E2(gene name synonym)', 'uniprotkb:Peplomer protein(gene name synonym)', 'uniprotkb:3(orf name)'], ['psi-mi:sp

In [8]:
# iterate through the list of items and each item's elements and remove
# all elements that don't match the specified condition; in this case, if the element(string)
# doesn't contain the substrings 'gene display', it will be deleted. I expect this to preserve all
# the strings that are (gene name), (gene name synonym), (display_long) and (display_short).

for item in aliases_list:
    for element in item:
        if not 'gene display' in element:
            item.remove(element)
            
print(aliases_list)    

[['uniprotkb:Ceacam1(gene name)', 'uniprotkb:Bgp(gene name synonym)', 'uniprotkb:Biliary glycoprotein 1(gene name synonym)', 'uniprotkb:MHVR1(gene name synonym)'], ['uniprotkb:S(gene name)', 'uniprotkb:E2(gene name synonym)', 'uniprotkb:3(orf name)'], ['uniprotkb:S(gene name)', 'uniprotkb:E2(gene name synonym)', 'uniprotkb:3(orf name)'], ['uniprotkb:S(gene name)', 'uniprotkb:E2(gene name synonym)', 'uniprotkb:3(orf name)'], ['uniprotkb:S(gene name)', 'uniprotkb:E2(gene name synonym)', 'uniprotkb:3(orf name)'], ['uniprotkb:S(gene name)', 'uniprotkb:E2(gene name synonym)', 'uniprotkb:3(orf name)'], ['psi-mi:"CHEBI:145416"(display_long)'], ['uniprotkb:E2(gene name synonym)', 'uniprotkb:S(gene name)'], ['psi-mi:EBI-20623174(display_long)'], ['psi-mi:EBI-20623174(display_long)'], ['psi-mi:EBI-20623174(display_long)'], ['psi-mi:EBI-20623174(display_long)'], ['psi-mi:EBI-20623174(display_long)'], ['psi-mi:"CHEBI:50210"(display_long)'], ['psi-mi:"CHEBI:50210"(display_long)'], ['psi-mi:"CHEBI:5

In [9]:
# the list is now contaminated with gene name synonyms, that must be removed. 
# Also, some entries will have (display_short). Those should be eliminated too.
# (display_long) is what I want to keep because some values only have these.
#I can use the same loop again

for item in aliases_list:
    for element in item:
        if 'synonym' in element or 'short' in element:
            item.remove(element)
            
print(aliases_list)    

[['uniprotkb:Ceacam1(gene name)', 'uniprotkb:Biliary glycoprotein 1(gene name synonym)'], ['uniprotkb:S(gene name)', 'uniprotkb:3(orf name)'], ['uniprotkb:S(gene name)', 'uniprotkb:3(orf name)'], ['uniprotkb:S(gene name)', 'uniprotkb:3(orf name)'], ['uniprotkb:S(gene name)', 'uniprotkb:3(orf name)'], ['uniprotkb:S(gene name)', 'uniprotkb:3(orf name)'], ['psi-mi:"CHEBI:145416"(display_long)'], ['uniprotkb:S(gene name)'], ['psi-mi:EBI-20623174(display_long)'], ['psi-mi:EBI-20623174(display_long)'], ['psi-mi:EBI-20623174(display_long)'], ['psi-mi:EBI-20623174(display_long)'], ['psi-mi:EBI-20623174(display_long)'], ['psi-mi:"CHEBI:50210"(display_long)'], ['psi-mi:"CHEBI:50210"(display_long)'], ['psi-mi:"CHEBI:50210"(display_long)'], ['psi-mi:"CHEBI:50210"(display_long)'], ['psi-mi:"CHEBI:50210"(display_long)'], ['psi-mi:"CHEBI:50210"(display_long)'], ['psi-mi:"CHEBI:50210"(display_long)'], ['uniprotkb:NCBP3(gene name)'], ['uniprotkb:NCBP3(gene name)'], ['uniprotkb:NCBP3(gene name)'], ['uni

In [10]:
#Create new dataframe with cleaned up aliases

cleaned_aliases = pd.DataFrame(aliases_list)
cleaned_aliases.head(700)

Unnamed: 0,0,1,2
0,uniprotkb:Ceacam1(gene name),uniprotkb:Biliary glycoprotein 1(gene name syn...,
1,uniprotkb:S(gene name),uniprotkb:3(orf name),
2,uniprotkb:S(gene name),uniprotkb:3(orf name),
3,uniprotkb:S(gene name),uniprotkb:3(orf name),
4,uniprotkb:S(gene name),uniprotkb:3(orf name),
...,...,...,...
695,,,
696,,,
697,,,
698,,,


In [None]:
# Column '0' looks good BUT some cells end up having None values and I don't understand
# why that happens....

#In addition, in column 1, I see values that include (orf name):
# those should have been eliminated earlier during the first FOR LOOP...
# Am I missing something?

# I also need to remove prefixes as well as the text in parenthesis

In [None]:
cleaned_aliases[0] = cleaned_aliases[0].str.replace("uniprotkb:", "", regex=False)
cleaned_aliases[0] = cleaned_aliases[0].str.replace("psi-mi:", "", regex=False)
cleaned_aliases[0] = cleaned_aliases[0].str.replace("(gene name)", "", regex=False)
cleaned_aliases[0] = cleaned_aliases[0].str.replace("(display_long)", "", regex=False)
cleaned_aliases[0] = cleaned_aliases[0].str.replace("(display_short", "", regex=False)

cleaned_aliases[0] = cleaned_aliases[0].str.strip('"')

cleaned_aliases.head(700)

### Option B

Alternatively, setting 'expand=True' will expand the substring in a new dataframe.

Most of the gene names are in column 1, so option B is pretty good, but some nodes will not have human readable names because of the reasons explaind above.

In [None]:
expanded_aliases = selected_data['Alias(es) interactor A'].str.split('|', expand = True)
expanded_aliases.head(5)

In [None]:
# Most of the gene names are in column 1, so option B is pretty good,
# but some nodes will not have human readable names.

# Option A might be better because we can manipulate the list of values... But I can t figure out how to do that.

In [None]:
# Option A

In [None]:
********************

In [None]:
df3[1] = df3[1].str.replace("(gene name)", "", regex=False)
df3[1] = df3[1].str.replace("(gene name synonym)", "", regex=False)
df3[1] = df3[1].str.replace("(display_long)", "", regex=False)
df3.head()

In [None]:
df2['name A'] = df3[1]
df2.head()

In [None]:
# SAME PROCEDURE NEEDS TO BE DONE FOR "ALIAS (ES) INTERACTOR B" column

In [None]:
# Correct header name and drop duplicate columns
df2['ID(s) interactor A'] = df2['#ID(s) interactor A']
df2.drop(columns =["Alias(es) interactor A", "Alias(es) interactor B", '#ID(s) interactor A'], inplace = True)
df2.head(3)

In [None]:
#Get columns headers as list
new_cols = df2.columns.values.tolist()
print(new_cols)

# Re-order columns in final dataframe
df2 = df2[['ID(s) interactor A', 'ID(s) interactor B', 'name A', 'name B', 'Interaction detection method(s)', 'Publication 1st author(s)', 'Publication Identifier(s)', 'Taxid interactor A', 'Taxid interactor B', 'Interaction type(s)', 'Confidence value(s)', 'Biological role(s) interactor A', 'Biological role(s) interactor B', 'Experimental role(s) interactor A', 'Experimental role(s) interactor B']]
df2.head(3)