# Use of hu.MAP 3.0 data programmatically with Python, taking advantage of Jupyter features

Work through the cells in this notebook that go through prepartaion steps and then some example rounds of queries. This will give you an idea of what is occuring to programmatically mine and annotate proteins identified in complexes in human cells. You could then edit this to run queries on the complexes of your favorite human proteins.  
However, I would suggest before spending a lot of time on that to see [the next notebook in this series](Making_many_hu.MAP3_reports_easily_using_Snakemake.ipynb) as you probably are interested in at least a handul of proteins and that will provide a more convenient way to query about the complexes of multiple proteins. What will be produced by the next notebook in this series is very similar to what you see here, yet all you need to do is provide a list of protein/gene identifiers and let it automatically process the identifiers in the list to produce report summaries like this file for each valid identifier in the list you provide. (That system has a check each round that what you provided is valid, and so if you are hitting issues here with your identifiers not working, see that one. There is no 'check' step in this demonstration Jupyter notebook.)

-------

What this Jupyter notebook file will produce when run:

- This Jupyter `.ipynb` file containing the following information:
    - list of all the proteins found in complexes along with the examined protein.
    - details about the individual complexes the examined protein occurs in, featuring extra information from UniProt.
    - list of proteins not observed to directly complex with the examined protein, yet complex with proteins that do directly complex.
- tab-separated files containing the details in the first two bullet points listed above
- an HTML file with the details about the individual complexes the examined protein occurs in, featuring extra information from UniProt. (The idea is you can use this anytime independent of Jupyter, to possibly share with others or convert to PDF & then share.)

#### Preparation

##### Get the complexes with confidence scores

While `curl -OL "https://humap3.proteincomplexes.org/static/downloads/humap3/hu.MAP3.0_complexes_wConfidenceScores_total15326_wGenenames_20240922.csv"` works on my local machine, the involved port may be blocked on MyBinder for getting it from the original resource. (Actually it turns out that data is not in tidy form and has inconsistencies in separators used, and so remediation was necessary anyway, see [here](additional_nbs/standardizing_initial_data/README_as.ipynb) if interested in more in that.) We'll obtain a standardized copy of that data placed where it is accessible in MyBinder-served sessions by running the next cell:

In [1]:
!curl -OL https://raw.githubusercontent.com/fomightez/humap3-binder/refs/heads/main/additional_nbs/standardizing_initial_data/hu.MAP3.0_complexes_wConfidenceScores_total15326_wGenenames_20240922InOrderMatched.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1243k  100 1243k    0     0  2182k      0 --:--:-- --:--:-- --:--:-- 2186k


##### Put the data on the complexes into Pandas dataframe

(I'm using uv here just because I want to learn about it. I could have run the code in the script right in this notebook, and skipped the pickling and read pickle steps.)

Get the script to use with `uv` to read in the raw data and make a dataframe.

In [2]:
!curl -OL https://raw.githubusercontent.com/fomightez/structurework/refs/heads/master/humap3-utilities/complexes_rawCSV_to_df.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1007  100  1007    0     0   2968      0 --:--:-- --:--:-- --:--:--  2970


In [3]:
!uv run complexes_rawCSV_to_df.py hu.MAP3.0_complexes_wConfidenceScores_total15326_wGenenames_20240922InOrderMatched.csv
import pandas as pd
rd_df = pd.read_pickle('raw_complexes_pickled_df.pkl')
rd_df

Reading inline script metadata from `[36mcomplexes_rawCSV_to_df.py[39m`
[2K[2mInstalled [1m10 packages[0m [2min 135ms[0m[0m                              [0m         


Unnamed: 0,HuMAP3_ID,ComplexConfidence,Uniprot_ACCs,genenames
0,huMAP3_00000.1,1,Q9UGQ2 P20963 Q9NWV4,CACFD1 CD247 CZIB
1,huMAP3_00001.1,1,Q9NWB1 O94887 Q9NQ92,RBFOX1 FARP2 COPRS
2,huMAP3_00002.1,1,Q8N3D4 Q9Y3A4,EHBP1L1 RRP7A
3,huMAP3_00003.1,1,Q5T2D2 O00429,TREML2 DNM1L
4,huMAP3_00004.1,1,Q9H9C1 Q9H267 O95460 P21941 P78540,VIPAS39 VPS33B MATN4 MATN1 ARG2
...,...,...,...,...
15321,huMAP3_15345.1,6,O14628 Q3SXZ3,ZNF195 ZNF718
15322,huMAP3_15346.1,6,Q6ZWT7 P08910 Q86VD1 Q9UJQ1 Q9Y6X9,MBOAT2 ABHD2 MORC1 LAMP5 MORC2
15323,huMAP3_15347.1,6,A6ND91 Q4V339,ASPDH ZNG1F
15324,huMAP3_15348.1,6,A6NKF2 P08217 Q8IVW6 Q99856,ARID3C CELA2A ARID3B ARID3A


That's a lot of complexes!

--------

## Analyze complexes for a protein

Let's start in this notebook with a single protein and use Python/Pandas to access the data easily.  

The next line will define the identifier for this protein to be used as the search term for this entire notebook. (The query can be done with the human gene name or the UniProt accession number.) 

In [4]:
search_term = "ROGDI"

With that set, the rest of this notebook will do three things:  
- First we'll just list all the proteins found in complexes along with that corresponding protein.
- Then we'll detail the individual complexes themselves, adding extra information from UniProt.
- Finally, we'll fo out another 'layer' by collecting a list of proteins observed in complexes with proteins identified in the hu.MAP 3,0 data to be complexed with the query protein, yet don't appear as members of the complexes along the query protein.

-----

### Show all proteins in related complexes with details added from Uniprot

Run the following cell to initiate the query that will collect all the.

In [5]:
# run the query collecting all proteins it occurs with
pattern = fr'\b{search_term}\b' # Create a regex pattern with word boundaries
rows_with_term = rd_df[rd_df['Uniprot_ACCs'].str.contains(pattern, case=False, regex=True) | rd_df['genenames'].str.contains(pattern, case=False, regex=True)]
list_all_associated_acc_name_tuples = []
for row in rows_with_term.itertuples():
    #print(row)
    list_all_associated_acc_name_tuples.extend((item1, item2) for item1, item2 in zip(row.Uniprot_ACCs.split(), row.genenames.split()))
partners_df = pd.DataFrame(set(list_all_associated_acc_name_tuples), columns=['Uniprot_ACCs', 'genenames'])
import rich
rich.print(f"\n[bold black]THE {len(partners_df)} PROTEINS OCCURING IN COMPLEXES WITH '{search_term}':[/bold black]\n")
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(partners_df.style.hide())

Uniprot_ACCs,genenames
O95398,RAPGEF3
Q9UNQ0,ABCG2
Q3ZAQ7,VMA21
Q8N5C1,CALHM5
Q15904,ATP6AP1
Q8NEY4,ATP6V1C2
Q13336,SLC14A1
A0AVI4,TMEM129
Q96Q45,TMEM237
Q16864,ATP6V1F


(Note: if you try to set the `search_term` to a non-valid identifier, you'll see as the ouptut here `NameError: name 'rd_df' is not defined`. The easiest way to see what you are using is a valid identifier & present in the data is to double-click in the file browser pane on the left side on the file `hu.MAP3.0_complexes_wConfidenceScores_total15326_wGenenames_20240922InOrderMatched.csv` retrieved in the preparation step above. And then with that `.CSV` file open from the menu choose `Edit` > `Find` and entire your identifier in the box in the upper right and hit enter. If you hit enter and nothing happens, that identifier is not valid or no data is available. There is a programmatic check for this in the next notebook in the series.)

For now, the information is inclusive, meaning the search term protein is listed among them. I could easily change that.

The convenience of Pandas makes that easy to store for later use as a tab-separated file that will work with Excel.  
Make sure to download it to your local machine.

In [6]:
import datetime
now = datetime.datetime.now()
partners_df.to_csv(f'{search_term}_complexes_partners_humap3{now.strftime("_%Y_%m_%d")}.tsv', sep='\t',index = False) 

Show the file made:

In [7]:
ls *_complexes_partners_*

ROGDI_complexes_partners_humap3_2024_11_05.tsv


Make sure to download that if it is useful because this session is temporary.
That file should open in Excel just fine. (I could actually produce Excel files using openpyxl but leave that for later expansion.)

### Show all complexes that protein is in with extra information

The extra annotation information will come from the UniProt KnowledgeBase, using the package unipressed. This starts to show how doing this in Python/Jupyter can add convenience.

Run the following cell, and then those below it in this seciton to perform the query for this round.

In [8]:
# Next few cells will run the query collecting all complexes it occurs with and adding details
pattern = fr'\b{search_term}\b' # Create a regex pattern with word boundaries
rows_with_term_df = rd_df[rd_df['Uniprot_ACCs'].str.contains(pattern, case=False, regex=True) | rd_df['genenames'].str.contains(pattern, case=False, regex=True)].copy()
# make the dataframe have each row be a single protein
# to prepare to use pandas `explode()` to do that, first make the content in be lists
rows_with_term_df['Uniprot_ACCs'] = rows_with_term_df['Uniprot_ACCs'].str.split()
rows_with_term_df['genenames'] = rows_with_term_df['genenames'].str.split()
# Now use explode to create a new row for each element in both columns
df_expanded = rows_with_term_df.explode(['Uniprot_ACCs', 'genenames']).copy()
# Reset the index 
df_expanded = df_expanded.reset_index(drop=True)
# Display the first few rows of the expanded dataframe
print(df_expanded.tail())
# Next add extra information from UniProt for each protein

          HuMAP3_ID  ComplexConfidence Uniprot_ACCs genenames
297  huMAP3_13872.1                  6       Q8NI08     NCOA7
298  huMAP3_13872.1                  6       Q8TDJ6     DMXL2
299  huMAP3_13872.1                  6       Q9GZN7     ROGDI
300  huMAP3_13872.1                  6       Q9Y485     DMXL1
301  huMAP3_13872.1                  6       Q9Y4E6      WDR7


In [9]:
# This cell makes lookup table with the extra information; it takes a while to run & so is in a cell on its own to save time during development
lookup_dict = {}
accs = set(df_expanded['Uniprot_ACCs'].to_list())
from unipressed import UniprotkbClient
import time
for acc in accs:
    uniprot_record = UniprotkbClient.fetch_one(acc)
    protein_name = uniprot_record['proteinDescription']['recommendedName']['fullName']['value']
    '''
    def safe_get_disease_id(comment):
        try:
            return comment['disease']['diseaseId']
        except KeyError:
            return comment.get('disease', {}).get('diseaseId', 'Unknown disease ID')
    
    disease_info_list = [
        safe_get_disease_id(comment)
        for comment in uniprot_record['comments']
        if comment['commentType'] == 'DISEASE'
    ]
    
    if not disease_info_list:
        disease_info_list = ['None reported']
    '''
    disease_info_list = []
    if 'comments' in uniprot_record:
        for comment in uniprot_record['comments']:
            if comment['commentType'] == 'DISEASE':
                disease_info = comment.get('disease', {})
                disease_id = disease_info.get('diseaseId', 'Unknown disease ID')
                disease_info_list.append(disease_id)
    if not disease_info_list:
        disease_info_list = ['None reported']
    disease_info = '; '.join(disease_info_list[:2])
    lookup_dict[acc] = {'protein_name':protein_name, 'disease': disease_info}
    time.sleep(1.1) # don't slam the API

In [10]:
# USe collected information to enhance the dataframe
pn_dict = {k: v['protein_name'] for k, v in lookup_dict.items()}
disease_dict = {k: v['disease'] for k, v in lookup_dict.items()}
df_expanded['protein_name'] = df_expanded['Uniprot_ACCs'].map(pn_dict)
df_expanded['disease'] = df_expanded['Uniprot_ACCs'].map(disease_dict)
conf_val2text_dict = {
    1: 'Extremely High',
    2: 'Very High',
    3: 'High',
    4: 'Moderate High',
    5: 'Medium High',
    6: 'Medium'
}
# Use vectorized mapping to convert confidence values to text
df_expanded['ComplexConfidence'] = df_expanded['ComplexConfidence'].map(conf_val2text_dict)
base_uniprot_url = 'https://www.uniprot.org/uniprotkb/'
format_str = '{}{}/'
df_expanded = df_expanded.assign(Link=base_uniprot_url + df_expanded['Uniprot_ACCs'])
df_expanded

Unnamed: 0,HuMAP3_ID,ComplexConfidence,Uniprot_ACCs,genenames,protein_name,disease,Link
0,huMAP3_01501.1,Extremely High,O75348,ATP6V1G1,V-type proton ATPase subunit G 1,None reported,https://www.uniprot.org/uniprotkb/O75348
1,huMAP3_01501.1,Extremely High,P15313,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progre...",https://www.uniprot.org/uniprotkb/P15313
2,huMAP3_01501.1,Extremely High,P21281,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congen...",https://www.uniprot.org/uniprotkb/P21281
3,huMAP3_01501.1,Extremely High,P36543,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
4,huMAP3_01501.1,Extremely High,Q8N8Y2,ATP6V0D2,V-type proton ATPase subunit d 2,None reported,https://www.uniprot.org/uniprotkb/Q8N8Y2
...,...,...,...,...,...,...,...
297,huMAP3_13872.1,Medium,Q8NI08,NCOA7,Nuclear receptor coactivator 7,None reported,https://www.uniprot.org/uniprotkb/Q8NI08
298,huMAP3_13872.1,Medium,Q8TDJ6,DMXL2,DmX-like protein 2,Polyendocrine-polyneuropathy syndrome; Deafnes...,https://www.uniprot.org/uniprotkb/Q8TDJ6
299,huMAP3_13872.1,Medium,Q9GZN7,ROGDI,Protein rogdi homolog,Kohlschuetter-Toenz syndrome,https://www.uniprot.org/uniprotkb/Q9GZN7
300,huMAP3_13872.1,Medium,Q9Y485,DMXL1,DmX-like protein 1,None reported,https://www.uniprot.org/uniprotkb/Q9Y485


**Note diseases are limited to the first two listed at UniProt.**  
The data will be displayed below arranged better so don't worry about studying this output yet. 

Saving that as tab-separated data.

In [11]:
import datetime
now = datetime.datetime.now()
df_expanded.to_csv(f'{search_term}_complexesHUMAP3{now.strftime("_%Y_%m_%d")}.tsv', sep='\t',index = False) 

In [12]:
ls *_complexesHUMAP3*

ROGDI_complexesHUMAP3_2024_11_05.tsv


If you are doing these steps with settings other than the demonstration, you may wish to save that to your local machine as this session is temporary.

-------------

### Detailing all the complexes nicely

Now with that dataframe in hand, we can group them by the individual complex and display each nicely and completely.

In [13]:
grouped = df_expanded.groupby(['HuMAP3_ID','ComplexConfidence'])
import datetime
now = datetime.datetime.now()
for complex, grouped_df in grouped:
    import rich
    rich.print(f"Complex: [bold black]{complex[0]}[/bold black]\tConfidence: [bold black]{complex[1]}[/bold black]\tProteins: [bold black]{len(grouped_df)}[/bold black]")
    with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', None):
        display(grouped_df [grouped_df .columns[3:]].reset_index(drop=True))
        grouped_df.to_csv(f'{complex[0]}_{search_term}_complx_CONF_{"_".join(complex[1].split())}_{len(grouped_df)}_proteins{now.strftime("_%Y_%m_%d")}.tsv', sep='\t',index = False) 

Unnamed: 0,genenames,protein_name,disease,Link
0,ATP6V1G1,V-type proton ATPase subunit G 1,None reported,https://www.uniprot.org/uniprotkb/O75348
1,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progressive sensorineural hearing loss",https://www.uniprot.org/uniprotkb/P15313
2,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congenital, with onychodystrophy, autosomal dominant",https://www.uniprot.org/uniprotkb/P21281
3,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
4,ATP6V0D2,V-type proton ATPase subunit d 2,None reported,https://www.uniprot.org/uniprotkb/Q8N8Y2
5,ATP6V1C2,V-type proton ATPase subunit C 2,None reported,https://www.uniprot.org/uniprotkb/Q8NEY4
6,ATP6V1E2,V-type proton ATPase subunit E 2,None reported,https://www.uniprot.org/uniprotkb/Q96A05
7,RAPGEF3,Rap guanine nucleotide exchange factor 3,None reported,https://www.uniprot.org/uniprotkb/O95398
8,ATP6V1F,V-type proton ATPase subunit F,None reported,https://www.uniprot.org/uniprotkb/Q16864
9,TBC1D24,TBC1 domain family member 24,Familial infantile myoclonic epilepsy; Developmental and epileptic encephalopathy 16,https://www.uniprot.org/uniprotkb/Q9ULP9


Unnamed: 0,genenames,protein_name,disease,Link
0,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progressive sensorineural hearing loss",https://www.uniprot.org/uniprotkb/P15313
1,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congenital, with onychodystrophy, autosomal dominant",https://www.uniprot.org/uniprotkb/P21281
2,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
3,ATP6V1C2,V-type proton ATPase subunit C 2,None reported,https://www.uniprot.org/uniprotkb/Q8NEY4
4,ATP6V1E2,V-type proton ATPase subunit E 2,None reported,https://www.uniprot.org/uniprotkb/Q96A05
5,RAPGEF3,Rap guanine nucleotide exchange factor 3,None reported,https://www.uniprot.org/uniprotkb/O95398
6,TBC1D24,TBC1 domain family member 24,Familial infantile myoclonic epilepsy; Developmental and epileptic encephalopathy 16,https://www.uniprot.org/uniprotkb/Q9ULP9
7,NCOA7,Nuclear receptor coactivator 7,None reported,https://www.uniprot.org/uniprotkb/Q8NI08
8,ROGDI,Protein rogdi homolog,Kohlschuetter-Toenz syndrome,https://www.uniprot.org/uniprotkb/Q9GZN7
9,DMXL1,DmX-like protein 1,None reported,https://www.uniprot.org/uniprotkb/Q9Y485


Unnamed: 0,genenames,protein_name,disease,Link
0,DMXL2,DmX-like protein 2,"Polyendocrine-polyneuropathy syndrome; Deafness, autosomal dominant, 71",https://www.uniprot.org/uniprotkb/Q8TDJ6
1,ROGDI,Protein rogdi homolog,Kohlschuetter-Toenz syndrome,https://www.uniprot.org/uniprotkb/Q9GZN7
2,DMXL1,DmX-like protein 1,None reported,https://www.uniprot.org/uniprotkb/Q9Y485
3,WDR7,WD repeat-containing protein 7,None reported,https://www.uniprot.org/uniprotkb/Q9Y4E6


Unnamed: 0,genenames,protein_name,disease,Link
0,DMXL2,DmX-like protein 2,"Polyendocrine-polyneuropathy syndrome; Deafness, autosomal dominant, 71",https://www.uniprot.org/uniprotkb/Q8TDJ6
1,ROGDI,Protein rogdi homolog,Kohlschuetter-Toenz syndrome,https://www.uniprot.org/uniprotkb/Q9GZN7


Unnamed: 0,genenames,protein_name,disease,Link
0,ATP6V1G1,V-type proton ATPase subunit G 1,None reported,https://www.uniprot.org/uniprotkb/O75348
1,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progressive sensorineural hearing loss",https://www.uniprot.org/uniprotkb/P15313
2,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congenital, with onychodystrophy, autosomal dominant",https://www.uniprot.org/uniprotkb/P21281
3,ATP6V1C1,V-type proton ATPase subunit C 1,None reported,https://www.uniprot.org/uniprotkb/P21283
4,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
5,ATP6V1C2,V-type proton ATPase subunit C 2,None reported,https://www.uniprot.org/uniprotkb/Q8NEY4
6,ATP6V1E2,V-type proton ATPase subunit E 2,None reported,https://www.uniprot.org/uniprotkb/Q96A05
7,ATP6V1G3,V-type proton ATPase subunit G 3,None reported,https://www.uniprot.org/uniprotkb/Q96LB4
8,RAPGEF3,Rap guanine nucleotide exchange factor 3,None reported,https://www.uniprot.org/uniprotkb/O95398
9,ATP6V1A,V-type proton ATPase catalytic subunit A,"Cutis laxa, autosomal recessive, 2D; Epileptic encephalopathy, infantile or early childhood, 3",https://www.uniprot.org/uniprotkb/P38606


Unnamed: 0,genenames,protein_name,disease,Link
0,ATP6V1G1,V-type proton ATPase subunit G 1,None reported,https://www.uniprot.org/uniprotkb/O75348
1,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progressive sensorineural hearing loss",https://www.uniprot.org/uniprotkb/P15313
2,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congenital, with onychodystrophy, autosomal dominant",https://www.uniprot.org/uniprotkb/P21281
3,ATP6V1C1,V-type proton ATPase subunit C 1,None reported,https://www.uniprot.org/uniprotkb/P21283
4,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
5,ATP6V0D2,V-type proton ATPase subunit d 2,None reported,https://www.uniprot.org/uniprotkb/Q8N8Y2
6,ATP6V1C2,V-type proton ATPase subunit C 2,None reported,https://www.uniprot.org/uniprotkb/Q8NEY4
7,ATP6V1E2,V-type proton ATPase subunit E 2,None reported,https://www.uniprot.org/uniprotkb/Q96A05
8,ATP6V1G3,V-type proton ATPase subunit G 3,None reported,https://www.uniprot.org/uniprotkb/Q96LB4
9,RAPGEF3,Rap guanine nucleotide exchange factor 3,None reported,https://www.uniprot.org/uniprotkb/O95398


Unnamed: 0,genenames,protein_name,disease,Link
0,ATP6V1G1,V-type proton ATPase subunit G 1,None reported,https://www.uniprot.org/uniprotkb/O75348
1,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progressive sensorineural hearing loss",https://www.uniprot.org/uniprotkb/P15313
2,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congenital, with onychodystrophy, autosomal dominant",https://www.uniprot.org/uniprotkb/P21281
3,ATP6V1C1,V-type proton ATPase subunit C 1,None reported,https://www.uniprot.org/uniprotkb/P21283
4,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
5,ATP6V0D1,V-type proton ATPase subunit d 1,None reported,https://www.uniprot.org/uniprotkb/P61421
6,ATP6V1C2,V-type proton ATPase subunit C 2,None reported,https://www.uniprot.org/uniprotkb/Q8NEY4
7,ATP6V0A1,V-type proton ATPase 116 kDa subunit a 1,Developmental and epileptic encephalopathy 104; Neurodevelopmental disorder with epilepsy and brain atrophy,https://www.uniprot.org/uniprotkb/Q93050
8,ATP6V1E2,V-type proton ATPase subunit E 2,None reported,https://www.uniprot.org/uniprotkb/Q96A05
9,ATP6V1G3,V-type proton ATPase subunit G 3,None reported,https://www.uniprot.org/uniprotkb/Q96LB4


Unnamed: 0,genenames,protein_name,disease,Link
0,ATP6V1G1,V-type proton ATPase subunit G 1,None reported,https://www.uniprot.org/uniprotkb/O75348
1,ATP6V1G2,V-type proton ATPase subunit G 2,None reported,https://www.uniprot.org/uniprotkb/O95670
2,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progressive sensorineural hearing loss",https://www.uniprot.org/uniprotkb/P15313
3,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congenital, with onychodystrophy, autosomal dominant",https://www.uniprot.org/uniprotkb/P21281
4,ATP6V1C1,V-type proton ATPase subunit C 1,None reported,https://www.uniprot.org/uniprotkb/P21283
5,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
6,ATP6V0D2,V-type proton ATPase subunit d 2,None reported,https://www.uniprot.org/uniprotkb/Q8N8Y2
7,ATP6V1C2,V-type proton ATPase subunit C 2,None reported,https://www.uniprot.org/uniprotkb/Q8NEY4
8,ATP6V1E2,V-type proton ATPase subunit E 2,None reported,https://www.uniprot.org/uniprotkb/Q96A05
9,ATP6V1G3,V-type proton ATPase subunit G 3,None reported,https://www.uniprot.org/uniprotkb/Q96LB4


Unnamed: 0,genenames,protein_name,disease,Link
0,ATP6V1G1,V-type proton ATPase subunit G 1,None reported,https://www.uniprot.org/uniprotkb/O75348
1,ATP6V1G2,V-type proton ATPase subunit G 2,None reported,https://www.uniprot.org/uniprotkb/O95670
2,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progressive sensorineural hearing loss",https://www.uniprot.org/uniprotkb/P15313
3,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congenital, with onychodystrophy, autosomal dominant",https://www.uniprot.org/uniprotkb/P21281
4,ATP6V1C1,V-type proton ATPase subunit C 1,None reported,https://www.uniprot.org/uniprotkb/P21283
5,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
6,ATP6V1C2,V-type proton ATPase subunit C 2,None reported,https://www.uniprot.org/uniprotkb/Q8NEY4
7,ATP6V1E2,V-type proton ATPase subunit E 2,None reported,https://www.uniprot.org/uniprotkb/Q96A05
8,ATP6V1G3,V-type proton ATPase subunit G 3,None reported,https://www.uniprot.org/uniprotkb/Q96LB4
9,RAPGEF3,Rap guanine nucleotide exchange factor 3,None reported,https://www.uniprot.org/uniprotkb/O95398


Unnamed: 0,genenames,protein_name,disease,Link
0,ATP6V1G1,V-type proton ATPase subunit G 1,None reported,https://www.uniprot.org/uniprotkb/O75348
1,ATP6V1G2,V-type proton ATPase subunit G 2,None reported,https://www.uniprot.org/uniprotkb/O95670
2,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progressive sensorineural hearing loss",https://www.uniprot.org/uniprotkb/P15313
3,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congenital, with onychodystrophy, autosomal dominant",https://www.uniprot.org/uniprotkb/P21281
4,ATP6V1C1,V-type proton ATPase subunit C 1,None reported,https://www.uniprot.org/uniprotkb/P21283
5,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
6,ATP6V0D1,V-type proton ATPase subunit d 1,None reported,https://www.uniprot.org/uniprotkb/P61421
7,KIAA2013,Uncharacterized protein KIAA2013,None reported,https://www.uniprot.org/uniprotkb/Q8IYS2
8,ATP6V0D2,V-type proton ATPase subunit d 2,None reported,https://www.uniprot.org/uniprotkb/Q8N8Y2
9,ATP6V1C2,V-type proton ATPase subunit C 2,None reported,https://www.uniprot.org/uniprotkb/Q8NEY4


Unnamed: 0,genenames,protein_name,disease,Link
0,ATP6V1G1,V-type proton ATPase subunit G 1,None reported,https://www.uniprot.org/uniprotkb/O75348
1,ATP6V1G2,V-type proton ATPase subunit G 2,None reported,https://www.uniprot.org/uniprotkb/O95670
2,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progressive sensorineural hearing loss",https://www.uniprot.org/uniprotkb/P15313
3,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congenital, with onychodystrophy, autosomal dominant",https://www.uniprot.org/uniprotkb/P21281
4,ATP6V1C1,V-type proton ATPase subunit C 1,None reported,https://www.uniprot.org/uniprotkb/P21283
5,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
6,ATP6V0D2,V-type proton ATPase subunit d 2,None reported,https://www.uniprot.org/uniprotkb/Q8N8Y2
7,ATP6V1C2,V-type proton ATPase subunit C 2,None reported,https://www.uniprot.org/uniprotkb/Q8NEY4
8,ATP6V1E2,V-type proton ATPase subunit E 2,None reported,https://www.uniprot.org/uniprotkb/Q96A05
9,ATP6V1G3,V-type proton ATPase subunit G 3,None reported,https://www.uniprot.org/uniprotkb/Q96LB4


Unnamed: 0,genenames,protein_name,disease,Link
0,ATP6V0E1,V-type proton ATPase subunit e 1,None reported,https://www.uniprot.org/uniprotkb/O15342
1,ATP6V1G1,V-type proton ATPase subunit G 1,None reported,https://www.uniprot.org/uniprotkb/O75348
2,ATP6V1G2,V-type proton ATPase subunit G 2,None reported,https://www.uniprot.org/uniprotkb/O95670
3,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progressive sensorineural hearing loss",https://www.uniprot.org/uniprotkb/P15313
4,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congenital, with onychodystrophy, autosomal dominant",https://www.uniprot.org/uniprotkb/P21281
5,ATP6V1C1,V-type proton ATPase subunit C 1,None reported,https://www.uniprot.org/uniprotkb/P21283
6,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
7,ATP6V0D1,V-type proton ATPase subunit d 1,None reported,https://www.uniprot.org/uniprotkb/P61421
8,KIAA2013,Uncharacterized protein KIAA2013,None reported,https://www.uniprot.org/uniprotkb/Q8IYS2
9,ATP6V0D2,V-type proton ATPase subunit d 2,None reported,https://www.uniprot.org/uniprotkb/Q8N8Y2


Unnamed: 0,genenames,protein_name,disease,Link
0,ATP6V0E1,V-type proton ATPase subunit e 1,None reported,https://www.uniprot.org/uniprotkb/O15342
1,ATP6V1G1,V-type proton ATPase subunit G 1,None reported,https://www.uniprot.org/uniprotkb/O75348
2,ATP6V1G2,V-type proton ATPase subunit G 2,None reported,https://www.uniprot.org/uniprotkb/O95670
3,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progressive sensorineural hearing loss",https://www.uniprot.org/uniprotkb/P15313
4,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congenital, with onychodystrophy, autosomal dominant",https://www.uniprot.org/uniprotkb/P21281
5,ATP6V1C1,V-type proton ATPase subunit C 1,None reported,https://www.uniprot.org/uniprotkb/P21283
6,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
7,ATP6V0D1,V-type proton ATPase subunit d 1,None reported,https://www.uniprot.org/uniprotkb/P61421
8,ATP6V0D2,V-type proton ATPase subunit d 2,None reported,https://www.uniprot.org/uniprotkb/Q8N8Y2
9,ATP6V1C2,V-type proton ATPase subunit C 2,None reported,https://www.uniprot.org/uniprotkb/Q8NEY4


Unnamed: 0,genenames,protein_name,disease,Link
0,ATP6V0E1,V-type proton ATPase subunit e 1,None reported,https://www.uniprot.org/uniprotkb/O15342
1,ATP6V1G1,V-type proton ATPase subunit G 1,None reported,https://www.uniprot.org/uniprotkb/O75348
2,ATP6V1G2,V-type proton ATPase subunit G 2,None reported,https://www.uniprot.org/uniprotkb/O95670
3,ATP6V1B1,"V-type proton ATPase subunit B, kidney isoform","Renal tubular acidosis, distal, 2, with progressive sensorineural hearing loss",https://www.uniprot.org/uniprotkb/P15313
4,ATP6V1B2,"V-type proton ATPase subunit B, brain isoform","Zimmermann-Laband syndrome 2; Deafness, congenital, with onychodystrophy, autosomal dominant",https://www.uniprot.org/uniprotkb/P21281
5,ATP6V1C1,V-type proton ATPase subunit C 1,None reported,https://www.uniprot.org/uniprotkb/P21283
6,ATP6V1E1,V-type proton ATPase subunit E 1,"Cutis laxa, autosomal recessive, 2C",https://www.uniprot.org/uniprotkb/P36543
7,ATP6V0D2,V-type proton ATPase subunit d 2,None reported,https://www.uniprot.org/uniprotkb/Q8N8Y2
8,ATP6V1C2,V-type proton ATPase subunit C 2,None reported,https://www.uniprot.org/uniprotkb/Q8NEY4
9,ATP6V1E2,V-type proton ATPase subunit E 2,None reported,https://www.uniprot.org/uniprotkb/Q96A05


**Keep in mind the disease entries are limited to the first two listed at UniProt.** 

These have been saved as tab-separated data. You may wish to download them, although the same information is already present in the prior saved tab-sepaarated data.

In [14]:
ls *complx_CONF_*

huMAP3_01501.1_ROGDI_complex_CONF_Extremely_High_15_proteins_2024_11_05.tsv
huMAP3_01899.1_ROGDI_complex_CONF_Extremely_High_11_proteins_2024_11_05.tsv
huMAP3_03469.1_ROGDI_complex_CONF_Extremely_High_4_proteins_2024_11_05.tsv
huMAP3_05952.1_ROGDI_complex_CONF_Very_High_2_proteins_2024_11_05.tsv
huMAP3_07099.1_ROGDI_complex_CONF_Very_High_18_proteins_2024_11_05.tsv
huMAP3_07329.1_ROGDI_complex_CONF_Very_High_20_proteins_2024_11_05.tsv
huMAP3_08381.1_ROGDI_complex_CONF_High_20_proteins_2024_11_05.tsv
huMAP3_08678.1_ROGDI_complex_CONF_High_22_proteins_2024_11_05.tsv
huMAP3_09242.1_ROGDI_complex_CONF_Moderate_High_20_proteins_2024_11_05.tsv
huMAP3_09477.1_ROGDI_complex_CONF_Moderate_High_26_proteins_2024_11_05.tsv
huMAP3_09873.1_ROGDI_complex_CONF_Moderate_High_24_proteins_2024_11_05.tsv
huMAP3_10511.1_ROGDI_complex_CONF_Medium_High_35_proteins_2024_11_05.tsv
huMAP3_12042.1_ROGDI_complex_CONF_Medium_56_proteins_2024_11_05.tsv
huMAP3_13872.1_ROGDI_complex_CONF_Medium_29_proteins_2024_11_05

However, you might be admiring the output above concerning each complex and wish you didn't have to deal with Jupyter to see that. Or want to share that with someone or print it out separate from this notebook.  
Running the next cell will make an HTML file that will make that easier:

In [15]:
def getTableHTML(df):
    """
    From https://stackoverflow.com/a/49687866/2007153
    
    Get a Jupyter like html of pandas dataframe with header underline (except index)
    """
    styles = [
        #table properties
        dict(selector=" ", 
             props=[("margin","0"),
                    ("font-family",'"Helvetica", "Arial", sans-serif'),
                    ("border-collapse", "collapse"),
                    ("border","none")]),
        #background shading
        dict(selector="tbody tr:nth-child(even)",
             props=[("background-color", "#f4f4f4")]), # TO SHOW IT IS BEING USED AND NOT NORMAL PANDAS COLORING, change this from `#eee` to `#fee` # to add reddish tinge
        dict(selector="tbody tr:nth-child(odd)",
             props=[("background-color", "#fff")]),  
        #cell spacing
        dict(selector="td", 
             props=[("padding", ".5em")]),
        #header cell properties (excluding index)
        dict(selector="thead th:not(:first-child)", 
             props=[("font-size", "80%"),
                    ("text-align", "center"),
                    ("border-bottom", "2px solid #666"),
                    ("padding", ".5em")]),
        #index header cell properties (no border)
        dict(selector="thead th:first-child", 
             props=[("font-size", "80%"),
                    ("text-align", "center"),
                    ("padding", ".5em")]),
    ]
    return (df.style.set_table_styles(styles)).to_html()
collected_html = ""
for complex, grouped_df in grouped:
    collected_html += (f"Complex: <strong>{complex[0]}</strong>&emsp;Confidence: <strong>{complex[1]}</strong>&emsp;Proteins: <strong>{len(grouped_df)}</strong></br>")
    with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', None):
        #display(grouped_df [grouped_df .columns[3:]].reset_index(drop=True))
        collected_html += getTableHTML(grouped_df)
        collected_html += "</br></br></br>"
collected_html_fn = f'{search_term}_individ_complexes_details{now.strftime("_%Y_%m_%d")}.html'
%store collected_html >{collected_html_fn}

Writing 'collected_html' (str) to file 'ROGDI_individ_complexes_details_2024_11_05.html'.


Download that file so you can then share that HTML file and tell the recipient to open it with a browser. Or if you prefer a PDF, after downloading it to your own machine, open it in your browser and print to PDF. (**TIP**: when using `'File'` > `'Print'` for printing to a PDF on a Mac, toggle on '`Background graphics`' to get the nice shading you see in the HTML file; to find '`Background graphics`', in the Print Dialog box, click the drop-down for '`More Settings`' to reveal at the bottom '`Options`'  with '`Background graphics`' to the right of it.)

-----

### 'Adjacent-complex' proteins?

What can be done to scale up and go beyond just what complexes your favorite protein occurs in starts revealing the power of having this in conjunction with Python use?

We have a list of proteins our favorite protein is known to occur with in the complexes, i.e. 'complexed proteins'. What if we then go out another layer and collect all the complexes those 'complexed proteins' are in and highlight any new proteins represented? This would build a list of those proteins that share a complex protein but aren't in the query's protein complex.  This would build up a network of interactions our favorite protein may be involved in directly influencing.

So how do we get a list of 'Adjacent-complex' proteins observed in the complexes in the data? This is meant to show with Python, such a thing is easy and quick.

First, we define two groups. One is those we want to skip looking into further, with the query protein from above being top of the list. (You can further modify the `skip_proteins` one as you see fit.) We also need to collect all the other proteins to use to start collecting the proteins that interact with those. Since we left the query protein in or list above, we need to be sure to filter that out, that is where the first list also comes in. 

In [16]:
skip_proteins = [search_term] # you can put any other genenames or accession after
# that in quotes to also skip those for example: `skip_proteins = [search_term, 'ATP6V0A4']`
# They idea being to leave out any you expect to make it easier to clue in on any new.

In [17]:
ptuples = [(row['Uniprot_ACCs'], row['genenames']) for index, row in df_expanded.iterrows()]
unique_ptuples = list(set(ptuples))
# skip any in the `skip_proteins` list
unique_ptuples = [ptuple for ptuple in ptuples if all(element not in skip_proteins for element in ptuple)]

Now with those two lists in hand, go through those and collect the proteins shared in the complexes those we observed complexed with our query protein and see if any new ones come up.

In [18]:
# run the query on each collecting all proteins it occurs with and removing any already in skip_proteins or those in the complexes directly with the query protein
rd_df = pd.read_pickle('raw_complexes_pickled_df.pkl') # make sure in memory
adjacent_proteins_dfs = []
for current_acc, pn in unique_ptuples:
    pattern = fr'\b{current_acc}\b' # Create a regex pattern with word boundaries
    rows_with_term_df = rd_df[rd_df['Uniprot_ACCs'].str.contains(pattern, case=False, regex=True)].copy()
    # explode these to be entries per row
    # to prepare to use pandas `explode()` to do that, first make the content in be lists
    rows_with_term_df['Uniprot_ACCs'] = rows_with_term_df['Uniprot_ACCs'].str.split()
    rows_with_term_df['genenames'] = rows_with_term_df['genenames'].str.split()
    # Now use explode to create a new row for each element in both columns
    df_expanded2 = rows_with_term_df.explode(['Uniprot_ACCs', 'genenames']).copy()
    # Reset the index 
    df_expanded2 = df_expanded2.reset_index(drop=True)
    #remove those that are in `skip_proteins` list or already in the ptuples
    accs_in_ptuples = [i[0] for i in unique_ptuples]
    new_partners_df = df_expanded2[~df_expanded2['Uniprot_ACCs'].isin(accs_in_ptuples)]
    new_partners_df = new_partners_df[~new_partners_df['Uniprot_ACCs'].isin(skip_proteins)]
    new_partners_df = new_partners_df[~new_partners_df['genenames'].isin(skip_proteins)]
    adjacent_proteins_dfs.append(new_partners_df)
if adjacent_proteins_dfs:
    final_new_partners_df = pd.concat(adjacent_proteins_dfs, ignore_index=True)
else:
    rich.print("Nothing 'adjacent' identified.")

In [19]:
try:
    list_all_associated_adj_name_tuples = []
    for row in final_new_partners_df.itertuples():
        #print(row)
        list_all_associated_adj_name_tuples.extend((item1, item2) for item1, item2 in zip(row.Uniprot_ACCs.split(), row.genenames.split()))
    adj_df = pd.DataFrame(set(list_all_associated_adj_name_tuples), columns=['Uniprot_ACCs', 'genenames'])
    import rich
    rich.print(f"\n[bold black]THE {len(adj_df)} PROTEINS THAT AREN'T IN '{search_term}' COMPLEXES THAT ARE\nOBSERVED IN OTHER COMPLEXES WITH PROTEINS FOUND IN '{search_term}' COMPLEXES:[/bold black]\n")
    with pd.option_context('display.max_rows', None, 'display.max_columns', None):
        display(adj_df.style.hide())
except NameError:
    rich.print("Likely, nothing 'adjacent' identified; see above cell.")

Uniprot_ACCs,genenames
P00966,ASS1
P60468,SEC61B
P60903,S100A10
Q8WWG9,KCNE4
P16050,ALOX15
Q9BUB7,TMEM70
Q9UJZ1,STOML2
O15155,BET1
Q8WUM4,PDCD6IP
O60725,ICMT


--------

Enjoy!

While you could edit the search term above and start analyzing the data for a protein of interest, it is unlikely you are interested in just one proteins. Therefore it is suggested that first you continue on with next notebook in this series because it makes it clear how to programmatically do that for more than one protein and bundles the collected data for easy download, see 'Available Notebooks' listed [here](../index.ipynb). Or click [here to open the next notebook in this series](Making_many_hu.MAP3_reports_easily_using_Snakemake.ipynb).

See my [humap3-binder repo](https://github.com/fomightez/humap3-binder) and [humap3-utilities](https://github.com/fomightez/structurework/humap3-utilities) for related information & resources for this notebook.