## (Technical) Comprehensive analysis of protein identifiers in hu.MAP 2.0 complexes that_disappear in hu.MAP 3.0 complexes

Note that this notebook is labeled as **technical** as it will be interesting to most users. If you are interested in looking at differences for the proteins you are interested in, you'll most likely want to see the following instead (or at least initially):

- [Highlight differences between hu.MAP 2.0 and hu.MAP 3.0 data](Highlight_differences_between_hu.MAP2_and_hu.MAP3_data.ipynb)
- [Using snakemake to highlight differences between hu.MAP 2.0 and hu.MAP 3.0 data for multiple identifiers](Using_snakemake_to_highlight_differences_between_hu.MAP2_and_hu.MAP3_data_for_multiple_identifiers.ipynb)

Only if after that, you are then trying to understand what is going on with the differences with Dataset-wide examination would then you be interested in this notebook as a way to start collecting information in more comprehensive way.  


**Impetus of this technical notebook:**

In the course of examining the hu.MAP 3.0 complexes for a complex of interest we noted that one of the proteins we knew to be important to the complex was lost in the hu.MAP 3.0 data, yet present in related complexes in the hu.MAP 2.0 data. In order to try to understand this better, we were curious to get a more global survey of proteins that went missing from related complexes in the hu.MAP 3.0 data. You may be in a similar situation with a protein you are interesyed in and want to explore in a comprehensive manner if your case is one of the rare instances of such a case or common to the hu.MAP 3.0 data. If so, you may want to see what was done here and adapt it further to look at your proteins of interest in relation to others.

**Overview of this technical notebook:**

- Note that it finds proteins 'ubiquitous' in hu.MAP 2.0 complexes, yet go missing in version 3.0 data. To be 'ubiquitous' the number of associated complexes has to be more than one. May be a strict criteria, but there were an adundance without applying that condition, and still indentified a substantial group even with applying that, and so probably not too strict at this time.  
- Then it goes to another level and is more strict to identify the proteins that go missing entirely from version 3.0 data.
- The last parts of the notebook look more broadly & directly at identifiers gone entirely in version 3 without regard to related complexes or how many times the protein occurs in version 2 data.

The main script that supports this notebook carries out that first bullet item. It will take all the identifiers occuring in both v2 and v3 data and use those to query and see what 'companion' proteins associated 'ubiquitously' in the complexes in the Hu.MAP 2.0 data go missing in the Hu.MAP 3.0 data for **related** complexes. Then in an even more strict analysis, the script looks for identifiers from Hu.MAP 2.0 data entirely missing in Hu.MAP 3.0 data. Additional code in the notebook looks for identifiers from Hu.MAP 2.0 data entirely missing in Hu.MAP 3.0 data and examines how this relates to what the script looking at related complexes identfied.

Many of the issues raised here are not fully explored at this time. (That is left to the reader as there are a lot of directions go in carrying this further. It revealed that others may be unfortunate as well and that though there is a lot of change in version 3 that represents improvement, it may have been too strict in processing. It at least highlights there is room for improvement with version 4.0.)         
And the idea is that the collections generated will serve as a foundation to extend this Jupyter `.ipynb` to suit your curiousities in this more panoramic view of the version 3.0 dataset.  

------

### Preparation: to save time get already-generated dictionaries 

To save time running the main script, get already-generated dictionaries containing details on the involved collection of indentifiers so that it isn't necessary to iterate on identifiers again to collect all that data from scratch.

In [1]:
from shutil import copyfile
files_to_copy_to_here = [
    'disappearing_identified_with_dict.pkl',
    'disappeared_total_complexes_dict.pkl',
    'v2_complexes_with_disappearing_dict.pkl',
    'gone_entirely_in_3_identified_via_dict.pkl',
    'gone_entirely_in_3_total_complexes_dict.pkl',
    'v2_complexes_dict_for_gone_in_3.pkl'
]
#storage_path = "additional_nbs/stored_additional_data/used_for_checking_shifting_n_disappearing/" # if running from root
storage_path = "stored_additional_data/used_for_checking_shifting_n_disappearing/"
for fn in files_to_copy_to_here:
    copyfile(f"{storage_path}{fn}", f"./{fn}")

(Note: Delete that step to start with square one and generate dictonaries again. Or if already ran that step and now need to run the skipped steps again because the script `identify_id_going_missing_from_related_complexes_and_going_entirely_inV3_Data.py` has been edited, delete the file `disappearing_identified_with_dict.pkl` in the current working directory. That is enough to trigger running entirely and new files with pickled dictionaries will get made.)

In [2]:
%run -i identify_id_going_missing_from_related_complexes_and_going_entirely_inV3_Data.py







In [3]:
len(accessions_shared_by_both_2and3)

9574

Note to put this number of identifiers shared between version 2 & 3 of the data in better context, there's 9963 unique Uniprot accension indentifiers in hu.MAP2.0 data and 13769 in hu.MAP 3.0 data. (Those numbers come from the 'Standardizing_identifier_order_...' notebooks under the 'standardizing_initial_data
' directory.) That is a substantial increase.  
And we need to bear in mind this entire notebook is only dealing with those that also occur in hu.MAP2.0 data and not those gained beyond that in hu.MAP 3.0 data.

With that script run, we can assign some variables to make it easier to handle some of what was generated later. For example, we need to make a Python set object out of the list of keys for using Python's 'set math' later. The next cell will do that for two of the main groups of items it made. (We actually won't use again the lists the next cell makes, but I wanted to make it clear that is what it is being used in the conversion to a `set` object. Of further note, is that nothing is being dropped in the conversion. Often `set()` is used on lists to filter the items to unique occurences. That is not what is happening here. That can be shown by running `assert len(disappearing_identified_with_dict.keys()) == len(set(disappearing_identified_with_dict.keys()))` and `assert len(gone_entirely_in_3_identified_via_dict.keys()) == len(set(gone_entirely_in_3_identified_via_dict.keys()))` to verify no filtering takes place, just typecasting.) 

In [4]:
assert len(disappearing_identified_with_dict.keys()) == len(set(disappearing_identified_with_dict.keys()))
assert len(gone_entirely_in_3_identified_via_dict.keys()) == len(set(gone_entirely_in_3_identified_via_dict.keys()))

In [5]:
ubiquitous_disappearing_list = disappearing_identified_with_dict.keys()
ubiquitous_disappearing_set = set(disappearing_identified_with_dict.keys())

gone_entirely_in_3_list = gone_entirely_in_3_identified_via_dict.keys()
gone_entirely_in_3_set = set(gone_entirely_in_3_identified_via_dict.keys())

With the script run and some preliminary preparations for building on the details of what was collected by the script, we are ready to finally look at something interesting, the total number of 'ubiquitous', 'companion' proteins that go disappearing from related complexes when going from hu.MAP 2.0 data to hu.MAP 3.0 data.

In [6]:
len(disappearing_identified_with_dict.keys())

2055

2065 related to 'ubiquitous' complex members seem to go missing from what are expected to be 'related' complexes. (This doesn't mean they go missing entirely though!)

This seems substantial and probably gives some intial indication of the shifting or refinement that occurs between versions 2 and 3 of the hu.MAP data. This also may be higher than expected because if the shifting / refiniement is taking place, what is 'related' may need to shift in concept. And while that may be true, this programmatic way to highlight cases of proteins in version 2 being dramatically altered isn't meant to address that bigger notion. We just want to gauge what has changed and collect the associated details.

You can look at individual examples like so:

In [7]:
disappearing_identified_with_dict['Q9NXF1'] 

['Q9Y4W2', 'Q5SY16', 'Q9H4L4', 'Q9BV38']

In [8]:
disappeared_total_complexes_dict['Q9NXF1']

4

Looking at the list of what identifiers helped highlight `Q9NXF1` , we see `['Q9Y4W2', 'Q5SY16', 'Q9H4L4', 'Q9BV38']`.
Each of those is known to be related to a known complext and so it is interesting that the identifier also known to be in that complex is lost in version 3 data while these related ones occur in version 2 data all together. known to be related are all in version 3 data but are lacking `Q9NXF1` in these related complexes.  

So seems obvious someone interested in `Q9NXF1` and the related Five Friends of Methylated CHTOP (5FMC) complex" would be wondering why gone in version 3 data and questioning 'improvement' at cost of what look to be good ones being lost.

I will suggest though you not to bother doing that type of analysis for what you are interested in. While looking at individual ones this way may be sufficient for your needs, I would hold off on doing this much for individuals as tabular data will be made with this and more below. And the derivatives of that will be better for exploring.

We'll save those Python dictionary as text so that it can be perused if desired.

In [9]:
%store disappearing_identified_with_dict >disappearing_identified_with_dict.txt
%store disappeared_total_complexes_dict >disappeared_total_complexes_dict.txt

Writing 'disappearing_identified_with_dict' (dict) to file 'disappearing_identified_with_dict.txt'.
Writing 'disappeared_total_complexes_dict' (dict) to file 'disappeared_total_complexes_dict.txt'.


Before going on to look more at what the script run above identified from among the complexes as 'ubiquotous' yet gone entirely in version 3 data, we'll display some of the data collected already in more combined & manageable form. This code in the next cell will make a dataframe composed of what has already been gleaned. This will probably be the form you want to peruse things in if you are so inclined.

In [10]:
disappearing_identified_with_STRversion_dict = {k:(', ').join(v) for k,v in disappearing_identified_with_dict.items()}
df_related_going_disappearing = pd.DataFrame.from_dict(
    disappearing_identified_with_STRversion_dict,orient='index').reset_index()
df_related_going_disappearing.columns = ['missing_acc', 'associated_identifiers_in_expected_related_huMAP3_complexes']
# add the column with the sizes of the complexes seen 
df_related_going_disappearing.insert(1, 'number_of_complexes', df_related_going_disappearing['missing_acc'].map(disappeared_total_complexes_dict))
# add the column with the ids of the complexes seen 
v2_complexes_with_disappearing_STRversion_dict = {k:(', ').join(v).replace('HuMAP2_','') for k,v in v2_complexes_with_disappearing_dict.items()} # in interest of streamlining also remove `HuMAP2_` that occurs on all complex ids
df_related_going_disappearing.insert(2, 'assoc_complex_ids', df_related_going_disappearing['missing_acc'].map(v2_complexes_with_disappearing_STRversion_dict))
'''
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_related_going_disappearing)
'''
df_related_going_disappearing

Unnamed: 0,missing_acc,number_of_complexes,assoc_complex_ids,associated_identifiers_in_expected_related_huMAP3_complexes
0,Q13137,3,"00806, 06259, 06729",Q9NVV4
1,Q9NXF1,4,"00441, 03649, 06152, 06210","Q9Y4W2, Q5SY16, Q9H4L4, Q9BV38"
2,Q8NBF6,2,"02970, 05289","Q6UXD5, Q9UBG0"
3,P80108,2,"02970, 05289","Q6UXD5, Q9UBG0"
4,Q96LR4,2,"02970, 05289","Q6UXD5, Q9UBG0"
...,...,...,...,...
2050,Q9Y3B1,2,"00879, 03806",Q9Y255
2051,A8K855,2,"02902, 06401",Q15506
2052,Q8N6M3,2,"01400, 01887",Q09328
2053,Q9H4B8,2,"02957, 04336",Q9BTN0


That is saved to csv by running the next cell and you can open that generated CSV file here or in Excel for looking over the whole set of details.

In [11]:
df_related_going_disappearing.to_csv("df_related_going_disappearing.csv")

--------

The script that was run above also highlighted those 'ubiquitous' ones that were in related complexes in hu.MAP 2.0 data and completely absent in hu.MAP 3.0 data. Running the next cell will show you that is a small fraction of the 2055 above.

In [12]:
len(gone_entirely_in_3_identified_via_dict)

151

So winnowed to 151 by making things much more strict by saying couldn't be anywhere in the hu.MAP 3.0 data at all.

This further emphasizes what was touched upon earlier about the the shifting or refinement that occurs between versions 2 and 3 of the hu.MAP data. Only 151 the 'ubiquitous' ones found in related complexes disappear entirely in the hu.MAP 3.0 data. The other 1904 of the 2055 'ubiquitous' ones found in related complexes at least show up in the hu.MAP 3.0 data somewhere. 

Before we go on to look at how what has been identified as 'ubiquitous' yet missing from related, let's do some steps like done with the larger group so these 151 can be perused more easily and/or in combined forms.

In [13]:
%store gone_entirely_in_3_identified_via_dict >gone_entirely_in_3_identified_via_dict.txt
%store gone_entirely_in_3_total_complexes_dict >gone_entirely_in_3_total_complexes_dict.txt

Writing 'gone_entirely_in_3_identified_via_dict' (dict) to file 'gone_entirely_in_3_identified_via_dict.txt'.
Writing 'gone_entirely_in_3_total_complexes_dict' (dict) to file 'gone_entirely_in_3_total_complexes_dict.txt'.


In [14]:
gone_entirely_in_3_identified_via_STRversion_dict = {k:(', ').join(v) for k,v in gone_entirely_in_3_identified_via_dict.items()}
df_related_gone_entirely_in_3 = pd.DataFrame.from_dict(
    gone_entirely_in_3_identified_via_STRversion_dict,orient='index').reset_index()
df_related_gone_entirely_in_3.columns = ['missing_acc', 'associated_identifiers_in_expected_related_huMAP3_complexes']
# add the column with the sizes of the complexes seen 
df_related_gone_entirely_in_3.insert(1, 'number_of_complexes', df_related_gone_entirely_in_3['missing_acc'].map(gone_entirely_in_3_total_complexes_dict))
# add the column with the ids of the complexes seen 
gone_v2_complexes_with_disappearing_STRversion_dict = {k:(', ').join(v).replace('HuMAP2_','') for k,v in v2_complexes_dict_for_gone_in_3 .items()} # in interest of streamlining also remove `HuMAP2_` that occurs on all complex ids
df_related_gone_entirely_in_3.insert(2, 'assoc_complex_ids', df_related_gone_entirely_in_3['missing_acc'].map(gone_v2_complexes_with_disappearing_STRversion_dict))
'''
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_related_gone_entirely_in_3)
'''
df_related_gone_entirely_in_3

Unnamed: 0,missing_acc,number_of_complexes,assoc_complex_ids,associated_identifiers_in_expected_related_huMAP3_complexes
0,Q9NXF1,4,"00441, 03649, 06152, 06210","Q9Y4W2, Q5SY16, Q9H4L4, Q9BV38"
1,SPECIAL_HGNC15792,3,"02624, 03560, 00611","Q8TBG4, Q04446"
2,O75533,4,"02124, 02351, 02806, 05865","Q9BWJ5, Q7RTV0, Q13435, Q15393, Q9Y3B4, Q15427"
3,Q5JNZ5,3,"01493, 02859, 05267",Q02539
4,O76021,2,"00300, 06948",Q5JTH9
...,...,...,...,...
146,Q6P996,3,"01309, 01760, 02725",Q9Y6A1
147,O75396,2,"00854, 02979",Q8WV48
148,Q9BVI4,4,"00135, 01287, 02156, 04976",P78316
149,Q8N4H0,2,"02522, 04359",O95239


In [15]:
df_related_gone_entirely_in_3.to_csv("df_related_gone_entirely_in_3.csv")

Open that file to review it more easily.

The analysis with the script limited things to those proteins in complexes where you can see related, 'ubiquitous' proteins that show up in version 3 data without proteins version 2 data revealed as associated. Whether or not those proteins showed up at al in version 3 or not. But what if look at things that seen in version 2 data but gone in version 3 without regards to whether there's proteins that look associated in version 2 or not. Meaning in a practical sense proteins that only need to show up once iin version 2, but are gone entirely in version 3 data.

I do this not only because your protein might be in such a group, but also because it gives a better sense of how many of the 'disappearing' ones we've already revealed by the approach above that focused on expecting to see related proteins, based on association version 2 data, in the complexes seen in version 3 data.

-----

**Trying to see if get same identifiers if just took what was in hu.MAP2 data and gone in version 3 data.**  

Run next cell to use set substraction to collect what remains from hu.MAP 2.0 data if you subract out everything seen in hu.MAP 3.0 data, and then in subsequent cell(s) see if same as keys in `gone_entirely_in_3_identified_via_dict`:

In [16]:
directly_determined_gone_missing_accs = set(v2data_info_as_df['Uniprot_ACCs']) - (set(v3data_info_as_df['Uniprot_ACCs']))
len(directly_determined_gone_missing_accs)

484

This is a bit bigger than 151 that we mentioned being gone entirely in version 3 data, but actually not by a lot. Nearly only double.  
And that will include the 151 we discussed above because I had restricted 'ubiquitous' to having be define by two ore more complexes. We can show that by a check to see if that group includes the 151 by running the next cell that uses more of Python's set math:

In [17]:
gone_entirely_in_3_set.issubset(directly_determined_gone_missing_accs)

True

That's true, as expected.

So how many have we expanded here sepcifically:

In [18]:
len(directly_determined_gone_missing_accs - gone_entirely_in_3_set)

333

So indeed do get all the same 151 identifiers if just took what was in hu.MAP2 data and gone in version 3 data, plus presumably get additional 333 where not really 'ubiqutous' & disappearing from related complexes because complex the identifier has disappeared from is unique.

So how do I prove that about these additional ones? They won't be among the dictionaries I already collected.  
And ones among the 333 could be in more than one hu.MAP 2.0 complex because they can occur in un-related hu.MAP 2.0 complexes as long that identifier is gone in hu.MAP 3.0 data entirely.  
So how to check that? The next three cells will set up a prelimanry examination.

In [19]:
three_three_three_set = directly_determined_gone_missing_accs - gone_entirely_in_3_set
len(three_three_three_set)

333

In [20]:
%store three_three_three_set >three_three_three_set.txt

Writing 'three_three_three_set' (set) to file 'three_three_three_set.txt'.


In [21]:
!head three_three_three_set.txt

{'A0A075B759',
 'A0A0G2JN01',
 'A0M8Q6',
 'A1Z1Q3',
 'A2A3N6',
 'A4FU69',
 'A6NKF1',
 'A6NKQ9',
 'A6NLF2',
 'A8MWD9',


Downloaded `three_three_three_set.txt to check and few and see if only occur in one or at least unrelated complexes as I expect.

Some results:  
`A6NKF1` only occurs once in hu.MAP 2.0 data and none in hu.MAP 3.0 data, which is consistent with what I expected.  
It wouldn't be in the 151 group I had in `df_related_gone_entirely_in_3` and associated data because that was filtered out of the 'ubiquitous' group and to be in there in the first place, by the constraint I applied, it has to be in more than one complex.

I know this isn't exhaustive and could be done better. However, at this point I see there is a fair amount of identifiers that either shift or get refined and a few hundred that don't show up at all in version 3 data. There isn't really a reason to look into the nature of these 333 so much for now.

While it would be better if some don't disappear in version 3 while looking legitimate in version 2, and you may be unfortunate and have a protein in this group. Know though that this is the case for only a few hundred others and this is still an interesting dataset on a whole and seems informative at a biological level.

----------
Enjoy!