### Checking for duplicates in the two main columns of hu.MAP 2.0 CSV

I observed some duplicates in the `Uniprot_ACCs` & `genenames` columns in the **raw** hu.MAP 2.0 data provided by authors and I thought I'd check how prevalent.

------

#### Preparation

##### Get the complexes with confidence scores

Because the author-provided source didn't work for the hu.MAP 3.0 data, I expected `curl -OL "http://humap2.proteincomplexes.org/static/downloads/humap2/humap2_complexes_20200809.txt"` to work on my local machine, yet fail on MyBinder because the involved port may be blocked on MyBinder for getting it from the original resource. Because of that expectation, I made a copy at https://gist.githubusercontent.com/fomightez/af3edda957e4d71acbaa30192e74e9af/raw/108a8c3fb3374a74ef3ca5d772a9dfe96e996c93/humap2_complexes_20200809.txt where MyBinder would have access. However, the curl of the original source works!! 
(Keeping a note about my copy now but using original sourc

In [1]:
!curl -OL "http://humap2.proteincomplexes.org/static/downloads/humap2/humap2_complexes_20200809.txt"
# If that fails, uncomment & try the next line which will be guaranteed to work with MyBinder:
#!curl -OL https://gist.githubusercontent.com/fomightez/af3edda957e4d71acbaa30192e74e9af/raw/108a8c3fb3374a74ef3ca5d772a9dfe96e996c93/humap2_complexes_20200809.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  500k  100  500k    0     0   449k      0  0:00:01  0:00:01 --:--:--  449k


##### Put the data on the complexes into Pandas dataframe

(I'm using uv here just because I want to learn about it. I could have run the code in the script right in this notebook, and skipped the pickling and read pickle steps.)

Get the script to use with `uv` to read in the raw data and make a dataframe.

In [2]:
!curl -OL https://raw.githubusercontent.com/fomightez/structurework/refs/heads/master/humap3-utilities/complexes_rawCSV_to_df.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1007  100  1007    0     0   4499      0 --:--:-- --:--:-- --:--:--  4495


In [5]:
!uv run complexes_rawCSV_to_df.py humap2_complexes_20200809.txt
import pandas as pd
rd_df = pd.read_pickle('raw_complexes_pickled_df.pkl')
rd_df

Reading inline script metadata from `[36mcomplexes_rawCSV_to_df.py[39m`
[2K[37m⠙[0m [2m                                                                              [0m

Unnamed: 0,HuMAP2_ID,Confidence,Uniprot_ACCs,genenames
0,HuMAP2_00000,3,Q9BQS8 O95900,FYCO1 TRUB2
1,HuMAP2_00001,4,P08133 Q15797 Q99426 Q9H4M9 P68402 Q15102,ANXA6 SMAD1 TBCB EHD1 PAFAH1B2 PAFAH1B3
2,HuMAP2_00002,5,Q93062 Q9NZC3 Q9UF11 Q15038 Q6ZRY4 A1KXE4 O432...,RBPMS GDE1 PLEKHB1 DAZAP2 RBPMS2 FAM168B RBFOX...
3,HuMAP2_00003,5,Q15836 Q16563 Q29983 Q8WUM9 O14974 Q9Y5Y0 Q149...,VAMP3 SYPL1 MICA SLC20A1 PPP1R12A FLVCR1 DRAP1...
4,HuMAP2_00004,4,Q8WV99 Q9NQT8 Q9H672 P20774 Q49A92,ZFAND2B KIF13B ASB7 OGN C8orf34
...,...,...,...,...
6960,HuMAP2_07014,4,Q9HC97 Q92871 Q6S8J3 P13727 P31152,GPR35 PMM1 POTEE PRG2 MAPK4
6961,HuMAP2_07015,4,Q9H5L6 Q8N5N7 Q96E29 O75127 Q9NPE2 Q96I51,THAP9 MRPL50 MTERF3 PTCD1 NGRN RCC1L
6962,HuMAP2_07016,5,Q99697 Q8NE31 P17509 P31274 Q2T9J0 Q8TAC2 P529...,PITX2 FAM13C HOXB6 HOXC9 TYSND1 JOSD2 NKX2-5 D...
6963,HuMAP2_07017,2,Q53GT1 Q96GP6 P49448,KLHL22 SCARF2 GLUD2


That's a lot of complexes!

--------

### Now to check for duplicates

In [7]:
# check for duplicates in the two main columns
rows_with_issue_content = []
for row in rd_df.itertuples():
    accs = row.Uniprot_ACCs.split()
    set_of_accs = set(accs)
    if len(accs) != len(set_of_accs):
        print(f"`{row.Uniprot_ACCs}` displays duplicates.")
        rows_with_issue_content.append(row)
    gns = row.genenames.split()
    set_of_gns = set(gns)
    if len(gns) != len(set_of_gns):
        print(f"`{row.genenames}` displays duplicates.")
        rows_with_issue_content.append(row)
if not rows_with_issue_content:
    import rich
    rich.print("[bold]There are no occurences of duplicate identifiers observed in the two main columns[/bold].")

`Q9P2F5 Q9GZY0 O14640 O14641 Q9GZY0` displays duplicates.
`STOX2 NXF2 DVL1 DVL2 NXF2` displays duplicates.
`P47929 P47929` displays duplicates.
`LGALS7 LGALS7` displays duplicates.
` P22392 O00746 Q13232 P22392 Q9H4I3` displays duplicates.
`NME2P1 NME2 NME4 NME3 NME2 TRABD` displays duplicates.
`Q9GZY0 Q9GZY0` displays duplicates.
`NXF2 NXF2` displays duplicates.
`Q6P1K8 Q6P1K8 Q03393` displays duplicates.
`GTF2H2C GTF2H2C PTS` displays duplicates.
`Q9Y3F4 P08621 O75534 Q9P2G9 Q7Z422 Q9H840 Q9BRS8 O14893 Q9H9B4 Q9NWZ8 O43353 Q8WXD5 P57678 Q16637 Q16637 Q9H3H3 Q9UHI6 Q9P2E3 P83369 Q8TEQ6` displays duplicates.
`STRAP SNRNP70 CSDE1 KLHL8 SZRD1 GEMIN7 LARP6 GEMIN2 SFXN1 GEMIN8 RIPK2 GEMIN6 GEMIN4 SMN1 SMN1 C11orf68 DDX20 ZNFX1 LSM11 GEMIN5` displays duplicates.
`P0DML3 P01241 P0DML3` displays duplicates.
`CSH2 GH1 CSH2` displays duplicates.
`Q8IWZ3 P43355 O15480 O75179 Q8IWZ3` displays duplicates.
`ANKHD1 MAGEA1 MAGEB3 ANKRD17 ANKHD1` displays duplicates.
`P59665 P18065 Q8N8U9 P59665` disp

**Seems to be an issue!**   
I'll be sure to check after I address standardizing further and build in fixing!

-----

Enjoy!