### Checking for duplicates in the two main columns

I observed some duplicates in the `Uniprot_ACCs` & `genenames` columns in the **raw** hu.MAP 2.0 data and I realized I was sure I had looked if such a thing was true of the hu.MAP 3.0 data.

Going to check that out now in the **standardized, balanced FIXED In-Order Matched CSV data**. At this point, **that is what I am using, and so is most important.**   
Separately, I checked by the same way in the raw hu.MAP 3.0 data CSV **and found none there, too**, after realizing because just assessing for duplicates those with multiple gene names using semi-colon will still get split and can see if duplicates with same code with no special handing needed. (I did that by editing the preparation cells in a copy of this notebook to get the raw data like I do at the start of `Standardizing_identifier_order_in_humap3-provided_csv.ipynb`.)

------

#### Preparation

##### Get the complexes with confidence scores

While `curl -OL "https://humap3.proteincomplexes.org/static/downloads/humap3/hu.MAP3.0_complexes_wConfidenceScores_total15326_wGenenames_20240922.csv"` works on my local machine, the involved port may be blocked on MyBinder for getting it from the original resource. (Actually it turns out that data is not in tidy form and has inconsistencies in separators used, and so remediation was necessary anyway, see [here](additional_nbs/standardizing_initial_data/README_as.ipynb) if interested in more in that.) We'll obtain a standardized copy of that data placed where it is accessible in MyBinder-served sessions by running the next cell:

In [4]:
!curl -OL https://raw.githubusercontent.com/fomightez/humap3-binder/refs/heads/main/additional_nbs/standardizing_initial_data/hu.MAP3.0_complexes_wConfidenceScores_total15326_wGenenames_20240922InOrderMatched.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1243k  100 1243k    0     0  2149k      0 --:--:-- --:--:-- --:--:-- 2152k


##### Put the data on the complexes into Pandas dataframe

(I'm using uv here just because I want to learn about it. I could have run the code in the script right in this notebook, and skipped the pickling and read pickle steps.)

Get the script to use with `uv` to read in the raw data and make a dataframe.

In [5]:
!curl -OL https://raw.githubusercontent.com/fomightez/structurework/refs/heads/master/humap3-utilities/complexes_rawCSV_to_df.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1007  100  1007    0     0   3460      0 --:--:-- --:--:-- --:--:--  3460


In [6]:
!uv run complexes_rawCSV_to_df.py hu.MAP3.0_complexes_wConfidenceScores_total15326_wGenenames_20240922InOrderMatched.csv
import pandas as pd
rd_df = pd.read_pickle('raw_complexes_pickled_df.pkl')
rd_df

Reading inline script metadata from `[36mcomplexes_rawCSV_to_df.py[39m`
[2K[2mInstalled [1m10 packages[0m [2min 135ms[0m[0m                              [0m         


Unnamed: 0,HuMAP3_ID,ComplexConfidence,Uniprot_ACCs,genenames
0,huMAP3_00000.1,1,Q9UGQ2 P20963 Q9NWV4,CACFD1 CD247 CZIB
1,huMAP3_00001.1,1,Q9NWB1 O94887 Q9NQ92,RBFOX1 FARP2 COPRS
2,huMAP3_00002.1,1,Q8N3D4 Q9Y3A4,EHBP1L1 RRP7A
3,huMAP3_00003.1,1,Q5T2D2 O00429,TREML2 DNM1L
4,huMAP3_00004.1,1,Q9H9C1 Q9H267 O95460 P21941 P78540,VIPAS39 VPS33B MATN4 MATN1 ARG2
...,...,...,...,...
15321,huMAP3_15345.1,6,O14628 Q3SXZ3,ZNF195 ZNF718
15322,huMAP3_15346.1,6,Q6ZWT7 P08910 Q86VD1 Q9UJQ1 Q9Y6X9,MBOAT2 ABHD2 MORC1 LAMP5 MORC2
15323,huMAP3_15347.1,6,A6ND91 Q4V339,ASPDH ZNG1F
15324,huMAP3_15348.1,6,A6NKF2 P08217 Q8IVW6 Q99856,ARID3C CELA2A ARID3B ARID3A


That's a lot of complexes!

--------

### Now to check for duplicates

In [24]:
# check for duplicates in the two main columns
rows_with_issue_content = []
for row in rd_df.itertuples():
    accs = row.Uniprot_ACCs.split()
    set_of_accs = set(accs)
    if len(accs) != len(set_of_accs):
        print(f"`{row.Uniprot_ACCs}` displays duplicates.")
        rows_with_issue_content.append(row)
    gns = row.genenames.split()
    # remove those with `SPECIALin_UniProt_but_no_gene` from consideration because can occur more than once to balance
    identifiers_not_to_consider = ['SPECIALin_UniProt_but_no_gene']
    gns = [x for x in gns if x not in identifiers_not_to_consider]
    set_of_gns = set(gns)
    if len(gns) != len(set_of_gns):
        print(f"`{row.genenames}` displays duplicates.")
        rows_with_issue_content.append(row)
if not rows_with_issue_content:
    import rich
    rich.print("[bold]There are no occurences of duplicate identifiers observed in the two main columns[/bold].")

-----

Enjoy!