# Adding 2 Columns of rank score for esm1b and alphamissense to compare those models
 - To facilitate comparison between scores, we added rank scores for most functional prediction scores and conservation scores, and replacing the  "converted" scores in the previous versions.

 - AlphaMissense_rankscore: AlphaMissense scores were ranked among all AlphaMissense scores in dbNSFP. The rankscore is the ratio of the rank of the AlphaMissense_score over the total number of scores in dbNSFP.
 - ESM1b_rankscore: ESM1b scores were firstly negated (i.e., -ESM1b_score), then ranked among all -ESM1b_score scores
in dbNSFP. The rankscore is the ratio of the rank of the -ESM1b_score over the total number of scores in dbNSFP.
 - use exact same datasets for both models to fairly rank - 20247 proteins

In [1]:
# --- Project Setup ---
from setup_notebook import setup_project_root
setup_project_root()

from src.project_config import get_paths
from src.rank_processing.exclude_isoforms import exclude_isoforms
from src.rank_processing.list_files import list_csv_files
import os
import pandas as pd
from scipy.stats import rankdata
from glob import glob
from src.rank_processing.rank_data import rank_data

In [2]:
paths = get_paths()
data_AM_raw = paths["data"] / "raw" / "AlphaMissense_csv"
data_ESM_raw = paths["data"] / "raw" / "ALL_hum_isoforms_ESM1b_LLR"

data_AM_rank = paths["processed"] / "AlphaMissense_rank_csv"
data_ESM_rank = paths["processed"] / "ESM1b_rank_csv"
data_ESM_no_isoform = paths["processed"] / "ESM1b_no_isoforms"
protein_ID_folder = paths["processed"] / "Protein_IDs_Per_Experiment"

# Create output directories
os.makedirs(data_AM_rank, exist_ok=True)
os.makedirs(data_ESM_rank, exist_ok=True)
os.makedirs(data_ESM_no_isoform, exist_ok=True)
os.makedirs(protein_ID_folder, exist_ok=True)

### 1. Find Overlap between two ESM1b and AlphaMissense Datasets for fair ranking

In [5]:
# Exclude every CSV file that is isoform and copy non-isoform files to a new folder
exclude_isoforms(data_ESM_raw, data_ESM_no_isoform)

Copying non-isoform CSVs: 100%|██████████| 20335/20335 [00:12<00:00, 1667.46it/s]

✅ Done: copied 20335 files to 'data/processed/ESM1b_no_isoforms'.





In [None]:
# List all CSV files in the raw data directories and save them for later
list_csv_files(data_ESM_no_isoform, "_LLR.csv", protein_ID_folder / "ESM1b_no_isoforms.csv", "ESM_no_isoform")
list_csv_files(data_AM_raw, ".csv", protein_ID_folder / "AlphaMissense_csv.csv", "AlphaMissense_csv")

In [4]:
# Find overlap between AlphaMissense and ESM1b datasets using listed csv files from above
AM_csv = pd.read_csv(protein_ID_folder / "AlphaMissense_csv.csv")
ESM_csv = pd.read_csv(protein_ID_folder / "ESM1b_no_isoforms.csv")

# Find intersection and save as CSV
pd.DataFrame(
    set(AM_csv['AlphaMissense_csv']).intersection(ESM_csv['ESM_no_isoform']),columns=["Protein_ID"]
).to_csv(protein_ID_folder / "intersection_protein_ids_to_be_ranked.csv", index=False)

### 2. Rank All Proteins from both Models from their Protein IDs intersection

In [7]:
protein_id_csv_pathway = protein_ID_folder / "intersection_protein_ids_to_be_ranked.csv"

In [37]:
rank_data(input_dir=data_AM_raw, output_dir=data_AM_rank, model="AlphaMissense", intersection_csv_path=protein_id_csv_pathway)

Collecting AM scores...


Reading scores: 100%|██████████| 20246/20246 [00:51<00:00, 392.08it/s]


Computing global ranks...
Processing per-file and writing outputs...


Processing files: 100%|██████████| 20246/20246 [04:41<00:00, 71.99it/s] 

AlphaMissense Scores were successfully ranked.





In [8]:
rank_data(input_dir=data_ESM_no_isoform, output_dir=data_ESM_rank, model="ESM", intersection_csv_path=protein_id_csv_pathway)

Collecting LLR scores...


Reading LLR scores: 100%|██████████| 20246/20246 [03:11<00:00, 105.97it/s]


Computing global ranks...
Writing ranked matrices...


100%|██████████| 20246/20246 [02:46<00:00, 121.76it/s]

ESM1b Scores were successfully ranked.



