# LoincRanker Notebook Overview

This notebook implements a learning-to-rank pipeline for LOINC data using an AdaRank model.  It covers:
- Data merging from multi-sheet Excel files,
- Diverse feature extraction techniques (lexical, semantic, and numerical),
- Construction of a combined sparse feature matrix,
- Query-aware model training and evaluation,
- And finally, analysis of model results.

The underlying AdaRank model is implemented using the repository git@github.com:rueycheng/AdaRank.git. The AdaRank algorithm was modified to include extra regularization mechnisms. 

Example predictions can be found in the output of the cells and in the folder Results/.

---

## Content of notebook 

1. Data Loading 
2. Feature Extraction and Data Merging
5. Numerical Feature Processing
6. Feature Combination
7. Labels and Query Identifiers
8. Model Training and Evaluation
9. Results Analysis





In [57]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix, hstack
from sklearn.model_selection import GroupShuffleSplit
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import GroupKFold
from sentence_transformers import SentenceTransformer
from adarank import AdaRank
from adarankv2 import AdaRankv2
from metrics import NDCGScorer
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import save_npz, load_npz


In [None]:



""" val_file_path = 'Input-test/loinc_query_terms_testing.xlsx'
train_file_path = 'Input-test/loinc_query_terms_training.xlsx' """
val_file_path = 'Input/loinc_query_terms_testing.xlsx'
train_file_path = 'Input/loinc_query_terms_training.xlsx'

# Load pretrained model vector embedding model
# We are using a medical specific model from huggingface. https://huggingface.co/ls-da3m0ns/bge_large_medical 'ls-da3m0ns/bge_large_medical'
# To improve performance, model can be switched to a more general & lightwight model (replace with "all-MiniLM-L6-v2")

embedder = SentenceTransformer('ls-da3m0ns/bge_large_medical')
#embedder = SentenceTransformer('all-MiniLM-L6-v2') # lightweight model


### Define function for semantic similarity computation
  
- Used predefined embedding model

In [59]:
# Function to safely calculate similarity between a query and a field
def calculate_embedding_similarity(query, field):
    if pd.isna(query) or pd.isna(field):
        return 0
    query_embedding = embedder.encode([str(query)])[0]
    field_embedding = embedder.encode([str(field)])[0]
    return cosine_similarity([query_embedding], [field_embedding])[0][0]



### Data Loading and Merging
- Iterate over all sheet names in the Excel file.
- Read each sheet into a temporary DataFrame.
- Add a new column (`query`) to capture the sheet name.
- Concatenate all the sheets into a single DataFrame (`merged_df`).

In [60]:
#Read the file and merge all sheets into one dataframe
xls = pd.ExcelFile(train_file_path)
dataframes = []
for sheet_name in xls.sheet_names:
    temp_df = pd.read_excel(xls, sheet_name=sheet_name)
    temp_df['query'] = sheet_name  # Each sheet name is used as the query text
    dataframes.append(temp_df)
merged_df = pd.concat(dataframes, ignore_index=True)


# ------------------------------
# Compute jaccard similarity for property, system, and component

# Define a function to compute the Jaccard similarity between two strings.
def jaccard_similarity(str1, str2):
    # If either string is not a valid string, return 0.
    if not isinstance(str1, str) or not isinstance(str2, str):
        return 0.0
    set1 = set(str1.lower().split())
    set2 = set(str2.lower().split())
    union = set1.union(set2)
    if not union:
        return 0.0
    return float(len(set1.intersection(set2))) / len(union)

# Compute Jaccard similarity for each of the  columns, comparing the 'query' to each.
merged_df['system_jaccard'] = merged_df.apply(
    lambda row: jaccard_similarity(row['query'], row['system']) if pd.notnull(row['system']) else 0.0, axis=1)
merged_df['component_jaccard'] = merged_df.apply(
    lambda row: jaccard_similarity(row['query'], row['component']) if pd.notnull(row['component']) else 0.0, axis=1)
merged_df['property_jaccard'] = merged_df.apply(
    lambda row: jaccard_similarity(row['query'], row['property']) if pd.notnull(row['property']) else 0.0, axis=1)

# inspect the computed similarities (not needed for training)
""" print("System Jaccard similarity stats:")
print(merged_df['system_jaccard'].describe())
print("Component Jaccard similarity stats:")
print(merged_df['component_jaccard'].describe())
print("Property Jaccard similarity stats:")
print(merged_df['property_jaccard'].describe()) """

# Convert the columns to sparse matrices so they can be used as features.
from scipy.sparse import csr_matrix
X_system_jaccard = csr_matrix(merged_df[['system_jaccard']].values)
X_component_jaccard = csr_matrix(merged_df[['component_jaccard']].values)
X_property_jaccard = csr_matrix(merged_df[['property_jaccard']].values)

# ------------------------------
# Compute lexical similarity for long_common_name

# We want to compute, for each row, the similarity between the query and the long_common_name.
# First, build a TF-IDF vectorizer fitted on the union of all queries and names.
corpus = pd.concat([merged_df['query'], merged_df['long_common_name']])
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)

# Transform the query and long_common_name columns
X_query = vectorizer.transform(merged_df['query'])
X_name_tfidf = vectorizer.transform(merged_df['long_common_name'])

# Compute cosine similarity for each row
cosine_sim = np.array([cosine_similarity(X_query[i], X_name_tfidf[i])[0, 0] 
                         for i in range(X_query.shape[0])])
merged_df['name_cosine_sim'] = cosine_sim

# Now use the computed cosine similarity as the feature for the name field.
X_name_lexical_similarity = csr_matrix(merged_df[['name_cosine_sim']].values)

# ------------------------------
# Compute semantic similarity for features property, system, component, long_common_name

# Create similarity scores for features, loinc name measurement, system, and component
# Apply a similarity function to each row, based on vector embeddings 
merged_df['name_similarity'] = merged_df.apply(
    lambda row: calculate_embedding_similarity(row['query'], row['long_common_name']), axis=1)
merged_df['property_similarity'] = merged_df.apply(
    lambda row: calculate_embedding_similarity(row['query'], row['property']), axis=1)
merged_df['system_similarity'] = merged_df.apply(
    lambda row: calculate_embedding_similarity(row['query'], row['system']), axis=1)
merged_df['component_similarity'] = merged_df.apply(
    lambda row: calculate_embedding_similarity(row['query'], row['component']), axis=1)

# Convert to sparse matrices
X_name_semantic_similarity = csr_matrix(merged_df[['name_similarity']].values)
X_property_similarity = csr_matrix(merged_df[['property_similarity']].values)
X_system_similarity = csr_matrix(merged_df[['system_similarity']].values)
X_component_similarity = csr_matrix(merged_df[['component_similarity']].values)

# ------------------------------
# Process numerical feature: 'rank' using StandardScaler
# Treat 0 (NaN) values as very high ranks by replacing them with the largest number 
max_rank = merged_df['rank'].max()
merged_df['rank'] = merged_df['rank'].replace(0, max_rank + 1)

scaler = StandardScaler()
X_rank = scaler.fit_transform(merged_df[['rank']])
X_rank_sparse = csr_matrix(X_rank)

# ------------------------------
# Combine all features into one sparse matrix.
X = hstack([
    X_rank_sparse,
    X_name_semantic_similarity, 
    X_name_lexical_similarity,
    X_system_similarity, 
    X_component_similarity,
    X_system_jaccard,
    X_component_jaccard,
    X_property_jaccard
])

# Labels and query identifiers
y = merged_df['relevant'].values
# convert query strings to integers -> required for AdaRank
merged_df['qid_numeric'] = pd.factorize(merged_df['query'])[0]
qid = merged_df['qid_numeric'].values

In [61]:
merged_df.to_csv('idxdata/indexed_dataset.csv', index=False)
# Save the feature matrix X to a file for later retrieval
save_npz('idxdata/X_sparse.npz', X)
# Save the labels y to a file for later retrieval
np.save('idxdata/y.npy', y)
# Save the query identifiers qid to a file for later retrieval
np.save('idxdata/qid.npy', qid)

In [62]:

X = load_npz('idxdata/X_sparse.npz')
merged_df = pd.read_csv('idxdata/indexed_dataset.csv')
y = np.load('idxdata/y.npy')
qid = np.load('idxdata/qid.npy') 
# Create a string array capturing the name of the features -> used later during evaluation
feature_names = [
    "X_rank_sparse",
    "X_name_semantic_similarity", 
    "X_name_lexical_similarity",
    "X_system_similarity", 
    "X_component_similarity",
    "X_system_jaccard",
    "X_component_jaccard",
    "X_property_jaccard"
]

In [63]:
# Query-aware train/test splitting
# we disbale te random state due to the small dataset
splitter = GroupShuffleSplit(test_size=0.2, random_state=None)
train_idx, test_idx = next(splitter.split(X, y, groups=qid))
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
qid_train, qid_test = qid[train_idx], qid[test_idx]

# Create DataFrames for the train and test sets using the respective indices
train_df = merged_df.iloc[train_idx].copy()
test_df = merged_df.iloc[test_idx].copy()

#Debugging prints
print("Unique queries in train:", np.unique(qid[train_idx]))
print("Unique queries in test:", np.unique(qid[test_idx]))
print("Unique queries and their frequencies:", np.unique(qid, return_counts=True))

#print("X_train:\n", X_train.toarray())
#print("y_train:", y_train)

#Check basic statistics 
#We can confirm that all selected features plus label have a meaningful variance/std dev so we dont need to apply a feature selection before 
print("X_train summary stats (mean, std):", np.mean(X_train.toarray(), axis=0), np.std(X_train.toarray(), axis=0))
#print("y_train distribution:", np.unique(y_train, return_counts=True))
print("y_train summary stats (mean, std):", np.mean(y_train, axis=0), np.std(y_train, axis=0))

# ------------------------------
# Train and evaluate AdaRank
model = AdaRankv2(max_iter=500, estop=50, scorer=NDCGScorer(k=5))
#model = AdaRank(max_iter=100, estop=100, scorer=NDCGScorer(k=5))
model.fit(X_train, y_train, qid_train)

# test NNDCG for different values of k
for k in (1, 2, 3, 4, 5, 10, 20):
    y_pred = model.predict(X_test, qid_test) 
    score = NDCGScorer(k=k)(y_test, y_pred, qid_test).mean()   
    print(f"NDCG Score {score}, K {k}")

Unique queries in train: [0 1 3 5 6 7]
Unique queries in test: [2 4]
Unique queries and their frequencies: (array([0, 1, 2, 3, 4, 5, 6, 7]), array([517, 491, 517, 514, 550, 521, 550, 550]))
X_train summary stats (mean, std): [0.0056889  0.39012529 0.0403473  0.36812505 0.38595793 0.00403012
 0.01871622 0.05472479] [0.95613794 0.11574428 0.08996267 0.08048696 0.12437721 0.03642981
 0.1132546  0.22744226]
y_train summary stats (mean, std): 0.10976773783009863 0.31260003448778023
NDCG Score 1.0, K 1
NDCG Score 1.0, K 2
NDCG Score 1.0, K 3
NDCG Score 1.0, K 4
NDCG Score 1.0, K 5
NDCG Score 1.0, K 10
NDCG Score 1.0, K 20


In [64]:
# Showcasing some predictions from the test set

# Create a DataFrame for the test set using the test indices
test_df = merged_df.iloc[test_idx].copy()
test_df['y_true'] = y_test
test_df['y_pred'] = y_pred

#Check how many distinct queries are in the test set
unique_test_queries = np.unique(qid_test)
print("Number of unique queries in test:", len(unique_test_queries))
print(unique_test_queries)

all_queries = merged_df['query'].unique()
print("Number of distinct queries overall:", len(all_queries))
print(all_queries)

# for each query in the test set, print the top 10 predictions (sorted by predicted score)
# Model seems to predcit well, however could be due to the small dataset and overfitting 
print("Top Predictions per Query:")
for query, group in test_df.groupby('query'):
    sorted_group = group.sort_values(by='y_pred', ascending=False)
    print(f"Query: {query}")
    print(sorted_group[['long_common_name', 'y_true', 'y_pred']].head(10))
    print("-" * 40)

Number of unique queries in test: 2
[2 4]
Number of distinct queries overall: 8
['glucose in blood' 'bilirubin in plasma' 'White blood cells count'
 'cholesterol in Bld' 'fever virus' 'calcium oxalate crystals' 'iron'
 'PrThr']
Top Predictions per Query:
Query: White blood cells count
                                       long_common_name  y_true    y_pred
1098  Blasts/Leukocytes [Pure number fraction] in Bo...       1  4.701988
1088  Deprecated Myeloblasts/100 leukocytes in Blood...       1  4.667403
1122  Myelocytes/Leukocytes in Stem cell product by ...       1  4.658710
1112  Deprecated Monocytes+Macrophages/100 leukocyte...       1  4.655984
1097          Large unstained cells/Leukocytes in Blood       1  4.650530
1079          Nonhematic cells/Leukocytes in Body fluid       1  4.629085
1121  Eosinophils/Leukocytes in Stem cell product by...       1  4.628964
1105                    Heterophils/Leukocytes in Blood       1  4.614963
1095  Other cells/Leukocytes in Synovial fluid b

In [65]:
# Check how the model weights the features 
coef_zip = model.coef_
# Print feature coefficients explicitly
print(coef_zip)
for name, coef in zip(feature_names, coef_zip):
    print(f"Feature: {name}, Importance (coef): {coef:.4f}")

[1.20870325 1.23835749 0.         0.         0.         0.
 0.         0.        ]
Feature: X_rank_sparse, Importance (coef): 1.2087
Feature: X_name_semantic_similarity, Importance (coef): 1.2384
Feature: X_name_lexical_similarity, Importance (coef): 0.0000
Feature: X_system_similarity, Importance (coef): 0.0000
Feature: X_component_similarity, Importance (coef): 0.0000
Feature: X_system_jaccard, Importance (coef): 0.0000
Feature: X_component_jaccard, Importance (coef): 0.0000
Feature: X_property_jaccard, Importance (coef): 0.0000


In [66]:
#Refactor code to prepare data 
xls_val = pd.ExcelFile(val_file_path)
dataframes_val = []
for sheet_name in xls_val.sheet_names:
    temp_df = pd.read_excel(xls_val, sheet_name=sheet_name)
    temp_df['query'] = sheet_name  # Each sheet name is the query text
    dataframes_val.append(temp_df)
val_df = pd.concat(dataframes_val, ignore_index=True)

# ------------------------------
# Compute Lexical Similarity for 'long_common_name'

corpus = pd.concat([val_df['query'], val_df['long_common_name']])
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
X_query_val = vectorizer.transform(val_df['query'])
X_name_tfidf_val = vectorizer.transform(val_df['long_common_name'])
cosine_sim_val = np.array([
    cosine_similarity(X_query_val[i], X_name_tfidf_val[i])[0, 0] 
    for i in range(X_query_val.shape[0])
])
val_df['name_lexical_similarity'] = cosine_sim_val
X_name_lexical_similarity_val = csr_matrix(val_df[['name_lexical_similarity']].values)

# ------------------------------
# Compute Lexical Similarity 


# Compute Jaccard similarity for each of the  columns, comparing the 'query' to each.
val_df['component_jaccard'] = val_df.apply(
    lambda row: jaccard_similarity(row['query'], row['component']) if pd.notnull(row['component']) else 0.0, axis=1)
val_df['system_jaccard'] = val_df.apply(
    lambda row: jaccard_similarity(row['query'], row['system']) if pd.notnull(row['system']) else 0.0, axis=1)
val_df['property_jaccard'] = val_df.apply(
    lambda row: jaccard_similarity(row['query'], row['property']) if pd.notnull(row['property']) else 0.0, axis=1)

# Convert the columns to sparse matrices so they can be used as features.
from scipy.sparse import csr_matrix
X_component_jaccard_val = csr_matrix(val_df[['component_jaccard']].values)
X_system_jaccard_val = csr_matrix(val_df[['system_jaccard']].values)
X_property_jaccard_val = csr_matrix(val_df[['property_jaccard']].values)
# ------------------------------
# Compute Semantic Similarities

val_df['name_semantic_similarity'] = val_df.apply(
    lambda row: calculate_embedding_similarity(row['query'], row['long_common_name']), axis=1)
val_df['property_semantic_similarity'] = val_df.apply(
    lambda row: calculate_embedding_similarity(row['query'], row['property']), axis=1)
val_df['system_semantic_similarity'] = val_df.apply(
    lambda row: calculate_embedding_similarity(row['query'], row['system']), axis=1)
val_df['component_semantic_similarity'] = val_df.apply(
    lambda row: calculate_embedding_similarity(row['query'], row['component']), axis=1)

X_name_semantic_similarity_val = csr_matrix(val_df[['name_semantic_similarity']].values)
X_property_similarity_val = csr_matrix(val_df[['property_semantic_similarity']].values)
X_system_similarity_val = csr_matrix(val_df[['system_semantic_similarity']].values)
X_component_similarity_val = csr_matrix(val_df[['component_semantic_similarity']].values)

# ------------------------------
# Process Numerical Feature: 'rank'
# Replace zeros (or NaN) with a high rank value as before
max_rank_val = val_df['rank'].max()
val_df['rank'] = val_df['rank'].replace(0, max_rank_val + 1)
# Use the same scaler from training.
scaler_val = StandardScaler().fit(merged_df[['rank']])
X_rank_val = scaler_val.transform(val_df[['rank']])
X_rank_sparse_val = csr_matrix(X_rank_val)

# ------------------------------
# Combine All Features into One Matrix
# Use the same ordering as during training.
X_val = hstack([
    X_rank_sparse_val,
    X_name_semantic_similarity_val, 
    X_name_lexical_similarity_val,
    X_system_similarity_val, 
    X_component_similarity_val,
    X_system_jaccard_val,
    X_component_jaccard_val,
    X_property_jaccard_val
])

KeyboardInterrupt: 

In [None]:
val_df['qid_numeric'] = pd.factorize(val_df['query'])[0]
qid_val = val_df['qid_numeric'].values

# Predict Using the Fitted AdaRank Model
# Assume your AdaRank model has already been trained and is available as 'model'
y_val_pred = model.predict(X_val, qid_val)

# Add the predictions to the DataFrame for inspection.
val_df['y_pred'] = y_val_pred
# Print Example Predictions Grouped by Query
print("Validation Predictions:")
for query, group in val_df.groupby('query'):
    print(f"\nQuery: {query}")
    # Adjust column names as needed; here we print the document 'long_common_name' and its prediction.
    print(group[['long_common_name', 'y_pred']]) 

Validation Predictions:

Query: Creatinine Urine or Blood
                                     long_common_name  y_pred
0   11-Deoxycortisol [Mass/volume] in Serum or Plasma     0.0
1   Hemoglobin.gastrointestinal [Presence] in Vomitus     0.0
2           Busulfan [Mass/volume] in Serum or Plasma     0.0
3   Fetal Trisomy 13 risk [Likelihood] based on Pl...     0.0
4   Cholesterol esters/Cholesterol.total in Serum ...     0.0
5    Bacteria identified in Isolate by Aerobe culture     0.0
6                   Meperidine [Presence] in Specimen     0.0
7           Green Bean IgE Ab [Units/volume] in Serum     0.0
8       Japanese Cedar IgE Ab [Units/volume] in Serum     0.0
9   Chronic lymphocytic leukemia gene targeted mut...     0.0
10                              Cell type in Specimen     0.0
11  Drugs identified in Blood by Screen method Nom...     0.0
12  California Live Oak IgE Ab [Units/volume] in S...     0.0
13  Entamoeba histolytica DNA [Presence] in Stool ...     0.0
14  Calcium 

In [None]:
val_df['qid_numeric'] = pd.factorize(val_df['query'])[0]
qid_val = val_df['qid_numeric'].values

# Predict Using the Fitted AdaRank Model
# Assume your AdaRank model has already been trained and is available as 'model'
y_val_pred = model.predict(X_val, qid_val)

# Add the predictions to the DataFrame for inspection.
val_df['y_pred'] = y_val_pred
# Print Example Predictions Grouped by Query
print("Validation Predictions:")
for query, group in val_df.groupby('query'):
    print(f"\nQuery: {query}")
    # Adjust column names as needed; here we print the document 'long_common_name' and its prediction.
    print(group[['long_common_name', 'y_pred']]) 

Validation Predictions:

Query: Creatinine Urine or Blood
                                     long_common_name  y_pred
0   11-Deoxycortisol [Mass/volume] in Serum or Plasma     0.0
1   Hemoglobin.gastrointestinal [Presence] in Vomitus     0.0
2           Busulfan [Mass/volume] in Serum or Plasma     0.0
3   Fetal Trisomy 13 risk [Likelihood] based on Pl...     0.0
4   Cholesterol esters/Cholesterol.total in Serum ...     0.0
5    Bacteria identified in Isolate by Aerobe culture     0.0
6                   Meperidine [Presence] in Specimen     0.0
7           Green Bean IgE Ab [Units/volume] in Serum     0.0
8       Japanese Cedar IgE Ab [Units/volume] in Serum     0.0
9   Chronic lymphocytic leukemia gene targeted mut...     0.0
10                              Cell type in Specimen     0.0
11  Drugs identified in Blood by Screen method Nom...     0.0
12  California Live Oak IgE Ab [Units/volume] in S...     0.0
13  Entamoeba histolytica DNA [Presence] in Stool ...     0.0
14  Calcium 

In [None]:
# Function to assign ranking within each query group
def assign_ranking(group):
    group = group.sort_values(by='y_pred', ascending=False).copy()
    group['AdaRank Ranking'] = range(1, len(group) + 1)
    return group

# Create a dictionary where each key is a query and value is the ranked DataFrame for that query
query_groups = {query: assign_ranking(group) for query, group in val_df.groupby('query')}

# Write each query's results to a separate sheet in an Excel file
output_file = 'Results/validation_results_by_query.xlsx'
with pd.ExcelWriter(output_file) as writer:
    for query, df_group in query_groups.items():
        # Excel sheet names can have a maximum of 31 characters, so we truncate if necessary
        sheet_name = query if len(query) <= 31 else query[:31]
        df_group.to_excel(writer, sheet_name=sheet_name, index=False)

print(f"Validation results saved to {output_file}")

Validation results saved to Results/validation_results_by_query.xlsx
