# BioMed: Real World Data - Staged Logistic Regression Model

---

**Group:**
- González Méndez, Alvaro (alvaro.gmendez@alumnos.upm.es)
- Reyes Castro, Didier Yamil (didier.reyes.castro@alumnos.upm.es)
- Rodriguez Fernández, Cristina (cristina.rodriguezfernandez@alumnos.upm.es)

**Course:** BioMedical Informatics - 2025/26

**Institution:** Polytechnic University of Madrid (UPM)

**Date:** October 2025

---

## Goals

The goal of the assignment is to implement a staged logistic regression model with real-world biomedical data. The model will be used to rank LOINC documents based on their relevance to specific clinical queries.

## 0 Setup

In [1]:
#!pip install pandas scikit-learn numpy

In [2]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np
import math, re

## 1 Implementation

**Change the `VERSION` variable to switch between different dataset versions (2 contains extra training queries).**

In [3]:
# Loading datasets

DATASET_FIRST_STAGE = 'data/first_stage_data.csv'
DATASET_SECOND_STAGE = 'data/second_stage_data.csv'
DATASET_FIRST_STAGE_V2 = 'data/first_stage_data_v2.csv'
DATASET_SECOND_STAGE_V2 = 'data/second_stage_data_v2.csv'

VERSION = 1

try:
    if VERSION == 1:
        df_first_stage = pd.read_csv(DATASET_FIRST_STAGE, decimal=',')
        df_second_stage = pd.read_csv(DATASET_SECOND_STAGE, decimal=',')
    else:
        df_first_stage = pd.read_csv(DATASET_FIRST_STAGE_V2, decimal='.')
        df_second_stage = pd.read_csv(DATASET_SECOND_STAGE_V2, decimal='.')
except FileNotFoundError as e:
    print(f"Error loading datasets: {e}")
    exit(1)

### 1.1 Part A: Train First Logistic Regression Model (Intra-Clue)

The elementary clues taken into account for the first stage are: TF, IDF, is_in_component and is_in_system.

In [4]:
features_1 = ['TF', 'IDF', 'is_in_component', 'is_in_system']
target_1 = 'relevance'

X1 = df_first_stage[features_1]
Y1 = df_first_stage[target_1]

In [5]:
# Logistic Regression default parameters: penalty='l2', C=1.0, solver='lbfgs'
# solver can be changed to 'liblinear' as it is great for small datasets and binary
# classification. Check: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
model_1 = LogisticRegression()
model_1.fit(X1, Y1)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


### 1.2 Part B: Generate Second-Level Dataset

1. Get Log-Odds from First Model (use the first dataset and predict with the model -> this will give you the log O(R/Ai) )
2.  Sum up Log-Odds per Document (group by doc_id and sum the log-odds) -> Gives you the Z score per document.
3. Complete the second stage dataset with the Z score (for those documents with 0 clues (N) fill Z with 0)

In [6]:
# 1. Get log-Odds
df_first_stage['log_odds'] = model_1.decision_function(X1)

In [7]:
# 2. Calculate Z score per document
Z = df_first_stage.groupby(['loinc_num', 'query_id'])['log_odds'].sum().reset_index()
Z = Z.rename(columns={'log_odds': 'Z'})
print(Z)

    loinc_num                 query_id         Z
0      1003-3         bilirubin_plasma -2.026019
1      1003-3            glucose_blood -2.141078
2      1003-3  white_blood_cells_count -2.141078
3     10331-7            glucose_blood -2.387183
4     10331-7  white_blood_cells_count -2.387183
..        ...                      ...       ...
131     925-8  white_blood_cells_count -1.198329
132     933-2            glucose_blood -1.198329
133     933-2  white_blood_cells_count -1.198329
134     934-0            glucose_blood -1.198329
135     934-0  white_blood_cells_count -1.198329

[136 rows x 3 columns]


In [8]:
# 3. Complete the second stage dataset with the Z score (for those documents with 0 clues (N) fill Z with 0)
df_second_stage = df_second_stage.merge(Z, on=['loinc_num', 'query_id'], how='left')
df_second_stage['Z'] = df_second_stage['Z'].fillna(0)

In [9]:
display(df_second_stage)

Unnamed: 0,loinc_num,query_id,N,relevance,Z
0,74774-1,glucose_blood,2,1,-1.438414
1,35184-1,glucose_blood,2,1,-1.684519
2,14764-5,glucose_blood,2,1,-1.438414
3,14749-6,glucose_blood,2,1,-1.684519
4,934-0,glucose_blood,1,0,-1.198329
...,...,...,...,...,...
196,14423-8,white_blood_cells_count,0,0,0.000000
197,13317-3,white_blood_cells_count,0,0,0.000000
198,1250-0,white_blood_cells_count,1,0,-2.633288
199,10331-7,white_blood_cells_count,1,0,-2.387183


### 1.3 Part C: Train Second Logistic Regression Model (Inter-Clue)

In [10]:
features_2 = ['Z', 'N']
target_2 = 'relevance'

X2 = df_second_stage[features_2]
Y2 = df_second_stage[target_2]

In [11]:
model_2 = LogisticRegression()
model_2.fit(X2, Y2)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


## 2 Retrieval

Let's make the ranking of documents for a given query using the two-stage logistic regression model.

**Change the `SEARCH_QUERY` variable to test with different queries or modify `NUM_RESULTS` to see more or fewer results.**

In [12]:
# Ideally, these structures should be generated from a large
# biomedical knowledge base. They are hardcoded here for simplicity.
THESAURUS = {
    'glucose': ['glucose'],
    'blood': ['bld', 'serum', 'ser', 'plasma', 'plas'],
    'bilirubin': ['bilirubin'],
    'plasma': ['plas'],
    'white blood cells': ['wbc', 'leukocytes', 'lymphocytes', 'monocytes'],
    'hydrothorax': ['pleural fluid','plr fld'],
    'creatinine': ['creatinine'],
    'serum': ['serum','ser'],
    'tyrosine': ['tyrosine'],
    'calcium': ['calcium'],
    'albumin': ['albumin']
}



# Ideally, this mapping should be generated at runtime from the THESAURUS
# This would be part of a large information retrieval module but it is
# out of scope for this example.
QUERY_TO_CONCEPTS = {
    'glucose in blood': ['glucose', 'blood'],
    'bilirubin in plasma': ['bilirubin', 'plasma'],
    'white blood cells count': ['white blood cells','blood','cells','count'],
    'creatinine in serum': ['creatinine','serum'],
    'tyrosine in blood': ['tyrosine','blood'],
    'glucose in hydrothorax':['glucose','hydrothorax'],
    'calcium in blood': ['calcium','blood'],
    'albumin in blood': ['albumin','blood']
}

# Getting our Corpus
CORPUS_PATH = 'data/loinc_docs.csv'
try:
    df_corpus = pd.read_csv(CORPUS_PATH)
except FileNotFoundError as e:
    print(f"Error loading corpus dataset: {e}")
    exit(1)

In [13]:
def check_match(field, concept_terms):
    return any(term.lower() in field.lower() for term in concept_terms)

def check_appears_in_document(loinc_doc, concept_terms):
    return check_match(loinc_doc['long_common_name'], concept_terms) or \
        check_match(loinc_doc['component'], concept_terms) or \
        check_match(loinc_doc['system'], concept_terms)

In [14]:
def build_first_stage_dataset(concepts,query):
  # Ensure elementary_clues_df is empty before filling
  elementary_clues_df = pd.DataFrame(columns=['loinc_num', 'query_id', 'composite_clue', 'TF', 'IDF', 'is_in_component', 'is_in_system'])

  # Calculate total number of documents
  N = len(df_corpus)

  # Calculate document frequency for original composites (for IDF of the composite itself)
  doc_freq_composites = {}
  for composite in concepts:
      lower_composite = composite.lower()
      # Get the composite and its synonyms, including the composite itself
      terms_to_check_doc_freq = [lower_composite]
      if composite in THESAURUS:
          terms_to_check_doc_freq.extend([synonym.lower() for synonym in THESAURUS[composite]])

      # Count the number of documents where ANY of the terms in this group appear (using whole word match)
      count = df_corpus.apply(lambda row: any(re.search(r'\b' + re.escape(term) + r'\b', str(row['long_common_name']).lower()) or
                                      re.search(r'\b' + re.escape(term) + r'\b', str(row['component']).lower()) or
                                      re.search(r'\b' + re.escape(term) + r'\b', str(row['system']).lower()) for term in terms_to_check_doc_freq), axis=1).sum()
      doc_freq_composites[composite] = count

  # Pre-calculate terms to count for each original composite (composite + its synonyms)
  terms_to_count_map = {}
  for composite in concepts:
      lower_composite = composite.lower()
      terms = [lower_composite]
      if composite in THESAURUS:
          terms.extend([synonym.lower() for synonym in THESAURUS[composite]])
      terms_to_count_map[composite] = terms


  # Iterate over each row in the main DataFrame (df)
  for index, row in df_corpus.iterrows():
      loinc_num = row['loinc_num']
      long_common_name = str(row['long_common_name']).lower()
      component = str(row['component']).lower()
      system = str(row['system']).lower()

      # Iterate over each original composite in the concepts list
      for composite in concepts:
          lower_composite = composite.lower()
          terms_to_count = terms_to_count_map[composite]

          # Check if ANY term related to this composite (composite or synonyms) is present in the document (using whole word match)
          any_term_found = any(re.search(r'\b' + re.escape(term) + r'\b', long_common_name) or
                              re.search(r'\b' + re.escape(term) + r'\b', component) or
                              re.search(r'\b' + re.escape(term) + r'\b', system) for term in terms_to_count)

          # If at least one related term was found in the document
          if any_term_found:
              # Calculate the total counts for X1, X3, X4 by summing occurrences of all related terms in the document (using whole word match)
              total_x1 = sum(len(re.findall(r'\b' + re.escape(term) + r'\b', long_common_name)) for term in terms_to_count)
              x3 = int(sum(len(re.findall(r'\b' + re.escape(term) + r'\b', component)) for term in terms_to_count) > 0)
              x4 = int(sum(len(re.findall(r'\b' + re.escape(term) + r'\b', system)) for term in terms_to_count) > 0)

              # Calculate X2 (IDF) for the ORIGINAL composite term
              x2 = np.log10(N / (doc_freq_composites.get(composite, 0)))

              # Create a new row dictionary for the original composite in this document
              new_row = {
                  'loinc_num': loinc_num,
                  'query_id': query,
                  'composite_clue': composite,
                  'TF': total_x1,
                  'IDF': x2,
                  'is_in_component': x3,
                  'is_in_system': x4
              }
              # Add the new row to the elementary_clues_df DataFrame
              elementary_clues_df.loc[len(elementary_clues_df)] = new_row

  return elementary_clues_df

In [15]:
def build_second_stage_dataset(concepts, df, query):
  # Create the second dataset DataFrame
  second_dataset_df = pd.DataFrame(columns=['loinc_num', 'query_id', 'Z', 'N'])

  # Iterate over each row in the original DataFrame (df)
  for index, row in df_corpus.iterrows():
      loinc_num = row['loinc_num']
      long_common_name = str(row['long_common_name']).lower()
      component = str(row['component']).lower()
      system = str(row['system']).lower()

      # Calculate N: number of unique composites or their thesaurus found in long_common_name
      found_composites = set()
      # Iterate over each original composite in the composites list
      for composite in concepts:
          lower_composite = composite.lower()
          # Get the composite and its synonyms, including the composite itself
          terms_to_check = [lower_composite]
          if composite in THESAURUS:
              terms_to_check.extend([synonym.lower() for synonym in THESAURUS[composite]])

          # Check if ANY of the terms related to this composite (composite or synonyms) is present in the document (using whole word match)
          if any(re.search(r'\b' + re.escape(term) + r'\b', long_common_name) for term in terms_to_check):
              found_composites.add(composite)

      z = df[(df['loinc_num'] == loinc_num)]
      # print(loinc_num)
      # dz = 0
      # print(z)
      if len(z) > 0:
        dz = z['log_odds'].sum()

      n_count = len(found_composites)

      # Create a new row dictionary for the second dataset
      new_row = {
          'loinc_num': loinc_num,
          'query_id': query,
          'Z': dz,
          'N': n_count
      }

      # Add the new row to the second_dataset_df DataFrame
      second_dataset_df.loc[len(second_dataset_df)] = new_row

  return second_dataset_df

In [16]:
def rank_documents(query):

    # 1. Get the concepts for the query. Again this would be part of a larger
    # information retrieval module.
    concepts = QUERY_TO_CONCEPTS.get(query)

    if not concepts:
        print(f"No concepts found for query: {query}")
        return None

    # 2. Build dataset #1 for the query
    df_first_stage_query = build_first_stage_dataset(concepts, query)

    # 3. Get log-odds from first model
    df_first_stage_query['log_odds'] = model_1.decision_function(df_first_stage_query[features_1])
    # display(df_first_stage_query)

    # 4. Build second stage dataset
    df_second_stage_query = build_second_stage_dataset(concepts,df_first_stage_query, query)
    # display(df_second_stage_query)
    # print("Llego")

    # 5. Predict relevance using second model
    df_second_stage_query['final_score'] = model_2.decision_function(df_second_stage_query[features_2])

    # Merge with df_corpus to get the 'long_common_name'
    df_second_stage_query = pd.merge(df_second_stage_query, df_corpus[['loinc_num', 'long_common_name']], on='loinc_num', how='left')

    # 6. Rank documents based on final score
    df_ranked = df_second_stage_query.sort_values(by='final_score', ascending=False)

    return df_ranked[['loinc_num', 'long_common_name', 'final_score']]

In [17]:
SEARCH_QUERY = "glucose in blood"
NUM_RESULTS = 5

ranked_list_1 = rank_documents(SEARCH_QUERY)
if ranked_list_1 is not None:
    print(f"--- Top {NUM_RESULTS} Results for '{SEARCH_QUERY}' ---")
    print(ranked_list_1.head(NUM_RESULTS))

--- Top 5 Results for 'glucose in blood' ---
   loinc_num                                   long_common_name  final_score
2    14764-5  Glucose [Moles/volume] in Serum or Plasma --3 ...     1.156934
0    74774-1    Glucose [Mass/volume] in Serum, Plasma or Blood     1.014601
1    35184-1  Fasting glucose [Mass or Moles/volume] in Seru...     0.872269
3    14749-6          Glucose [Moles/volume] in Serum or Plasma     0.872269
59   15076-3                    Glucose [Moles/volume] in Urine    -1.805175


In [21]:
SEARCH_QUERY = "tyrosine in blood"
NUM_RESULTS = 5

ranked_list_1 = rank_documents(SEARCH_QUERY)
if ranked_list_1 is not None:
    print(f"--- Top {NUM_RESULTS} Results for '{SEARCH_QUERY}' ---")
    print(ranked_list_1.head(NUM_RESULTS))

--- Top 5 Results for 'tyrosine in blood' ---
   loinc_num                                   long_common_name  final_score
17    3082-5  Tyrosine aminotransferase [Mass/volume] in Plasma     1.538578
7      890-4  Blood group antibody screen [Presence] in Seru...    -2.716547
6      925-8                   Blood product disposition [Type]    -2.904406
5      933-2                                 Blood product type    -2.904406
4      934-0                          Blood product unit ID [#]    -2.904406


In [19]:
SEARCH_QUERY = "calcium in blood"
NUM_RESULTS = 5

ranked_list_1 = rank_documents(SEARCH_QUERY)
if ranked_list_1 is not None:
    print(f"--- Top {NUM_RESULTS} Results for '{SEARCH_QUERY}' ---")
    print(ranked_list_1.head(NUM_RESULTS))

--- Top 5 Results for 'calcium in blood' ---
   loinc_num                                   long_common_name  final_score
55   17861-6           Calcium [Mass/volume] in Serum or Plasma     0.954553
37    1995-0  Calcium.ionized [Moles/volume] in Serum or Plasma     0.954553
19   29265-6  Calcium [Moles/volume] corrected for albumin i...     0.954553
38    1994-3            Calcium.ionized [Moles/volume] in Blood     0.812221
12   54439-5                Calcium bilirubinate/Total in Stone    -1.722892
