# BioMed: Real World Data - Staged Logistic Regression Model

---

**Group:**
- González Méndez, Alvaro ()
- Reyes Castro, Didier Yamil (didier.reyes.castro@alumnos.upm.es)
- Rodriguez Fernández, Cristina ()

**Course:** BioMedical Informatics - 2025/26

**Institution:** Polytechnic University of Madrid (UPM)

**Date:** October 2026

---

## Goals

The goal of the assignment is to implement a staged logistic regression model with real-world biomedical data. The model will be used to rank LOINC documents based on their relevance to specific clinical queries.

## 0 Setup

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
import joblib

## 1 Implementation

In [None]:
# Loading datasets

DATASET_FIRST_STAGE = 'data/first_stage_data.csv'
DATASET_SECOND_STAGE = 'data/second_stage_data.csv'

MODEL_1_PATH = 'first_stage_logistic_regression_model.joblib'
MODEL_2_PATH = 'second_stage_logistic_regression_model.joblib'

try:
    df_first_stage = pd.read_csv(DATASET_FIRST_STAGE)
    df_second_stage = pd.read_csv(DATASET_SECOND_STAGE)
except FileNotFoundError as e:
    print(f"Error loading datasets: {e}")
    exit(1)

### 1.1 Part A: Train First Logistic Regression Model (Intra-Clue)

The elementary clues taken into account for the first stage are: TF, IDF, is_in_component and is_in_system.

In [None]:
features_1 = ['TF', 'IDF', 'is_in_component', 'is_in_system']
target_1 = 'relevance'

X1 = df_first_stage[features_1]
Y1 = df_first_stage[target_1]

In [None]:
# Logistic Regression default parameters: penalty='l2', C=1.0, solver='lbfgs'
# solver can be changed to 'liblinear' as it is great for small datasets and binary
# classification. Check: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
model_1 = LogisticRegression()
model_1.fit(X1, Y1)

# Save the trained model to a file
joblib.dump(model_1, MODEL_1_PATH)

### 1.2 Part B: Generate Second-Level Dataset

1. Get Log-Odds from First Model (use the first dataset and predict with the model -> this will give you the log O(R/Ai) )
2.  Sum up Log-Odds per Document (group by doc_id and sum the log-odds) -> Gives you the Z score per document.
3. Complete the second stage dataset with the Z score (for those documents with 0 clues (N) fill Z with 0)

In [None]:
# 1. Get log-Odds
df_first_stage['log_odds'] = model_1.decision_function(X1)

In [None]:
# 2. Calculate Z score per document
Z = df_first_stage.groupby(['doc_id', 'query_id'])['log_odds'].sum().reset_index()
Z = Z.rename(columns={'log_odds': 'Z'})

In [None]:
# 3. Complete the second stage dataset with the Z score (for those documents with 0 clues (N) fill Z with 0)
df_second_stage = df_second_stage.merge(Z, on=['doc_id', 'query_id'], how='left')
df_second_stage['Z'] = df_second_stage['Z'].fillna(0)

### 1.3 Part C: Train Second Logistic Regression Model (Inter-Clue)

In [None]:
features_2 = ['Z', 'N']
target_2 = 'relevance'

X2 = df_first_stage[features_2]
Y2 = df_first_stage[target_2]

In [None]:
model_2 = LogisticRegression()
model_2.fit(X2, Y2)

# Save the trained model to a file
joblib.dump(model_2, MODEL_2_PATH)

## 2 Evaluation