# Predictive Modeling of Hospital Length of Stay and Discharge Type
# [Step 1: Data cleaning - Laboratory data]

This notebook explores and cleans the **laboratory results dataset** that will be used in the analysis.

The dataset, provided by the Insel Data Science Center (IDSC) of Bern, contains laboratory data spanning approximately 16 years from Inselspital, the university hospital of Bern.

The data includes:

- Laboratory test results from hospital visits.

## 1. Import libraries and load datasets

In [1]:
# Import data manipulation library
import pandas as pd
# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Set data path
lab_data_path = "/home/anna/Desktop/Master_thesis/raw_data/RITM0154633_del_20240923/RITM0154633_lab.csv"
lab_output_path = "/home/anna/Desktop/Master_thesis/output_data/cleaned_lab_data"

# Load lab dataset 
lab_data = pd.read_csv(lab_data_path)

### First look at the dataset:

In [3]:
display(lab_data)

Unnamed: 0,dim_patient_bk_pseudo,dim_fall_bk_pseudo,bezeichnung,kurzbezeichnung,methodenummer,ergebnis_numerisch,ergebnis_text,unit
0,1,171465,Natrium,,1,138.0,138,mmol/L
1,1,171465,Kalium,KA,3,4.6,4.6,mmol/L
2,1,171465,Hämolytisch,H-Se,42,4.0,4,Unknown
3,1,171465,Lipämisch,L-Se,43,3.0,3,Unknown
4,1,171465,Ikterisch,I-Se,44,1.0,1,Unknown
...,...,...,...,...,...,...,...,...
20486999,240990,415184,RDW,RDWn,9117,14.3,14.3,%
20487000,240990,415184,Thrombozyten,THZn,9119,226.0,226,G/L
20487001,240990,415184,MPV,MPVn,9204,9.0,9.0,fL
20487002,240990,415184,Normoblasten maschinell,NRBCmn,9205,0.0,0.00,/100 Leuk.


# 2. Lab data cleaning and exploration

## 2.1 Extract information about the dataset

The dataset comprises 20,487,004 rows and 8 columns representing different variables. 
These variables include a mix of data types, specifically:
- one float variable, 
- three integer variables, 
- and four string variables.

In [4]:
# Information about the dataset
print(lab_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20487004 entries, 0 to 20487003
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   dim_patient_bk_pseudo  int64  
 1   dim_fall_bk_pseudo     int64  
 2   bezeichnung            object 
 3   kurzbezeichnung        object 
 4   methodenummer          int64  
 5   ergebnis_numerisch     float64
 6   ergebnis_text          object 
 7   unit                   object 
dtypes: float64(1), int64(3), object(4)
memory usage: 1.2+ GB
None


In [5]:
# Optimize memory usage by converting numerical columns to smaller data types
  
lab_data["ergebnis_numerisch"] = lab_data["ergebnis_numerisch"].astype("float32")  # Reduce float precision  

for col in ["dim_patient_bk_pseudo", "dim_fall_bk_pseudo", "methodenummer"]:  
    lab_data[col] = lab_data[col].astype("int32")  # Use smaller integer type  

In [6]:
# Translate columns from German to English

# Translation dictionary
column_translation = {
    "dim_patient_bk_pseudo": "patient_id",
    "dim_fall_bk_pseudo": "case_id",
    "bezeichnung": "test_name",
    "kurzbezeichnung": "test_abbr",
    "methodenummer": "method_number",
    "ergebnis_numerisch": "numeric_result",
    "ergebnis_text": "text_result",
    "unit": "unit"
}

# Rename columns
lab_data.rename(columns=column_translation, inplace=True)

# Check new column names
print(lab_data.columns)

display(lab_data)

Index(['patient_id', 'case_id', 'test_name', 'test_abbr', 'method_number',
       'numeric_result', 'text_result', 'unit'],
      dtype='object')


Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
0,1,171465,Natrium,,1,138.0,138,mmol/L
1,1,171465,Kalium,KA,3,4.6,4.6,mmol/L
2,1,171465,Hämolytisch,H-Se,42,4.0,4,Unknown
3,1,171465,Lipämisch,L-Se,43,3.0,3,Unknown
4,1,171465,Ikterisch,I-Se,44,1.0,1,Unknown
...,...,...,...,...,...,...,...,...
20486999,240990,415184,RDW,RDWn,9117,14.3,14.3,%
20487000,240990,415184,Thrombozyten,THZn,9119,226.0,226,G/L
20487001,240990,415184,MPV,MPVn,9204,9.0,9.0,fL
20487002,240990,415184,Normoblasten maschinell,NRBCmn,9205,0.0,0.00,/100 Leuk.


In [7]:
# Get the number of unique entries per column
unique_counts_lab = lab_data.nunique()

# Display results
print(unique_counts_lab)

patient_id        182238
case_id           311611
test_name           3469
test_abbr           4608
method_number       4695
numeric_result     62295
text_result        98458
unit                 168
dtype: int64


## 2.2 Check for missing values

In [8]:
# Total missing values
print(f"Total missing values:\n{lab_data.isna().sum().sum()}")

# Check rows with missing values
print(lab_data.isna().sum())

Total missing values:
2740336
patient_id              0
case_id                 0
test_name             779
test_abbr          387136
method_number           0
numeric_result    1175245
text_result       1171815
unit                 5361
dtype: int64


### 2.2.1 Explore missing values in test_name column

In [9]:
# Isolate rows with null values
null_test_name = lab_data[lab_data['test_name'].isnull()]

print(f"Number of missing values in test_name column: {len(null_test_name)}")

# Display summary stats of rows with null values
display(null_test_name)

Number of missing values in test_name column: 779


Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
63168,772,352562,,B1adn+,34509,,,Unknown
63169,772,352562,,K2adn-,34513,,,Unknown
63170,772,352562,,B2adn-,34514,,,Unknown
147627,1800,154714,,K1adn+,34502,,,Unknown
192265,2337,86522,,B1adn-,34508,,,Unknown
...,...,...,...,...,...,...,...,...
20443413,240733,402028,,B1-dn+,34510,,,Unknown
20443414,240733,402028,,B1-dn-,34511,,,Unknown
20455809,240808,407027,,K1adn-,34501,,,Unknown
20455810,240808,407027,,B1adn-,34508,,,Unknown


### 2.2.2 Explore missing values in test_abbr column

In [10]:
# Isolate rows with null values
null_test_abbr = lab_data[lab_data['test_abbr'].isnull()]

print(f"Number of missing values in test_abbr column: {len(null_test_abbr)}")

# Display rows with null values
display(null_test_abbr)

Number of missing values in test_abbr column: 387136


Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
0,1,171465,Natrium,,1,138.0,138,mmol/L
129,1,333396,Natrium,,1,137.0,137,mmol/L
249,2,27091,Natrium,,1,136.0,136,mmol/L
250,2,27091,Natrium,,1,135.0,135,mmol/L
331,2,36154,Natrium,,1,138.0,138,mmol/L
...,...,...,...,...,...,...,...,...
20486787,240988,393440,Natrium,,1,140.0,140,mmol/L
20486855,240988,412516,Natrium,,1,139.0,139,mmol/L
20486856,240988,412516,Natrium,,1,137.0,137,mmol/L
20486949,240989,393141,Natrium,,1,141.0,141,mmol/L


In [11]:
# Print unique test_name values where test_abbr is missing
print(f"Unique test_name values where test_abbr is missing: {null_test_abbr['test_name'].unique()}")

# Replace missing test_abbr values where test_name is 'Natrium'
print(f"\nReplace missing test_abbr values where test_name is 'Natrium'...\n")
lab_data.loc[lab_data['test_name'] == 'Natrium', 'test_abbr'] = 'Na'

# Print number of missing values in test_abbr column after replacement
print(f"Number of missing values in test_abbr column after replacement: {lab_data['test_abbr'].isnull().sum()}")

Unique test_name values where test_abbr is missing: ['Natrium']

Replace missing test_abbr values where test_name is 'Natrium'...

Number of missing values in test_abbr column after replacement: 0


In [12]:
# Remove temporary variable
del null_test_abbr

### 2.2.3 Explore missing values in numeric_result and text_result columns

In [13]:
# Isolate rows with null values
null_num_res = lab_data[lab_data['numeric_result'].isna()]

# Display summary stats of rows with null values
display(null_num_res)

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
120,1,171465,Neutrophile,NEUm#n,9225,,,G/L
121,1,171465,Lymphozyten,LYMm#n,9227,,,G/L
122,1,171465,Monozyten,MONm#n,9229,,,G/L
123,1,171465,Eosinophile,EOSm#n,9231,,,G/L
124,1,171465,Basophile,BASm#n,9233,,,G/L
...,...,...,...,...,...,...,...,...
20486853,240988,393440,Hämatogramm 2,HGR 2n,9702,,,Unknown
20486933,240988,412516,Hämatogramm 2,HGR 2n,9702,,,Unknown
20486948,240988,425806,Hämatogramm 2,HGR 2n,9702,,,Unknown
20486978,240989,393141,Hämatogramm 2,HGR 2n,9702,,,Unknown


In [14]:
# Store the initial number of rows
initial_rows = len(lab_data)

# Remove rows where both text_result and numeric_result are missing
lab_data = lab_data.dropna(subset=['text_result', 'numeric_result'], how='all')

# Compute the number of removed rows
removed_rows = initial_rows - len(lab_data)

# Print the result
print(f"Number of rows removed: {removed_rows}")

Number of rows removed: 1171775


In [15]:
# Total missing values
print(f"Total missing values after cleaning:\n{lab_data.isna().sum().sum()}")

# Check rows with missing values
print(lab_data.isna().sum())

Total missing values after cleaning:
8852
patient_id           0
case_id              0
test_name            0
test_abbr            0
method_number        0
numeric_result    3470
text_result         40
unit              5342
dtype: int64


In [16]:
# Remove temporary variable
del null_num_res

### 2.2.4 Explore missing values in numeric_result column

In [17]:
# Isolate rows with null values in 'numeric_result' column
num_res_null = lab_data[lab_data['numeric_result'].isna()]

print(f"Number of missing values in numeric_result column: {len(num_res_null)}")

# Display summary stats of rows with null values
display(num_res_null)

Number of missing values in numeric_result column: 3470


Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
5673,54,244839,Fragestellung,CCFFra,59001,,AL,Unknown
63096,772,352562,Megakaryopoese,KMAMK1,28030,,1,Unknown
63097,772,352562,Megakaryopoese,KMAMK3,28032,,0,Unknown
63098,772,352562,Erythropoese,KMAER1,28035,,1,Unknown
63099,772,352562,Erythropoese,KMAER3,28037,,2,Unknown
...,...,...,...,...,...,...,...,...
20470723,240867,419994,Bekannte Diagnose,StB167,54667,,AndSynd,Unknown
20470862,240867,423278,Fragestellung,StB149,54649,,k.A.,Unknown
20470863,240867,423278,Medikamente,StB151,54651,,k.A.,Unknown
20470866,240867,423278,Bekannte Diagnose,StB167,54667,,k.A.,Unknown


In [18]:
# Print unique lab test abbreviations with missing numeric_result
print(f"Unique lab TEST_ABBR with missing numeric_result:\n\n{num_res_null['test_abbr'].unique()}\n")

print(f"\n**********************************************************************\n")
# Print unique lab test names with missing numeric_result
print(f"Unique lab TEST_NAME with missing numeric_result:\n\n{num_res_null['test_name'].unique()}\n")

Unique lab TEST_ABBR with missing numeric_result:

['CCFFra' 'KMAMK1' 'KMAMK3' 'KMAER1' 'KMAER3' 'KMAML1' 'KMAPL1' 'KMAPL4'
 'KMALY1' 'akid' 'KMAML3' 'KMAEG1' 'igm-q' 'Symp' 'StB149' 'StB151'
 'StB167' 'Vdiag' 'StB133' 'StB139' 'StB141' 'StB143' 'igaqko' 'igmqko']


**********************************************************************

Unique lab TEST_NAME with missing numeric_result:

['Fragestellung' 'Megakaryopoese' 'Erythropoese' 'Myelopoese'
 'Plasmazellen' 'Lymphozyten' 'Antikörperidentifikation' 'Eisen'
 'IgM Quotient' 'Symptome' 'Medikamente' 'Bekannte Diagnose'
 'Klinischer Verdacht' 'Erwartetes Urinkreatinin' 'Körpergrösse in m'
 'BMI' 'IgA Quot. korr' 'IgM Quot. korr']



In [19]:
# Fill missing numeric_result with converted numeric values from text_result
mask = lab_data["numeric_result"].isna()  # Identify missing numeric_result
lab_data.loc[mask, "numeric_result"] = pd.to_numeric(lab_data.loc[mask, "text_result"], errors="coerce")

# Check missing values in 'numeric_result' column after filling
lab_data[lab_data['numeric_result'].isna()]

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
5673,54,244839,Fragestellung,CCFFra,59001,,AL,Unknown
63180,772,352562,Fragestellung,CCFFra,59001,,AL,Unknown
88810,1095,96417,Antikörperidentifikation,akid,7023,,Anti-s,Unknown
109837,1329,258399,Fragestellung,CCFFra,59001,,CD34,Unknown
115344,1409,265648,Fragestellung,CCFFra,59001,,LST,Unknown
...,...,...,...,...,...,...,...,...
20470723,240867,419994,Bekannte Diagnose,StB167,54667,,AndSynd,Unknown
20470862,240867,423278,Fragestellung,StB149,54649,,k.A.,Unknown
20470863,240867,423278,Medikamente,StB151,54651,,k.A.,Unknown
20470866,240867,423278,Bekannte Diagnose,StB167,54667,,k.A.,Unknown


In [20]:
# Filter rows where 'numeric_result' is missing and 'text_result' starts with '>' or '<'
lab_data[lab_data['numeric_result'].isna() & lab_data['text_result'].str.startswith(('>', '<'), na=False)]

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit


In [21]:
# Remove rows where numeric_result is still missing
lab_data = lab_data.dropna(subset=["numeric_result"])

In [22]:
# Check rows with missing values again
print(lab_data.isna().sum())

patient_id           0
case_id              0
test_name            0
test_abbr            0
method_number        0
numeric_result       0
text_result         40
unit              5342
dtype: int64


In [23]:
# Remove temporary variable
del num_res_null

## 2.3 Check and remove duplicated rows

In [24]:
duplicate_rows_lab = lab_data[lab_data.duplicated()]
print(f"Number of duplicate rows: {len(duplicate_rows_lab)}")

Number of duplicate rows: 25


In [25]:
# View duplicated rows
display(duplicate_rows_lab)

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
165491,2019,30187,"Heparin, unfrakt.",HAXaU,6152,0.09,< 0.1,AntiXa/mL
531739,6628,195679,Basophile,BASO2n,9130,0.06,0.06,G/L
724316,8904,370137,INR,INRiH,6608,0.96,<1.0,Unknown
785263,9644,78608,Monozyten,MONO2n,9132,0.85,0.85,G/L
1154100,14122,126261,INR,INRiH,6608,0.95,<1.0,Unknown
1541282,19085,231531,Hämatokrit,Hk.n,9110,0.37,0.37,L/L
1541284,19085,231531,Hämatokrit,Hkn,9111,0.37,0.37,L/L
1803787,22437,220248,INR,INRiH,6608,0.97,<1.0,Unknown
3990912,49820,380729,INR,INRiH,6608,0.98,<1.0,Unknown
4152251,51845,387034,INR,INRiH,6608,0.96,<1.0,Unknown


In [26]:
# Remove duplicate rows and check
lab_data = lab_data.drop_duplicates()
print(f"Number of remaining duplicate rows: {lab_data.duplicated().sum()}")

Number of remaining duplicate rows: 0


In [27]:
# Remove temporary variable
del duplicate_rows_lab

## 2.4 Check plausibility of numeric_result values

### 2.4.1 Check negative values

In [28]:
# Count rows with negative numeric_result
negative_rows = (lab_data['numeric_result'] < 0)
print(f"Number of rows with negative numeric_result: {negative_rows.sum()}")

Number of rows with negative numeric_result: 276389


In [29]:
# Display rows with negative values
lab_data[lab_data['numeric_result'] < 0]

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
610,2,335525,Basen-Excess Std.,BE-St,26,-1.9,-1.9,mmol/L
611,2,335525,Basen-Excess Std.,BE-St,26,-2.1,-2.1,mmol/L
1666,21,205034,Basen-Excess-2,BE2,23,-0.8,-0.8,mmol/L
1667,21,205034,Basen-Excess,BE,24,-0.8,-0.8,mmol/L
1713,22,71424,Basen-Excess Std.,BE-St,26,-2.4,-2.4,mmol/L
...,...,...,...,...,...,...,...,...
20486429,240985,423650,Basen-Excess Std.,BE-St,26,-4.9,-4.9,mmol/L
20486430,240985,423650,Basen-Excess Std.,BE-St,26,-1.6,-1.6,mmol/L
20486431,240985,423650,Basen-Excess Std.,BE-St,26,-1.3,-1.3,mmol/L
20486432,240985,423650,Basen-Excess Std.,BE-St,26,-0.1,-0.1,mmol/L


In [30]:
# Calculate the number of negative results for each test and method combination
negative_counts = (
    lab_data[lab_data['numeric_result'] < 0]
    .groupby(['test_name', 'method_number', 'unit'])
    .size()
    .reset_index(name='negative_count')
    .sort_values(by='negative_count', ascending=False)  # Sorting in descending order
)

print(negative_counts.to_string(index=False))  # Show all rows

               test_name  method_number    unit  negative_count
       Basen-Excess Std.             26  mmol/L          110488
            Basen-Excess             24  mmol/L           76873
          Basen-Excess-2             23  mmol/L           76869
       Basen-Excess akt.          70004  mmol/L            7362
       Basen-Excess akt.          70104  mmol/L            2684
                D-Dimere           6504    µg/L             676
             Hämolytisch             47 Unknown             310
          HIV -1/2 Ag-Ak          60030 Unknown             242
                    COHb            551       %             131
      Faktor VII (koag.)           6312       %             121
           Methämoglobin             35       %             109
        Faktor X (koag.)           6313       %              91
                Delta-He           9214      pg              72
     C-reaktives Protein             67    mg/L              55
            Erythrozyten            380 

Basen-Excess is the only test in the list that can have physiological negative values. Therefore, it makes sense to keep only Basen-Excess in the analysis and remove rows for all other tests with negative results, as they are likely due to errors rather than meaningful physiological measurements.

In [31]:
# Keep all tests with "Basen-Excess" in the test name and remove other tests with negative values
lab_data = lab_data[~((lab_data['numeric_result'] < 0) & (~lab_data['test_name'].str.contains('Basen-Excess')))]

# Display rows with negative values
lab_data[lab_data['numeric_result'] < 0]

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
610,2,335525,Basen-Excess Std.,BE-St,26,-1.9,-1.9,mmol/L
611,2,335525,Basen-Excess Std.,BE-St,26,-2.1,-2.1,mmol/L
1666,21,205034,Basen-Excess-2,BE2,23,-0.8,-0.8,mmol/L
1667,21,205034,Basen-Excess,BE,24,-0.8,-0.8,mmol/L
1713,22,71424,Basen-Excess Std.,BE-St,26,-2.4,-2.4,mmol/L
...,...,...,...,...,...,...,...,...
20486429,240985,423650,Basen-Excess Std.,BE-St,26,-4.9,-4.9,mmol/L
20486430,240985,423650,Basen-Excess Std.,BE-St,26,-1.6,-1.6,mmol/L
20486431,240985,423650,Basen-Excess Std.,BE-St,26,-1.3,-1.3,mmol/L
20486432,240985,423650,Basen-Excess Std.,BE-St,26,-0.1,-0.1,mmol/L


In [32]:
# Check after cleaning: print the number of negative results for each test and method combination
print(
    lab_data[lab_data['numeric_result'] < 0]
    .groupby(['test_name', 'method_number', 'unit'])
    .size()
    .reset_index(name='negative_count')
    .sort_values(by='negative_count', ascending=False)  # Sorting in descending order
)

           test_name  method_number    unit  negative_count
1  Basen-Excess Std.             26  mmol/L          110488
0       Basen-Excess             24  mmol/L           76873
6     Basen-Excess-2             23  mmol/L           76869
2  Basen-Excess akt.          70004  mmol/L            7362
3  Basen-Excess akt.          70104  mmol/L            2684
4  Basen-Excess akt.          70204  mmol/L               9
5     Basen-Excess-1             22  mmol/L               6


### 2.4.2 Filtering out values incompatible with life
    
Limits were defined for specific tests to remove values incompatible with life. 

*NOTE: Consider other ranges, or addressing outliers later in the analysis!*
    
- Natrium: values below 100 mmol/L and above 191 mmol/L (Vogt, W. et al., 1992).
- Kalium: values below 1.2 mmol/L and above 9.8 mmol/L (Janssens, P. M. W. *et al.,* 2021). 
- Chloride: values below 65 mmol/L and above 138 mmol/L (Vogt, W. et al., 1992). 
- pH: values below 6.8 and above 7.8 (for both venous and arterial) (Janssens, P. M. W. *et al.,* 2021).

In [33]:
# Define the limits for each test
limits = {
    'KA': {'min': 1.2, 'max': 9.8},  # Kalium (KA)
    'Na': {'min': 100, 'max': 191},  # Natrium (NA)
    'CL': {'min': 65, 'max': 138},   # Chloride (CL)
    'pH': {'min': 6.8, 'max': 7.8}   # pH (PH)
}

# Loop through each test abbreviation in limits and filter out rows outside the range
for test_abbr, limit in limits.items():
    lab_data = lab_data[~((lab_data['test_abbr'] == test_abbr) & 
                          ((lab_data['numeric_result'] < limit['min']) | 
                           (lab_data['numeric_result'] > limit['max'])))]

# After running this, `lab_data` will have rows removed where numeric_result is outside the specified range

lab_data

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
0,1,171465,Natrium,Na,1,138.0,138,mmol/L
1,1,171465,Kalium,KA,3,4.6,4.6,mmol/L
2,1,171465,Hämolytisch,H-Se,42,4.0,4,Unknown
3,1,171465,Lipämisch,L-Se,43,3.0,3,Unknown
4,1,171465,Ikterisch,I-Se,44,1.0,1,Unknown
...,...,...,...,...,...,...,...,...
20486998,240990,415184,MCHC,MCHCn,9116,333.0,333,g/L
20486999,240990,415184,RDW,RDWn,9117,14.3,14.3,%
20487000,240990,415184,Thrombozyten,THZn,9119,226.0,226,G/L
20487001,240990,415184,MPV,MPVn,9204,9.0,9.0,fL


### 2.4.3 Check for Outliers in the Remaining Laboratory Tests

**Most important laboratory test**

Select the most frequent laboratory measurements with less than 80% missing cases.

In [34]:
# Count unique cases per lab test directly, avoiding an intermediate df_unique
lab_test_counts = lab_data.groupby("test_abbr", as_index=False)["case_id"].nunique()
lab_test_counts.rename(columns={"case_id": "num_cases"}, inplace=True)

# Total number of unique cases (computed once)
total_cases = lab_data["case_id"].nunique()

# Compute missing percentage efficiently
lab_test_counts["missing_percentage"] = 100 - (lab_test_counts["num_cases"] / total_cases * 100)

# Optionally sort (skip if not required for performance reasons)
lab_test_counts.sort_values(by="num_cases", ascending=False, inplace=True)

# If you only need test_abbr and missing_percentage
missing_counts = lab_test_counts[["test_abbr", "missing_percentage"]]

# Define the threshold for missing percentage (keep tests with < 80% missing)
filtered_lab_tests = missing_counts[missing_counts["missing_percentage"] < 80]


In [35]:
print(filtered_lab_tests.to_string(index=False))

test_abbr  missing_percentage
       KA           14.163553
    Leukn           15.379800
      Hbn           15.380763
     Eryn           15.382368
      Hkn           15.383331
     THZn           15.384936
    MCHCn           15.385257
     MCHn           15.385578
     MCVn           15.385578
     RDWn           15.458123
     MPVn           16.120655
       Na           16.224657
        L           18.770463
        I           18.771426
        H           18.830489
       GL           22.091471
   NRBCmn           23.693232
      CRP           24.653968
       CR           25.418256
    INRiH           25.541196
     QUHD           25.546653
   ENTN1n           29.520884
   EPIGFR           30.735205
   Quicks           36.439916
     ASAT           61.683230
     ALAT           62.087362
   Ben-ID           64.571858
      GGT           65.177574
     UREA           65.340960
       CA           71.260737
    EC3-U           71.549632
    PH4-U           71.550595
    GLUC3 

In the list of most frequent laboratory tests, the following code checks the test name corresponding to the abbreviation:

- **Ben-ID** is Benutzer (user)  ->  remove (likely not a relevant test)
- **FARBE3** is Farbe (color)  ->  keep, eventually not to be included in the modeling
- **TRUEB3** is Trübung (turbidity)  ->  keep, eventually excluded from the modeling
- **QUHD** is Quick  ->  keep (likely a valid laboratory test)
- **ENTN1n** is Entnahmeart (sample type)  ->  keep as metadata (useful for understanding sample collection)

In addition:
- **TNT_hn** is Hilfsanalyse (auxiliary analysis)  ->  remove (not relevant to analysis)

In [36]:
lab_data[lab_data["test_abbr"].isin(["Ben-ID", "EART", "FARBE3", "TRUEB3", "QUHD", "ENTN1n"])]["test_name"].unique()

array(['Farbe', 'Trübung', 'Benutzer', 'Quick', 'Entnahmeart',
       'Quick venös'], dtype=object)

In [37]:
# List of test abbreviations to be removed based on their descriptions
tests_to_remove = [
    "Ben-ID",  # Benutzer (user)
    "TNT_hn"   # Hilfsanalyse (auxiliary analysis)
]

# Filter out the rows where 'test_abbr' is in the tests_to_remove list
lab_data = lab_data[~lab_data["test_abbr"].isin(tests_to_remove)]

In [38]:
# Filter rows where 'text_result' starts with '>' or '<'
lab_data[lab_data['text_result'].str.startswith(('>', '<'), na=False)]

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
16,1,171465,eGFR nach CKD-EPI,EPIGFR,64,100.000000,> 90,mL/min
165,1,333396,Bakterien,BA-Ux,384,32164.800781,> 10000,/µL
167,1,333396,Pilze,PI-Ux,386,2.800000,<10,/µL
170,1,333396,Erythrozyten,ERY-CH,490,5.000000,>20,Unknown
171,1,333396,Leukozyten,LK-CH,491,5.000000,>20,Unknown
...,...,...,...,...,...,...,...,...
20486899,240988,412516,TSH,TSH,2005,101.000000,> 100,mU/L
20486903,240988,412516,Folsäure,FOLIII,2374,50.000000,> 45.4,nmol/L
20486959,240989,393141,eGFR nach CKD-EPI,EPIGFR,64,100.000000,> 90,mL/min
20486960,240989,393141,C-reaktives Protein,CRP,67,1.000000,< 3,mg/L


In the list of most frequent laboratory tests, check and remove extreme values that are most likely placeholders or errors, as identified in a previous exploratory data analysis step.

**Note**: as alternative, for placeholders, consider replacing them with upper threshold in the corresponding'text_result' column.

In [39]:
# Define the conditions for extreme values
extreme_values = (
    ((lab_data["test_abbr"] == "INRiH") & (lab_data["numeric_result"].isin([9999, 5001.36]))) |
    ((lab_data["test_abbr"] == "QUHD") & (lab_data["numeric_result"] == 9999)) |
    ((lab_data["test_abbr"] == "Tbga") & (lab_data["numeric_result"] == 34103))
)

# Display rows that will be removed
lab_data[extreme_values]

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
3833,33,187592,Quick,QUHD,6800,9999.0,>100,%
1033900,12633,71315,Quick,QUHD,6800,9999.0,>100,%
3867468,48306,53283,Körpertemperatur,Tbga,32,34103.0,34103.0,°C
4356463,54491,189920,Quick,QUHD,6800,9999.0,>100,%
5566541,69617,326336,Quick,QUHD,6800,9999.0,>100,%
5951445,74276,4105,Quick,QUHD,6800,9999.0,>100,%
7425897,92754,313524,Quick,QUHD,6800,9999.0,>100,%
7871705,98342,371573,Quick,QUHD,6800,9999.0,>100,%
10226695,127669,182835,Quick,QUHD,6800,9999.0,>100,%
11330332,141456,185165,Quick,QUHD,6800,9999.0,>100,%


In [40]:
# Remove rows with extreme values
lab_data = lab_data[~extreme_values]

lab_data.head()

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
0,1,171465,Natrium,Na,1,138.0,138.0,mmol/L
1,1,171465,Kalium,KA,3,4.6,4.6,mmol/L
2,1,171465,Hämolytisch,H-Se,42,4.0,4.0,Unknown
3,1,171465,Lipämisch,L-Se,43,3.0,3.0,Unknown
4,1,171465,Ikterisch,I-Se,44,1.0,1.0,Unknown


# 3. Save cleaned dataset

In [41]:
lab_data.to_csv(lab_output_path, index=False)