# Predictive Modeling of Hospital Length of Stay and Discharge Type
# [Step 1: Data cleaning - Laboratory data]

This notebook explores and performs an initial cleaning of the laboratory results dataset, which will be used in the analysis.

The dataset, provided by the Insel Data Science Center (IDSC) of Bern, contains laboratory data spanning approximately 16 years from Inselspital, the university hospital of Bern.

The **data** includes:

- Laboratory test results from hospital visits.

**Libraries used:**  
- pandas  

**Last updated:** May 13, 2025 (Table of contents was added)

### Table of contents and summary of the cleaning steps

1. [Import libraries and load data](#1-import-libraries-and-load-data)
  - Imported `pandas` library.
  - Loaded the laboratory results dataset from a CSV file.

2. [Initial exploration and cleaning](#2-initial-exploration-and-cleaning)
  - 2.1 [Extract information about the dataset](#2-1-extract-information-about-the-dataset)
    - Used `.info()` and `.nunique()` to inspect data structure and unique values.
    - Memory Optimization: converted numerical columns to more efficient data types (`float32`, `int32`).
    - Column Renaming: translated German column names to English using a dictionary and `rename()`.
  - 2.2 [Check for missing values](#2-2-check-for-missing-values)
    - Checked for missing values in all columns.
    - Explored missing values in `test_name` and `test_abbr`.
    - Replaced missing `test_abbr` for 'Natrium' with 'Na'.
    - Removed rows where both `numeric_result` and `text_result` were missing.
    - Converted numeric values in `text_result` to `numeric_result` where possible.
    - Dropped rows where `numeric_result` was still missing after conversion.
  - 2.3 [Check and remove duplicated rows](#2-3-check-remove-duplicates)
    - Identified duplicate rows.
    - Removed duplicates from the dataset.

3. [Saving Cleaned Data](#3-save-cleaned-data)
    - Saved the cleaned DataFrame to a new CSV file for further use.

## 1. Import libraries and load dataset <a id='1-import-libraries-and-load-data'></a>

In [None]:
# Import data manipulation library
import pandas as pd

In [2]:
# Set data path
lab_data_path = "/home/anna/Desktop/Master_thesis/raw_data/RITM0154633_del_20240923/RITM0154633_lab.csv"
lab_output_path = "/home/anna/Desktop/Master_thesis/output_data/cleaned_lab_data"

# Load lab dataset 
lab_data = pd.read_csv(lab_data_path)

### First look at the dataset:

In [3]:
display(lab_data)

Unnamed: 0,dim_patient_bk_pseudo,dim_fall_bk_pseudo,bezeichnung,kurzbezeichnung,methodenummer,ergebnis_numerisch,ergebnis_text,unit
0,1,171465,Natrium,,1,138.0,138,mmol/L
1,1,171465,Kalium,KA,3,4.6,4.6,mmol/L
2,1,171465,Hämolytisch,H-Se,42,4.0,4,Unknown
3,1,171465,Lipämisch,L-Se,43,3.0,3,Unknown
4,1,171465,Ikterisch,I-Se,44,1.0,1,Unknown
...,...,...,...,...,...,...,...,...
20486999,240990,415184,RDW,RDWn,9117,14.3,14.3,%
20487000,240990,415184,Thrombozyten,THZn,9119,226.0,226,G/L
20487001,240990,415184,MPV,MPVn,9204,9.0,9.0,fL
20487002,240990,415184,Normoblasten maschinell,NRBCmn,9205,0.0,0.00,/100 Leuk.


# 2. Lab data exploration and cleaning <a id='2-initial-exploration-and-cleaning'></a>

## 2.1 Extract information about the dataset <a id="2-1-extract-information-about-the-dataset"></a>

The dataset comprises 20,487,004 rows and 8 columns representing different variables. 
These variables include a mix of data types, specifically:
- one float variable, 
- three integer variables, 
- and four string variables.

In [4]:
# Information about the dataset
print(lab_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20487004 entries, 0 to 20487003
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   dim_patient_bk_pseudo  int64  
 1   dim_fall_bk_pseudo     int64  
 2   bezeichnung            object 
 3   kurzbezeichnung        object 
 4   methodenummer          int64  
 5   ergebnis_numerisch     float64
 6   ergebnis_text          object 
 7   unit                   object 
dtypes: float64(1), int64(3), object(4)
memory usage: 1.2+ GB
None


In [5]:
# Optimize memory usage by converting numerical columns to smaller data types
  
lab_data["ergebnis_numerisch"] = lab_data["ergebnis_numerisch"].astype("float32")  # Reduce float precision  

for col in ["dim_patient_bk_pseudo", "dim_fall_bk_pseudo", "methodenummer"]:  
    lab_data[col] = lab_data[col].astype("int32")  # Use smaller integer type  

In [6]:
# Translate columns from German to English

# Translation dictionary
column_translation = {
    "dim_patient_bk_pseudo": "patient_id",
    "dim_fall_bk_pseudo": "case_id",
    "bezeichnung": "test_name",
    "kurzbezeichnung": "test_abbr",
    "methodenummer": "method_number",
    "ergebnis_numerisch": "numeric_result",
    "ergebnis_text": "text_result",
    "unit": "unit"
}

# Rename columns
lab_data.rename(columns=column_translation, inplace=True)

# Check new column names
print(lab_data.columns)

display(lab_data)

Index(['patient_id', 'case_id', 'test_name', 'test_abbr', 'method_number',
       'numeric_result', 'text_result', 'unit'],
      dtype='object')


Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
0,1,171465,Natrium,,1,138.0,138,mmol/L
1,1,171465,Kalium,KA,3,4.6,4.6,mmol/L
2,1,171465,Hämolytisch,H-Se,42,4.0,4,Unknown
3,1,171465,Lipämisch,L-Se,43,3.0,3,Unknown
4,1,171465,Ikterisch,I-Se,44,1.0,1,Unknown
...,...,...,...,...,...,...,...,...
20486999,240990,415184,RDW,RDWn,9117,14.3,14.3,%
20487000,240990,415184,Thrombozyten,THZn,9119,226.0,226,G/L
20487001,240990,415184,MPV,MPVn,9204,9.0,9.0,fL
20487002,240990,415184,Normoblasten maschinell,NRBCmn,9205,0.0,0.00,/100 Leuk.


In [7]:
lab_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20487004 entries, 0 to 20487003
Data columns (total 8 columns):
 #   Column          Dtype  
---  ------          -----  
 0   patient_id      int32  
 1   case_id         int32  
 2   test_name       object 
 3   test_abbr       object 
 4   method_number   int32  
 5   numeric_result  float32
 6   text_result     object 
 7   unit            object 
dtypes: float32(1), int32(3), object(4)
memory usage: 937.8+ MB


In [8]:
# Get the number of unique entries per column
unique_counts_lab = lab_data.nunique()

# Display results
print(unique_counts_lab)

patient_id        182238
case_id           311611
test_name           3469
test_abbr           4608
method_number       4695
numeric_result     62295
text_result        98458
unit                 168
dtype: int64


## 2.2 Check for missing values <a id="2-2-check-for-missing-values"></a>

In [9]:
# Total missing values
print(f"Total missing values:\n{lab_data.isna().sum().sum()}")

# Check rows with missing values
print(lab_data.isna().sum())

Total missing values:
2740336
patient_id              0
case_id                 0
test_name             779
test_abbr          387136
method_number           0
numeric_result    1175245
text_result       1171815
unit                 5361
dtype: int64


### 2.2.1 Explore missing values in test_name column

In [10]:
# Isolate rows with null values
null_test_name = lab_data[lab_data['test_name'].isnull()]

print(f"Number of missing values in test_name column: {len(null_test_name)}")

# Display summary stats of rows with null values
display(null_test_name)

Number of missing values in test_name column: 779


Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
63168,772,352562,,B1adn+,34509,,,Unknown
63169,772,352562,,K2adn-,34513,,,Unknown
63170,772,352562,,B2adn-,34514,,,Unknown
147627,1800,154714,,K1adn+,34502,,,Unknown
192265,2337,86522,,B1adn-,34508,,,Unknown
...,...,...,...,...,...,...,...,...
20443413,240733,402028,,B1-dn+,34510,,,Unknown
20443414,240733,402028,,B1-dn-,34511,,,Unknown
20455809,240808,407027,,K1adn-,34501,,,Unknown
20455810,240808,407027,,B1adn-,34508,,,Unknown


### 2.2.2 Explore missing values in test_abbr column

In [11]:
# Isolate rows with null values
null_test_abbr = lab_data[lab_data['test_abbr'].isnull()]

print(f"Number of missing values in test_abbr column: {len(null_test_abbr)}")

# Display rows with null values
display(null_test_abbr)

Number of missing values in test_abbr column: 387136


Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
0,1,171465,Natrium,,1,138.0,138,mmol/L
129,1,333396,Natrium,,1,137.0,137,mmol/L
249,2,27091,Natrium,,1,136.0,136,mmol/L
250,2,27091,Natrium,,1,135.0,135,mmol/L
331,2,36154,Natrium,,1,138.0,138,mmol/L
...,...,...,...,...,...,...,...,...
20486787,240988,393440,Natrium,,1,140.0,140,mmol/L
20486855,240988,412516,Natrium,,1,139.0,139,mmol/L
20486856,240988,412516,Natrium,,1,137.0,137,mmol/L
20486949,240989,393141,Natrium,,1,141.0,141,mmol/L


In [12]:
# Print unique test_name values where test_abbr is missing
print(f"Unique test_name values where test_abbr is missing: {null_test_abbr['test_name'].unique()}")

# Replace missing test_abbr values where test_name is 'Natrium'
print(f"\nReplace missing test_abbr values where test_name is 'Natrium'...\n")
lab_data.loc[lab_data['test_name'] == 'Natrium', 'test_abbr'] = 'Na'

# Print number of missing values in test_abbr column after replacement
print(f"Number of missing values in test_abbr column after replacement: {lab_data['test_abbr'].isnull().sum()}")

Unique test_name values where test_abbr is missing: ['Natrium']

Replace missing test_abbr values where test_name is 'Natrium'...

Number of missing values in test_abbr column after replacement: 0


In [13]:
# Remove temporary variable
del null_test_abbr

### 2.2.3 Explore missing values in numeric_result and text_result columns

In [14]:
# Isolate rows with null values
null_num_res = lab_data[lab_data['numeric_result'].isna()]

# Display summary stats of rows with null values
display(null_num_res)

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
120,1,171465,Neutrophile,NEUm#n,9225,,,G/L
121,1,171465,Lymphozyten,LYMm#n,9227,,,G/L
122,1,171465,Monozyten,MONm#n,9229,,,G/L
123,1,171465,Eosinophile,EOSm#n,9231,,,G/L
124,1,171465,Basophile,BASm#n,9233,,,G/L
...,...,...,...,...,...,...,...,...
20486853,240988,393440,Hämatogramm 2,HGR 2n,9702,,,Unknown
20486933,240988,412516,Hämatogramm 2,HGR 2n,9702,,,Unknown
20486948,240988,425806,Hämatogramm 2,HGR 2n,9702,,,Unknown
20486978,240989,393141,Hämatogramm 2,HGR 2n,9702,,,Unknown


In [15]:
# Store the initial number of rows
initial_rows = len(lab_data)

# Remove rows where both text_result and numeric_result are missing
lab_data = lab_data.dropna(subset=['text_result', 'numeric_result'], how='all')

# Compute the number of removed rows
removed_rows = initial_rows - len(lab_data)

# Print the result
print(f"Number of rows removed: {removed_rows}")

Number of rows removed: 1171775


In [16]:
# Total missing values
print(f"Total missing values after cleaning:\n{lab_data.isna().sum().sum()}")

# Check rows with missing values
print(lab_data.isna().sum())

Total missing values after cleaning:
8852
patient_id           0
case_id              0
test_name            0
test_abbr            0
method_number        0
numeric_result    3470
text_result         40
unit              5342
dtype: int64


In [17]:
# Remove temporary variable
del null_num_res

### 2.2.4 Explore missing values in numeric_result column

In [18]:
# Isolate rows with null values in 'numeric_result' column
num_res_null = lab_data[lab_data['numeric_result'].isna()]

print(f"Number of missing values in numeric_result column: {len(num_res_null)}")

# Display summary stats of rows with null values
display(num_res_null)

Number of missing values in numeric_result column: 3470


Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
5673,54,244839,Fragestellung,CCFFra,59001,,AL,Unknown
63096,772,352562,Megakaryopoese,KMAMK1,28030,,1,Unknown
63097,772,352562,Megakaryopoese,KMAMK3,28032,,0,Unknown
63098,772,352562,Erythropoese,KMAER1,28035,,1,Unknown
63099,772,352562,Erythropoese,KMAER3,28037,,2,Unknown
...,...,...,...,...,...,...,...,...
20470723,240867,419994,Bekannte Diagnose,StB167,54667,,AndSynd,Unknown
20470862,240867,423278,Fragestellung,StB149,54649,,k.A.,Unknown
20470863,240867,423278,Medikamente,StB151,54651,,k.A.,Unknown
20470866,240867,423278,Bekannte Diagnose,StB167,54667,,k.A.,Unknown


In [19]:
# Print unique lab test abbreviations with missing numeric_result
print(f"Unique lab TEST_ABBR with missing numeric_result:\n\n{num_res_null['test_abbr'].unique()}\n")

print(f"\n**********************************************************************\n")
# Print unique lab test names with missing numeric_result
print(f"Unique lab TEST_NAME with missing numeric_result:\n\n{num_res_null['test_name'].unique()}\n")

Unique lab TEST_ABBR with missing numeric_result:

['CCFFra' 'KMAMK1' 'KMAMK3' 'KMAER1' 'KMAER3' 'KMAML1' 'KMAPL1' 'KMAPL4'
 'KMALY1' 'akid' 'KMAML3' 'KMAEG1' 'igm-q' 'Symp' 'StB149' 'StB151'
 'StB167' 'Vdiag' 'StB133' 'StB139' 'StB141' 'StB143' 'igaqko' 'igmqko']


**********************************************************************

Unique lab TEST_NAME with missing numeric_result:

['Fragestellung' 'Megakaryopoese' 'Erythropoese' 'Myelopoese'
 'Plasmazellen' 'Lymphozyten' 'Antikörperidentifikation' 'Eisen'
 'IgM Quotient' 'Symptome' 'Medikamente' 'Bekannte Diagnose'
 'Klinischer Verdacht' 'Erwartetes Urinkreatinin' 'Körpergrösse in m'
 'BMI' 'IgA Quot. korr' 'IgM Quot. korr']



In [20]:
# Isolate rows with null values in 'numeric_result' column and text_result that can be converted to a numeric value
text_to_num = lab_data[lab_data['numeric_result'].isna() & lab_data['text_result'].apply(pd.to_numeric, errors='coerce').notna()]

# Print the number of such rows
print(f"Number of rows where numeric_result is missing but text_result is a number: {len(text_to_num)}")

# Display these rows for inspection
display(text_to_num)

Number of rows where numeric_result is missing but text_result is a number: 1799


Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
63096,772,352562,Megakaryopoese,KMAMK1,28030,,1,Unknown
63097,772,352562,Megakaryopoese,KMAMK3,28032,,0,Unknown
63098,772,352562,Erythropoese,KMAER1,28035,,1,Unknown
63099,772,352562,Erythropoese,KMAER3,28037,,2,Unknown
63100,772,352562,Myelopoese,KMAML1,28040,,1,Unknown
...,...,...,...,...,...,...,...,...
20455863,240808,421529,Myelopoese,KMAML1,28040,,1,Unknown
20455864,240808,421529,Myelopoese,KMAML3,28042,,7,Unknown
20455867,240808,421529,Plasmazellen,KMAPL1,28055,,1,Unknown
20455868,240808,421529,Lymphozyten,KMALY1,28065,,1,Unknown


In [21]:
# Print unique lab test abbreviations where 'numeric_result' is missing but 'text_result' contains a numeric value
print(f"Unique lab TEST_ABBR with missing numeric_result where text_result is a number:\n\n{text_to_num['test_abbr'].unique()}\n")

print(f"\n**********************************************************************\n")

# Print unique lab test names where 'numeric_result' is missing but 'text_result' contains a numeric value
print(f"Unique lab TEST_NAME with missing numeric_result where text_result is a number:\n\n{text_to_num['test_name'].unique()}\n")

Unique lab TEST_ABBR with missing numeric_result where text_result is a number:

['KMAMK1' 'KMAMK3' 'KMAER1' 'KMAER3' 'KMAML1' 'KMAPL1' 'KMAPL4' 'KMALY1'
 'KMAML3' 'KMAEG1']


**********************************************************************

Unique lab TEST_NAME with missing numeric_result where text_result is a number:

['Megakaryopoese' 'Erythropoese' 'Myelopoese' 'Plasmazellen' 'Lymphozyten'
 'Eisen']



In [22]:
# Fill missing numeric_result with converted numeric values from text_result
mask = lab_data["numeric_result"].isna()  # Identify missing numeric_result
lab_data.loc[mask, "numeric_result"] = pd.to_numeric(lab_data.loc[mask, "text_result"], errors="coerce")

# Check missing values in 'numeric_result' column after filling
lab_data[lab_data['numeric_result'].isna()]

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
5673,54,244839,Fragestellung,CCFFra,59001,,AL,Unknown
63180,772,352562,Fragestellung,CCFFra,59001,,AL,Unknown
88810,1095,96417,Antikörperidentifikation,akid,7023,,Anti-s,Unknown
109837,1329,258399,Fragestellung,CCFFra,59001,,CD34,Unknown
115344,1409,265648,Fragestellung,CCFFra,59001,,LST,Unknown
...,...,...,...,...,...,...,...,...
20470723,240867,419994,Bekannte Diagnose,StB167,54667,,AndSynd,Unknown
20470862,240867,423278,Fragestellung,StB149,54649,,k.A.,Unknown
20470863,240867,423278,Medikamente,StB151,54651,,k.A.,Unknown
20470866,240867,423278,Bekannte Diagnose,StB167,54667,,k.A.,Unknown


In [23]:
# Filter rows where 'numeric_result' is missing and 'text_result' starts with '>' or '<'
lab_data[lab_data['numeric_result'].isna() & lab_data['text_result'].str.startswith(('>', '<'), na=False)]

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit


In [24]:
# Remove rows where numeric_result is still missing
lab_data = lab_data.dropna(subset=["numeric_result"])

In [25]:
# Check rows with missing values again
print(lab_data.isna().sum())

patient_id           0
case_id              0
test_name            0
test_abbr            0
method_number        0
numeric_result       0
text_result         40
unit              5342
dtype: int64


In [26]:
# Remove temporary variable
del num_res_null

## 2.3 Check and remove duplicated rows <a id="2-3-check-remove-duplicates"></a>

In [27]:
duplicate_rows_lab = lab_data[lab_data.duplicated()]
print(f"Number of duplicate rows: {len(duplicate_rows_lab)}")

Number of duplicate rows: 25


In [28]:
# View duplicated rows
display(duplicate_rows_lab)

Unnamed: 0,patient_id,case_id,test_name,test_abbr,method_number,numeric_result,text_result,unit
165491,2019,30187,"Heparin, unfrakt.",HAXaU,6152,0.09,< 0.1,AntiXa/mL
531739,6628,195679,Basophile,BASO2n,9130,0.06,0.06,G/L
724316,8904,370137,INR,INRiH,6608,0.96,<1.0,Unknown
785263,9644,78608,Monozyten,MONO2n,9132,0.85,0.85,G/L
1154100,14122,126261,INR,INRiH,6608,0.95,<1.0,Unknown
1541282,19085,231531,Hämatokrit,Hk.n,9110,0.37,0.37,L/L
1541284,19085,231531,Hämatokrit,Hkn,9111,0.37,0.37,L/L
1803787,22437,220248,INR,INRiH,6608,0.97,<1.0,Unknown
3990912,49820,380729,INR,INRiH,6608,0.98,<1.0,Unknown
4152251,51845,387034,INR,INRiH,6608,0.96,<1.0,Unknown


In [29]:
# Remove duplicate rows and check
lab_data = lab_data.drop_duplicates()
print(f"Number of remaining duplicate rows: {lab_data.duplicated().sum()}")

Number of remaining duplicate rows: 0


In [30]:
# Remove temporary variable
del duplicate_rows_lab

# 3. Save cleaned dataset <a id="3-save-cleaned-data"></a>

In [31]:
lab_data.to_csv(lab_output_path, index=False)