# Teil 5 Demo 1: Aufwertung von Daten mit Terminologien

In dieser Demo zeigen wir, wie Rohdaten durch Hinzufügen von Terminologien aufgewertet werden können.

## Bibliotheken & Konfiguration

In diesem Abschnitt werden benötigte Programmpakete geladen und Konfigurationsvariablen z.B. für die Datenquellen gesetzt.

Hinweis: auf Google Colab kann das Laden der Pakete vor allem beim ersten Aufruf einige Minuten dauern. Bitte führen Sie diesen Block dann nicht erneut aus, sondern warten die Ausführung ab.

In [1]:
packages <- c("readr", "dplyr", "stringr", "tidyr", "icd.data")
install.packages(setdiff(packages, rownames(installed.packages())))
lapply(packages, require, character.only = TRUE)

base_url <- "https://raw.githubusercontent.com/ganslats/TMF-School-Datenanalyse-Visualisierung/master/Rohdaten/mimic-iii-demo/"

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Loading required package: readr

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: stringr

Loading required package: tidyr

Loading required package: icd.data



## Ausgewählte MIMIC III-Rohdaten laden

In diesem Block laden wir aus dem MIMIC III-DAtensatz die für die Behandlungsfälle dokumentierten Diagnosen (Tabelle `diagnoses`) sowie den ICD9-Diagnosekatalog (Tabelle `d_icd`).

In [2]:
# Diagnosen laden
mimic.diagnoses.raw <- read_delim(paste(base_url, "DIAGNOSES_ICD.csv", sep=""),
                                  col_types = cols(row_id = col_double(), subject_id = col_double(), hadm_id = col_double(), seq_num = col_double(), icd9_code = col_character()),
                                  skip = 0, delim = ",")
mimic.d_icd.raw     <- read_csv(paste(base_url, "D_ICD_DIAGNOSES.csv", sep=""),
                                    col_types = cols(row_id = col_double(), icd9_code = col_character(), short_title = col_character(), long_title = col_character()))
head(mimic.diagnoses.raw)
#head(mimic.d_icd.raw)

row_id,subject_id,hadm_id,seq_num,icd9_code
<dbl>,<dbl>,<dbl>,<dbl>,<chr>
112344,10006,142345,1,99591
112345,10006,142345,2,99662
112346,10006,142345,3,5672
112347,10006,142345,4,40391
112348,10006,142345,5,42731
112349,10006,142345,6,4280


In [4]:
# ICD9-Diagnosekatalog laden
mimic.d_icd.raw <- read_delim(paste(base_url, "D_ICD_DIAGNOSES.csv", sep=""),
                                    col_types = cols(row_id = col_double(), icd9_code = col_character(), short_title = col_character(), long_title = col_character()),
                                  skip = 0, delim = ",")
head(mimic.d_icd.raw)

row_id,icd9_code,short_title,long_title
<dbl>,<chr>,<chr>,<chr>
1,1716,Erythem nod tb-oth test,"Erythema nodosum with hypersensitivity reaction in tuberculosis, tubercle bacilli not found by bacteriological or histological examination, but tuberculosis confirmed by other methods [inoculation of animals]"
2,1720,TB periph lymph-unspec,"Tuberculosis of peripheral lymph nodes, unspecified"
3,1721,TB periph lymph-no exam,"Tuberculosis of peripheral lymph nodes, bacteriological or histological examination not done"
4,1722,TB periph lymph-exam unk,"Tuberculosis of peripheral lymph nodes, bacteriological or histological examination unknown (at present)"
5,1723,TB periph lymph-micro dx,"Tuberculosis of peripheral lymph nodes, tubercle bacilli found (in sputum) by microscopy"
6,1724,TB periph lymph-cult dx,"Tuberculosis of peripheral lymph nodes, tubercle bacilli not found (in sputum) by microscopy, but found by bacterial culture"


## Häufigste Diagnosen ermitteln

In diesem Block wollen wir

In [5]:
head(mimic.diagnoses.raw %>%
    group_by(icd9_code) %>%
    summarize(n = n(), .groups = "keep") %>%
    arrange(desc(n)),
10)

icd9_code,n
<chr>,<int>
4019,53
42731,48
5849,45
4280,39
25000,31
51881,31
2724,29
5990,27
486,26
2859,25


## Bezeichner aus dem ICD9-Katalog ergänzen

In [None]:
head(mimic.diagnoses.raw %>%
    inner_join(mimic.d_icd.raw %>% select(icd9_code, short_title), by = "icd9_code") %>%
    group_by(icd9_code, short_title) %>%
    summarize(n = n(), .groups = "keep") %>%
    arrange(desc(n)),
10)

In [None]:
icd9.hierarchy.raw <- icd9cm_hierarchy
icd9.hierarchy.raw$icd9_code <- as.character(icd9.hierarchy.raw$code)
head(icd9.hierarchy.raw)

## ICD9-Hierarchie ergänzen

### ICD9-Hierarchy aus dem R-Paket "icd.data" laden und String-Version des ICD-Codes ergänzen

In [None]:
icd9.hierarchy.raw <- icd9cm_hierarchy
icd9.hierarchy.raw$icd9_code <- as.character(icd9.hierarchy.raw$code)
head(icd9.hierarchy.raw)

## Aggregation auf Ebene der Unterkapitel des ICD9-Katalogs

In [None]:
head(mimic.diagnoses.raw %>%
    inner_join(icd9.hierarchy.raw, by = "icd9_code") %>%
    group_by(sub_chapter) %>%
    summarize(n = n(), .groups = "keep") %>%
    arrange(desc(n)),
10)