# Database Review and Profiling

## Table Structure

### Eight tables:

| Table          | Columns | Rows    | Business Primary Key | Apparent FKs          |
|----------------|---------|---------|----------------------|-----------------------|
| CARE_GIVERS    | 5       | 7567    | CGID                 |                       |
| DIAGNOSES      | 6       | 651047  | ROW_ID               | SUBJECT_ID, ICD9_CODE |
| ICD_DIAGNOSES  | 5       | 14567   | ICD9_CODE            |                       |
| ICD_LABS       | 7       | 753     | ITEMID               | LOINC_CODE            |
| LABS           | 7       | 3014967 | index                | SUBJECT_ID, ITEMID    |
| PATIENTS       | 9       | 58909   | SUBJECT_ID           |                       |
| TRANSFERS      | 8       | 30084   | index                | SUBJECT_ID            |
| TREATMENT_TEAM | 6       | 105789  | index                | SUBJECT_ID, CGID      |

## Assumptions and Questions

* Index and ROW_ID - are they all unique IDs?
    * No! in PATIENTS, The same ROW_ID and SUBJECT_ID can point more than one row! Some have same birthdate, always same gender
    * There are a host of DODs, but I'm not focusing on this b/c it doesn't appear to be part of the equation
    * Key question: Can we behave as though the same SUBJECT_ID is the same pt throughout?
    * Not duplicated:
        * CARE_GIVERS
        * LABS
* What is HADM_ID?
    * I think it's Hospital Admission ID
* Assumption: all the patients in the database were admitted during the period of interest.
    * WRONG. The TRANSFERS table has admits from much of 2000. So we'll have to narrow down the patients to those who have an admit date in June and work with the HADM
    * ALSO: TRANSFERS does not reliably have a to/from unit. This is going to make specifying exactly what unit the pt is in difficult

## Early Analysis

How many of these anomalies actually matter?

Let's look at the requirements for the SQL extract:

All patients in the cohort should have a diagnosis of Heart Failure or Cardiac Dysfunction and should have a lab result for Troponin
some time during their admission.

This means:
1. Gather all patients whose admit d/t was in June 2000
2. Inner join to DX HF or CD
    * Does this DX have to be made during this admission? Can a prior DX qualify the patient?
    * __HF__: `lower(SHORT_TITLE) like '%heart f%'`
    * __CD__: This is complicated. There is no specific dx called "Cardiac Dysfunction." We could do one of a few things:
       * Hand-pick dx descriptions with a string
       * Select a range of ICD codes. Grossly speaking, it looks like we might want:
           * 390-392  Acute Rheumatic Fever
           * 393-398  Chronic Rheumatic Heart Disease
           * 410-414  Ischemic Heart Disease
           * 415-417  Diseases Of Pulmonary Circulation
           * 420-429  Other Forms Of Heart Disease
       * BUT There are a host of other conditions not in this range, particularly a group of neonatal codes (there are many NICU admissions in this data set) and a group of cardiac inflammation caused by pathogens in the 0* and 1* hierarchies.  Need guidance
3. Inner join to a lab result for Troponin during that stay <-- __that__ stay
    * We have two ICD_LABS entries for Troponin: __do we want these both__?
       * 51002,Troponin I
       * 51003,Troponin T


## Building out the Query

## Identify the admitted patients

This gives us 4,839 rows:
```
SELECT DISTINCT SUBJECT_ID, HADM_ID
FROM TRANSFERS T
WHERE T.EVENTTYPE = 'admit' and INTIME BETWEEN '2000-06-01' and '2000-07-01';
```

## Filter for the Troponin Labs

We can pretty quickly filter this on the lab result. We get 5,868 lab results for this patient population and 1,841 distinct patients
from this call:

```
SELECT DISTINCT pts.SUBJECT_ID
FROM (SELECT DISTINCT SUBJECT_ID, HADM_ID
      FROM TRANSFERS T
      WHERE T.EVENTTYPE = 'admit'
        and INTIME BETWEEN '2000-06-01' and '2000-07-01') pts
         inner join
     (SELECT *
      FROM LABS
               INNER JOIN ICD_LABS on LABS.ITEMID = ICD_LABS.ITEMID and lower(ICD_LABS.label) like 'tropo%') tropo
     on pts.SUBJECT_ID = tropo.SUBJECT_ID
         and pts.HADM_ID = tropo.HADM_ID
```

## DRAFT: Filter for Heart Disease

```
SELECT * from ICD_DIAGNOSES
where SUBSTR(ICD9_CODE, 1, 3) in
    ('390', '391', '392', --acute rheumatic
     '393', '394', '395', '396', '397', '398', --chronic rheumatic
     '410', '411', '412', '413', '414', -- ischemic hd
     '415', '416', '417', -- pulmonary circulation
     '420', '421', '422', '423', '424', '425', '426', '427', '428', '429') -- other hd
```

This returns 197 DX rows and can be tuned as requirements clarify

## Final Draft Query:

```
SELECT DISTINCT pts.SUBJECT_ID
FROM (SELECT DISTINCT SUBJECT_ID, HADM_ID
      FROM TRANSFERS T
      WHERE T.EVENTTYPE = 'admit'
        AND INTIME BETWEEN '2000-06-01' AND '2000-07-01') pts
 INNER JOIN
     (SELECT *
      FROM LABS
               INNER JOIN ICD_LABS ON LABS.ITEMID = ICD_LABS.ITEMID AND lower(ICD_LABS.label) like 'tropo%') tropo
     ON pts.SUBJECT_ID = tropo.SUBJECT_ID
         AND pts.HADM_ID = tropo.HADM_ID
INNER JOIN
    (SELECT * from DIAGNOSES
     WHERE (SUBSTR(DIAGNOSES.ICD9_CODE, 1, 3) in
    ('390', '391', '392', --acute rheumatic
     '393', '394', '395', '396', '397', '398', --chronic rheumatic
     '410', '411', '412', '413', '414', -- ischemic hd
     '415', '416', '417', -- pulmONary circulatiON
     '420', '421', '422', '423', '424', '425', '426', '427', '428', '429'))-- other hd
        ) dx
    ON pts.SUBJECT_ID = dx.SUBJECT_ID AND pts.HADM_ID = dx.HADM_ID;
```
This returns 1,433 distinct patients. Since Troponin is a test pretty specific to cardiac care, I suspect there are in fact more pts we need to gather up with a wider net in the DIAGNOSIS table

# Questions for the restrospective analysis

* Does "every twelve hours" mean every twelve hours from the moment of a given patient's admission, or does it mean we can take measurements for everyone at 7AM and 7PM?
* Does "in the patient's unit" mean a "unit" as labeled in the TRANSFER table?
* If it's the TRANSFER table, what are the rules for imputing a unit? Many times PREV and CURR are NULL
* What generates a row in the TRANSFER table? Sometimes they come within minutes, sometimes after a week.
* The CARE_GIVERS table is inconsistent. What is a "Nurse?" Is there a specific combination of LABEL/DESCRIPTION that identifies them? What about NPs? They often get lumped in with residents. Selecting on LABEL in ('RN', 'Rn', 'RNs', 'rn') gets close but what about students? Is a LPN a Nurse? Is a Read Only RN a Nurse?
* Need to validate the lab test as LOINC 33762-6 Natriuretic peptide.B prohormone