# Workshop 1 <br> Gaining Clinical Insight from Text
## 1.3 - Pattern Matching in MIMIC-III

In this section, we will look at a few examples of what free text notes are line in MIMIC-III and use some pattern matching.  We will be using SQLite in Firefox ESR.  Please refer to **0.0 - Installation of SQLite** for installation instructions.
<br>
Of note, different versions of SQL may have slightly different syntax.
<br>
Refer to the following websie for more information:
https://www.postgresql.org/docs/9.1/static/functions-string.html

## Looking at Regular Expressions in MIMIC-III
In this section we will use SQL to practice regular expressions.  Of note, SQLite does not have built in Regular Expressions, so what we will be using are based off of user defined functions that were written for this workshop.  If you use PostgreSQL, you can find inforation here:<br>
https://www.postgresql.org/docs/9.3/static/functions-matching.html<br>
https://www.postgresql.org/docs/9.1/static/functions-string.html

In the past, we have used LIKE to do simple matching. Lets take a look at this by using the microbiologyevents table.

First lets take a look at all the distinct microorganisms. Clic on the **f(x)** icon. Select **Execute SQL** and type the following text into the window.  When you are finished, click **Run SQL**
```SQL
SELECT distinct micro.org_name
FROM microbiologyevents as micro
ORDER by org_name
```
Organism|
---|
"ALPHA STREPTOCOCCI"|
"BETA STREPTOCOCCUS GROUP B"|
"BURKHOLDERIA (PSEUDOMONAS) CEPACIA"|
"PRESUMPTIVE PEPTOSTREPTOCOCCUS SPECIES"|
"PSEUDOMONAS AERUGINOSA"|
"STREPTOCOCCUS PNEUMONIAE"|
"VIRIDANS STREPTOCOCCI"|

We can use LIKE to find all the types of of STREP in this table.
<br>    
```SQL
SELECT distinct micro.org_name
FROM microbiologyevents as micro
WHERE org_name like '%STREP%' or 
ORDER by org_name
```
<BR>
```SQL
SELECT *
FROM microbiologyevents as micro
WHERE regexp_match('.*ALPHA.*|.*BETA.*', org_name)
ORDER by org_name
```
<img src="images/query_like.jpg" alt="Version"  style="width:600px;"/>


Now lets take a look at echo files.  Here is an example of the text on the top of each echo file.  If you want to take a closer look (and this is encouraged! you should know what your text is like before you analyze it), you can highlight the rows you are interested in, copy as CSV format, and past it into a text editor.

```
PATIENT/TEST INFORMATION:
Indication: Endocarditis. Left ventricular function. Valvular heart disease.
Height: (in) 64
Weight (lb): 170
BSA (m2): 1.83 m2
BP (mm Hg): 92/61
HR (bpm): 106
Status: Inpatient
Date/Time: [**2144-2-11**] at 12:07
Test: TTE (Complete)
Doppler: Full Doppler and color Doppler
Contrast: None
Technical Quality: Adequate
```

    
<BR>
You can use this method to pull out useful information from files.  Lets say you want to get all the heights, you can do a command like this:
    
```SQL
SELECT *
, REGEXP_VAL('Height: \(in\) (.*?)\n',text) as height
FROM  echo
```

Congratulations! You have just structured free text!

Exercise 3
1. How do you find the weight?
2. How do you find the indication?

Click for answer to 1.

<input id="height" name="answer" type="hidden" 
value="

SELECT *
, REGEXP_VAL('Weight \(lb\): (.*?)\n',text) as weight
FROM  echo
">

Click for answer to 2.

<input id="height" name="answer" type="hidden" 
value="

SELECT *
, REGEXP_VAL('Indication: (.*?)\n',text) as Indication
FROM  echo

">

## Regular Expression Substitution
<BR>
You can use regular expressions substitute one word/part-of-a-word for something else.  Below is the syntax for creating an additional column that is the same as org_name except all the instances are Streptococcus are now SC.
```SQL
select *
    , regexp_replace(org_name, 'STREPTOCOCCUS', 'SC', 'i') as ABBREVIATION
from microbiologyevents
```
<BR>
Instead of just replacing the word Streptococcus with SC, you can replace the entire line with SC if you want.  (The . means any character and the * means there is any character 0 to infinity number of times.)
```SQL
select *, regexp_replace(org_name, '.*STREPTOCOCCUS.*', 'SC', 'i') as ABBREVIATION
from microbiologyevents
```


Now we will take a look at the ECG notes table, or electrocardiogram, which analyzes the electrical activity of the heart.  Take a look at what a sample report looks like.  The text portion looks like this:
```
Sinus rhythm. Left ventricular hypertrophy with ST-T wave changes. Compared to
the previous tracing of [**2124-12-7**] ST segment depressions and T wave inversions
persist in leads I and aVL. ST segment depressions is again recorded in
leads V2-V3. The rate has slowed. Otherwise, no diagnostic interim change.
TRACING #1
```

The formatting of the text show that some entries do not have any periods, but each sentence is separated by linebreaks (newlines or enters):
```
Sinus rhythm
Left bundle branch block
Since previous tracing of [**2147-9-9**], atrial ectopy absent
```
Some text elements items have sentences with periods. There are sometimes with extra line breaks (newlines or enters):
```
Sinus rhythm with an atrial premature beat.  Short P-R interval.  Leftward
axis.  Intraventricular conduction delay of left bundle-branch block type.
Since the previous tracing of [**2147-9-23**] the rate is faster.  The Q-T interval is
shorter but is still prolonged.
```
In order to make the text more uniform we can do the following:

1. If the text element has no periods, we can replace the line breaks (newlines) with periods:
    ```
    Sinus rhythm.  Left bundle branch block. Since previous tracing of [**2147-9-9**], atrial ectopy absent   
    ```
2. If the text element has periods, it has extra line breaks. We can get rid of all line breaks (newlines)
    ```
    Sinus rhythm with an atrial premature beat. Short P-R interval. Leftward axis. Intraventricular conduction delay of left bundle-branch block type. Since the previous tracing of [2147-9-23] the rate is faster. The Q-T interval is shorter but is still prolonged.
    ```

You can do the above with something like this, which replaces line breaks (newlines or enters) listed as '\n' with blank space or a period:

```SQL
SELECT text as before, 
    (SELECT CASE WHEN ntext ~ '.*\..*' 
     THEN  REGEXP_REPLACE(ne.text, '\n', ' ', 'g') 
     ELSE REGEXP_REPLACE(ne.text, '\n', '.', 'g')
     END as after)
FROM  ECG
```
Pay attention to the text before and after this is run.

Lets say we want to find reports that were clearly marked as sinus rhythm.  This means that the electrical activity of the heart originals from the sinus node (where it should come from).  Sinus rhythm includes sinus tacycardia (which means a fast heartbeat that originates at the sinus node).  We also want to identify patients with atrial fibrillation, which means they have an abnormal heart rhythm where.

```SQL

```