## AI for Medicine Course 3 Week 2 lecture exercises - Cleaning Text

For this notebook we'll be using the re module which is part of Python's Standard Library and provides support for regular expressions (aka regexp). If you aren't familiar with regexp we strongly recommend to check the [official docs](https://docs.python.org/3/library/re.html).

Regular expressions allow you to perform searches and replacements in strings based on patterns.
As a quick intro let's check some examples:

We'll be using the search method which has the form 

```python
search(pattern, text)
```
It will output a match if one is found or None if no match is found.

Let's begin with a simple example. Notice the next three examples, we try to match the pattern to the "Pleural Effusion" string. In particular notice the special characters:
- ^ denotes "starts with" followed by the pattern
- $ denotes "ends with" preceded by the pattern
- | denotes "or" followed by another pattern

Can you see why the first two examples output a match unlike the third one?

In [1]:
import re

In [2]:
print("match found") if re.search("^Pl|ion$", "Pleural Effusion") else "no match found"

match found


In [3]:
print("match found") if re.search("^Sa|ion$", "Pleural Effusion") else "no match found"

match found


In [4]:
print("match found") if re.search("^Ut|xs$", "Pleural Effusion") else "no match found"

'no match found'

Next is a more advanced example. We want to match the pattern "any number of letters" followed by a slash, followed by "any number of letters":

In [5]:
print("match found") if re.search('(?<=[a-zA-Z])/(?=[a-zA-Z])', "O8OOO/9YYYkkk") else "no match found"

'no match found'

In [6]:
print("match found") if re.search('(?<=[a-zA-Z])/(?=[a-zA-Z])', "XXXX/YYYY") else "no match found"

match found


Let's implement a clean() function. It should receive a sentence as input, clean it up and then return the clean version of it. By "cleaning" we refer to:
   1. Convert to lowercase only
   2. Change "and/or" to "or"
   3. Change "/" to "or" when used to indicate equality between two words such as tomatos/tomatoes
   4. Replace double periods ".." with single period "."
   5. Insert the appropiate space after periods or commas
   6. Convert multi whitespaces to a single whitespace

Let's take on step at a time: First let's convert the sentence to lowercase. Will be using a sample sentence and see how it changes along the way:

In [7]:
sentence = "     BIBASILAR OPACITIES,likely representing bilateral pleural effusions with ATELECTASIS   and/or PNEUMONIA.."

Use the built-in lower() method to change all characters of a string to their lowercase counterparts. Very easy to implement the first step:

In [8]:
sentence = sentence.lower()
sentence

'     bibasilar opacities,likely representing bilateral pleural effusions with atelectasis   and/or pneumonia..'

Let's do steps 2 and 3 in a single cell. re module provides the sub() method to substitute patterns in a string with other string:

In [9]:
sentence = re.sub('and/or', 'or', sentence)
sentence = re.sub('(?<=[a-zA-Z])/(?=[a-zA-Z])', ' or ', sentence)
sentence

'     bibasilar opacities,likely representing bilateral pleural effusions with atelectasis   or pneumonia..'

Sometimes using regexp might be an overkill as you're trying to match a simple pattern. In those cases it is better to use Python's built-in replace() method, let's do that for step 4:

In [10]:
sentence = sentence.replace("..", ".")
sentence

'     bibasilar opacities,likely representing bilateral pleural effusions with atelectasis   or pneumonia.'

For step 5 let's use an obscure Python built-in method, translate(). You can read everything about it [here](https://docs.python.org/3/library/stdtypes.html#str.translate). Notice it usually is used alongside the maketrans() method:

In [11]:
punctuation_spacer = str.maketrans({key: f"{key} " for key in ".,"})
sentence = sentence.translate(punctuation_spacer)
sentence

'     bibasilar opacities, likely representing bilateral pleural effusions with atelectasis   or pneumonia. '

For step 6 we can trim multiple whitespaces leveraging Python's join method. This can be also done using regexp:

In [12]:
sentence = ' '.join(sentence.split())
sentence

'bibasilar opacities, likely representing bilateral pleural effusions with atelectasis or pneumonia.'

Now we have a much cleaner sentence, won't you agree?

Putting it all together into a function we can test this implementation for various sentences:

In [17]:
def clean(sentence):
    lower_sentence = sentence.lower()
    corrected_sentence = re.sub('and/or', 'or', lower_sentence)
    corrected_sentence = re.sub('(?<=[a-zA-Z])/(?=[a-zA-Z])', ' or ', corrected_sentence)
    clean_sentence = corrected_sentence.replace("..", ".")
    punctuation_spacer = str.maketrans({key: f"{key} " for key in ".,"})
    clean_sentence = clean_sentence.translate(punctuation_spacer)
    clean_sentence = ' '.join(clean_sentence.split())
    return clean_sentence

sentences = ["     BIBASILAR OPACITIES,likely representing bilateral pleural effusions with ATELECTASIS   and/or PNEUMONIA..",
             "Small left pleural effusion/decreased lung volumes bilaterally.left RetroCardiac Atelectasis.",
             "PA  and lateral views of the chest demonstrate   clear lungs,with NO focal air space opacity and/or pleural effusion.",
             "worrisome nodule in the Right Upper  lobe.CANNOT exclude neoplasm.."]

for n, sentence in enumerate(sentences):
    print("\n##########################\n")
    print(f"Sentence number: {n+1}")
    print(f"Raw sentence: \n{sentence}")
    print(f"Cleaned sentence: \n{clean(sentence)}")


##########################

Sentence number: 1
Raw sentence: 
     BIBASILAR OPACITIES,likely representing bilateral pleural effusions with ATELECTASIS   and/or PNEUMONIA..
Cleaned sentence: 
bibasilar opacities, likely representing bilateral pleural effusions with atelectasis or pneumonia.

##########################

Sentence number: 2
Raw sentence: 
Small left pleural effusion/decreased lung volumes bilaterally.left RetroCardiac Atelectasis.
Cleaned sentence: 
small left pleural effusion or decreased lung volumes bilaterally. left retrocardiac atelectasis.

##########################

Sentence number: 3
Raw sentence: 
PA  and lateral views of the chest demonstrate   clear lungs,with NO focal air space opacity and/or pleural effusion.
Cleaned sentence: 
pa and lateral views of the chest demonstrate clear lungs, with no focal air space opacity or pleural effusion.

##########################

Sentence number: 4
Raw sentence: 
worrisome nodule in the Right Upper  lobe.CANNOT exclude n

**Congratulations on finishing this lecture notebook!** Hopefully you know better understand regexp and some built-in methods which can be leveraged to clean up text. The clean() function will be used in your graded assignment so better to understand how it works.