## TT01: Natural Language Processing Pipeline

We can use natural language processing in the healthcare field for a variety of purposes, such as gathering data from notes in electronic medical records. The site mtsamples.com provides a variety of notes based on medical transcription which are de-identified and freely available for educational purposes. For this task, we will use a sample physical therapy note, found here: https://mtsamples.com/site/pages/sample.asp?Type=68-Physical%20Medicine%20-%20Rehab&Sample=1613-Physical%20Therapy%20-%20Back%20Pain

### Create virtual environment

To create a virtual environment in Python (on Windows), enter the following in the powershell terminal:

python -m venv tt_venv

tt_venv\Scripts\activate

Ensure the newly created venv is selected as the notebook kernel.

## Webscraping Text Data with BeautifulSoup4


In [1]:
#if needed: pip install beautifulsoup4 requests spacy
from bs4 import BeautifulSoup
import requests
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.language import Language
import uuid

In [2]:
url = "https://mtsamples.com/site/pages/sample.asp?Type=68-Physical%20Medicine%20-%20Rehab&Sample=1613-Physical%20Therapy%20-%20Back%20Pain"
response = requests.get(url)
html_content = response.text

### Extract and Clean Data
There will be a lot of extra content we don't need. We can inspect the webpage to find the location of the text we want. 

In [3]:
soup = BeautifulSoup(response.content, "html.parser")

In [5]:
#print(soup)

In [6]:
# Search for a specific tag or element
# get all the headers (<b> tags)
headers = soup.find_all("b")
full_text = ""

for header in headers[1:]:
    #get the header and any text immediately following
    #only keep headers with corresponding text
    if header.next_sibling.string:
        #strip() will remove extra whitespace
        heading_text = header.text.strip()
        full_text += f"{heading_text} "
        next_sibling = header.next_sibling.strip()
        full_text += f"{next_sibling}; "

full_text

"Sample Name: Physical Therapy - Back Pain; Description: Patient was referred to Physical Therapy, secondary to low back pain and degenerative disk disease. The patient states she has had a cauterization of some sort to the nerves in her low back to help alleviate with painful symptoms. The patient would benefit from skilled physical therapy intervention.; DIAGNOSIS: Low back pain and degenerative lumbar disk.; HISTORY: The patient is a 59-year-old female, who was referred to Physical Therapy, secondary to low back pain and degenerative disk disease. The patient states she has had a cauterization of some sort to the nerves in her low back to help alleviate with painful symptoms. The patient states that this occurred in October 2008 as well as November 2008. The patient has a history of low back pain, secondary to a fall that originally occurred in 2006. The patient states that she slipped on a newly waxed floor and fell on her tailbone and low back region. The patient then had her seco

## Preprocessing Text with spaCy

In the terminal, run: python -m spacy download en_core_web_sm

In [7]:
#load an English language model
nlp = spacy.load("en_core_web_sm")

#create a spaCy document
doc = nlp(full_text)

### Explore spaCy features for NLP

In [8]:
#view tokens, part of speech tagging, dependency labels, lemmatization
for token in doc[11:17]:
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    token_lemma = token.lemma_

    print(f"Token text: {token_text}\nToken part of speech: {token_pos}\nToken dependency label: {token_dep}\nToken lemmatization: {token_lemma}\n")    

Token text: Patient
Token part of speech: PROPN
Token dependency label: nsubjpass
Token lemmatization: Patient

Token text: was
Token part of speech: AUX
Token dependency label: auxpass
Token lemmatization: be

Token text: referred
Token part of speech: VERB
Token dependency label: ROOT
Token lemmatization: refer

Token text: to
Token part of speech: ADP
Token dependency label: prep
Token lemmatization: to

Token text: Physical
Token part of speech: PROPN
Token dependency label: compound
Token lemmatization: Physical

Token text: Therapy
Token part of speech: PROPN
Token dependency label: pobj
Token lemmatization: Therapy



In [9]:
# view entity labels
for ent in doc.ents:
    print(ent.text, ent.label_)

Physical Therapy PERSON
59-year-old DATE
Physical Therapy PERSON
October 2008 DATE
November 2008 DATE
2006 DATE
second ORDINAL
March 2006 DATE
daily DATE
one CARDINAL
approximately June 2008 DATE
MEDICAL IMAGING ORG
7/10 CARDINAL
Pain Analog Scale FAC
0 CARDINAL
10 CARDINAL
10 CARDINAL
PATIENT PERSON
daily DATE
26 cm QUANTITY
52.5 cm QUANTITY
4/5 CARDINAL
six-minute TIME
approximately 400 feet QUANTITY
700 feet QUANTITY
two minutes TIME
six-minute TIME
PROGNOSIS ORG
Prognosis ORG
three CARDINAL
six weeks DATE


In [10]:
#spaCy encodes strings to hash values, which we can view
patient_hash = nlp.vocab.strings["Patient"]

#we can also retrieve a string using the hash
patient_string = nlp.vocab.strings[patient_hash]

print(f"Hash for 'Patient': {patient_hash}")
print(f"String for hash {patient_hash}: {patient_string}")

Hash for 'Patient': 9416364957002412138
String for hash 9416364957002412138: Patient


In [11]:
#we can also view the vector embeddings for the text
patient_vector = doc[11].vector
print(patient_vector)

[-3.9741808e-01 -1.3856400e+00  5.7806504e-01 -1.5399754e-02
  1.7877106e-01  2.1425787e-01  1.2081375e+00 -7.1735270e-02
 -3.1131499e-02 -1.1783031e+00 -4.3315195e-02  1.3181183e+00
 -1.1444057e-01 -8.3655721e-01 -1.1318797e+00  1.5556824e-01
 -4.7439426e-01  1.8106326e-02  4.3560645e-01 -4.0725997e-01
 -1.7060134e-01  1.0878589e+00  2.1799409e-01  4.9136853e-01
 -3.4494674e-01  2.0363297e-01  1.0529954e+00 -2.6651493e-01
  3.3629447e-02  2.1072283e+00 -1.5693143e-01 -8.5809922e-01
  6.7423582e-01 -7.0788735e-01  2.2461680e-01  8.9101803e-01
 -1.1783483e+00  8.1692055e-02 -5.7475716e-01  9.8924422e-01
  2.3335974e-01 -1.2498984e+00 -8.1326163e-01  5.7638109e-01
  5.9747797e-01  4.4530320e-01  1.1371143e+00  6.0283279e-01
  1.3172561e-01 -8.4578639e-01 -6.5022427e-01  3.1630656e-01
 -5.7802093e-01 -7.2950244e-01  7.8096412e-02 -5.0547588e-01
  4.4854665e-01 -3.1139100e-01  9.7099662e-02 -2.5116798e-01
 -2.5167334e-01  7.6783502e-01 -3.6242005e-01 -5.7144147e-01
 -1.4918718e-01 -1.64393

## Create a preprocessing pipeline

In [10]:
#the "en_core_web_sm" is a prebuild spaCy pipeline
#we can view the components of the pipeline to understand what it is doing
print("Pipeline components:\n")

for n in nlp.pipe_names:
    print(n)

Pipeline components:

tok2vec
tagger
parser
attribute_ruler
lemmatizer
ner


### Add custom pipeline components

In [12]:
# Component 1: Cleaning Component
@Language.component("cleaning_component")
def cleaning_component(doc):
    """
    Cleans the text:
    - Tokenization, lemmatization, and stopword removal
    - Removes punctuation and spaces
    - Stores cleaned text in doc.user_data
    """
    tokens = [
        token.lemma_.lower()
        for token in doc
        if not token.is_stop  # Remove stopwords
        and not token.is_punct  # Remove punctuation
        and not token.is_space  # Remove spaces
    ]
    cleaned_text = " ".join(tokens)
    doc.user_data["cleaned_text"] = cleaned_text  # Store cleaned text
    return doc

In [13]:
# Component 2: Chunking Component
@Language.component("chunking_component")
def chunking_component(doc, chunk_size=200, overlap=20):
    """
    Splits the cleaned text into chunks and stores them in doc.user_data:
    - Uses the cleaned text from doc.user_data["cleaned_text"]
    - Adds overlap between chunks to retain context
    """
    cleaned_text = doc.user_data.get("cleaned_text", "")
    words = cleaned_text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk_text = " ".join(words[i:i + chunk_size])
        chunk_metadata = {
            "document_id": doc.user_data.get("document_id", str(uuid.uuid4())),  # Ensure a document ID exists
            "chunk_index": i // (chunk_size - overlap)
        }
        chunks.append({"text": chunk_text, "metadata": chunk_metadata})
    
    doc.user_data["chunks"] = chunks  # Store chunks
    return doc

In [14]:
#Component 3: Generate a unique document ID
@Language.component("doc_id_component")

def document_id_component(doc):
    """Add a globally unique document ID using uuid4"""
    doc_id = str(uuid.uuid4())
    doc.user_data["document_id"] = doc_id
    return doc

In [15]:
# Add components to the pipeline
nlp.add_pipe("doc_id_component", first=True)          # Generate document ID
nlp.add_pipe("cleaning_component", after="ner")  # Clean text
nlp.add_pipe("chunking_component", last=True)         # Chunk text

<function __main__.chunking_component(doc, chunk_size=200, overlap=20)>

In [16]:
#we can see our functions are added to the pipeline
for n in nlp.pipe_names:
    print(n)

doc_id_component
tok2vec
tagger
parser
attribute_ruler
lemmatizer
ner
cleaning_component
chunking_component


## Test the Finished Pipeline

In [17]:
#reprocess the text in our updated pipeline
doc = nlp(full_text)

# Access results
cleaned_text = doc.user_data.get("cleaned_text")
print("Cleaned Text:", cleaned_text)  # Print first 100 characters of cleaned text

chunks = doc.user_data.get("chunks")
for chunk in chunks[:2]:  # Print first 2 chunks for brevity
    print(f"Chunk Text: {chunk['text']}")
    print(f"Metadata: {chunk['metadata']}")

Cleaned Text: sample physical therapy pain description patient refer physical therapy secondary low pain degenerative disk disease patient state cauterization sort nerve low help alleviate painful symptom patient benefit skilled physical therapy intervention diagnosis low pain degenerative lumbar disk history patient 59 year old female refer physical therapy secondary low pain degenerative disk disease patient state cauterization sort nerve low help alleviate painful symptom patient state occur october 2008 november 2008 patient history low pain secondary fall originally occur 2006 patient state slip newly wax floor fall tailbone low region patient second fall march 2006 patient state qualify range handgun lose footing state fall weakness low extremity loss balance past medical history past medical history significant allergy thyroid problem past surgical history patient past surgical history appendectomy hysterectomy social history patient state live single level home husband good hea