# Using Custom Tokenizer (not spaCy tokenizer) with spaCy

This example demonstrates how to use a tokenizer from HF. Pretrained models can be used as well. The last example also includes a method to finding spaces based on the tokenizer output, followed by using this in a more complex pipeline with a sectionizer.

Useful links:
- https://huggingface.co/docs/tokenizers/quicktour
- https://spacy.io/usage/linguistic-features#how-tokenizer-works

```
# Install tokenizer library
pip install tokenizers
```
```
# Get Data Set
import urllib.request
urllib.request.urlretrieve("https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip", "./data/wikitext-103-raw-v1/tokenizer-wiki.json")
```

```
# Unzip Data Set
from zipfile import ZipFile
with ZipFile('./data/wikitext-103-raw-v1/wikitext-103-raw-v1.zip', 'r') as f:
    f.extractall('/wikitext-103-raw-v1/')
```

### Train Model

In [62]:
# Define Tokenizer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

# Define Trainer
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

In [63]:
# Train Tokenizer
files = [f"./data/wikitext-103-raw-v1/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
tokenizer.train(files, trainer)

### Save & Load Tokenizer Config

In [65]:
tokenizer.save("./data/wikitext-103-raw-v1/wikitext-103-raw/tokenizer-wiki.json")

In [9]:
tokenizer.from_file("./data/wikitext-103-raw-v1/wikitext-103-raw/tokenizer-wiki.json")

<tokenizers.Tokenizer at 0x1f581910ed0>

### Test Tokenizer

In [66]:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)

['Hello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '[UNK]', '?']


### Using Custom Tokenizer With Spacy
Tokenize the medical document.

In [67]:
medical_doc = """REASON FOR REFERRAL: , Ms. A is a 60-year-old African-American female with 12 years of education who was referred for neuropsychological evaluation by Dr. X after she demonstrated mild cognitive deficits on a neuropsychological screening evaluation during a followup appointment with him for stroke in July.  A comprehensive evaluation was requested to assess current cognitive functioning and assist with diagnostic decisions and treatment planning.,RELEVANT BACKGROUND INFORMATION:,  Historical information was obtained from a review of available medical records and clinical interview with Ms. A.  A summary of pertinent information is presented below.  Please refer to the patient's medical chart for a more complete history.,HISTORY OF PRESENTING PROBLEM:,  Ms. A presented to the ABC Hospital emergency department on 07/26/2009 reporting that after she had woken up that morning she noticed numbness and weakness in her left hand, slurred speech, and left facial droop.  Neurological evaluation with Dr. X confirmed left hemiparesis.  Brain CT showed no evidence of intracranial hemorrhage or mass effect and that she received TPA and had moderate improvement in left-sided weakness.  These symptoms were thought to be due to a right middle cerebral artery stroke.  She was transferred to the ICU for monitoring.  Ultrasound of the carotids showed 20% to 30% stenosis of the right ICA and 0% to 19% stenosis of the left ICA.  On 07/29/2009, she was admitted for acute inpatient rehabilitation for the treatment of residual functional deficits of her acute ischemic right MCA/CVA.  At discharge on 08/06/2009, she was mainly on supervision for all ADLs and walking with a rolling walker, but tolerating increased ambulation with a cane.  She was discharged home with recommendations for outpatient physical therapy.  She returned to the Sinai ER on 08/2009/2009 due to reported left arm pain, numbness, and weakness, which lasted 10 to 15 minutes and she reported that it felt "just like the stroke."  Brain CT on 08/2009/2009 was read as showing "mild chronic microvascular ischemic change of deep white matter," but no acute or significant interval change compared to her previous scan.  Neurological examination with Dr. Y was within normal limits, but she was admitted for a more extensive workup.  Due to left arm pain an ultrasound was completed on her left upper extremity, but it did not show deep vein thrombosis.,Followup CT on 08/10/2009 showed no significant interval change.  MRI could not be completed due to the patient's weight.  She was discharged on 08/11/2009 in stable condition after it was determined that this event was not neurological in origin; however, note that Ms. A referred to this as a second stroke.,Ms. A presented for a followup outpatient neurological evaluation with Dr. X on 09/22/2009, at which time a brief neuropsychological screening was also conducted.  She demonstrated significant impairments in confrontation naming, abstract verbal reasoning, and visual and verbal memory and thus a more comprehensive evaluation was suggested due to her intent to return to her full-time work duty.  During the current interview, Ms. A reported that she noticed mild memory problems including some difficultly remembering conversations, events, and at times forgetting to take her medications.  She also reported mild difficulty finding words in conversation, solving novel problems and tasks (e.g. difficulty learning to use her camcorder), but overall denied significant cognitive deficits in attention, concentration, language or other areas of cognitive functioning.  When asked about her return to work, she said that she was still on light duty due to limited physical activity because of residual left leg weakness.  She reported that no one had indicated to her that she appeared less capable of performing her job duties, but said that she was also receiving fewer files to process and enter data into the computer at the Social Security Agency that she works at.  Note also that she had some difficulty explaining exactly what her job involved.  She also reported having problems falling asleep at work and that she is working full-time although on light duty.,OTHER MEDICAL HISTORY:  ,As mentioned, Ms. A continues to have some residual left leg weakness and continues to use a rolling walker for ambulation, but she reported that her motor functioning had improved significantly.  She was diagnosed with sleep apnea approximately two years ago and was recently counseled by Dr. X on the need to use her CPAP because she indicated she never used it at night.  She reported that since her appointment with Dr. X, she has been using it "every other night."  When asked about daytime fatigue, Ms. A initially denied that she was having any difficulties, but repeatedly indicated that she was falling asleep at work and thought that it was due to looking at a computer screen.  She reported at times "snoring" and forgetting where she is at and said that a supervisor offered to give her coffee at one point.  She receives approximately two to five hours of sleep per night.  Other current untreated risk factors include obesity and hypercholesterolemia.  Her medical history is also significant for hypertension, asthma, abdominal adenocarcinoma status post hysterectomy with bilateral salpingo-oophorectomy, colonic benign polyps status post resection, benign lesions of the breast status post lumpectomy, and deep vein thrombosis in the left lower extremity status post six months of anticoagulation (which she had discontinued just prior to her stroke).,CURRENT MEDICATIONS: , Aspirin 81 mg daily, Colace 100 mg b.i.d., Lipitor 80 mg daily, and albuterol MDI p.r.n.,SUBSTANCE USE:,  Ms. A denied drinking alcohol or using illicit drugs.  She used to smoke a pack of cigarettes per day, but quit five to six years ago.,FAMILY MEDICAL HISTORY: , Ms. A had difficulty providing information on familial medical history.  She reported that her mother died three to four years ago from lung cancer.  Her father has gout and blood clots.  Siblings have reportedly been treated for asthma and GI tumors.  She was unsure of familial history of other conditions such as hypertension, high cholesterol, stroke, etc.,SOCIAL HISTORY: , Ms. A completed high school degree.  She reported that she primarily obtained B's and C's in school.  She received some tutoring for algebra in middle school, but denied ever having been held back a grade failing any classes or having any problems with attention or hyperactivity.,She currently works for the Social Security Administration in data processing.  As mentioned, she has returned to full-time work, but continues to perform only light duties due to her physical condition.  She is now living on her own.  She has never driven.  She reported that she continues to perform ADLs independently such as cooking and cleaning.  She lost her husband in 2005 and has three adult daughters.  She previously reported some concerns that her children wanted her to move into assisted living, but she did not discuss that during this current evaluation.  She also reported number of other family members who had recently passed away.  She has returned to activities she enjoys such as quire, knitting, and cooking and plans to go on a cruise to the Bahamas at the end of October.,PSYCHIATRIC HISTORY: , Ms. A did not report a history of psychological or psychiatric treatment.  She reported that her current mood was good, but did describe some anxiety and nervousness about various issues such as her return to work, her upcoming trip, and other events.  She reported that this only "comes and goes.",TASKS ADMINISTERED:,Clinical Interview,Adult History Questionnaire,Wechsler Test of Adult Reading (WTAR),Mini Mental Status Exam (MMSE),Cognistat Neurobehavioral Cognitive Status Examination,Repeatable Battery for the Assessment of Neuropsychological Status (RBANS; Form XX),Mattis Dementia Rating Scale, 2nd Edition (DRS-2),Neuropsychological Assessment Battery (NAB),Wechsler Adult Intelligence Scale, Third Edition (WAIS-III),Wechsler Adult Intelligence Scale, Fourth Edition (WAIS-IV),Wechsler Abbreviated Scale of Intelligence (WASI),Test of Variables of Attention (TOVA),Auditory Consonant Trigrams (ACT),Paced Auditory Serial Addition Test (PASAT),Ruff 2 & 7 Selective Attention Test,Symbol Digit Modalities Test (SDMT),Multilingual Aphasia Examination, Second Edition (MAE-II),  Token Test,  Sentence Repetition,  Visual Naming,  Controlled Oral Word Association,  Spelling Test,  Aural Comprehension,  Reading Comprehension,Boston Naming Test, Second Edition (BNT-2),Animal Naming Test"""

In [254]:
output = tokenizer.encode(medical_doc.lower())
print(output.tokens)

['reason', 'for', 'refer', 'ral', ':', ',', 'ms', '.', 'a', 'is', 'a', '60', '-', 'year', '-', 'old', 'af', 'rican', '-', 'amer', 'ican', 'female', 'with', '12', 'years', 'of', 'education', 'who', 'was', 'referred', 'for', 'ne', 'u', 'rop', 'sy', 'ch', 'ological', 'evaluation', 'by', 'dr', '.', 'x', 'after', 'she', 'demonstrated', 'mild', 'cognitive', 'def', 'ic', 'its', 'on', 'a', 'ne', 'u', 'rop', 'sy', 'ch', 'ological', 'screening', 'evaluation', 'during', 'a', 'follow', 'up', 'appointment', 'with', 'him', 'for', 'stroke', 'in', 'ju', 'ly', '.', 'a', 'comprehensive', 'evaluation', 'was', 'requested', 'to', 'assess', 'current', 'cognitive', 'functioning', 'and', 'assist', 'with', 'diag', 'nostic', 'decisions', 'and', 'treatment', 'planning', '.', ',', 'relevant', 'background', 'information', ':', ',', 'historical', 'information', 'was', 'obtained', 'from', 'a', 'review', 'of', 'available', 'medical', 'records', 'and', 'clinical', 'interview', 'with', 'ms', '.', 'a', '.', 'a', 'summar

### Process without defining spaces.

In [255]:
import spacy
from spacy.tokens import Doc

nlp = spacy.blank("en")
words = output.tokens
spaces = None
doc = Doc(nlp.vocab, words=words, spaces=spaces)

doc.text

'reason for refer ral : , ms . a is a 60 - year - old af rican - amer ican female with 12 years of education who was referred for ne u rop sy ch ological evaluation by dr . x after she demonstrated mild cognitive def ic its on a ne u rop sy ch ological screening evaluation during a follow up appointment with him for stroke in ju ly . a comprehensive evaluation was requested to assess current cognitive functioning and assist with diag nostic decisions and treatment planning . , relevant background information : , historical information was obtained from a review of available medical records and clinical interview with ms . a . a summary of pert inent information is presented below . please refer to the patient \' s medical chart for a more complete history . , history of presenting problem : , ms . a presented to the ab c hospital emergency department on 07 / 26 / 2009 reporting that after she had w ok en up that morning she noticed num b ness and weakness in her left hand , sl ur red s

### Process when defining spaces.

In [256]:
# Find Spaces

spaces = []

for i in enumerate(output.words):
    if (i[0]-1 >= 0) and (i[0]+1 < len(output.words)):
        if output.words[i[0]] != output.words[i[0]+1]:
            spaces.append(True)
        else:
            spaces.append(False)
    else:
        spaces.append(False)
        
# Re-Run Spacy
nlp = spacy.blank("en")
words = output.tokens
spaces = spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)

doc.text

'reasonfor referral : , ms . a is a 60 - year - old african - american female with 12 years of education who was referred for neuropsychological evaluation by dr . x after she demonstrated mild cognitive deficits on a neuropsychological screening evaluation during a followup appointment with him for stroke in july . a comprehensive evaluation was requested to assess current cognitive functioning and assist with diagnostic decisions and treatment planning ., relevant background information :, historical information was obtained from a review of available medical records and clinical interview with ms . a . a summary of pertinent information is presented below . please refer to the patient \' s medical chart for a more complete history ., history of presenting problem :, ms . a presented to the abc hospital emergency department on 07 / 26 / 2009 reporting that after she had woken up that morning she noticed numbness and weakness in her left hand , slurred speech , and left facial droop .

### Bonus Example - Add Sectionizer

Note that all sections are not present because the tokenizer did not detect every word properly. For instance, "reason referral" was tokenized as "reason", "for", "refer", "ral", ":".

In [259]:
import re
import spacy
from medspacy.section_detection import Sectionizer
from medspacy.section_detection import SectionRule

nlp = spacy.blank("en")
sectionizer = nlp.add_pipe("medspacy_sectionizer", config={"rules":None})

# Find new sections using regex patterns & clean up strings
sections = [x.replace(":","").replace(",","") for x in re.findall('[A-Z, ]*:', medical_doc)]
new_patterns = []
for s in sections:
    section_cat = s.lower().replace(" ","_").replace(":","")
    section_lit = "{}:".format(s)
    new_patterns.append(SectionRule(category=section_cat, literal=section_lit))

sectionizer.add(new_patterns)

doc = Doc(nlp.vocab, words=words, spaces=spaces)
sectionizer(doc)
visualize_ent(doc)

doc._.sections

[Section(category=None at 0 : 0 in the doc with a body at 0 : 94 based on the rule None,
 Section(category=relevant_background_information at 94 : 98 in the doc with a body at 98 : 144 based on the rule SectionRule(literal="RELEVANT BACKGROUND INFORMATION:", category="relevant_background_information", pattern=None, on_match=None, parents=None, parent_required=False),
 Section(category=history_of_presenting_problem at 144 : 149 in the doc with a body at 149 : 885 based on the rule SectionRule(literal="HISTORY OF PRESENTING PROBLEM:", category="history_of_presenting_problem", pattern=None, on_match=None, parents=None, parent_required=False),
 Section(category=other_medical_history at 885 : 889 in the doc with a body at 889 : 1190 based on the rule SectionRule(literal="OTHER MEDICAL HISTORY:", category="other_medical_history", pattern=None, on_match=None, parents=None, parent_required=False),
 Section(category=current_medications at 1190 : 1193 in the doc with a body at 1193 : 1231 based 