# Correcting Transcription Errors

If we have a list of valid medications, then a foundational model can use it as a reference to correct transcription errors in hard to read prescriptions.

Using Drugs@FDA as an example
https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files

In [119]:
import sqlite3
import csv
import re
import numpy as np

import rapidfuzz

 ## Fuzzy Matching
First, we need to identify any words in the transcription that are close to valid drug names or active ingredients. We can use fuzzy matching to get likely matches based on metrics like Levenshtein Distance. To do that, create a word list or dictionary of all the possibly words we care about.

In [2]:
def clean_word(word: str) -> str:
    """Clean a single word by removing special characters and converting to uppercase"""
    return re.sub(r'[^A-Za-z]', '', word).upper()

In [97]:
def should_keep_word(word: str) -> bool:
    """Determine if a word should be kept in the final list"""
    # Skip empty strings
    if not word:
        return False

    # Skip short words
    if len(word) < 3:
        return False

    # Skip common conjunctions and articles and any words that don't differentiate
    skip_words = {'AND', 'OR', 'WITH', 'IN', 'THE', 'DAILY'}

    # Skip common chemical terms
    chemical_terms = {'SODIUM', 'HYDROCHLORIDE', 'HCL', 'SULFATE', 'PHOSPHATE',
                      'ACETATE', 'CITRATE', 'COMPLEX', 'RESIN', 'ASPARTATE'}
    if word in skip_words | chemical_terms:
        return False

    return True

In [98]:
def compact_drug_names(filename: str) -> list:
    """
    Read drug names from file and return set of individual cleaned words
    """
    unique_words = set()

    with open(filename, 'r') as file:
        #discard first line header
        file.readline()
        for line in file:
            # Remove any text between **
            line = re.sub(r'\*\*.*?\*\*', '', line)

            # Split on common delimiters
            words = re.split(r'[,;/\(\)]', line)

            for word_group in words:
                # Split into individual words
                individual_words = word_group.split()

                # Clean and filter each word
                for word in individual_words:
                    cleaned_word = clean_word(word)
                    if should_keep_word(cleaned_word):
                        unique_words.add(cleaned_word)

    return sorted(list(unique_words))


In [99]:
sorted_names = compact_drug_names("data/Products.txt")

In [116]:
print(len(sorted_names))

7767


In [100]:
# Write the result to a new file
with open('data/compact_drug_names.txt', 'w') as outfile:
    for name in sorted_names:
        outfile.write(f"{name}\n")

In [107]:
class DrugNameMatcher:
    def __init__(self, word_list: list[str], threshold: int = 80):
        self.word_list = word_list
        self.threshold = threshold

    def find_matches(self, query: str, limit: int = 5) -> list[tuple[str, int, int | str]]:
        """
        Find matching drug names for the query.
        Returns empty list if no matches meet the threshold.
        """
        matches = rapidfuzz.process.extract(
            query.upper(),
            self.word_list,
            scorer=rapidfuzz.fuzz.WRatio,
            limit=limit,
            score_cutoff=self.threshold
        )
        return matches

In [108]:
matcher = DrugNameMatcher(sorted_names, threshold=80)

In [140]:
# Test with mix of drug names and common medical text
test_words = [
    "Heprin",  # Should match
    "1x",  # Should not match
    "daily",  # Should not match
    "Demerol",  # Should match
    "take",  # Should not match
    "Amphetm1ne",  # Should match
    "fexafenodine",
    "albeturol",
]
upper_test_words = [w.upper() for w in test_words]

In [141]:
terms = set()
for word in upper_test_words:
    matches = rapidfuzz.process.extract(
            word,
            sorted_names,
            scorer=rapidfuzz.fuzz.WRatio,
            score_cutoff=80
        )
    if matches:
        for match, score, _ in matches:
            terms.add(match)

In [142]:
matches_matrix = rapidfuzz.process.cdist(upper_test_words, sorted_names, scorer=rapidfuzz.fuzz.WRatio, score_cutoff=80)
terms = set(sorted_names[i] for i in np.where(matches_matrix >= 0.8)[0])

## Drug Details
It could be sufficient to provide the list of likely words to the foundational model, but our data set includes valid strengths for each drug. By providing the full details of the likely medications, the foundational model can correct transcription errors in strength as well.

The dictionary includes terms from the drug names and active ingredients. We can use a substring search to get all the records for the likely words. To reduce the results somewhat, filter the substring matches by word boundaries. It is more likely that the transcription error will be a similar length to the correct term.

Since the dataset is relatively small and doesn't change frequently, it's fine to use sqlite stored in S3 and downloaded on initialization.

In [11]:
def clean_strength(text):
    # Remove comments between ** and **
    return re.sub(r'\*\*.*?\*\*', '', text).strip()

In [65]:
consolidated = []

# Read and process the data
with open('data/Products.txt', 'r') as file:
    next(file)  # Skip header
    reader = csv.reader(file, delimiter='\t')

    for row in reader:
        if len(row) >= 8:
            active_ingredient = row[6]
            form = row[2]
            drug_name = row[5]
            # Clean strength by removing comments between **
            strength = clean_strength(row[3])

            consolidated.append((drug_name, active_ingredient, form, strength))


In [68]:
# Create/connect to SQLite database
conn = sqlite3.connect('data/drugs.db')
cursor = conn.cursor()

# Create table with consolidated columns
cursor.execute('''
    CREATE TABLE IF NOT EXISTS drugs (
        drug_name TEXT,
        active_ingredient TEXT,
        strength TEXT,
        form TEXT
    )
''')

# Create index for text search on active_ingredient
cursor.execute('CREATE INDEX IF NOT EXISTS idx_active_ingredient ON drugs(active_ingredient)')
cursor.execute('CREATE INDEX IF NOT EXISTS idx_drug_name ON drugs(drug_name)')


<sqlite3.Cursor at 0x10423a340>

In [69]:
# Insert consolidated data
for (drug_name, active_ingredient, form, strength) in consolidated:

    cursor.execute('''
        INSERT INTO drugs (active_ingredient, form, drug_name, strength)
        VALUES (?, ?, ?, ?)
    ''', (active_ingredient, form, drug_name, strength))

# Commit changes and close connection
conn.commit()
conn.close()

In [74]:
def search_drugs(search_terms: list[str]):
    conn = sqlite3.connect('data/drugs.db')
    cursor = conn.cursor()

    # Create the WHERE clause dynamically with OR conditions
    where_conditions = []
    params = []
    for term in search_terms:
        where_conditions.append('''
            (drug_name LIKE ? OR active_ingredient LIKE ?)
        ''')
        params.extend([f'%{term}%', f'%{term}%'])

    query = f'''
        SELECT active_ingredient, drug_name, form, strength
        FROM drugs
        WHERE {' OR '.join(where_conditions)}
    '''

    cursor.execute(query, params)
    results = cursor.fetchall()
    conn.close()
    pattern = r'\b(' + r'|'.join(re.escape(term) for term in search_terms) + r')\b'
    filtered_results = [
        row for row in results
        if re.search(pattern, row[0], re.IGNORECASE) or
           re.search(pattern, row[1], re.IGNORECASE)
    ]
    return filtered_results


In [143]:
results = search_drugs(list(terms))
print(len(results))

51
