# NLP-based Clinical Terms Standardization (Proof of Concept)

This notebook demonstrates how to standardize autoimmune encephalitis (AIE) subtype names using NLP techniques:
- TF-IDF Cosine Similarity
- Fuzzy Matching
- Semantic Similarity via spaCy

You can run this notebook without installing anything using [Binder](https://mybinder.org).
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/davidzhao1015/nlp-clinical-term-standardization/edit/main/match-string-terms_2025.03.31_DZ.ipynb)

In [56]:
# Install dependencies if running on Google Colab or Binder
!pip install -q spacy fuzzywuzzy scikit-learn
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


## Load Libraries

In [57]:
import pandas as pd
from collections import defaultdict

from sklearn.feature_extraction.text import TfidfVectorizer # Term Frequency-Inverse Document Frequency (TF-IDF)
from sklearn.feature_extraction.text import CountVectorizer # CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity # Cosine similarity

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

import spacy

## Create a custom function to standardize clinical terms

The custom function `standardize_terms` takes a list of clinical terms and standardizes them using a predefined mapping. The mapping is defined in the `term_mapping` dictionary, which maps common clinical terms to their standardized forms.

The custom function is powered by the `cosine_similarity` function, and `fuzzywuzzy` library and `spacy` library to implement the text minining techniques. The `cosine_similarity` function is used to calculate the similarity between the input term and the standardized terms in the mapping. The `fuzzywuzzy` library is used to perform fuzzy matching, and the `spacy` library is used for tokenization and lemmatization of the terms.

In [58]:
def standardize_clinical_terms(reported_term, std_term_list):
    """
    Standardize a clinical term based on TF-IDF similarity, fuzzy string matching, and semantic similarity.

    Parameters:
        reported_term (str): The clinical term to standardize.
        std_term_list (list): A list of standard terms.

    Returns:
        Tuple: (standardized_term, number_of_matching_methods)
    """
    # Check input types
    if not isinstance(reported_term, str):
        raise ValueError("reported_term must be a string.")
    
    best_match_collection = defaultdict(list)

    # --- Cosine Similarity (TF-IDF) ---
    all_terms = std_term_list + [reported_term]
    tfidf_matrix = TfidfVectorizer().fit_transform(all_terms)
    similarity_matrix = cosine_similarity(tfidf_matrix[:-1], tfidf_matrix[-1:])
    max_score = similarity_matrix.max()
    cosine_score = max_score * 100
    if max_score >= 0.7:
        best_match_idx = similarity_matrix.argmax()
        best_match = std_term_list[best_match_idx]
    else:
        best_match = reported_term
    best_match_collection[best_match].append(cosine_score)

    # --- Fuzzy Matching ---
    best_match_fuzzy, fuzzy_score = process.extractOne(reported_term, std_term_list)
    if fuzzy_score >= 80:
        best_match_collection[best_match_fuzzy].append(fuzzy_score)
    else:
        best_match_collection[reported_term].append(fuzzy_score)

    # --- Semantic Similarity ---
    nlp = spacy.load("en_core_web_md") # Load the medium-sized English model

    reported_doc = nlp(reported_term)
    best_match_semantic = reported_term
    best_score = -1
    for std_term in std_term_list:
        std_doc = nlp(std_term)
        if reported_doc.vector_norm == 0 or std_doc.vector_norm == 0:
            continue
        score = reported_doc.similarity(std_doc)
        if score > best_score:
            best_score = score
            best_match_semantic = std_term
    semantic_score = best_score * 100
    if best_score >= 0.7:
        best_match_collection[best_match_semantic].append(semantic_score)
    else:
        best_match_collection[reported_term].append(semantic_score)

    # --- Format Output ---
    best_match_df = pd.DataFrame([
        {"Standardized": k, "Score": sum(v)/len(v), "Count": len(v)}
        for k, v in best_match_collection.items()
    ])
    best_match_df.sort_values(by="Score", ascending=False, inplace=True)

    return best_match_df.iloc[0]["Standardized"], best_match_df.iloc[0]['Count']

## Test Case: Standardizing AIE subtypes

In [59]:
# Reported AIE subtypes
subtypes_reported = pd.DataFrame({"reported_term": ["Anti-NMDAR Encephalitis",
                                                  "NMDAR Encephalitis",
                                                  "NMDAR",
                                                  "NMDA-R",
                                                  "LGI1 Autoimmune Encephalitis",
                                                  "Caspr2"]})  

print(subtypes_reported)                                             


                  reported_term
0       Anti-NMDAR Encephalitis
1            NMDAR Encephalitis
2                         NMDAR
3                        NMDA-R
4  LGI1 Autoimmune Encephalitis
5                        Caspr2


In [60]:
# Standard AIE subtypes list
subtype_std = [
    "NMDAR",
    "LGI1",
    "CASPR2",
    "AMPAR",
    "GABAAR",
    "GABABR",
    "DPPX",
    "Dopamine-2R",
    "mGluR5",
    "Neurexin-3α",
    "IgLON5",
    "P/Q type VGCC",
    "mGluR1",
    "GlyR",
    "SOX-1"
]

In [61]:
# Standardize the reported AIE subtypes
subtypes_reported[['standard_term', 'match_type']] = subtypes_reported['reported_term'].apply(lambda x: pd.Series(standardize_clinical_terms(x, std_term_list=subtype_std)))

In [62]:
subtypes_reported.head(10)

Unnamed: 0,reported_term,standard_term,match_type
0,Anti-NMDAR Encephalitis,NMDAR,1
1,NMDAR Encephalitis,NMDAR,1
2,NMDAR,NMDAR,3
3,NMDA-R,NMDAR,1
4,LGI1 Autoimmune Encephalitis,LGI1,1
5,Caspr2,CASPR2,2


The test case demonstrates the standardization of AIE subtypes. The input terms are a list of clinical terms related to AIE subtypes, and the expected output is a standardized list of terms.