#Project - Text Cleaning NLP


##Application of various NLP techniques to a small set of medical texts.

##This project will help demonstrate a few basic cleaning tasks in NLP

In [None]:
medical_texts = [
    "Diabetes mellitus (DM), commonly known as diabetes, is a group of metabolic disorders characterized by a high blood sugar level over a prolonged period of time.",
    "Symptoms of high blood sugar include frequent urination, increased thirst, and increased hunger.",
    "Cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body.",
    "These contrast with benign tumors, which do not spread. Treatment options may include surgery, radiation therapy, chemotherapy, or palliation.",
    "Hypertension, also known as high blood pressure (HBP), is a long-term medical condition in which the blood pressure in the arteries is persistently elevated."
]


1. Lowercase Conversion:

  Task: Convert all the text entries in medical_texts to lowercase.
  Objective: Standardize the text to eliminate case sensitivity issues in subsequent analyses.

2.   Remove all stopwords from the texts using a comprehensive medical stopword list (students can be directed to use a custom list or a library such as NLTK for general English stopwords).
Objective: Eliminate common words that may not contribute useful information to the analysis.

3. Punctuation Removal:

  Task: Remove all punctuation from the texts in medical_texts.
  Objective: Simplify the text for easier tokenization and analysis.
  Regular Expression Filtering:

4. Use regular expression to   the dataset to include only texts containing any of the following keywords: ['diabetes', 'cancer', 'hypertension'].
Objective: Focus the dataset on texts related to specific medical conditions for targeted analysis.

4. Lemmatization:
Task: Apply lemmatization to the texts to reduce words to their base or dictionary form.
Objective: Normalize word forms to ensure that variations of a word are processed as a single item.
Stopword Removal:


5. Bonus - Comprehensive Preprocessing Function:

  Create a function you can call it  clean_medical_text(text) that incorporates all the preprocessing steps: normalization, punctuation removal, keyword filtering, lemmatization, and stopword removal.
  Objective: Produce a clean output from any given medical text, demonstrating how to streamline these preprocessing steps into a single, reusable function.
## Instructions:
Implement each task in Python using libraries such as NLTK for NLP-related tasks and re for regular expressions.
Provide detailed explanations for your choices in preprocessing techniques and discuss their impact on the dataset.
Include comments in your code to explain the purpose and function of each code block.

## Submission:
  Submit a well-documented Jupyter Notebook containing all code and comment in google colab.
  Ensure your final function is tested with examples from medical_texts to demonstrate its effectiveness.

NLP - Data Preprocessing

In [None]:
medical_texts = [
    "Diabetes mellitus (DM), commonly known as diabetes, is a group of metabolic disorders characterized by a high blood sugar level over a prolonged period of time.",
    "Symptoms of high blood sugar include frequent urination, increased thirst, and increased hunger.",
    "Cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body.",
    "These contrast with benign tumors, which do not spread. Treatment options may include surgery, radiation therapy, chemotherapy, or palliation.",
    "Hypertension, also known as high blood pressure (HBP), is a long-term medical condition in which the blood pressure in the arteries is persistently elevated."
]

1. Standardizing: the text to eliminate case sensitivity issues -  Converting all the text entries in medical_texts to lowercase

In [None]:
!pip install spicy



In [None]:
revised_medical_texts = [x.lower() for x in medical_texts]
revised_medical_texts

['diabetes mellitus (dm), commonly known as diabetes, is a group of metabolic disorders characterized by a high blood sugar level over a prolonged period of time.',
 'symptoms of high blood sugar include frequent urination, increased thirst, and increased hunger.',
 'cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body.',
 'these contrast with benign tumors, which do not spread. treatment options may include surgery, radiation therapy, chemotherapy, or palliation.',
 'hypertension, also known as high blood pressure (hbp), is a long-term medical condition in which the blood pressure in the arteries is persistently elevated.']

2. Removing stopwords from the texts using a comprehensive medical stopword list - Eliminating non-useful common words in the Analysis.

In [None]:
import nltk                                     # importing nltk toolkit
from nltk.corpus import stopwords               # importing stopword module from nltk
nltk.download('stopwords')                      # downloading stopword dataset to llocal nltk installation
english_stopwords = set(stopwords.words('english'))  # Getting list of English stopwords

# Add domain-specific stopwords
medical_stopwords = english_stopwords.union({
    "patient", "doctor", "mg", "ml", "tablet", "symptom",
    "treatment", "disease", "prescription", "daily", "diagnosed"
    "dose", "day", "result", "study"
})


print ("Comprehensive Medical Stopwords:", medical_stopwords)

Comprehensive Medical Stopwords: {'shan', 'didn', "shouldn't", 'this', 's', 'when', 'he', 'too', 'under', 'above', 'wasn', 'through', 'at', 'has', 'd', 'have', 'me', 'up', "weren't", 'for', 'these', 'their', 'does', 'them', 'whom', 'we', 'after', 'him', 'in', "won't", 'patient', 'now', 'did', 'don', 'same', 'can', 'between', 'they', 'herself', "don't", 'the', 'she', 'of', 'because', 'nor', 'your', 'only', "mightn't", 'there', "aren't", 'itself', 'again', 'how', 'just', 'being', "couldn't", 'ourselves', 'day', 'yourselves', 'once', 'over', 'been', 'aren', 'further', 'during', 'm', 'until', 'and', 'my', 'that', 'so', 'out', 'it', 'a', "should've", 've', 'do', 'result', 'some', 'its', "that'll", "isn't", 'were', 'himself', 'our', 'off', 'is', 'very', "hasn't", 'should', "doesn't", 'o', 'on', "mustn't", 'doctor', "you've", "didn't", 'against', 'ain', 'to', 'any', 'hasn', 'be', "needn't", 'had', 'all', 'am', 'tablet', 'shouldn', 'ml', 'then', 'from', 'as', 'his', 'what', 'will', 'daily', 'y

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
revised_medical_texts2 = ' '.join([word for word in revised_medical_texts if word.lower() not in medical_stopwords])
revised_medical_texts2

'diabetes mellitus (dm), commonly known as diabetes, is a group of metabolic disorders characterized by a high blood sugar level over a prolonged period of time. symptoms of high blood sugar include frequent urination, increased thirst, and increased hunger. cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. these contrast with benign tumors, which do not spread. treatment options may include surgery, radiation therapy, chemotherapy, or palliation. hypertension, also known as high blood pressure (hbp), is a long-term medical condition in which the blood pressure in the arteries is persistently elevated.'

3. Removing all punctuation from the texts (Simplifying the text for easier tokenization and analysis) - Regular Expression Filtering:

In [None]:
import re
np_medical_texts = []
no_punct_medical_texts = re.sub(r'[^\w\s]','',revised_medical_texts2)
np_medical_texts.append(no_punct_medical_texts)
np_medical_texts

['diabetes mellitus dm commonly known as diabetes is a group of metabolic disorders characterized by a high blood sugar level over a prolonged period of time symptoms of high blood sugar include frequent urination increased thirst and increased hunger cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body these contrast with benign tumors which do not spread treatment options may include surgery radiation therapy chemotherapy or palliation hypertension also known as high blood pressure hbp is a longterm medical condition in which the blood pressure in the arteries is persistently elevated']

4. Using regular expression to limit the datasets to the following keywords: ['diabetes', 'cancer', 'hypertension'] - limiting medical_texts to specific medical conditions for targeted analysis. (Tokenisation)

In [None]:
nltk.download('punkt')           # Downloading necessary Tokenization data
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize, word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
# Defining keywords
keywords = ['diabetes', 'cancer', 'hypertension']

# Creating regex pattern
pattern = re.compile(r'\b(?:' + '|'.join(keywords) + r')\b', flags=re.IGNORECASE)

# Filter texts that match the keywords
filtered_texts = [text for text in np_medical_texts if pattern.search(text)]

# Tokenizing sentences and highlighting matches
tokenized_results = [
    {
        "original_text": text,
        "matched_keywords": pattern.findall(text),
        "tokenized_sentences": sent_tokenize(text),
    }
    for text in filtered_texts
]

# Print results
for result in tokenized_results:
    print("Original Text:", result["original_text"])
    print("Matched Keywords:", result["matched_keywords"])
    print("Tokenized Sentences:", result["tokenized_sentences"])
    print("-" * 50)

Original Text: diabetes mellitus dm commonly known as diabetes is a group of metabolic disorders characterized by a high blood sugar level over a prolonged period of time symptoms of high blood sugar include frequent urination increased thirst and increased hunger cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body these contrast with benign tumors which do not spread treatment options may include surgery radiation therapy chemotherapy or palliation hypertension also known as high blood pressure hbp is a longterm medical condition in which the blood pressure in the arteries is persistently elevated
Matched Keywords: ['diabetes', 'diabetes', 'cancer', 'hypertension']
Tokenized Sentences: ['diabetes mellitus dm commonly known as diabetes is a group of metabolic disorders characterized by a high blood sugar level over a prolonged period of time symptoms of high blood sugar include frequent urination increased thirs

Refining the Keywords search to include word before and after keyboard

In [None]:
# Defining keywords
keywords = ['diabetes', 'cancer', 'hypertension']

# Creating regex pattern
pattern = re.compile(r'\bdiabetes\b(?!-)', flags=re.IGNORECASE)

# Filter texts that match the keywords
filtered_texts = [text for text in np_medical_texts if pattern.search(text)]

# Tokenizing sentences and highlighting matches
tokenized_results = [
    {
        "original_text": text,
        "matched_keywords": pattern.findall(text),
        "tokenized_sentences": sent_tokenize(text),
    }
    for text in filtered_texts
]

# Print results
for result in tokenized_results:
    print("Original Text:", result["original_text"])
    print("Matched Keywords:", result["matched_keywords"])
    print("Tokenized Sentences:", result["tokenized_sentences"])
    print("-" * 50)

Original Text: diabetes mellitus dm commonly known as diabetes is a group of metabolic disorders characterized by a high blood sugar level over a prolonged period of time symptoms of high blood sugar include frequent urination increased thirst and increased hunger cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body these contrast with benign tumors which do not spread treatment options may include surgery radiation therapy chemotherapy or palliation hypertension also known as high blood pressure hbp is a longterm medical condition in which the blood pressure in the arteries is persistently elevated
Matched Keywords: ['diabetes', 'diabetes']
Tokenized Sentences: ['diabetes mellitus dm commonly known as diabetes is a group of metabolic disorders characterized by a high blood sugar level over a prolonged period of time symptoms of high blood sugar include frequent urination increased thirst and increased hunger can

In [None]:
import pandas as pd

data = {
    'id': [1, 2, 3, 4,5],
    'revised_medical_texts' : ['diabetes mellitus (dm), commonly known as diabetes, is a group of metabolic disorders characterized by a high blood sugar level over a prolonged period of time.',
 'symptoms of high blood sugar include frequent urination, increased thirst, and increased hunger.',
 'cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body.',
 'these contrast with benign tumors, which do not spread. treatment options may include surgery, radiation therapy, chemotherapy, or palliation.',
 'hypertension, also known as high blood pressure (hbp), is a long-term medical condition in which the blood pressure in the arteries is persistently elevated.']
}
df = pd.DataFrame(data)

# Define the keywords
keywords = ['diabetes', 'cancer', 'hypertension']

# Create a regex pattern
pattern = '|'.join(keywords)  # "diabetes|cancer|hypertension"

# Filter the dataset
filtered_df = df[df['revised_medical_texts'].str.contains(pattern, case=False, na=False)]  # case-insensitive

print(filtered_df)

   id                              revised_medical_texts
0   1  diabetes mellitus (dm), commonly known as diab...
2   3  cancer is a group of diseases involving abnorm...
4   5  hypertension, also known as high blood pressur...


In [None]:
word = word_tokenize(np_medical_texts[0])
word

['diabetes',
 'mellitus',
 'dm',
 'commonly',
 'known',
 'as',
 'diabetes',
 'is',
 'a',
 'group',
 'of',
 'metabolic',
 'disorders',
 'characterized',
 'by',
 'a',
 'high',
 'blood',
 'sugar',
 'level',
 'over',
 'a',
 'prolonged',
 'period',
 'of',
 'time',
 'symptoms',
 'of',
 'high',
 'blood',
 'sugar',
 'include',
 'frequent',
 'urination',
 'increased',
 'thirst',
 'and',
 'increased',
 'hunger',
 'cancer',
 'is',
 'a',
 'group',
 'of',
 'diseases',
 'involving',
 'abnormal',
 'cell',
 'growth',
 'with',
 'the',
 'potential',
 'to',
 'invade',
 'or',
 'spread',
 'to',
 'other',
 'parts',
 'of',
 'the',
 'body',
 'these',
 'contrast',
 'with',
 'benign',
 'tumors',
 'which',
 'do',
 'not',
 'spread',
 'treatment',
 'options',
 'may',
 'include',
 'surgery',
 'radiation',
 'therapy',
 'chemotherapy',
 'or',
 'palliation',
 'hypertension',
 'also',
 'known',
 'as',
 'high',
 'blood',
 'pressure',
 'hbp',
 'is',
 'a',
 'longterm',
 'medical',
 'condition',
 'in',
 'which',
 'the',
 '

5. Lemmatization: Reducing words to their base or dictionary form. (Normalize word forms their variations are processed as a single item) - Stopword Removal:

# Stemming

In [None]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()                            # Initializing Porter Stemmer

for word in words[0]:
  print(word, ":", ps.stem(word))

diabetes : diabet
mellitus : mellitu
dm : dm
commonly : commonli
known : known
as : as
diabetes : diabet
is : is
a : a
group : group
of : of
metabolic : metabol
disorders : disord
characterized : character
by : by
a : a
high : high
blood : blood
sugar : sugar
level : level
over : over
a : a
prolonged : prolong
period : period
of : of
time : time
symptoms : symptom
of : of
high : high
blood : blood
sugar : sugar
include : includ
frequent : frequent
urination : urin
increased : increas
thirst : thirst
and : and
increased : increas
hunger : hunger
cancer : cancer
is : is
a : a
group : group
of : of
diseases : diseas
involving : involv
abnormal : abnorm
cell : cell
growth : growth
with : with
the : the
potential : potenti
to : to
invade : invad
or : or
spread : spread
to : to
other : other
parts : part
of : of
the : the
body : bodi
these : these
contrast : contrast
with : with
benign : benign
tumors : tumor
which : which
do : do
not : not
spread : spread
treatment : treatment
options : opt

Lemmatization is similar to Stemming, but it is better because it uses Dictionary in reducing words to their meaningful root form unlike Stemming that can reduce words to root form that does not make sense.

# Lemmatization

In [None]:
nltk.download('wordnet')                                     # downloading wordnet for Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
for word in words[0]:
  print(word, ":", lemmatizer.lemmatize(word))

diabetes : diabetes
mellitus : mellitus
dm : dm
commonly : commonly
known : known
as : a
diabetes : diabetes
is : is
a : a
group : group
of : of
metabolic : metabolic
disorders : disorder
characterized : characterized
by : by
a : a
high : high
blood : blood
sugar : sugar
level : level
over : over
a : a
prolonged : prolonged
period : period
of : of
time : time
symptoms : symptom
of : of
high : high
blood : blood
sugar : sugar
include : include
frequent : frequent
urination : urination
increased : increased
thirst : thirst
and : and
increased : increased
hunger : hunger
cancer : cancer
is : is
a : a
group : group
of : of
diseases : disease
involving : involving
abnormal : abnormal
cell : cell
growth : growth
with : with
the : the
potential : potential
to : to
invade : invade
or : or
spread : spread
to : to
other : other
parts : part
of : of
the : the
body : body
these : these
contrast : contrast
with : with
benign : benign
tumors : tumor
which : which
do : do
not : not
spread : spread
tr

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Bonus - Comprehensive Preprocessing Function:

Create a function you can call it clean_medical_text(text) that incorporates all the preprocessing steps: normalization, punctuation removal, keyword filtering, lemmatization, and stopword removal. Objective: Produce a clean output from any given medical text, demonstrating how to streamline these preprocessing steps into a single, reusable function.

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer

# Ensure necessary NLTK resources are downloaded
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Define the function
def clean_medical_text(text):
    """
    Preprocesses medical text by normalizing, removing punctuation,
    filtering keywords, lemmatizing, and removing stopwords.

    Parameters:
        text (str): The input medical text.

    Returns:
        dict: Contains cleaned text, matched keywords, and tokenized sentences.
    """
    # Normalization: Convert to lowercase
    text = text.lower()

    # Define comprehensive stopwords (English + medical-specific)
    english_stopwords = set(stopwords.words('english'))
    medical_stopwords = english_stopwords.union({
        "patient", "doctor", "mg", "ml", "tablet", "symptom",
        "treatment", "disease", "prescription", "daily", "diagnosed",
        "dose", "day", "result", "study"
    })

    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize the text into words
    words = word_tokenize(text)

    # Remove stopwords
    words = [word for word in words if word not in medical_stopwords]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

    # Reconstruct the cleaned text
    cleaned_text = ' '.join(lemmatized_words)

    # Define keywords for filtering
    keywords = ['diabetes', 'cancer', 'hypertension']
    pattern = re.compile(r'\b(?:' + '|'.join(keywords) + r')\b', flags=re.IGNORECASE)

    # Filter for keywords and tokenize sentences
    matched_keywords = pattern.findall(cleaned_text)
    tokenized_sentences = sent_tokenize(cleaned_text)

    # Return results as a dictionary
    return {
        "cleaned_text": cleaned_text,
        "matched_keywords": matched_keywords,
        "tokenized_sentences": tokenized_sentences
    }

# Example usage
if __name__ == "__main__":
    medical_texts = [
        "The patient was diagnosed with diabetes and hypertension.",
        "Cancer treatment is ongoing.",
        "The prescribed dose is 500mg daily for the patient.",
        "No significant findings were noted in the study."
    ]

    # Apply the function to each medical text
    for text in medical_texts:
        result = clean_medical_text(text)
        print("Original Text:", text)
        print("Cleaned Text:", result["cleaned_text"])
        print("Matched Keywords:", result["matched_keywords"])
        print("Tokenized Sentences:", result["tokenized_sentences"])
        print("-" * 50)


Original Text: The patient was diagnosed with diabetes and hypertension.
Cleaned Text: diabetes hypertension
Matched Keywords: ['diabetes', 'hypertension']
Tokenized Sentences: ['diabetes hypertension']
--------------------------------------------------
Original Text: Cancer treatment is ongoing.
Cleaned Text: cancer ongoing
Matched Keywords: ['cancer']
Tokenized Sentences: ['cancer ongoing']
--------------------------------------------------
Original Text: The prescribed dose is 500mg daily for the patient.
Cleaned Text: prescribed 500mg
Matched Keywords: []
Tokenized Sentences: ['prescribed 500mg']
--------------------------------------------------
Original Text: No significant findings were noted in the study.
Cleaned Text: significant finding noted
Matched Keywords: []
Tokenized Sentences: ['significant finding noted']
--------------------------------------------------


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
