In [56]:
# Import the drive module from google.colab to interact with Google Drive.
# Mount Google Drive to access files stored there. This will prompt for authentication.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [57]:
# Import the os module for operating system functionalities.
import os
# Change the current working directory to the specified path in Google Drive.
# This ensures that subsequent file operations are performed relative to this directory.
os.chdir('/content/drive/MyDrive/Academics/Visiting Lectures/2026-H1/202601-SDP-AU/Session-12-Natural-Language-Processing')

In [58]:
# Import the pandas library for data manipulation and analysis.
import pandas as pd

# Load the 'hospital_exit_interviews.csv' dataset into a pandas DataFrame.
# The dataset is located in the 'Data' subdirectory.
exit_interview = pd.read_csv('Data/hospital_exit_interviews.csv')

# Display the first few rows of the DataFrame to inspect its structure and content.
exit_interview.head()

Unnamed: 0,Patient_ID,Discharge Date,Exit Interview,Customer Sentiment
0,P00001,15-04-2024,Billing and insurance processing was confusing...,Negative
1,P00002,16-11-2024,There were significant delays in diagnostic pr...,Negative
2,P00003,18-07-2024,I was impressed with the hospital’s cleanlines...,Positive
3,P00004,13-03-2024,"The medical care met basic expectations, and t...",Neutral
4,P00005,09-11-2024,The nursing staff was consistently attentive a...,Positive


## **Introduction to Text Preprocessing**

Text preprocessing is a crucial step in Natural Language Processing (NLP) that involves cleaning and transforming raw text data into a more suitable and standardized format for analysis and model training. Raw text often contains noise, inconsistencies, and irrelevant information that can hinder the performance of NLP models.

The importance of text preprocessing steps lies in their ability to improve the quality of text data, reduce dimensionality, and enhance the efficiency and accuracy of NLP tasks. Key preprocessing steps include:

1.  **Tokenization**: Breaking down text into smaller units, such as words or phrases (tokens).
2.  **Lowercasing**: Converting all text to lowercase to treat words like 'The' and 'the' as the same.
3.  **Punctuation Removal**: Eliminating punctuation marks (e.g., periods, commas, question marks) that often don't contribute to the semantic meaning.
4.  **Stop Word Removal**: Deleting common words (e.g., 'a', 'an', 'the', 'is') that carry little lexical meaning and can add noise.
5.  **Stemming**: Reducing words to their root or base form (e.g., 'running', 'runs', 'ran' all become 'run'), often by chopping off suffixes.
6.  **Lemmatization**: Reducing words to their dictionary form (lemma), considering the word's context and part of speech, resulting in grammatically correct base forms (e.g., 'better' becomes 'good', unlike stemming which might reduce it to 'bett').

By performing these steps, text data becomes cleaner, more consistent, and easier for algorithms to process, leading to better insights and more robust NLP models.

## **Tokenization**


In [59]:
# Import the Natural Language Toolkit (nltk).
import nltk

# Download the 'punkt' tokenizer models from NLTK.
# The 'punkt' tokenizer is used for splitting text into words and sentences.
nltk.download('punkt')
print("NLTK and 'punkt' tokenizer downloaded.")

NLTK and 'punkt' tokenizer downloaded.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [60]:
from nltk.tokenize import word_tokenize
import nltk

# Download the 'punkt_tab' resource, which is sometimes needed for word tokenization.
# This resolves potential LookupErrors if the resource is missing.
nltk.download('punkt_tab') # Download the missing 'punkt_tab' resource

def tokenize_text(text):
    """
    Tokenizes the input text into individual words.

    Args:
        text (str): The input string to be tokenized.

    Returns:
        list: A list of tokens (words).
    """
    return word_tokenize(text)

# Apply the tokenize_text function to the 'Exit Interview' column of the DataFrame.
# Store the resulting list of tokens in a new column named 'tokens'.
exit_interview['tokens'] = exit_interview['Exit Interview'].apply(tokenize_text)
print("Text tokenization complete. Displaying the first few rows with the new 'tokens' column:")
exit_interview.head()

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Text tokenization complete. Displaying the first few rows with the new 'tokens' column:


Unnamed: 0,Patient_ID,Discharge Date,Exit Interview,Customer Sentiment,tokens
0,P00001,15-04-2024,Billing and insurance processing was confusing...,Negative,"[Billing, and, insurance, processing, was, con..."
1,P00002,16-11-2024,There were significant delays in diagnostic pr...,Negative,"[There, were, significant, delays, in, diagnos..."
2,P00003,18-07-2024,I was impressed with the hospital’s cleanlines...,Positive,"[I, was, impressed, with, the, hospital, ’, s,..."
3,P00004,13-03-2024,"The medical care met basic expectations, and t...",Neutral,"[The, medical, care, met, basic, expectations,..."
4,P00005,09-11-2024,The nursing staff was consistently attentive a...,Positive,"[The, nursing, staff, was, consistently, atten..."


## **Lowercasing**


In [61]:
def to_lowercase(tokens):
    """
    Converts all tokens in a list to lowercase.

    Args:
        tokens (list): A list of string tokens.

    Returns:
        list: A new list with all tokens converted to lowercase.
    """
    return [token.lower() for token in tokens]

# Apply the to_lowercase function to the 'tokens' column.
# Store the results in a new column named 'tokens_lower'.
exit_interview['tokens_lower'] = exit_interview['tokens'].apply(to_lowercase)

print("Tokens converted to lowercase. Displaying the first few rows with the new 'tokens_lower' column:")
exit_interview.head()

Tokens converted to lowercase. Displaying the first few rows with the new 'tokens_lower' column:


Unnamed: 0,Patient_ID,Discharge Date,Exit Interview,Customer Sentiment,tokens,tokens_lower
0,P00001,15-04-2024,Billing and insurance processing was confusing...,Negative,"[Billing, and, insurance, processing, was, con...","[billing, and, insurance, processing, was, con..."
1,P00002,16-11-2024,There were significant delays in diagnostic pr...,Negative,"[There, were, significant, delays, in, diagnos...","[there, were, significant, delays, in, diagnos..."
2,P00003,18-07-2024,I was impressed with the hospital’s cleanlines...,Positive,"[I, was, impressed, with, the, hospital, ’, s,...","[i, was, impressed, with, the, hospital, ’, s,..."
3,P00004,13-03-2024,"The medical care met basic expectations, and t...",Neutral,"[The, medical, care, met, basic, expectations,...","[the, medical, care, met, basic, expectations,..."
4,P00005,09-11-2024,The nursing staff was consistently attentive a...,Positive,"[The, nursing, staff, was, consistently, atten...","[the, nursing, staff, was, consistently, atten..."


## **Punctuation Removal**

In [62]:
import string

def remove_punctuation(tokens):
    """
    Removes punctuation from a list of tokens.

    Args:
        tokens (list): A list of string tokens.

    Returns:
        list: A new list with punctuation removed from each token.
    """
    # Create a translation table that maps all punctuation characters to None (i.e., removes them).
    translator = str.maketrans('', '', string.punctuation)
    # Apply the translation to each token and filter out any empty strings that might result
    # from tokens that were solely punctuation.
    return [token.translate(translator) for token in tokens if token.translate(translator)]

# Apply the remove_punctuation function to the 'tokens_lower' column.
# Store the results in a new column named 'tokens_no_punct'.
exit_interview['tokens_no_punct'] = exit_interview['tokens_lower'].apply(remove_punctuation)

print("Punctuation removal complete. Displaying the first few rows with the new 'tokens_no_punct' column:")
exit_interview.head()

Punctuation removal complete. Displaying the first few rows with the new 'tokens_no_punct' column:


Unnamed: 0,Patient_ID,Discharge Date,Exit Interview,Customer Sentiment,tokens,tokens_lower,tokens_no_punct
0,P00001,15-04-2024,Billing and insurance processing was confusing...,Negative,"[Billing, and, insurance, processing, was, con...","[billing, and, insurance, processing, was, con...","[billing, and, insurance, processing, was, con..."
1,P00002,16-11-2024,There were significant delays in diagnostic pr...,Negative,"[There, were, significant, delays, in, diagnos...","[there, were, significant, delays, in, diagnos...","[there, were, significant, delays, in, diagnos..."
2,P00003,18-07-2024,I was impressed with the hospital’s cleanlines...,Positive,"[I, was, impressed, with, the, hospital, ’, s,...","[i, was, impressed, with, the, hospital, ’, s,...","[i, was, impressed, with, the, hospital, ’, s,..."
3,P00004,13-03-2024,"The medical care met basic expectations, and t...",Neutral,"[The, medical, care, met, basic, expectations,...","[the, medical, care, met, basic, expectations,...","[the, medical, care, met, basic, expectations,..."
4,P00005,09-11-2024,The nursing staff was consistently attentive a...,Positive,"[The, nursing, staff, was, consistently, atten...","[the, nursing, staff, was, consistently, atten...","[the, nursing, staff, was, consistently, atten..."


## **Stop Word Removal**


In [63]:
from nltk.corpus import stopwords
import nltk

# Download the 'stopwords' corpus from NLTK.
# Stop words are common words (e.g., 'the', 'is', 'a') that often carry little meaning
# and can be removed to reduce noise in text data.
nltk.download('stopwords')

# Get the set of English stop words for efficient lookup.
stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    """
    Removes common English stop words from a list of tokens.

    Args:
        tokens (list): A list of string tokens.

    Returns:
        list: A new list with stop words removed.
    """
    # Filter out tokens that are present in the stop_words set.
    return [token for token in tokens if token not in stop_words]

# Apply the remove_stopwords function to the 'tokens_no_punct' column.
# Store the results in a new column named 'tokens_no_stopwords'.
exit_interview['tokens_no_stopwords'] = exit_interview['tokens_no_punct'].apply(remove_stopwords)

print("Stop word removal complete. Displaying the first few rows with the new 'tokens_no_stopwords' column:")
exit_interview.head()

Stop word removal complete. Displaying the first few rows with the new 'tokens_no_stopwords' column:


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Patient_ID,Discharge Date,Exit Interview,Customer Sentiment,tokens,tokens_lower,tokens_no_punct,tokens_no_stopwords
0,P00001,15-04-2024,Billing and insurance processing was confusing...,Negative,"[Billing, and, insurance, processing, was, con...","[billing, and, insurance, processing, was, con...","[billing, and, insurance, processing, was, con...","[billing, insurance, processing, confusing, po..."
1,P00002,16-11-2024,There were significant delays in diagnostic pr...,Negative,"[There, were, significant, delays, in, diagnos...","[there, were, significant, delays, in, diagnos...","[there, were, significant, delays, in, diagnos...","[significant, delays, diagnostic, procedures, ..."
2,P00003,18-07-2024,I was impressed with the hospital’s cleanlines...,Positive,"[I, was, impressed, with, the, hospital, ’, s,...","[i, was, impressed, with, the, hospital, ’, s,...","[i, was, impressed, with, the, hospital, ’, s,...","[impressed, hospital, ’, cleanliness, infrastr..."
3,P00004,13-03-2024,"The medical care met basic expectations, and t...",Neutral,"[The, medical, care, met, basic, expectations,...","[the, medical, care, met, basic, expectations,...","[the, medical, care, met, basic, expectations,...","[medical, care, met, basic, expectations, staf..."
4,P00005,09-11-2024,The nursing staff was consistently attentive a...,Positive,"[The, nursing, staff, was, consistently, atten...","[the, nursing, staff, was, consistently, atten...","[the, nursing, staff, was, consistently, atten...","[nursing, staff, consistently, attentive, comp..."


## **Stemming**

In [64]:
from nltk.stem import PorterStemmer

# Initialize the Porter Stemmer. Stemming reduces words to their root or base form
# by removing suffixes (e.g., 'running', 'runs' -> 'run').
stemmer = PorterStemmer()

def apply_stemming(tokens):
    """
    Applies Porter Stemming to a list of tokens.

    Args:
        tokens (list): A list of string tokens.

    Returns:
        list: A new list with stemmed tokens.
    """
    # Apply the stemmer to each token in the list.
    return [stemmer.stem(token) for token in tokens]

# Apply the apply_stemming function to the 'tokens_no_stopwords' column.
# Store the results in a new column named 'tokens_stemmed'.
exit_interview['tokens_stemmed'] = exit_interview['tokens_no_stopwords'].apply(apply_stemming)

print("Stemming complete. Displaying the first few rows with the new 'tokens_stemmed' column:")
exit_interview.head()

Stemming complete. Displaying the first few rows with the new 'tokens_stemmed' column:


Unnamed: 0,Patient_ID,Discharge Date,Exit Interview,Customer Sentiment,tokens,tokens_lower,tokens_no_punct,tokens_no_stopwords,tokens_stemmed
0,P00001,15-04-2024,Billing and insurance processing was confusing...,Negative,"[Billing, and, insurance, processing, was, con...","[billing, and, insurance, processing, was, con...","[billing, and, insurance, processing, was, con...","[billing, insurance, processing, confusing, po...","[bill, insur, process, confus, poorli, coordin..."
1,P00002,16-11-2024,There were significant delays in diagnostic pr...,Negative,"[There, were, significant, delays, in, diagnos...","[there, were, significant, delays, in, diagnos...","[there, were, significant, delays, in, diagnos...","[significant, delays, diagnostic, procedures, ...","[signific, delay, diagnost, procedur, consult,..."
2,P00003,18-07-2024,I was impressed with the hospital’s cleanlines...,Positive,"[I, was, impressed, with, the, hospital, ’, s,...","[i, was, impressed, with, the, hospital, ’, s,...","[i, was, impressed, with, the, hospital, ’, s,...","[impressed, hospital, ’, cleanliness, infrastr...","[impress, hospit, ’, cleanli, infrastructur, o..."
3,P00004,13-03-2024,"The medical care met basic expectations, and t...",Neutral,"[The, medical, care, met, basic, expectations,...","[the, medical, care, met, basic, expectations,...","[the, medical, care, met, basic, expectations,...","[medical, care, met, basic, expectations, staf...","[medic, care, met, basic, expect, staff, perfo..."
4,P00005,09-11-2024,The nursing staff was consistently attentive a...,Positive,"[The, nursing, staff, was, consistently, atten...","[the, nursing, staff, was, consistently, atten...","[the, nursing, staff, was, consistently, atten...","[nursing, staff, consistently, attentive, comp...","[nurs, staff, consist, attent, compassion, reg..."


## **Lemmatization**


In [65]:
from nltk.stem import WordNetLemmatizer
import nltk

# Download the 'wordnet' corpus from NLTK.
# WordNet is a lexical database for English, used by the lemmatizer.
nltk.download('wordnet')

# Initialize the WordNet Lemmatizer. Lemmatization reduces words to their dictionary
# or base form (lemma), considering context and part of speech, resulting in valid words.
lemmatizer = WordNetLemmatizer()

def apply_lemmatization(tokens):
    """
    Applies WordNet Lemmatization to a list of tokens.

    Args:
        tokens (list): A list of string tokens.

    Returns:
        list: A new list with lemmatized tokens.
    """
    # Apply the lemmatizer to each token in the list.
    return [lemmatizer.lemmatize(token) for token in tokens]

# Apply the apply_lemmatization function to the 'tokens_no_stopwords' column.
# Store the results in a new column named 'tokens_lemmatized'.
exit_interview['tokens_lemmatized'] = exit_interview['tokens_no_stopwords'].apply(apply_lemmatization)

print("Lemmatization complete. Displaying the first few rows with the new 'tokens_lemmatized' column:")
exit_interview.head()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmatization complete. Displaying the first few rows with the new 'tokens_lemmatized' column:


Unnamed: 0,Patient_ID,Discharge Date,Exit Interview,Customer Sentiment,tokens,tokens_lower,tokens_no_punct,tokens_no_stopwords,tokens_stemmed,tokens_lemmatized
0,P00001,15-04-2024,Billing and insurance processing was confusing...,Negative,"[Billing, and, insurance, processing, was, con...","[billing, and, insurance, processing, was, con...","[billing, and, insurance, processing, was, con...","[billing, insurance, processing, confusing, po...","[bill, insur, process, confus, poorli, coordin...","[billing, insurance, processing, confusing, po..."
1,P00002,16-11-2024,There were significant delays in diagnostic pr...,Negative,"[There, were, significant, delays, in, diagnos...","[there, were, significant, delays, in, diagnos...","[there, were, significant, delays, in, diagnos...","[significant, delays, diagnostic, procedures, ...","[signific, delay, diagnost, procedur, consult,...","[significant, delay, diagnostic, procedure, co..."
2,P00003,18-07-2024,I was impressed with the hospital’s cleanlines...,Positive,"[I, was, impressed, with, the, hospital, ’, s,...","[i, was, impressed, with, the, hospital, ’, s,...","[i, was, impressed, with, the, hospital, ’, s,...","[impressed, hospital, ’, cleanliness, infrastr...","[impress, hospit, ’, cleanli, infrastructur, o...","[impressed, hospital, ’, cleanliness, infrastr..."
3,P00004,13-03-2024,"The medical care met basic expectations, and t...",Neutral,"[The, medical, care, met, basic, expectations,...","[the, medical, care, met, basic, expectations,...","[the, medical, care, met, basic, expectations,...","[medical, care, met, basic, expectations, staf...","[medic, care, met, basic, expect, staff, perfo...","[medical, care, met, basic, expectation, staff..."
4,P00005,09-11-2024,The nursing staff was consistently attentive a...,Positive,"[The, nursing, staff, was, consistently, atten...","[the, nursing, staff, was, consistently, atten...","[the, nursing, staff, was, consistently, atten...","[nursing, staff, consistently, attentive, comp...","[nurs, staff, consist, attent, compassion, reg...","[nursing, staff, consistently, attentive, comp..."


## Summary:

### Data Analysis Key Findings

*   **Data Loading and Inspection**: The `exit_interview` DataFrame was successfully loaded and its initial structure, including columns like `Patient_ID`, `Discharge Date`, `Exit Interview`, and `Customer Sentiment`, was confirmed.
*   **Text Preprocessing Introduction**: An introductory overview of text preprocessing and its importance in Natural Language Processing (NLP) was provided, outlining key steps such as tokenization, lowercasing, punctuation removal, stop word removal, stemming, and lemmatization.
*   **Tokenization**: The `Exit Interview` column was successfully tokenized into words using `nltk.word_tokenize()`, with a new column named `tokens` added to the DataFrame. An initial `LookupError` for the `punkt_tab` resource was resolved by downloading the necessary NLTK data.
*   **Lowercasing**: All tokens in the `tokens` column were converted to lowercase, and the results were stored in a new column called `tokens_lower`.
*   **Punctuation Removal**: Punctuation marks were removed from the lowercased tokens using the `string` module and `str.maketrans`, with the cleaned tokens stored in a new column named `tokens_no_punct`.
*   **Stop Word Removal**: Common stop words were successfully removed from the `tokens_no_punct` column, and the output was saved to a new column called `tokens_no_stopwords`.
*   **Stemming**: Stemming was applied to the `tokens_no_stopwords` column using NLTK's `PorterStemmer`, reducing words to their root forms, and the results were stored in a new column named `tokens_stemmed`.

### Insights or Next Steps

*   All intended text preprocessing steps, including tokenization, lowercasing, punctuation removal, stop word removal, and stemming, were successfully completed. The `exit_interview` DataFrame is now prepared with several new columns, each representing a stage of text cleaning, making the text data ready for subsequent NLP tasks.
*   The final intended step of lemmatization was not successfully executed or demonstrated in the provided process. It is recommended to perform lemmatization using `nltk.stem.WordNetLemmatizer` to reduce words to their grammatically correct base forms, which often provides a more refined output than stemming for linguistic analysis.
