In [13]:
from google.colab import drive
# Mount Google Drive to access files stored in the user's Drive.
drive.mount('/content/drive')

import os
# Change the current working directory to the specified path within Google Drive.
# This ensures that subsequent file operations (e.g., reading CSVs) are performed relative to this directory.
os.chdir('/content/drive/MyDrive/Academics/Visiting Lectures/2026-H1/202601-SDP-AU/Session-12-Natural-Language-Processing')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [14]:
# Import the pandas library for data manipulation and analysis.
import pandas as pd

# Load the 'hospital_exit_interviews.csv' dataset into a pandas DataFrame.
# The dataset is located in the 'Data' subdirectory.
exit_interview = pd.read_csv('Data/hospital_exit_interviews.csv')

# Display the first few rows of the DataFrame to inspect its structure and content.
exit_interview.head()

Unnamed: 0,Patient_ID,Discharge Date,Exit Interview,Customer Sentiment
0,P00001,15-04-2024,Billing and insurance processing was confusing...,Negative
1,P00002,16-11-2024,There were significant delays in diagnostic pr...,Negative
2,P00003,18-07-2024,I was impressed with the hospital’s cleanlines...,Positive
3,P00004,13-03-2024,"The medical care met basic expectations, and t...",Neutral
4,P00005,09-11-2024,The nursing staff was consistently attentive a...,Positive


## **Load Libraries and Preprocessing Setup**

In [15]:
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Download the 'stopwords' corpus if not already downloaded
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

print("Necessary libraries imported and NLTK stopwords set up.")

Necessary libraries imported and NLTK stopwords set up.


## **Text Preprocessing**



cleaning the 'Exit Interview' text data by defining a function to convert text to lowercase, remove punctuation, numbers, and extra spaces, then applying this function to create a new 'cleaned_interview' column.



In [16]:
import string

def preprocess_text(text):
    """
    Cleans a given text string by converting it to lowercase, removing punctuation,
    numbers, and extra spaces.

    Args:
        text (str): The input text string to be cleaned.

    Returns:
        str: The cleaned text string.
    """
    # Convert to lowercase to ensure consistency
    text = text.lower()
    # Remove punctuation using a regular expression
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)
    # Remove all numeric characters
    text = re.sub(r'\d+', '', text)
    # Replace multiple spaces with a single space and strip leading/trailing spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the preprocessing function to the 'Exit Interview' column
exit_interview['cleaned_interview'] = exit_interview['Exit Interview'].apply(preprocess_text)

# Display the first few rows with the new cleaned column to verify the preprocessing
print(exit_interview[['Exit Interview', 'cleaned_interview']].head())

                                      Exit Interview  \
0  Billing and insurance processing was confusing...   
1  There were significant delays in diagnostic pr...   
2  I was impressed with the hospital’s cleanlines...   
3  The medical care met basic expectations, and t...   
4  The nursing staff was consistently attentive a...   

                                   cleaned_interview  
0  billing and insurance processing was confusing...  
1  there were significant delays in diagnostic pr...  
2  i was impressed with the hospital’s cleanlines...  
3  the medical care met basic expectations and th...  
4  the nursing staff was consistently attentive a...  


## **Vectorize Text Data**
Apply TF-IDF vectorization to convert the cleaned 'Exit Interview' text into numerical features suitable for machine learning models. This step will transform the text into a sparse matrix of TF-IDF features.


In [17]:
# Initialize TfidfVectorizer with English stopwords to convert text into numerical features.
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))
# Fit the vectorizer to the 'cleaned_interview' column and transform the text data into a TF-IDF matrix.
X = tfidf_vectorizer.fit_transform(exit_interview['cleaned_interview'])

print("TF-IDF vectorization completed. Shape of the feature matrix (X):")
print(X.shape)

TF-IDF vectorization completed. Shape of the feature matrix (X):
(10000, 133)


**Reasoning**:
The next step is to encode the target variable 'Customer Sentiment' into numerical form using `LabelEncoder`, which is crucial for machine learning models. This aligns with the overall task of preparing the data for sentiment analysis.



In [18]:
# Initialize LabelEncoder to convert categorical sentiment labels into numerical format.
label_encoder = LabelEncoder()
# Fit the encoder to the 'Customer Sentiment' column and transform the labels.
y = label_encoder.fit_transform(exit_interview['Customer Sentiment'])

print("Customer Sentiment labels encoded. First 5 encoded values:")
print(y[:5])
print("Original classes and their encoded values:")
# Print the mapping of original class names to their encoded numerical values.
for i, class_name in enumerate(label_encoder.classes_):
    print(f"{class_name}: {i}")

Customer Sentiment labels encoded. First 5 encoded values:
[0 0 2 1 2]
Original classes and their encoded values:
Negative: 0
Neutral: 1
Positive: 2


## **Sentiment Analysis Model Development**

In [19]:
# Split the feature matrix X and the encoded target labels y into training and testing sets.
# test_size=0.2 means 20% of the data will be used for testing, and 80% for training.
# random_state=42 ensures reproducibility of the split.
# stratify=y ensures that the proportion of target labels is the same in both training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Data split into training and testing sets.")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Data split into training and testing sets.
X_train shape: (8000, 133)
X_test shape: (2000, 133)
y_train shape: (8000,)
y_test shape: (2000,)


In [20]:
xgb_model = XGBClassifier(eval_metric='mlogloss', random_state=42)
xgb_model.fit(X_train, y_train)

print("XGBoost Classifier trained successfully.")

XGBoost Classifier trained successfully.


**After training the XGBoost Classifier, task is to evaluate its performance.** This involves making predictions on the test set (`X_test`) and then calculating relevant metrics such as accuracy, precision, recall, and F1-score.

In [21]:
y_pred = xgb_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted') # Use 'weighted' for multi-class to account for class imbalance.
recall = recall_score(y_test, y_pred, average='weighted')       # Use 'weighted' for multi-class to account for class imbalance.
f1 = f1_score(y_test, y_pred, average='weighted')               # Use 'weighted' for multi-class to account for class imbalance.

print("Model Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Optional: Print a more detailed classification report
# print("\nClassification Report:")
# print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

Model Evaluation Metrics:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1-Score: 1.0000


## Summary:

### Data Analysis Key Findings
*   Text data from the 'Exit Interview' column was preprocessed by converting it to lowercase, removing punctuation and numbers, and standardizing whitespace, creating a `cleaned_interview` column.
*   TF-IDF vectorization transformed the cleaned text into a feature matrix `X` of shape (10000, 133), indicating 10,000 samples and 133 unique terms after applying English stopwords.
*   The 'Customer Sentiment' target variable was encoded into numerical labels: 'Negative' as 0, 'Neutral' as 1, and 'Positive' as 2.
*   The dataset was split into training and testing sets with an 80/20 ratio, resulting in `X_train` and `y_train` shapes of (8000, 133) and (8000,) respectively, and `X_test` and `y_test` shapes of (2000, 133) and (2000,).
*   An XGBoost Classifier was successfully trained on the vectorized and split data.
*   The model achieved perfect performance metrics on the test set: Accuracy: 1.0000, Precision: 1.0000, Recall: 1.0000, and F1-Score: 1.0000.

### Insights or Next Steps
*   The perfect evaluation scores (1.00 for Accuracy, Precision, Recall, and F1-Score) suggest that the model might be overfitting or there could be data leakage. It's crucial to investigate the dataset for potential issues like target leakage or a highly separable dataset that might not represent real-world complexity.
*   Further steps should include cross-validation, hyperparameter tuning of the XGBoost classifier, and potentially analyzing feature importance to understand the drivers of sentiment, especially given the perfect scores.
