## Load Libraries and Dataset

In this section, we import all required libraries for text preprocessing.
We load the raw dataset from the raw data directory and confirm that:

- the file is read correctly
- the dataset has the expected number of rows and columns
- there are no missing values in the Review or Rating fields

This step ensures that the initial data quality is sufficient before we apply any preprocessing operations.

In [20]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt', quiet=False)
nltk.download('punkt_tab', quiet=False)
nltk.download('stopwords', quiet=False)
from nltk.tokenize import word_tokenize
import spacy as sp

df = pd.read_csv('../data/raw_data/tripadvisor_hotel_reviews.csv')
df.shape

[nltk_data] Downloading package punkt to /Users/taanone1/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/taanone1/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/taanone1/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


(20491, 2)

In [21]:
df.columns

Index(['Review', 'Rating'], dtype='object')

In [22]:
df.isnull().isnull().sum()

Review    0
Rating    0
dtype: int64

## Basic Text Cleaning

We prepare the raw text for further NLP processing.
The cleaning operations include:

- converting text to lowercase
- removing punctuation
- removing digits
- stripping extra whitespace

These transformations reduce noise in the text and ensure consistent formatting.
Cleaning the text at this stage helps downstream methods, such as tokenization and topic modeling, to operate more effectively.

In [23]:
df['Review_clean'] = (
    df['Review']
        .astype(str)
        .str.lower()
        .str.replace('[^\w\s]', '', regex=True)
        .str.replace('\d+', '', regex=True)
        .str.strip()
)

df['Review_tokens'] = df['Review_clean'].apply(word_tokenize)

df.head()

Unnamed: 0,Review,Rating,Review_clean,Review_tokens
0,nice hotel expensive parking got good deal sta...,4,nice hotel expensive parking got good deal sta...,"[nice, hotel, expensive, parking, got, good, d..."
1,ok nothing special charge diamond member hilto...,2,ok nothing special charge diamond member hilto...,"[ok, nothing, special, charge, diamond, member..."
2,nice rooms not 4* experience hotel monaco seat...,3,nice rooms not experience hotel monaco seattl...,"[nice, rooms, not, experience, hotel, monaco, ..."
3,"unique, great stay, wonderful time hotel monac...",5,unique great stay wonderful time hotel monaco ...,"[unique, great, stay, wonderful, time, hotel, ..."
4,"great stay great stay, went seahawk game aweso...",5,great stay great stay went seahawk game awesom...,"[great, stay, great, stay, went, seahawk, game..."


## Tokenization

Tokenization splits the cleaned review text into individual words (tokens).
We use NLTK’s `word_tokenize` function, which handles English text reliably.

Tokenization is an essential step because most NLP techniques operate at the token level rather than on full sentences.

In [24]:
stop_words = set(nltk.corpus.stopwords.words('english'))
filtered_tokens = df['Review_tokens'].apply(lambda tokens: [word for word in tokens if word not in stop_words])
df['Review_filtered'] = filtered_tokens
df.head()

Unnamed: 0,Review,Rating,Review_clean,Review_tokens,Review_filtered
0,nice hotel expensive parking got good deal sta...,4,nice hotel expensive parking got good deal sta...,"[nice, hotel, expensive, parking, got, good, d...","[nice, hotel, expensive, parking, got, good, d..."
1,ok nothing special charge diamond member hilto...,2,ok nothing special charge diamond member hilto...,"[ok, nothing, special, charge, diamond, member...","[ok, nothing, special, charge, diamond, member..."
2,nice rooms not 4* experience hotel monaco seat...,3,nice rooms not experience hotel monaco seattl...,"[nice, rooms, not, experience, hotel, monaco, ...","[nice, rooms, experience, hotel, monaco, seatt..."
3,"unique, great stay, wonderful time hotel monac...",5,unique great stay wonderful time hotel monaco ...,"[unique, great, stay, wonderful, time, hotel, ...","[unique, great, stay, wonderful, time, hotel, ..."
4,"great stay great stay, went seahawk game aweso...",5,great stay great stay went seahawk game awesom...,"[great, stay, great, stay, went, seahawk, game...","[great, stay, great, stay, went, seahawk, game..."


## Lemmatization

Lemmatization reduces words to their base or dictionary form (lemma).

**For example:**

- “rooms” → “room”
- “running” → “run”
- “better” → “good”

We use spaCy’s `en_core_web_sm` model, which provides context-aware lemmatization.
Applying lemmatization reduces vocabulary size and groups different word forms under one semantic representation, which improves topic modeling performance.

In [25]:
import spacy
!python -m spacy download en_core_web_sm

nlp = spacy.load('en_core_web_sm')

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m47.7 MB/s[0m  [33m0:00:00[0meta [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [26]:
def lemmatize_tokens(tokens):
    doc = nlp(" ".join(tokens))
    return [token.lemma_ for token in doc]

In [29]:
df["Review_lemmatized"] = df["Review_filtered"].apply(lemmatize_tokens)
df.head()


Unnamed: 0,Review,Rating,Review_clean,Review_tokens,Review_filtered,Review_lemmatized
0,nice hotel expensive parking got good deal sta...,4,nice hotel expensive parking got good deal sta...,"[nice, hotel, expensive, parking, got, good, d...","[nice, hotel, expensive, parking, got, good, d...","[nice, hotel, expensive, parking, get, good, d..."
1,ok nothing special charge diamond member hilto...,2,ok nothing special charge diamond member hilto...,"[ok, nothing, special, charge, diamond, member...","[ok, nothing, special, charge, diamond, member...","[ok, nothing, special, charge, diamond, member..."
2,nice rooms not 4* experience hotel monaco seat...,3,nice rooms not experience hotel monaco seattl...,"[nice, rooms, not, experience, hotel, monaco, ...","[nice, rooms, experience, hotel, monaco, seatt...","[nice, room, experience, hotel, monaco, seattl..."
3,"unique, great stay, wonderful time hotel monac...",5,unique great stay wonderful time hotel monaco ...,"[unique, great, stay, wonderful, time, hotel, ...","[unique, great, stay, wonderful, time, hotel, ...","[unique, great, stay, wonderful, time, hotel, ..."
4,"great stay great stay, went seahawk game aweso...",5,great stay great stay went seahawk game awesom...,"[great, stay, great, stay, went, seahawk, game...","[great, stay, great, stay, went, seahawk, game...","[great, stay, great, stay, go, seahawk, game, ..."


## Rebuilding Cleaned Review Text

After lemmatization, we convert the list of lemmas back into a single text string.

In [30]:
df["Cleaned_Review"] = df["Review_lemmatized"].apply(lambda tokens: ' '.join(tokens))
df.head()

Unnamed: 0,Review,Rating,Review_clean,Review_tokens,Review_filtered,Review_lemmatized,Cleaned_Review
0,nice hotel expensive parking got good deal sta...,4,nice hotel expensive parking got good deal sta...,"[nice, hotel, expensive, parking, got, good, d...","[nice, hotel, expensive, parking, got, good, d...","[nice, hotel, expensive, parking, get, good, d...",nice hotel expensive parking get good deal sta...
1,ok nothing special charge diamond member hilto...,2,ok nothing special charge diamond member hilto...,"[ok, nothing, special, charge, diamond, member...","[ok, nothing, special, charge, diamond, member...","[ok, nothing, special, charge, diamond, member...",ok nothing special charge diamond member hilto...
2,nice rooms not 4* experience hotel monaco seat...,3,nice rooms not experience hotel monaco seattl...,"[nice, rooms, not, experience, hotel, monaco, ...","[nice, rooms, experience, hotel, monaco, seatt...","[nice, room, experience, hotel, monaco, seattl...",nice room experience hotel monaco seattle good...
3,"unique, great stay, wonderful time hotel monac...",5,unique great stay wonderful time hotel monaco ...,"[unique, great, stay, wonderful, time, hotel, ...","[unique, great, stay, wonderful, time, hotel, ...","[unique, great, stay, wonderful, time, hotel, ...",unique great stay wonderful time hotel monaco ...
4,"great stay great stay, went seahawk game aweso...",5,great stay great stay went seahawk game awesom...,"[great, stay, great, stay, went, seahawk, game...","[great, stay, great, stay, went, seahawk, game...","[great, stay, great, stay, go, seahawk, game, ...",great stay great stay go seahawk game awesome ...


## Remove Extremely Short Reviews

We calculate the number of words in each cleaned review and remove all observations with fewer than three words.

Very short reviews do not provide enough information for modeling and may introduce noise.
Filtering them out improves the reliability of downstream analyses.

In [31]:
df["Review_word_count"] = df["Cleaned_Review"].str.split().apply(len)
df = df[df["Review_word_count"] >= 3]
df.shape

(20491, 8)

In [32]:
df.to_csv("../data/processed_data/cleaned_reviews.csv", index=False)