# Data Preprocessing for TripAdvisor Review Dataset

The dataset [TripAdvisor Review Rating](https://huggingface.co/datasets/jniimi/tripadvisor-review-rating) contains reviews from TripAdvisor, including various attributes related to the review and the hotel. The goal of this notebook is to preprocess the data and prepare it for analysis and model training.

The dataset contains the following columns:
- `hotel_id`: The ID of the hotel.
- `user_id`: The ID of the user who wrote the review.
- `title`: The title of the review.
- `text`: The text of the review.
- `overall`: The overall rating given by the user.
- `cleanliness`: The cleanliness rating given by the user.
- `value`: The value rating given by the user.
- `location`: The location rating given by the user.
- `rooms`: The rooms rating given by the user.
- `sleep_quality`: The sleep quality rating given by the user.
- `stay_year`: The year of the stay.
- `post_date`: The date the review was posted.
- `freq`: The frequency of the review.
- `review`: The full review text.
- `char`: The character count of the review.
- `lang`: The language of the review.

In [23]:
# List of packages to install
%pip install datasets
%pip install pandas
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [5]:
from datasets import load_dataset
import pandas as pd

from sklearn.model_selection import train_test_split

In [6]:
ds = load_dataset("jniimi/tripadvisor-review-rating")
raw_data = pd.DataFrame(ds['train'])

display(raw_data.head())

print(raw_data.shape)
print(raw_data.columns)

Unnamed: 0,hotel_id,user_id,title,text,overall,cleanliness,value,location,rooms,sleep_quality,stay_year,post_date,freq,review,char,lang
0,127781101,2262DCBFC351F42A9DD30AC8BAD24686,Really excellent Hilton,Stayed here on business trips and the hotel is...,5.0,4.0,5.0,4.0,5.0,4.0,2012,2012-04-13,1,Really excellent Hilton\nStayed here on busine...,204,__label__en
1,137380592,8477E11DABF4D6743885E401BB4C8CCF,Exceptional service and comfort,Spent two nights here for a wedding in Brookly...,5.0,5.0,4.0,5.0,4.0,5.0,2012,2012-08-16,1,Exceptional service and comfort\nSpent two nig...,621,__label__en
2,129673371,483A193B7113ADFFD5CE30849564F69C,Nice room and five star service,Great place for a 3-night stay. Our king room ...,5.0,5.0,5.0,3.0,5.0,4.0,2012,2012-05-09,1,Nice room and five star service\nGreat place f...,1259,__label__en
3,129006626,E5A63DD7239A7057746D4644A5C986EB,"BRILLIANT hotel, my #1 Chicago pick for busine...","This is my favorite hotel in Chicago, and I've...",5.0,5.0,5.0,5.0,5.0,5.0,2012,2012-04-28,1,"BRILLIANT hotel, my #1 Chicago pick for busine...",2242,__label__en
4,139168159,CBFE281C9386225267BC52518836A6C2,Convenient and comfortable,BEST. BREAKFAST. EVER. Couldn't have been happ...,5.0,5.0,4.0,5.0,4.0,5.0,2012,2012-09-02,1,Convenient and comfortable\nBEST. BREAKFAST. E...,511,__label__en


(201295, 16)
Index(['hotel_id', 'user_id', 'title', 'text', 'overall', 'cleanliness',
       'value', 'location', 'rooms', 'sleep_quality', 'stay_year', 'post_date',
       'freq', 'review', 'char', 'lang'],
      dtype='object')


# Preprocessing

In [7]:
# Remove the System Prompt
df = raw_data.drop(columns=['stay_year', 'post_date', 'freq', 'lang'])

# Drop the rows with missing User Prompt
df = df.dropna()

# Drop the duplicates
df = df.drop_duplicates()

# Shuffle the data
df = df.sample(frac=1).reset_index(drop=True)

# Split the data into train and test
train_df, test_df = train_test_split(df, test_size=0.2)

train_df, val_df = train_test_split(train_df, test_size=0.2)

print(train_df.shape)
print(val_df.shape)
print(test_df.shape)

(128828, 12)
(32208, 12)
(40259, 12)
