**In this notebook, we focus on identifying and isolating English-language reviews from a larger dataset of hotel reviews. The process begins by loading the review data and applying a language detection algorithm to each review text. Reviews identified as English are then filtered for further analysis. Additionally, the notebook includes basic text cleaning steps, such as removing HTML tags, punctuation, and converting text to lowercase, to prepare the data for subsequent analysis. The cleaned and filtered English reviews are then saved to a new CSV file for use in later stages of the project.**

In [1]:
from langdetect import detect
import pandas as pd
import re
import string

In [2]:
df = pd.read_csv('./data/nyc_hotels.csv')

In [3]:
# Function to detect language
def detect_language(text):
    if pd.isna(text):
        return "unknown"
    try:
        return detect(text)
    except:
        return "unknown"

In [4]:
# Apply the language detection function to the review text
df['language'] = df['text'].apply(detect_language)

In [5]:
reviews_df = df[df['language'] == 'en']

In [6]:
print(f'All languages shape {df.shape}')
print(f'En languages shape {reviews_df.shape}')

All languages shape (267057, 18)
En languages shape (206370, 18)


In [7]:
# Data Cleaning

In [8]:
import pandas as pd
import re
import string

In [9]:
# Function to clean text
def basic_clean_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Convert to lowercase
    text = text.lower()
    
    return text

In [10]:
# Apply the basic_clean_text function to the 'text' column
reviews_df.loc[:, 'cleaned_text'] = reviews_df['text'].apply(basic_clean_text)

# Display the first few rows of the cleaned dataset
reviews_df[['text', 'cleaned_text']].head()

Unnamed: 0,text,cleaned_text
0,Stayed in a king suite for 11 nights and yes i...,stayed in a king suite for 11 nights and yes i...
1,"On every visit to NYC, the Hotel Beacon is the...",on every visit to nyc the hotel beacon is the ...
2,This is a great property in Midtown. We two di...,this is a great property in midtown we two dif...
3,The Andaz is a nice hotel in a central locatio...,the andaz is a nice hotel in a central locatio...
4,I have stayed at each of the US Andaz properti...,i have stayed at each of the us andaz properti...


In [11]:
reviews_df.to_csv('./data/eng_reviews.csv', index = False)