# # Project: NLP to Analyze Sentiments of Consumer Reviews of Hotels    

In this project, we use Sentiment Analysis of NLP to analyze reviews of hotels. We would like to understand if a review is positive or negative (without looking at the ratings). We will use the following libraries:
- NLTK: the most famous python module for NLP techniques
- Gensim: a topic-modelling and vector space modelling toolkit
- Scikit-learn: the most used python machine learning library

The dataset is from Kaggle and contains 515,000 customer reviews and scoring of 1493 luxury hotels offered in Booking.come across Europe.
The dataset can be found here: https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe/version/1

In [63]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## Part 1: Load the Data and Select

In [None]:
# load the data into a datadrame
reviews = pd.read_csv('/Users/Amir/PythonProjects/HotelRRDataset/Hotel_Reviews.csv')
reviews.info()

# append positive and negative reviews to have all in one column (in booking, there is two entires for 
# consumers to write positive and negative comments)
reviews["Review"] = reviews["Negative_Review"] + reviews["Positive_Review"]

# we define a bad review as the one which has overall rating of lower than 5. We label them using actual ratings 
# to compare them to our predicion to evaluate the performance 
reviews["Bad_Review?"] = reviews["Reviewer_Score"].apply(lambda x: 1 if x < 5 else 0)

In [65]:
# Select only the relevant data for Sentiment Analysis
reviews_SA = reviews[["Review","Bad_Review?"]]
reviews_SA.head(10)

Unnamed: 0,Review,Bad_Review?
0,I am so angry that i made this post available...,1
1,No Negative No real complaints the hotel was g...,0
2,Rooms are nice but for elderly a bit difficul...,0
3,My room was dirty and I was afraid to walk ba...,1
4,You When I booked with your company on line y...,0
5,Backyard of the hotel is total mess shouldn t...,0
6,Cleaner did not change our sheet and duvet ev...,1
7,Apart from the price for the brekfast Everyth...,0
8,Even though the pictures show very clean room...,0
9,The aircondition makes so much noise and its ...,0


In [None]:
# Sample some data to increase the computation speed
# sample 10% of the data 
reviews_SA = reviews_SA.sample(frac = 0.1, replace = False, random_state=42)

In [66]:
# Clean-up
# based on the data spec, if the reviewer does not give the negative or positive review, 
# the associated column will 'No Negative' and 'No Positive'. We need to remove them
# ro turn off the chain warning 
pd.options.mode.chained_assignment = None
reviews_SA["Review"] = reviews_SA["Review"].apply(lambda x: x.replace('No Negative','').replace('No Positive',''))
# see row two for an example 
reviews_SA.head(10)

Unnamed: 0,Review,Bad_Review?
0,I am so angry that i made this post available...,1
1,No real complaints the hotel was great great ...,0
2,Rooms are nice but for elderly a bit difficul...,0
3,My room was dirty and I was afraid to walk ba...,1
4,You When I booked with your company on line y...,0
5,Backyard of the hotel is total mess shouldn t...,0
6,Cleaner did not change our sheet and duvet ev...,1
7,Apart from the price for the brekfast Everyth...,0
8,Even though the pictures show very clean room...,0
9,The aircondition makes so much noise and its ...,0


## Part 2: Text Preprocessing
Since, text is the most unstructured form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing.

In [74]:
# Import required libraries 
import nltk
nltk.download('stopwords')
from nltk.corpus import wordnet
import string
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /Users/Amir/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Part 2A: Lower the Text

In [68]:
# lower text
def text_low(text):
    text = text.lower()
    return text 
reviews_SA["Review"] = reviews_SA["Review"].apply(lambda x: text_low(x))
reviews_SA.head(3)

Unnamed: 0,Review,Bad_Review?
0,i am so angry that i made this post available...,1
1,no real complaints the hotel was great great ...,0
2,rooms are nice but for elderly a bit difficul...,0


## Part 2B: Tokenize
Tokenize the text (split the text into words) and remove the punctuation

In [69]:
# tokenize text and remove puncutation
def text_tok(text):
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    return text

reviews_SA["Review"] = reviews_SA["Review"].apply(lambda x: text_tok(x))
reviews_SA.head(3)


Unnamed: 0,Review,Bad_Review?
0,"[, i, am, so, angry, that, i, made, this, post...",1
1,"[, no, real, complaints, the, hotel, was, grea...",0
2,"[, rooms, are, nice, but, for, elderly, a, bit...",0


## Part 2C: Noise Removal

In [75]:
def text_noise(text):
    # remove words that contain numbers
    text = [word for word in text if not any(c.isdigit() for c in word)]
    # remove useless stop words like 'the', 'a' ,'this' etc.
    stop = stopwords.words('english')
    text = [x for x in text if x not in stop]
    # remove empty tokens
    text = [t for t in text if len(t) > 0]
    return text 

reviews_SA["Review"] = reviews_SA["Review"].apply(lambda x: text_noise(x))
reviews_SA.head(3)

KeyboardInterrupt: 

## Part 2: POS Tagging
Part-Of-Speech (POS) tagging: assign a tag to every word to define if it corresponds to a noun, a verb etc. using the WordNet lexical database

## Part 3: 

In [1]:
# Drop unwante columns and return results in the same dataframe
#features_to_remove = ['address','attributes','business_id','categories','city','hours','is_open','latitude','longitude','name','neighborhood','postal_code','state','time']
# axis=1 lets Pandas know we want to drop columns, not rows
# inplace=True lets us drop the columns right here in our DataFrame, instead of returning a new DataFrame
#df.drop(labels=features_to_remove, axis=1, inplace=True)
#df.info()

In [3]:
# check for missing values (if any, they will prevent LR from running)
#df.isna().any()
# replace all of our NaNs with 0s
#df.fillna({'weekday_checkins':0,'weekend_checkins':0, 'average_tip_length':0, 'number_tips':0, 'average_caption_length':0,'number_pics':0}, inplace=True)
# check for missing values
#df.isna().any()

## Part 4: 

## Part 5: 

## Part 6: 