In [1]:
import pandas as pd 
import numpy as np
import prepare
#to see the whole review, display max column width
pd.set_option('display.max_colwidth', None)

# Rate My Review
## An Analysis on Hotel reviews in Texas
#### Xavier Carter, September 2021

----

#### The Dataset
- Using Selinium, 13,800 reviews were gathered from various hotels across 4 major cities in Texas (Houston, Austin, Dallas, San Antonio)

#### Project Goals
- Analyze reviews to understand correlation to the review rating and the review. 
- Build a machine learning model to predict what rating a review should get.

#### Executive Summary
- Executive Summary here

----

## Acquire
- Utilizing Selinium (see acquire1.py and acquire2.py) , Gathering review information from TripAdvisor.com
- For sake of time, the max number of reviews looked at for each hotel was maxed to 35, as some hotels had hundreds of reviews

In [2]:
df = pd.read_csv('hotel_data.csv')

In [3]:
df.head(2)

Unnamed: 0,hotel_name,hotel_city,date_of_stay,review_rating,review
0,Drury Plaza Hotel San Antonio Riverwalk,San Antonio,September 2021,5,Joseph was so helpful and attentive! Awesome customer service. Made our trip more enjoyable! This will now be our go to hotel when we come to San Antonio. Everything about the hotel was nice and the staff was very friendly. Very pleased with the whole experience.
1,Drury Plaza Hotel San Antonio Riverwalk,San Antonio,September 2020,5,"We stayed one night at the Drury Plaza Riverwalk in mid-September. Sooo enjoyed our stay. Definitely our favorite hotel on the Riverwalk. We specifically stayed here for the rooms with the balconies overlooking the San Fernando Cathedral. I sat on that balcony all day long, reading and enjoying the view, even despite the day of rain! Love the separate bedroom! The afternoon happy hour could have easily sufficed for dinner had the allure of the Riverwalk restaurants not been there. The indoor pool/hot tub was nice, and the fitness center was perfectly equipped with great views while running the treadmill. The breakfast was hearty and very good quality...love that they have biscuits and gravy! Every employee we encountered was upbeat and kind and seemed to be interested in serving"


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13801 entries, 0 to 13800
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   hotel_name     13801 non-null  object
 1   hotel_city     13801 non-null  object
 2   date_of_stay   13801 non-null  object
 3   review_rating  13801 non-null  int64 
 4   review         13801 non-null  object
dtypes: int64(1), object(4)
memory usage: 539.2+ KB


In [5]:
df.isna().sum()

hotel_name       0
hotel_city       0
date_of_stay     0
review_rating    0
review           0
dtype: int64

In [6]:
df.describe()

Unnamed: 0,review_rating
count,13801.0
mean,3.622564
std,1.559053
min,1.0
25%,2.0
50%,4.0
75%,5.0
max,5.0


In [7]:
for i in df.columns:
    print(df[i].value_counts())
    print('---------------------------')

Fairmont Austin                                             70
La Cantera Resort & Spa                                     70
Homewood Suites by Hilton Dallas-Park Central Area          35
Hyatt House Austin / Downtown                               35
La Quinta Inn & Suites by Wyndham Austin Near the Domain    35
                                                            ..
Americas Best Value Inn & Suites Northeast Houston I-610     1
Studio 6 Houston, TX- Intercontinental Airport South         1
Garden Inn & Suites                                          1
Ramada Limited Addison                                       1
Scottish Inn & Suites                                        1
Name: hotel_name, Length: 548, dtype: int64
---------------------------
Austin         4033
San Antonio    3633
Houston        3574
Dallas         2561
Name: hotel_city, dtype: int64
---------------------------
 August 2021      1356
 July 2021        1215
 June 2021         696
 February 2020     696
 May

#### Acquire Findings 

#### TO-Do's:
1). The cap for reviews was 35, each review being unique, since value counts of 70 and 2 were seen, duplicates exist in the data. duplicates need to be removed.

2). Month and year can be in their own seperate columns.

3.) no null values or missing values. 

4.) Standardize english words using NLP processing, Standard cleaning using NLTK.

----

## Prepare
- In Preperation, we will
     * Drop Duplicates
     * Split month and year into seperate columns 
     * Drop date of stay column
     * Prep review content (Basic cleaning, tokenizing, lemmentizing, removing stop words, excluding common negative stop words. As they add to negative sentiment)
     * Makeing columns for word and letter count
     * creating columns for negative , postive and neautral sentiment

In [8]:
df = prepare.prep_review_data(df)

In [9]:
len(df)

13732

----

## Outliers 
- Here , we'll take a look at possible anomolies
     * looking at positive sentiment reviews with low ratings
     * looking at negative sentiment reviews with high ratings

In [10]:
postive_when_neg  = (df.positive_sentiment  >= .500) & (df.review_rating < 3)
negative_when_pos = (df.negative_sentiment  >= .500) & (df.review_rating > 3)

In [11]:
df[postive_when_neg].sample(2)

Unnamed: 0,hotel_name,hotel_city,review_rating,review,month_of_stay,year_of_stay,review_cleaned,message_length,word_count,positive_sentiment,negative_sentiment,neatral_sentiment
12130,Best Western Plus Northwest Inn & Suites,Houston,1,The place smelled. The ac struggled to cool. The lobby needed better air flow. The pool is still not open. The Togo breakfast is a joke. The workers are the best assets this location has going for it.,June,2021,place smelled. ac struggled cool. lobby needed better air flow. pool still not open. togo breakfast joke. worker best asset location ha going .,143,24,0.503,0.069,0.429
12994,Hilton Garden Inn San Antonio At The Rim,San Antonio,1,"Fabulous stay. The front desk staff was friendly and helpful. I enjoyed the breakfast daily and the breakfast crew as well. Room 404 - temperature control was excellent, I could adjust from heat to cool without issues. Necessary because of the temperature swings",November,2019,"fabulous stay. front desk staff friendly helpful. enjoyed breakfast daily breakfast crew well. room 404 - temperature control excellent , could adjust heat cool without issues. necessary temperature swing",204,29,0.51,0.0,0.49


In [12]:
df[negative_when_pos].head()

Unnamed: 0,hotel_name,hotel_city,review_rating,review,month_of_stay,year_of_stay,review_cleaned,message_length,word_count,positive_sentiment,negative_sentiment,neatral_sentiment
13793,Econo Lodge Inn & Suites,Dallas,5,"Service of staff was a little bad and rude, In the lobby the didn't have the uniform. And the location is not good, is very dangerous.",March,2014,"service staff little bad rude , lobby ' uniform. location not good , dangerous .",80,15,0.141,0.526,0.332


#### Most of these may be missclicked, as these people may wanted to rate higher or lower then their sentiment may suggest. so we will remove these from the dataframe as outliers.

In [13]:
drop1=(df[postive_when_neg] == True).index.to_list()
drop2=(df[negative_when_pos]== True).index.to_list()

In [14]:
df = df.drop(drop1)

In [15]:
df = df.drop(drop2)

In [20]:
df = df.reset_index().drop(columns='index')

In [22]:
df = df.drop(columns='level_0')

In [24]:
len(df)

13721