## Data Science - Final Project

For my final project I am using Natural Language Processing to analyze the text from reviews for Universal Studio Parks, calculate the sentiment of the review text, and then compare the sentiment to the actual rating the user gave in the review.

I am using a Kaggle dataset [Reviews of Universal Studios](https://www.kaggle.com/dwiknrd/reviewuniversalstudio).

I created a Dash application based on the information in this notebook, used Dash Bootstrap Components for a responsive page layout, and deployed the app to Heroku.

### Imports
---
Import classes used in the notebook for the analysis and visualization.

In [1]:
import calendar
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

import plotly.graph_objects as go
import plotly.express as px

Set options to ignore all *Future Warning* messages.

In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

I will use the VADER (Valence Aware Dictionary for Sentiment Reasoning) model from the Natural Language Toolkit (NLTK).

In [3]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\fmcguirk\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Import the data
Import data from source file.

In [4]:
df_reviews = pd.read_csv('../data/universal_studio_branches.csv')

## Exploratory data analysis

Check the data in the dataset.
* Check the size (shape) of the dataset
* Check for null values
* Check the data types of the columns

In [5]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50904 entries, 0 to 50903
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   reviewer      50904 non-null  object 
 1   rating        50904 non-null  float64
 2   written_date  50904 non-null  object 
 3   title         50904 non-null  object 
 4   review_text   50904 non-null  object 
 5   branch        50904 non-null  object 
dtypes: float64(1), object(5)
memory usage: 2.3+ MB


In [6]:
df_reviews.isna().sum()

reviewer        0
rating          0
written_date    0
title           0
review_text     0
branch          0
dtype: int64

In [7]:
df_reviews['rating'].value_counts()

5.0    28202
4.0    13514
3.0     5229
2.0     1986
1.0     1973
Name: rating, dtype: int64

In [8]:
df_reviews.head(2)

Unnamed: 0,reviewer,rating,written_date,title,review_text,branch
0,Kelly B,2.0,"May 30, 2021",Universal is a complete Disaster - stick with ...,We went to Universal over Memorial Day weekend...,Universal Studios Florida
1,Jon,1.0,"May 30, 2021",Food is hard to get.,The food service is horrible. I’m not reviewin...,Universal Studios Florida


## Data Cleanup

Perform data cleanup
* change the rating column to integer since there are no fractional values
* create a new date column by converting the values in the written date column
* Remove 'Universal Studios' from branch names
* Drop columns we will not be using

In [9]:
df_reviews['rating'] = df_reviews['rating'].apply(lambda x: int(x))

In [10]:
df_reviews['date'] = pd.to_datetime(df_reviews['written_date'])

In [11]:
df_reviews['branch'] = [x.replace('Universal Studios ', '') for x in df_reviews['branch']]

In [12]:
# drop columns we will not use
df_reviews = df_reviews.drop(['reviewer', 'written_date'], axis=1)

In [13]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50904 entries, 0 to 50903
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   rating       50904 non-null  int64         
 1   title        50904 non-null  object        
 2   review_text  50904 non-null  object        
 3   branch       50904 non-null  object        
 4   date         50904 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 1.9+ MB


In [14]:
df_reviews.head(2)

Unnamed: 0,rating,title,review_text,branch,date
0,2,Universal is a complete Disaster - stick with ...,We went to Universal over Memorial Day weekend...,Florida,2021-05-30
1,1,Food is hard to get.,The food service is horrible. I’m not reviewin...,Florida,2021-05-30


## Visualize number of reviews per year for each park

Define a function to compare the number of reviews for each park per year:
* create a new dataframe with a summary of the number of reviews for each park per year
* plot the new dataframe

In [15]:
def park_review_counts(year_list):
    
    parks_list = ['Florida', 'Japan', 'Singapore']

    review_counts_df = pd.DataFrame()
    for park in parks_list:
        for year in years:
            counts = df_reviews[(df_reviews['branch']==park) & (df_reviews['date'].dt.year == year)]['title'].count()
            review_counts_df.loc[year, park] = counts

    # display results with plotly
    trace0 = go.Bar(x= review_counts_df.index,
                    y= review_counts_df['Florida'].values,
                    name='Florida')
    trace1 = go.Bar(x= review_counts_df.index,
                    y= review_counts_df['Japan'].values,
                    name='Japan')
    trace2 = go.Bar(x= review_counts_df.index,
                    y= review_counts_df['Singapore'].values,
                    name='Singapore')

    # now the layout
    layout=go.Layout(title='Park Reviews Count',
                       xaxis= dict(title='Year'),
                       yaxis=dict(title='Count'))

    # bind using go.Figure
    fig = go.Figure(data=[trace0, trace1, trace2], layout=layout)
    fig.show()


Get list of unique years in the dataset.

In [16]:
years = df_reviews['date'].dt.year.unique()
years = np.sort(years).tolist()
print(years)

[2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]


In [17]:
park_review_counts(years)

There is not very much data before 2012 or after 2020. Restrict the list of years to 2012-2020.

In [18]:
years = np.arange(2012, 2021)
park_review_counts(years)

## Visualize the reviews for July 2019 for Universal Studios Florida
Create a new dataframe with only the reviews for Universal Studios Florida for July 2019.

In [19]:
park = 'Florida'
year = 2019
month = 7

df_florida_2019_07 = df_reviews[(df_reviews['branch']==park) & \
                      (df_reviews['date'].dt.year == year) & (df_reviews['date'].dt.month == month)].copy()

In [20]:
print(df_florida_2019_07.shape)
df_florida_2019_07.sort_values(by='rating', ascending=False).head()

(233, 5)


Unnamed: 0,rating,title,review_text,branch,date
1930,5,Worth Staying at the Hotels,We stayed at the Lowes Sapphire Falls and we h...,Florida,2019-07-14
1952,5,An amazing day,"Important tip, before visiting the Park, downl...",Florida,2019-07-10
1944,5,VIP Tour w/ PJ,We had a 9:30 a.m. public VIP tour with PJ and...,Florida,2019-07-12
1945,5,Familyvacation,Excellent! As a movie fan this is the best par...,Florida,2019-07-12
1946,5,Birthday,I am a big kid at heart! My husband made this ...,Florida,2019-07-11


In [21]:
df_florida_2019_07['rating'].value_counts()

5    122
4     44
3     31
1     22
2     14
Name: rating, dtype: int64

Visualize the ratings for Universal Studios Florida in July 2019

In [22]:
df_ratings = pd.get_dummies(df_florida_2019_07, columns = ['rating'], prefix='', prefix_sep='')
df_ratings_month = df_ratings.groupby(['date'])['1', '2', '3', '4', '5'].apply(lambda x : x.astype(int).sum())

column_list = df_ratings_month.select_dtypes(include='number').columns.to_list()

fig = px.line(df_ratings_month, 
              x=df_ratings_month.index, 
              y=column_list
             )

fig.update_layout(
    xaxis=dict(tickformat='%m-%d', title='Month-Day'),
    yaxis=dict(title='Total Ratings by Class'),
    title=dict(text=park + ' Ratings - ' + calendar.month_name[month] + ' ' + str(year)),
    legend=dict(title='Rating Classes')
)


## Analyze Sentiment of Reviews
---
First, we will create a new DataFrame and re-order the columns.

In [23]:
df_analyzed_reviews = df_reviews[['rating', 'date', 'title', 'review_text', 'branch']].copy()

Instantiate the sentiment analyzer. Get the sentiment scores for each review and accummulate the overall sentiment of the review text.

In [24]:
neg = []
neu = []
pos = []
sia = SentimentIntensityAnalyzer()
for review in df_analyzed_reviews['review_text']:
    sent = sia.polarity_scores(review)
    pos.append(1 if sent['compound'] >= 0.5 else 0)
    neg.append(1 if sent['compound'] <= -0.5 else 0)
    neu.append(1 if sent['compound'] > -0.5 and sent['compound'] < 0.5 else 0)

Append the accumulated values as new columns to our dataframe.

In [25]:
df_analyzed_reviews['neg'] = neg
df_analyzed_reviews['neu'] = neu
df_analyzed_reviews['pos'] = pos

How does the data look now?

In [26]:
print(df_analyzed_reviews.shape)
df_analyzed_reviews.head()

(50904, 8)


Unnamed: 0,rating,date,title,review_text,branch,neg,neu,pos
0,2,2021-05-30,Universal is a complete Disaster - stick with ...,We went to Universal over Memorial Day weekend...,Florida,1,0,0
1,1,2021-05-30,Food is hard to get.,The food service is horrible. I’m not reviewin...,Florida,1,0,0
2,2,2021-05-30,Disappointed,I booked this vacation mainly to ride Hagrid m...,Florida,0,0,1
3,4,2021-05-29,My opinion,When a person tries the test seat for the ride...,Florida,0,1,0
4,5,2021-05-28,The Bourne Stuntacular...MUST SEE,"Ok, I can't stress enough to anyone and everyo...",Florida,0,0,1


## Export processed data
---
Since the analysis of the review text can take some time, export the entire dataframe excluding the review_text column. This exported data will be used by app for the interactive visualizations.

In [27]:
df_analyzed_reviews.drop(['review_text'], axis=1).to_csv('../data/df_analyzed_reviews.csv', index=False)

## Visualization
---
Let's look at the reviews for Universal Studios Florida in July 2019 and create a plot of this data.

In [28]:
park = 'Florida'
year = 2019
month = 7

df_florida_2019_07 = df_analyzed_reviews[(df_analyzed_reviews['branch']==park) & \
                          (df_analyzed_reviews['date'].dt.year == year) & \
                              (df_analyzed_reviews['date'].dt.month == month)].copy()

In [29]:
df_florida_2019_07.head()

Unnamed: 0,rating,date,title,review_text,branch,neg,neu,pos
1814,4,2019-07-31,Good fun,We enjoyed our day at universal studios. The p...,Florida,0,0,1
1815,5,2019-07-31,Epic,Absolutely brilliant for all the family! We ca...,Florida,0,0,1
1816,3,2019-07-31,Good but not special,Sorry to say but I have been here before and i...,Florida,1,0,0
1817,4,2019-07-31,Fun Times Despite Teenagers,We visited Universal Studios(US) with two teen...,Florida,0,0,1
1818,4,2019-07-31,Good Time,Overall Universal excellent... I took my 8 yea...,Florida,0,0,1


Create a new dataframe summarizing the review sentiment and group by the rating value assigned to the review.

In [30]:
ratings_list = [1, 2, 3, 4, 5]
summary_df = pd.DataFrame(columns=['neutral', 'negative', 'positive'], \
                             index=ratings_list)
for rating in ratings_list:
    num_positive = df_florida_2019_07[(df_florida_2019_07['rating'] == rating) \
                                    & (df_florida_2019_07['pos'] == 1)]['title'].count()
    num_negative = df_florida_2019_07[(df_florida_2019_07['rating'] == rating) \
                                    & (df_florida_2019_07['neg'] == 1)]['title'].count()
    num_neutral = df_florida_2019_07[(df_florida_2019_07['rating'] == rating) \
                                    & (df_florida_2019_07['neu'] == 1)]['title'].count()
    
    summary_df.loc[rating] = [num_neutral, num_negative, num_positive]
summary_df.head()

Unnamed: 0,neutral,negative,positive
1,6,9,7
2,2,4,8
3,12,2,17
4,5,1,38
5,9,1,112


Create a scatter plot for each sentiment value (neutral, negative, and positive).

In [31]:
data = []
for col in summary_df.columns.to_list():
    data.append(
        go.Scatter(
            x=summary_df.index,
            y=summary_df[col],
            name=col
        )
    )

data

[Scatter({
     'name': 'neutral', 'x': array([1, 2, 3, 4, 5], dtype=int64), 'y': array([6, 2, 12, 5, 9], dtype=object)
 }),
 Scatter({
     'name': 'negative', 'x': array([1, 2, 3, 4, 5], dtype=int64), 'y': array([9, 4, 2, 1, 1], dtype=object)
 }),
 Scatter({
     'name': 'positive', 'x': array([1, 2, 3, 4, 5], dtype=int64), 'y': array([7, 8, 17, 38, 112], dtype=object)
 })]

Show the plot.

In [32]:
fig=go.Figure(data=data)
fig.show()

### Define a function encapsulating the steps performed to create the plot.

In [33]:
def check_sentiment(park, year, month):
    # 
    df_park_month = df_analyzed_reviews[(df_analyzed_reviews['branch']==park) & \
                          (df_analyzed_reviews['date'].dt.year == year) & \
                              (df_analyzed_reviews['date'].dt.month == month)].copy()
    
    print('Reviews for Universal Studio ', park, ' for ', year, '/', month)
    print('  Total reviews analyzed: ', df_park_month.shape[0])


    summary_cols = ['neutral', 'negative', 'positive']
    summary_index = [1, 2, 3, 4, 5]

    summary_df = pd.DataFrame(columns=summary_cols, \
                             index=summary_index)

    for rating in summary_index:
        num_positive = df_park_month[(df_park_month['rating'] == rating) \
                                     & (df_park_month['pos'] == 1)]['title'].count()
        num_negative = df_park_month[(df_park_month['rating'] == rating) \
                                     & (df_park_month['neg'] == 1)]['title'].count()
        num_neutral = df_park_month[(df_park_month['rating'] == rating) \
                                     & (df_park_month['neu'] == 1)]['title'].count()
        
        summary_df.loc[rating] = [num_neutral, num_negative, num_positive]

    display("Summary DataFrame", summary_df)

    data = []
    for col in summary_df.columns.to_list():
        data.append(
            go.Scatter(
                x=summary_df.index,
                y=summary_df[col],
                name=col
            )
        )

    month_name = calendar.month_name[month]
    the_title = f'{park} - Review Sentiment - {month_name} {year}'

    fig = go.Figure(data)

    fig.update_layout(
        xaxis=dict(title='Rating'),
        yaxis=dict(title='Total Reviews'),
        title=dict(text=the_title),
        title_x=0.45,      # shift title to the right to be closer to center
        title_y=1.00,      # Move plot title to be closer to top
        # shift legend down a small amount to make space for modebar
        legend=dict(title='Sentiment', yanchor='top', y=0.90),
        # mode: Compare data on hover (shows tags for all values at selected x position)
        hovermode='x',
        # reduce space around plot (top, botton, left, right
        margin=dict(t=20, b=60, l=10, r=10),
    )

    return fig

### Now, check sentiment of reviews for Florida in July 2019

In [34]:
fig = check_sentiment('Florida', 2019, 7)
fig.show()

Reviews for Universal Studio  Florida  for  2019 / 7
  Total reviews analyzed:  233


'Summary DataFrame'

Unnamed: 0,neutral,negative,positive
1,6,9,7
2,2,4,8
3,12,2,17
4,5,1,38
5,9,1,112


## Visualize using Saved Data
---
The processed dataframe was saved to a CSV file without the review_text column.

Import the saved CVS file, process the data to set values to expected data types, and then visualize again.

In [35]:
df_analyzed_reviews = pd.read_csv('../data/df_analyzed_reviews.csv')

Change the date column to a date object.

In [36]:
df_analyzed_reviews['date'] = pd.to_datetime(df_analyzed_reviews['date'])

Check the structure of the dataset.

In [37]:
df_analyzed_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50904 entries, 0 to 50903
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   rating  50904 non-null  int64         
 1   date    50904 non-null  datetime64[ns]
 2   title   50904 non-null  object        
 3   branch  50904 non-null  object        
 4   neg     50904 non-null  int64         
 5   neu     50904 non-null  int64         
 6   pos     50904 non-null  int64         
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 2.7+ MB


Now, visualize the selected data using the same parameters as before.

In [38]:
fig = check_sentiment('Florida', 2019, 7)
fig.show()

Reviews for Universal Studio  Florida  for  2019 / 7
  Total reviews analyzed:  233


'Summary DataFrame'

Unnamed: 0,neutral,negative,positive
1,6,9,7
2,2,4,8
3,12,2,17
4,5,1,38
5,9,1,112
