# Lab 4 - Sentiment analysis from texts

In this lab, you will learn:
* How to clean texts
* How to generate Document-Term matrix from texts
* How to do sentiment analysis from texts

This lab is written by Jisun AN (jisunan@smu.edu.sg) and Michelle KAN (michellekan@smu.edu.sg).


# 1. Getting the data

In this lab, we will use restaurant review data. 

This data is manually annotated by humans according to their aspect and sentiment. 

One review may have two or more aspects and thus two ore more sentiment. 

We note that we excluded those conflicting reviews.

"restaurant_reviews.tsv" is tab-separated file which fields are "sid \t text \t aspect \t sentiment." 

sid is review id, text is a review, aspect is one of five lables (food, service, ambience, price), sentiment is one of three lables (positive, negative, neutral). 

In [None]:
# Import Pandas to analyze the data
import pandas as pd


In [None]:
# Read the file using Pandas 'read_table' function (either read_table, read_csv is fine)
ori_df = pd.read_table("https://raw.githubusercontent.com/anjisun221/css_codes/main/restaurant_reviews.tsv", sep="\t")

print(ori_df.shape)
ori_df.head()

In [None]:
# to see entire text 
pd.set_option('display.max_colwidth', 150)
ori_df.head()

In [None]:
ori_df['sentiment'].value_counts()

In [None]:
ori_df['aspect'].value_counts()

### Combine review by aspect + sentiment (e.g., all positive reviews about food)

In [None]:
list_sentiment = ['positive', 'negative', 'neutral']
list_aspect = ['food', 'service', 'ambience', 'price']

data_combined = {}

for each_sent in list_sentiment:
    for each_aspect in list_aspect:
        
        new_label = each_aspect+"_"+each_sent
        print(new_label)
        
        tmp_df = ori_df.query("sentiment==@each_sent and aspect==@each_aspect")
        texts = " ".join(tmp_df['text'].to_list())
        data_combined[new_label] = [texts]


In [None]:
data_combined

Turn dictionary into dataframe

In [None]:
df = pd.DataFrame.from_dict(data_combined, orient='index')
df.columns = ['text']
df = df.sort_index()
df.head()


In [None]:
# Let's take a look at the negative reviews for ambience
df.text.loc['ambience_negative']


Let's save this dataframe to the file

# 2. Cleaning the data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

### Round 1. Let's make text lowercase, remove punctuations, remove words containing numbers.

In [None]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)


In [None]:
# Let's take a look at the updated text
df_clean = pd.DataFrame(df['text'].apply(round1))
df_clean.head()


### Round 2. Let's remove stopwords. 

A stop word is a commonly used word (such as "the", "a", "an", "in"). For some analysis, like sentiment analysis, those stop words are often meaningless, and thus we remove them.

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
round2 = lambda x: ' '.join([word for word in x.split() if word not in (stop)])


In [None]:
df_clean = pd.DataFrame(df_clean['text'].apply(round2))
df_clean.head()


**NOTE:** This data cleaning aka text pre-processing step could go on for a while, but we are going to stop for now. After going through some analysis techniques, if you see that the results don't make sense or could be improved, you can come back and make more edits such as:
* Mark 'outstanding' and 'outstand' as the same word (stemming / lemmatization)
* Combine 'thank you' into one term (bi-grams)
* And a lot more...

## Organizing the data

We will have clean, organized data in two standard text formats:

1. **Corpus - **a collection of text
2. **Document-Term Matrix - **word counts in matrix format

### Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.


In [None]:
df

In [None]:
# Let's add the full label as well
full_labels = ['ambience_negative', 'ambience_neutral', 'ambience_positive',
       'food_negative', 'food_neutral', 'food_positive', 'price_negative',
       'price_neutral', 'price_positive', 'service_negative',
       'service_neutral', 'service_positive']

df['category'] = full_labels
df


In [None]:
# Let's pickle it for later use
df.to_pickle("corpus.pkl")


### Document-Term Matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.



In [None]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english') #Iyou can remove stop words using CountVectorizer as well
data_cv = cv.fit_transform(df_clean.text)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = df.index
data_dtm


In [None]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")


In [None]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
import pickle
df_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))


## Additional Exercises


1. Can you add an additional regular expression to the clean_text_round2 function to further clean the text?
2. Play around with CountVectorizer's parameters. What is ngram_range? What is min_df and max_df?

# 3. Expolatory Data Analysis

After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.

We are going to look at the **Most common words** and **Amount of love/hate** for each category.


## Most common words

In [None]:
# If you start from Section 3, please uncomment below code
# import pandas as pd
# data_dtm = pd.read_pickle('./dtm.pkl')

data = data_dtm.transpose()
data.head()

In [None]:
# Find the top 30 words said by each category
top_dict = {}
for c in data.columns:
    top = data[c].sort_values(ascending=False).head(30)
    top_dict[c]= list(zip(top.index, top.values))

top_dict

In [None]:
# Print the top 15 words said by each category
for category, top_words in top_dict.items():
    print(category)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')

**NOTE:** At this point, we could go on and create word clouds. However, by looking at these top words, you can see that some of them have very little meaning and could be added to a stop words list, so let's do just that.


In [None]:
# Look at the most common top words --> add them to the stop word list
from collections import Counter

# Let's first pull out the top 30 words for each category
words = []
for category in data.columns:
    top = [word for (word, count) in top_dict[category]]
    for t in top:
        words.append(t)
        
words


In [None]:
# Let's aggregate this list and identify the most common words along with how many categories they occur in
Counter(words).most_common()


In [None]:
# If more than half of the categories (6) have it as a top word, exclude it from the list 
add_stop_words = [word for word, count in Counter(words).most_common() if count > 6]
add_stop_words

In [None]:
# Let's update our document-term matrix with the new list of stop words
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer

# Read in cleaned data
df_clean = pd.read_pickle('./data_clean.pkl')

# Add new stop words
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate document-term matrix
cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(df_clean.text)
data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_stop.index = df_clean.index

# # Pickle it for later use
# If you start from Section 3, please uncomment below code
# import pickle
pickle.dump(cv, open("cv_stop.pkl", "wb"))
data_stop.to_pickle("dtm_stop.pkl")


In [None]:
# Let's make some word clouds!
# Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud
from wordcloud import WordCloud

wc = WordCloud(stopwords=stop_words, background_color="white", colormap="Dark2",
               max_font_size=150, random_state=42)


In [None]:
# Reset the output dimensions
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [16, 6]

full_labels = ['ambience_negative', 'ambience_neutral', 'ambience_positive',
       'food_negative', 'food_neutral', 'food_positive', 'price_negative',
       'price_neutral', 'price_positive', 'service_negative',
       'service_neutral', 'service_positive']

# Create subplots for each category
for index, category in enumerate(data.columns):
    wc.generate(df_clean.text[category])
    
    plt.subplot(3, 4, index+1)
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title(full_labels[index])
    
plt.show()


Findings 

* Reviews of different aspects seems to have different set of frequent words. E.g., ambience : table, atmosphere, decor, etc whiel price : price, worth, cheap, etc.   


## Amount of love/hate words in positive/negative reviews

In [None]:
# Let's isolate just these words
data_lovehate_words = data.transpose()[['love', 'best', 'dont', 'worst', 'didnt']]
data_lovehate = pd.concat([data_lovehate_words.love + data_lovehate_words.best, data_lovehate_words.dont + data_lovehate_words.worst + data_lovehate_words.didnt], axis=1)
data_lovehate.columns = ['love', 'hate']
data_lovehate


In [None]:
# Let's create a scatter plot of our findings
plt.rcParams['figure.figsize'] = [12, 6]
fig, ax = plt.subplots()

for i, category in enumerate(data_lovehate.index):
    x = data_lovehate.love.loc[category]
    y = data_lovehate.hate.loc[category]
    plt.scatter(x, y, color='blue')
    plt.text(x+1.5, y+0.5, full_labels[i], fontsize=10)
    plt.xlim(-5, 85) 

plt.title('Number of Love/Hate Words Used in Reviews', fontsize=20)
plt.xlabel('Number of love/best', fontsize=15)
plt.ylabel('Number of dont/didnt/worst', fontsize=15)

xpoints = ypoints = plt.xlim()
plt.plot(xpoints, ypoints, linestyle='--', color='k', lw=3, scalex=False, scaley=False)

plt.show()



## Exercise 1

What other word counts do you think would be interesting to compare instead of the love/hate words? Create a scatter plot comparing them.


In [None]:
# Write your code here 
# If you get an error that you don't have 'data' defined, please uncomment below
# import pandas as pd
# data_dtm = pd.read_pickle('./dtm.pkl')
# data = data_dtm.transpose()


# Let's isolate some words


In [None]:
# Let's create a scatter plot of your findings



# 4. Sentiment Analysis

We will examine whether sentiment analysis method is useful to distinguish positive/neutral/negative reviews. 

In this lab, we will use **TextBlob** for sentiment analysis.

1. **TextBlob Module:** Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels.
2. **Sentiment Labels:** Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.
   * **Polarity**: How positive or negative a word is. -1 is very negative. +1 is very positive.
   * **Subjectivity**: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.

For more info on how TextBlob coded up its [sentiment function](https://planspace.org/20150607-textblob_sentiment/).

Let's take a look at the sentiment of the various categories. 


In [None]:
# If you start from Section 4, please uncomment below code
# import pandas as pd

# We'll start by reading in the corpus, which preserves word order
data = pd.read_pickle('./corpus.pkl')
data

In [None]:
# install textblob
# !pip install textblob

In [None]:
# Create quick lambda functions to find the polarity and subjectivity of each category
# Terminal / Anaconda Navigator: conda install -c conda-forge textblob
from textblob import TextBlob

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

data['polarity'] = data['text'].apply(pol)
data['subjectivity'] = data['text'].apply(sub)
data


In [None]:
# Let's plot the results
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8]

for index, category in enumerate(data.index):
    x = data.polarity.loc[category]
    y = data.subjectivity.loc[category]
    plt.scatter(x, y, color='blue')
    plt.text(x+.001, y+.001, data['category'][index], fontsize=10)
#     plt.xlim(-.01, .12) 
    plt.xlim(-.1, .55) 
    
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)

plt.show()


### Exercise 2

Let's compare the sentiments of the reviews computed by Textblob and Vader

You will need to apply Vader to analyze sentiment of reviews. 

1. Define the function that return compound score given a sentence
2. Apply function 1 to compute vader score, the column name would be 'vader_sent' 
3. Draw a scatter plot to compare Vader sentiment score (x-axis) with TextBlob polarity score (y-axis)
4. What's your conclusion?


In [None]:
!pip install vaderSentiment

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 

In [None]:
# 1. Write a python function that returns VADER's 'compound score' of a sentence


In [None]:
# 2. Apply vader_compound_score on data. New column name would be 'vader_sent'


In [None]:
# 3. Let's draw scatter plot



4. Conclusion



