**Preparation:** Run (Select the cell and `CTRL+Enter` or `CMD+Enter`) the following code just so that the output looks better.

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

Whenever you want to add extra cells, you can do that by clicking `+Code` in the upper-left corner, or by clicking `Esc` (stop editting cell) and `A` (add a cell above) or `B` (add a cell below)

# BOOK REVIEWS

Let's start with consumer review. We'll look into book reviews. The data set we'll use is a subset of Amazon review data prepared by Jianmo Ni (https://nijianmo.github.io/amazon/index.html).

We can ask, for example, what terms are indicative of the review being positive and try to build a model for predicting whether a review is positive or negative.

First, let's download the data.

1. **Download the file using the following code**

In [None]:
# The following code downloads the file from GitHub:
!wget https://raw.githubusercontent.com/amjassem/DxU-Intro-to-Text-Analytics/main/Data/reviews.json

As you can see by the extension of the file, this is a **JSON**.  This is a popular way to store data. Each line stores the data for a single review. 

*But there's more to a review than just the text*. Each review has **attributes** (text, rating, date), which have **values**.

2. **Let's read the first line and see what it looks like:**

In [None]:
# Read a single line from a file
with open('reviews.json', 'r') as f:
  print(f.readline())

Let's focus on the text of the review. For this, we'll use the `json` package. You can then easily see the values for each attribute.

3. **Run the code below:**

In [None]:
import json

# Read a line
with open('reviews.json', 'r') as f:
  line = f.readline()

# Parse it using json
review = json.loads(line)

# Read the rating
rating = review['overall']
print('Rating:', rating)

# Read the text
text = review['reviewText']
print(text)

# Part 1: Natural Language Processing

## 1.A Tokenization

In order to use analyze the text using statistical methods we need to **pre-process** it.

Many packages can do all the following steps at once (and actually compute it quite a bit faster) but it's nice to be able to do it yourself in case you want to do something non-standard.

First of all, we'll **tokenize** the document. We want to split the long string into the smallest meaningfull bits - tokens - and discard all the things that are not informative.

A lot of text manipulation is based on *regular expressions* (regex) - finding particular characters or sequences of characters (**pattern**) in the text. `re` package provides all the functionalities for this.

You can find basics of regex [here](https://cheatography.com/davechild/cheat-sheets/regular-expressions/). For example:

* `abc` will look for "abc" in the text,
* `[abc]` will look for "a", "b" or "c",
* `[^abc]` will look for **not** "a", "b" and "c",
* `[a-j]` will look for a letter between "a" and "j", similar for `[K-Z]` or `[4-7]`
* `a*` will look for 0 or more "a"s
* `a+` will look for 1 or more "a"s
* `\s` is used for white space, `\n` for a new line
* `.` - Any character except new line (\n)
* You can combine the above (an more) to make arbitrarily complex patterns, eg. `[^\s]+@[\^s]+\.[^\s]+` would find e-mail adresses.

We'll start by **splitting** the text using `re.split(pettern, string)`.
4. **Test a couple of patterns to split on.**

In [None]:
import re
# Splits the text according to some pattern
print(re.split('', text))

If we're interested only in keeping words, it might make sense to split on anything that is not an alphabetic character
5. **Split the text into only words**

In [None]:
# Splits the text on anything that is not letters
tokens = re.split('[^A-Za-z]+', text)
print(tokens)

If you want to remove a pattern from a text, you can also use `re.sub(pattern, '', string)` (it effectively replaces the pattern with an empty string)

## 1.B Cleaning the tokens - stopwords, stemming, lemmatization

We've split the string into tokens, but those tokens are still quite messy. 

* Is it really informative to have "King" and "king" be different terms? We might want to just have everything in **lower case**.
* How informative is "the" really? Such terms are called **stopwords** and we might want to exclude them.
* Is "order" and "orders" a crucial distinction we need? We might **stem** or **lemmatize** the tokens.

The exact approach to pre-processing will always depend on what features you need to consider in the text!

For now, let's start by getting a list of stopwords. For this, we'll use the `nltk` package, usefull in all kinds of NLP tasks.

6. **Run the following code to get the list of stopwords:**

In [None]:
import nltk
# Downloads and loads stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
# We only need the english ones
stops = stopwords.words('english')
print(stops)

Let's move on to **stemming** and **lemmatization**. Once again, we can use `nltk` to do either, although for lemmatizing we do need to first download extra resources. To lemmatize a word we also need to know what **part-of-speech** (POS) it is.

7. **Run the code below to prepare the stemmer and the lemmatizer:**

In [None]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Initilize the stemmer and the lemmatizer
stemmer =  PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define a function to get the POS of a word (in correct format)
def pos(word):
    # There's different formats for POS
    # We want the one expected by the lemmatizer
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
                
    return tag_dict.get(tag, wordnet.NOUN)

8. **Check the stemming and lemmatization results for the following:**
  * orders
  * ordering
  * orderly
  * wrote
  * nothing


In [None]:
# Choose a word
word = ''
print('Word: ' + word)

# Find the stem of a word
stem = stemmer.stem(word)
print('Stem: ' + stem)

# Find the lemma of a word (given it's POS)
lemma = lemmatizer.lemmatize(word, pos(word))
print('Lemma: ' + lemma)

Let's prepare a function to tokenize the documents and clean the tokens all at once. It should:

9. **Which parts of the code below achieves the following tasks?**
* Split a text into tokens
* Convert to lower case
* Exclude stopwords
* Exclude anything shorter than 3 characters
* Lemmatize the token



In [None]:
def CleanToken(token):
  """Converts a token to lower-case and lemmatizes it"""

  token = token.lower() # A

  token = lemmatizer.lemmatize(token, pos(word)) # B
  return token

def CleanDocument(text):
  """Splits the document, cleans tokens, removed redundant"""
  
  tokens = re.split('[^A-Za-z]+', text) # C

  tokens = [CleanToken(token) for token in tokens] # D

  tokens = [token for token in tokens if token not in stops and len(token) >= 3] # E
  return tokens

10. **Let's see whether it works:**

In [None]:
# Clean a document and print is
print(CleanDocument(text))

Now that we have a function that tokenizes (and cleans a text) let's run it for all the review.

We need to read each review JSON (line of the file), find the text and tokenize it.

*If this takes to long, you can run the next code cell to download the pre-processed data:*

11. a) **Run the following the pre-process the whole corpus:**

In [None]:
corpus = []
ratings = []


f = open('reviews.json', 'r')
for line in f:

  # Parse line and get the text review
  try:
    review = json.loads(line)
    text = review['reviewText']
  except:
    # Not all reviews have text, we just skip those
    continue

  # Tokenize and clean
  doc = CleanDocument(text)

  # Add to the corpus
  corpus.append(doc)

  # Record the rating
  rating = review['overall']
  ratings.append(rating)

f.close()
nDoc = len(corpus)

print('#Docs:', nDoc, ', #Ratings:', len(ratings))

11. b) **Run the following to download already pre-processed data**

In [None]:
# The following code downloads the pre-processed corpus from GitHub:
!wget https://raw.githubusercontent.com/amjassem/DxU-Intro-to-Text-Analytics/main/Data/reviews_corpus.pickle

# The file is serialized, this unpacks it
import pickle
corpus = pickle.load(open('reviews_corpus.pickle', 'rb'))

ratings = []

f = open('reviews.json', 'r')
for line in f:

  # Parse line and get the text review
  try:
    review = json.loads(line)
    text = review['reviewText']
  except:
    # Not all reviews have text, we just skip those
    continue

  # Record the rating
  rating = review['overall']
  ratings.append(rating)

f.close()
nDoc = len(corpus)

print('#Docs:', nDoc, ', #Ratings:', len(ratings))

## 1.C Vectorization

In the previous section we've defined the features of the text that interest us - what words are used.

Still, a list of tokens is not going to work for many of the methods we want to use. We might want to **vectorize** our data - express it as a vector of numeric values.

A common way to do that is to compute a TF-IDF:
* each element of the vector tells us how many times each term is used in a document
* but the values are scaled down by how common the term is across documents

`sklearn` provides a lot of functionalities for statistics and machine learning, including text analytics. We'll use the `TfidfVectorizer`. It actually can perform tokenization as well, but since we've already done that we'll tell it to not do that (hence the use of `ident_func`).

You can see what other specification you might include in the vectorizer by running `?TfidfVectorizer`. Perhaps it's worth using some of them, for example `min_df=`?

12. **Run the follwing to define a vectorizer object:**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Returns the documents unchanged
def ident_func(doc):
    return doc

# The object for vectorizing the corpus
vecTfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=ident_func,
    preprocessor=ident_func,
    token_pattern=None)

Now that we have our vectorizer object ready, we can transform the corpus.

13. **Run the following to get the TF-IDF:**
* See how the document is represented

In [None]:
# Calculates the tf-idf and count for each term in each document
tfidf = vecTfidf.fit_transform(corpus)

# Save the vocabulary (an attribute of the vectorizer object)
vocabulary =  vecTfidf.get_feature_names_out()
print('Number of terms:', len(vocabulary))
print(vocabulary[0:100])

# Print the count vector for the first document
print('TF-IDF of a doc')
d = 0
for i in tfidf[d].nonzero()[1]:
  print(tfidf[d, i], vocabulary[i])

# Part 2: Analysis

We have finished our NLP. Now we can used the pre-processed data in statistical analysis. Let's see if we can understand what words are indicative of a high/low rating.

## 2.B Regression

We can start by running a regression.

14. **Why would running an OLS would not be a good idea?**
* The code below prints the size of our predictor matrix (# observations, # variables)


In [None]:
# Print the dimensions of the TF-IDF matrix
print(tfidf.shape)

It might be better to run **penalized regression**, `sklearn` can do many of those:
* `LinearRegression()` for OLS - let's not use this one
* `Lasso()`
* `Ridge()`
* `ElasticNet()`
* and many more. Run `dir(linear_model)` to print all the methods in a module.

For penalized regression you might want to first standardize the data.


15. **Run one of those models (perhaps Lasso?) on our data:**

In [None]:
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler

# Scale the data so that it has equal st. dev.
scaler = StandardScaler()
X = scaler.fit_transform(tfidf)

# Run a model
clf = linear_model.MODELNAME()
clf.fit(X, ratings)

# Print the R^2:
score = clf.score(tfidf, ratings)
print('R^2:', score)

# Print the number of selected variables:
nSelected = (clf.coef_ != 0).sum()
print('Variables selected:', nSelected)

The results above might be poor (especially for Lasso). That's because we're running the model on the default value of `alpha` (you can see the default values by running e.g. `?linear_model.Lasso`.

Let's try to optimize it.

The following code:
* Splits the data-set into training and testing subsets.
* Runs a model for each value of `alpha`
* Calculates the score based on the testing set
* `tqdm` is a package for printing progress bars for loops. Usefull when the code is taking a while.

16. **Run the following and decide which alpha is best (you can try a different model)**
* You might want to select a different value of alpha - for very small values the model might run very long


In [None]:
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm

# Split the dataset into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(X, ratings,
                                                    test_size=0.33, random_state=1)

# Select values of alpha
alphas = [1e-4, 1e-3, 1e-2, 1e-1]

for alpha in tqdm(alphas):

  # Run a model
  clf = linear_model.MODELNAME(alpha=alpha)
  clf.fit(X_train, y_train)

  # Print the R^2:
  score = clf.score(X_test, y_test)
  print()

  # Print the number of selected variables:
  nSelected = (clf.coef_ != 0).sum()
  print('alpha:', alpha, 'R^2:', score, 'Variables selected:', nSelected)

Based on an estimated model, we might want to print the terms with the strongest coefficients.

17. **Re-reun the model on the whole data-set**

In [None]:
clf = linear_model.Lasso(alpha=1e-4)
clf.fit(tfidf, ratings)

Let's find out what are the terms with the biggest coefficients.

Coefficients of a model (`clf.coef_`) are stored as an array. We can use the `numpy` package to find the index of the elements with the smallest (most negative) values.

18. **Run the following to find the most negative terms.**
* You can find the most positive terms by searching `-clf.coef_` (biggest terms become smallest).

In [None]:
import numpy as np

topN = 20
# Finds the index of topN terms with the smallest values
sel = np.argsort(clf.coef_)[0:topN]

for i in sel:
  # Print the term and its coefficient
  print(vocabulary[i], clf.coef_[i])

# 2.B Sentiment analysis

In the previous section we've build a regression model to identify which terms are indicative of a negative/positive review. Perhaps it's a bit of an overkill?

There are lexicons that can tell us whether a term is positive or negative, maybe they'd do better?

`nltk` provides a way to get assign sentiment score to texts.

19. **Run the following to prepare the analyzer:**

In [None]:
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
# Defines the object for sentiment analysis
sia = SentimentIntensityAnalyzer()

20. **Try out sentiment analysis on a couple of words. You can even use phrases or sentences!**

In [None]:
word = "amazing"
# Print the sentiment score
print(word, sia.polarity_scores(word))

Let's run the sentiment analysis for our documents. The analyzer expects a string, so we'll convert the tokens back into one long string.

21. **Run the following to calculate sentiment score for the documents:**

In [None]:
import numpy as np

# Initialize an array to store the scores
sentiments = np.zeros((nDoc, 2))

for d, doc in enumerate(tqdm(corpus)):
  # Get the sentiment scores for a documents
  sentiment = sia.polarity_scores(" ".join(doc))
  # Record the positive and negative scores
  sentiments[d, 0] = sentiment['pos']
  sentiments[d, 1] = sentiment['neg']

Can we explain the rating based on the sentiment of the words? This time we only have two variables, so we can use OLS.

22. **Run the following model:**

In [None]:
# Run a model
clf = linear_model.LinearRegression()
clf.fit(sentiments, ratings)

# Print the R^2:
score = clf.score(sentiments, ratings)
print('R^2:', score)

Did we do better than with penalized regression?

* Actually, the analyzer does it's own processing, and we've removed things such as "not". You could try running it on the original, unprocessed texts.

# Part 3: Topic modelling

Sentiment analysis is a form of dimensionality reduction. Above, we've reduced our documents into two dimensions (positive and negative sentiments). There's also other dimensionality techniques we can try.

One of them is **topic modelling**. Using LDA we can find the topics in an unsupervised fashion - we do not specify them, but instead we'll learn the best fitting topics from the data.

For the case of customer reviews the topics might refer to e.g. product categories, aspects of the products, or even sentiments.

Perhaps the usage of some of the topics in the review text is indicative of the rating?

LDA doesn't use TF-IDF, but the actual counts.

23. **Run the following to get document-terms count matrix**
* The size of the vocabulary impacts the estimation time. You might want to exclude rare words using `min_df=` to simplify the model.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Prepare vectorizer
vecCount = CountVectorizer(
    analyzer='word',
    tokenizer=ident_func,
    preprocessor=ident_func,
    min_df=1,
    token_pattern=None)

# Transform the corpus into counts
counts = vecCount.fit_transform(corpus)

# Get the vocabulary
vocabulary = vecCount.get_feature_names_out()

print('Size of the vocabulary:', len(vocabulary))


Let's run the LDA model. Select a number of topics to estimate.

Usually you'd want to estimate the model until it converges (reaches a stable state). Unfortunatelly we don't have that much time, so let's run the model with just 10 iterations (it will still take some time, about 2-3 min.).

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

# Define the number of topics to estimate
nTop = 10

# Run the model, record the proportions
lda = LatentDirichletAllocation(n_components=nTop, max_iter=10)
proportions = lda.fit_transform(counts)

We can now see the estimated topics distributions, and the proportion of each topic in each document.

24. **Run the following to see the top words in a given topic**

In [None]:
# Select a topic
k = 0
# Number of top terms
nTerm = 25

# Find the nTop words
sel = np.argsort(-lda.components_[k])[0:nTerm]
for i in sel:
  print(vocabulary[i], lda.components_[k, i])

The above way of printing topics is not very nice to look at.

Instead, we might want to use a wordcloud - an image showing top term, with each term's size scaled by it's importance.

25. **Run the following to see the estimated topics**
* Do they have a clear interpratation?
* Looking at the topics, should we perhaps have added some additional steps in pre-processing?

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Number of top terms
nTerm = 150

# Creates subplots, one for each wordcloud
fig, axs = plt.subplots(nTop, figsize=(8, 4*nTop))

for k in range(nTop):
  
  # Find the nTop words
  sel = np.argsort(-lda.components_[k])[0:nTerm]
  
  # Get it in form of (term, frequency)
  topic = [(vocabulary[i], lda.components_[k, i]) for i in sel]
  topic = dict(topic)

  # Create the wordcloud
  wordcloud = WordCloud(prefer_horizontal=1).generate_from_frequencies(topic)

  # Plot it
  axs[k].imshow(wordcloud)
  axs[k].axis("off")

26. Try running a regression of `ratings` on the document-topic mixing proportions (we saved them as `proportions`)