In [1]:
import seaborn as sns
sns.set()

In [2]:
from static_grader import grader

# NLP: Analyzing Review Text


Unstructured data makes up the vast majority of data.  This is a basic intro to handling unstructured data.  Our objective is to be able to extract the sentiment (positive or negative) and gain insight from review text.  We will do this from Yelp review data.

## Metrics and scoring

The first two questions task you to build models, of increasing complexity, to predict the rating of a review from its text. The grader uses a test set to evaluate your model's performance against our reference solution, using the $R^2$ score. It **is** possible to receive a score greater than one, indicating that you've beaten our reference model. We compare our model's score on a test set to your score on the same test set. See how high you can go!

The final two questions asks only for the result of a calculation, and your results will be compared directly to those of a reference solution.

## Download and parse the data


To start, let's download the data set from Amazon S3:

In [3]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'yelp_train_academic_dataset_review_reduced.json.gz'

The training data are a series of JSON objects, in a Gzipped file. Python supports Gzipped files natively: [`gzip.open`](https://docs.python.org/3/library/gzip.html) has the same interface as `open`, but handles `.gz` files automatically.

The built-in `json` package has a `loads` function that converts a JSON string into a Python dictionary. We could call that once for each row of the file. [`ujson`](http://docs.micropython.org/en/latest/library/ujson.html) has the same interface as the built-in `json` package, but is *substantially* faster (at the cost of non-robust handling of malformed JSON). We will use that inside a list comprehension to get a list of dictionaries:

In [4]:
import gzip
import ujson as json

with gzip.open('yelp_train_academic_dataset_review_reduced.json.gz') as f:
    data = [json.loads(line) for line in f]

The scikit-learn API requires that we keep labels (in this case, the star ratings) and features in separate data structures.

In [45]:
stars = [row['stars'] for row in data]

# Questions


## Question 1: bag_of_words_model

Build a linear model predicting the star rating based on the text reviews. Apply the bag-of-words model using the [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) to produce a feature matrix giving the counts of each word in each review.

**Hints**:
1. You will need to extract the review text from the raw input data, a list of dictionaries. You can take a similar approach you took in the `ml` miniproject by first converting the data into a pandas data frame and then using [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=columntransformer#sklearn.compose.ColumnTransformer) or you can build a custom transform to extract the text. Either way, remember that the `CountVectorizer` accepts as input to its `transform` method a 1D array of text.

1. Try choosing different values for `min_df` (minimum document frequency cutoff) and `max_df` in `CountVectorizer`. Setting `min_df` to zero admits rare words which might only appear once in the entire corpus.  This is both prone to overfitting and makes your data unmanageably large.  Don't forget to use cross-validation to select the right value.

1. Try using [`LinearRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) or [`Ridge`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?highlight=ridge#sklearn.linear_model.Ridge). There is also [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html?highlight=ridge#sklearn.linear_model.RidgeCV) which has built-in leave-on-out cross-validation. If the memory footprint is too big, try switching to [Stochastic Gradient Descent](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor). Don't forget to search for the optimal value of the regularization parameter. How do the regularization parameter `alpha` and the values of `min_df` and `max_df` from `CountVectorizer` change the answer?

1. You will likely pick up several hyperparameters between the vectorization step and the regularization of the predictor. While it is more strictly correct to do a grid search over all of them at once, this can take a long time. Quite often, doing a grid search over a single hyperparameter at a time can produce similar results.  Alternatively, the grid search may be done over a smaller subset of the data, as long as it is representative of the whole.

1. Finally, assemble a pipeline that will transform the data from list of dictionaries all the way to predictions.  This will allow you to submit the model's `predict` method to the grader for scoring as the test set used by the grader is a list of dictionaries.

In [7]:
data[1]

{'votes': {'funny': 6, 'useful': 0, 'cool': 0},
 'user_id': 'ZYaumz29bl9qHpu-KVtMGA',
 'review_id': 'ow1c4Lcl3ObWxDC2yurwjQ',
 'stars': 4,
 'date': '2009-05-04',
 'text': "If you like lot lizards, you'll love the Pine Cone!",
 'type': 'review',
 'business_id': 'JwUE5GmEO-sH1FuwJgKBlQ'}

In [5]:
#Extract review text from input data

from sklearn.base import TransformerMixin

class DictListValueExtractor(TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        extracted_values = []
        for item in X:
            if self.key in item:
                extracted_values.append(item[self.key])
        return extracted_values
    
# Create the transformer instance with the key you want to extract
transformer = DictListValueExtractor(key='text')

# Apply the transformation to the data
result = transformer.transform(data)

In [8]:
len(result)

253272

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV

#cv = CountVectorizer(min_df=17500, max_df=0.8)
#cv.fit(result)

#print(cv.get_feature_names_out())

In [12]:
# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, stars, test_size=0.2, random_state=42)

# Step 3: Create the pipeline with CountVectorizer and LinearRegression
pipeline = Pipeline([
    ('extractor', DictListValueExtractor(key='text')),
    ('vectorizer', CountVectorizer(min_df=8000, max_df=0.8)),     # Convert text to a bag-of-words representation
    ('regressor', Ridge(alpha=0.1))      # Use Linear Regression for prediction
])

# Step 4: Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

In [None]:
bag_of_words_model = ...

bag_of_words_model.fit(data, stars)

In [13]:
grader.score('nlp__bag_of_words_model', pipeline.predict)

Your score: 1.0248


## Question 2: bigram_model

In a bigram model, we'll consider both single words and pairs of consecutive words that appear. This is going to be a much higher-dimensional problem so you should be careful about overfitting. You should also use a vectorizer that applies some sort of normalization, e.g., the `TfidfVectorizer` or a word count vectorizer combined with `TfidfTransformer`.

Sometimes, reducing the dimension can be useful. If you're using the `TfidfVectorizer`, you can change the `max_features` hyperparameter to reduce the size of the resulting vocabulary. For `HashingVectorizer`, you can adjust the size of the feature matrix through `n_features`.

**A side note on multi-stage model evaluation:** When your model consists of a pipeline with several stages, it can be worthwhile to evaluate which parts of the pipeline have the greatest impact on the overall accuracy (or other metric) of the model. This allows you to focus your efforts on improving the important algorithms, and leaving the rest "good enough".

One way to accomplish this is through ceiling analysis, which can be useful when you have a training set with ground truth values at each stage. Let's say you're training a model to extract image captions from websites and return a list of names that were in the caption. Your overall accuracy at some point reaches 70%. You can try manually giving the model what you know are the correct image captions from the training set, and see how the accuracy improves (maybe up to 75%). Alternatively, giving the model the perfect name parsing for each caption increases accuracy to 90%. This indicates that the name parsing is a much more promising target for further work, and the caption extraction is a relatively smaller factor in the overall performance.

If you don't know the right answers at different stages of the pipeline, you can still evaluate how important different parts of the model are to its performance by changing or removing certain steps while keeping everything else constant. You might try this kind of analysis to determine how important adding stopwords and stemming to your NLP model actually is, and how that importance changes with parameters like the number of features.

In [7]:
#Extract review text from input data
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin

class DictListValueExtractor(TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        extracted_values = []
        for item in X:
            if self.key in item:
                extracted_values.append(item[self.key])
        return extracted_values
    
# Create the transformer instance with the key you want to extract
transformer = DictListValueExtractor(key='text')

# Apply the transformation to the data
result = transformer.transform(data)

In [8]:
# Step 2: Split the data into training and testing sets
#X_train, X_test, y_train, y_test = train_test_split(result, stars, test_size=0.2, random_state=42)

# Step 3: Create the pipeline with DictListValueExtractor, CountVectorizer (bigrams), TfidfTransformer, and Ridge
pipeline = Pipeline([
    ('extractor', DictListValueExtractor(key='text')),  # Extract 'review' values from dictionaries
    ('vectorizer', CountVectorizer(min_df=200, max_df=0.8, ngram_range=(2, 2))),  # Convert text to bag-of-words with bigrams
    ('tfidf', TfidfTransformer()),                       # Apply TF-IDF transformation
    ('regressor', Ridge(alpha=1.0))                      # Use SGD
])

# Step 4: Fit the pipeline on the training data
pipeline.fit(data, stars)

In [None]:
bigram_model = ...

bigram_model.fit(data, stars)

In [9]:
grader.score('nlp__bigram_model', pipeline.predict)

Your score: 1.0268


## Question 3: word_polarity

Let's consider a different approach and try to derive some insight from our analysis.  

We want to determine the most "polarizing words" in the corpus of reviews.  In other words, we want to identify words that strongly signal a review is either positive or negative.  For example, we understand that a word like "terrible" will most likely appear in negative rather than positive reviews.  

During training, the [naive Bayes model](https://scikit-learn.org/stable/modules/naive_bayes.html#) calculates probabilities such as $Pr(\textrm{terrible}\ |\ \textrm{negative}),$ the probability that the word "terrible" appears in the review text, given that the review is negative.  Using these probabilities, we can define a **polarity score** for each word $w$,

$$\textrm{polarity}(w) = \log\left(\frac{Pr(w\ |\ \textrm{positive})}{Pr(w\ |\ \textrm{negative})}\right).$$

Polarity analysis is an example where a simpler model (naive Bayes) offers more explicability than more complicated models.  Aside from this, naive Bayes models are easy to train, the training process is parallelizable, and these models lend themselves well to online learning.  Given enough training data, naive Bayes models have performed well in NLP applications such as spam filtering.  

For this problem, you are asked to determine the top 25 most positive polar words and the 25 most negative polar words.  For this analysis, you should:

1.  **Filter** the collection of reviews you were using above to **only keep** the one-star and five-star reviews. Since these are the "most polar" reviews, it should give us the most polarizing words.   
1.  Use the naive Bayes model, `MultinomialNB`.  
1.  Use TF-IDF weighting.
1.  Remove stop words.
1.  As mentioned, generate a (Python) list with most positive (25 words) and most negative (25 words) polar words.  

A naive Bayes model (after training) stores the log of the probabilities in an attribute of the model.  It is a `numpy` array of shape (number of classes, number of features).  You will need the mapping between feature indices to words to find the most polarizing words.  

In [100]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin
from sklearn.naive_bayes import MultinomialNB

class DictListValueExtractor(TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        extracted_values = []
        for item in X:
            if self.key in item:
                extracted_values.append(item[self.key])
        return extracted_values

In [118]:
# Filter the dictionaries based on 'stars' before applying the pipeline, also filter stars vector so the length matches
filtered_results = [item for item, star in zip(data, stars) if star in [1, 5]]

filtered_stars = [row['stars'] for row in data if row['stars'] in [1, 5]]

# Create the pipeline
pipeline = Pipeline([
    ('extractor', DictListValueExtractor(key='text')),  # Extract 'review' values from dictionaries
    ('vectorizer', TfidfVectorizer(stop_words='english')),  
    ('regressor', MultinomialNB())                      
])

# Fit the pipeline on the training data
pipeline.fit(filtered_results, filtered_stars)

In [119]:
#Step 2: Access the MultinomialNB classifier from the pipeline using its name
nb_classifier = pipeline.named_steps['regressor']
log_prob = nb_classifier.feature_log_prob_

In [121]:
# Step 3: Access the feature names from the TfidfVectorizer
feature_names = pipeline.named_steps['vectorizer'].get_feature_names_out()

In [123]:
# Step 5: Calculate the polarity scores for each word
polarities = log_prob[1] - log_prob[0]

In [124]:
# Step 6: Get the indices of the words with the highest and lowest polarity scores
top_positive_words_indices = polarities.argsort()[::-1][:25]  # Words with highest positive polarity
top_negative_words_indices = polarities.argsort()[:25]        # Words with lowest (most negative) polarity

# Step 7: Retrieve the actual words from the feature names
top_positive_words = [feature_names[idx] for idx in top_positive_words_indices]
top_negative_words = [feature_names[idx] for idx in top_negative_words_indices]

In [125]:
merged_top_words = top_positive_words + top_negative_words

In [21]:
# We're only keeping the one and five star reviews
grader.check(len(filtered_results) == 116576)

True

In [None]:
polar_words = ['perfection'] * 50

In [126]:
grader.score('nlp__word_polarity', merged_top_words)

Your score: 1.0000


## Question 4: food_bigrams

Look over all reviews of restaurants.  You can determine which businesses are restaurants by looking in the `yelp_train_academic_dataset_business.json.gz` file from the ml project or downloaded below.

In [6]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'yelp_train_academic_dataset_business.json.gz'

In [7]:
with gzip.open('yelp_train_academic_dataset_business.json.gz') as f:
    business_data = [json.loads(line) for line in f]

Each row of this file corresponds to a single business.  The category key gives a list of categories for each; take all where "Restaurants" appears.

In [8]:
restaurant_ids = [d for d in business_data if 'Restaurants' in d.get('categories', [])]

In [8]:
# Look at the categories to check for spelling and capitalization
grader.check(len(restaurant_ids) == 12876)

True

The "business_id" here is the same as in the review data.  Use this to extract the review text for all reviews of restaurants.

In [9]:
business_ids = [d.get('business_id') for d in restaurant_ids]
restaurant_reviews = [d for d in data if d.get('business_id') in business_ids]

In [10]:
# Just reviews of restaurants
# restaurant_ids is helpful here
grader.check(len(restaurant_reviews) == 143361)

True

In [11]:
restaurant_reviews[0]

{'votes': {'funny': 6, 'useful': 0, 'cool': 0},
 'user_id': 'ZYaumz29bl9qHpu-KVtMGA',
 'review_id': 'ow1c4Lcl3ObWxDC2yurwjQ',
 'stars': 4,
 'date': '2009-05-04',
 'text': "If you like lot lizards, you'll love the Pine Cone!",
 'type': 'review',
 'business_id': 'JwUE5GmEO-sH1FuwJgKBlQ'}

We want to find collocations --- that is, bigrams that are "special" and appear more often than you'd expect from chance. We can think of the corpus as defining an empirical distribution over all *n*-grams.  We can find word pairs that are unlikely to occur consecutively based on the underlying probability of their words. Mathematically, if $p(w)$ be the probability of a word $w$ and $p(w_1 w_2)$ is the probability of the bigram $w_1 w_2$, then we want to look at word pairs $w_1 w_2$ where the statistic

  $$ \frac{p(w_1 w_2)}{p(w_1) p(w_2)} $$

is high.  Return the top 100 (mostly food) bigrams with this statistic with the 'right' prior factor (see below).

Estimating the probabilities is simply a matter of counting, and there are number of approaches that will work.  One is to use one of the tokenizers to count up how many times each word and each bigram appears in each review, and then sum those up over all reviews.  You might want to know that the `CountVectorizer` has a `.get_feature_names_out()` method which gives the string associated with each column.  (Question for thought: Why doesn't the `HashingVectorizer` have a similar method?)

*Questions:* This statistic is a ratio and problematic when the denominator is small.  We can fix this by applying Bayesian smoothing to $p(w)$ (i.e. mixing the empirical distribution with the uniform distribution over the vocabulary).

1. How does changing this smoothing parameter affect the word pairs you get qualitatively?

2. We can interpret the smoothing parameter as adding a constant number of occurrences of each word to our distribution.  Does this help you determine set a reasonable value for this 'prior factor'?

3. For fun: also check out [Amazon's Statistically Improbable Phrases](http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases).

*Implementation note:*
As you adjust the size of the Bayesian smoothing parameter, you will notice first nonsense phrases being removed and then legitimate bigrams being removed, leaving you with only generic bigrams.  The goal is to find a value of the smoothing parameter between these two transitions.

The reference solution is not an aggressive filterer: it errors in favor of leaving apparently nonsensical words. On further consideration, many of these are actually somewhat meaningful. The smoothing parameter chosen in the reference solution is equivalent to giving each word 30 previous appearances prior to considering this data.  This was chosen by generating a list of bigrams for a range of smoothing parameters and seeing how many of the bigrams were shared between neighboring values.  When the shared fraction reached 95%, we judged the solution to have converged.

There are a few reviews that include the same nonsense strings multiple times.  To keep these from showing up in our results, we set `min_df=10`, to ensure that a bigram occurs in at least 10 reviews before we consider it.

In [11]:
from sklearn.base import TransformerMixin

class DictListValueExtractor(TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        extracted_values = []
        for item in X:
            if self.key in item:
                extracted_values.append(item[self.key])
        return extracted_values
    
# Create the transformer instance with the key you want to extract
transformer = DictListValueExtractor(key='text')

# Apply the transformation to the data
result = transformer.transform(restaurant_reviews)

In [12]:
result[0]

"If you like lot lizards, you'll love the Pine Cone!"

In [74]:
from sklearn.feature_extraction.text import CountVectorizer

def calculate_bigram_statistic_with_smoothing(reviews):
    # Step 1: Create a CountVectorizer instance for bigrams
    vectorizer = CountVectorizer(min_df=10, ngram_range=(2, 2))

    # Step 2: Fit the CountVectorizer with customer reviews
    X = vectorizer.fit_transform(reviews)

    # Step 3: Get the names of the features (bigrams) in the vocabulary
    feature_names = vectorizer.get_feature_names_out()

    # Step 4: Calculate the frequencies of bigram and individual words
    bigram_frequencies = X.sum(axis=0).A1  # Get the sum of occurrences for each bigram

    # Initialize dictionaries to store the frequencies of individual words
    word1_frequencies = {}
    word2_frequencies = {}

    for feature_name, frequency in zip(feature_names, bigram_frequencies):
        word1, word2 = feature_name.split(' ')
        word1_frequencies[word1] = word1_frequencies.get(word1, 0) + frequency
        word2_frequencies[word2] = word2_frequencies.get(word2, 0) + frequency

    # Step 5: Calculate the statistic 𝑝(𝑤1𝑤2)/𝑝(𝑤1)𝑝(𝑤2) for each bigram with smoothing
    alpha = 30  # Smoothing parameter (adjust as needed)
    bigram_statistics_with_smoothing = {}

    for feature_name, frequency in zip(feature_names, bigram_frequencies):
        word1, word2 = feature_name.split(' ')
        bigram_frequency_with_smoothing = frequency
        word1_frequency_with_smoothing = word1_frequencies[word1] + alpha
        word2_frequency_with_smoothing = word2_frequencies[word2] + alpha

        statistic_with_smoothing = bigram_frequency_with_smoothing / (word1_frequency_with_smoothing * word2_frequency_with_smoothing)
        bigram_statistics_with_smoothing[feature_name] = statistic_with_smoothing

    # Step 6: Sort the bigrams by statistic with smoothing in descending order
    sorted_bigrams_with_smoothing = sorted(bigram_statistics_with_smoothing.items(), key=lambda x: x[1], reverse=True)

    return sorted_bigrams_with_smoothing[:100]  # Return the top 100 most significant bigrams with smoothing

In [72]:
top_bigrams = calculate_bigram_statistic_with_smoothing(result)

first_items_list = [item[0] for item in top_bigrams]
print(first_items_list)

['bradley ogden', 'cinema suites', 'identity crisis', 'va bene', 'lomo saltado', 'lechon kawali', 'scantily clad', 'chino bandido', 'kao tod', 'woody allen', 'cabo wabo', 'ropa vieja', 'artery clogging', 'adam richman', 'bi bim', 'emeril lagasse', 'gulab jamun', 'harry potter', 'knick knacks', 'vice versa', 'van buren', 'cien agaves', 'rula bula', 'ama ebi', 'cheeburger cheeburger', 'dean martin', 'feng shui', 'molecular gastronomy', 'si senor', 'womp womp', 'miller lite', 'rick moonen', 'itty bitty', 'lindo michoacan', 'malai kofta', 'mt everest', 'rogan josh', 'yada yada', 'casey moore', 'toby keith', 'fo sho', 'patatas bravas', 'pura vida', 'wal mart', 'cochinita pibil', 'valle luna', 'aloo gobi', 'dom demarco', 'haricot vert', 'krispy kreme', 'nanay gloria', 'tammie coe', 'tutti santi', 'woon sen', 'celiac disease', 'highway robbery', 'lloyd wright', 'nuoc mam', 'riff raff', 'yadda yadda', 'alain ducasse', 'dueling pianos', 'khai hoan', 'log cabin', 'nove italiano', 'shirley temple

In [None]:
top100 = ['haricot vert'] * 100

In [73]:
grader.score('nlp__food_bigrams', first_items_list)

Your score: 0.9100


*Copyright &copy; 2022 Pragmatic Institute. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*