# Word Algebra using Word2Vec 

Word Algebra is an AI technique 

This tutorial features an end-to-end natural language processing pipeline for word algebra, starting with **raw data** and running through **preparing**, **modeling**, **visualizing**, and **analyzing** the data. We'll touch on the following points:
1. Overview of the dataset
1. Text processing with spaCy
1. Automatic phrase modeling
1. Topic modeling with LDA
1. Visualizing topic models with pyLDAvis
1. Word vector models with word2vec
1. Visualizing word2vec with t-SNE

...and we will learn Python features and packages along the way.

## The Yelp Dataset
[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) is a dataset published by the business review service [Yelp](http://yelp.com) for academic research and educational purposes. The dataset is big but and most of the preprocessing and training tasks require several hours to complete.

**Note:** To run this code, you'll need to download your own copy of the Yelp dataset. Here's how to get the dataset:
1. Go to the Yelp dataset webpage [here](https://www.yelp.com/dataset_challenge/) and download the data.
1. The dataset downloads as a compressed .tgz file; uncompress it
1. Setup the variable data_directory below with the path where you copied the data to. 

## ⚠️ Troubleshooting Guide

**Before you start**, here are solutions to common issues:

### Memory Issues

**Problem**: Jupyter kernel crashes or computer freezes
- **Solution**: Make sure `RECOMPUTE_DATA = False` (in cell 5)
  - This uses pre-computed results instead of processing 4.2M reviews
  - Processing from scratch requires 8-16GB RAM

**Problem**: "MemoryError" when training models
- **Solution**: Close other applications
- **Alternative**: Reduce `batch_size` in nlp.pipe() calls
- **Advanced**: Use a machine with more RAM or cloud notebook (Google Colab)

### Runtime Issues

**Problem**: Code cells take too long to run
- **Expected runtimes** (with `RECOMPUTE_DATA = False`):
  - Loading models/data: 1-5 seconds per cell
  - t-SNE visualization: 30-60 seconds
  - Everything else: < 5 seconds
- **If using `RECOMPUTE_DATA = True`**:
  - Text preprocessing: 10-20 minutes
  - LDA training: 5-10 minutes
  - Word2Vec training: 2-3 minutes
  - Total: ~30-40 minutes

**Problem**: Cells seem stuck or frozen
- **Solution**: Look for `[*]` indicator - cell is still running
- **Solution**: Check terminal/console for progress messages
- **Restart**: Kernel → Restart & Run All (if truly stuck)

### Import Errors

**Problem**: `ModuleNotFoundError: No module named 'X'`
- **Solution**: Install missing package: `pip install X`
- **Common missing packages**: spacy, gensim, pyLDAvis, bokeh, sklearn
- **Check**: Run `pip list` to see installed packages

**Problem**: `OSError: [E050] Can't find model 'en_core_web_sm'`
- **Solution**: Download spaCy model: `python -m spacy download en_core_web_sm`

### Data File Issues

**Problem**: `FileNotFoundError` for intermediate files
- **Solution**: Set `RECOMPUTE_DATA = True` to regenerate files
- **Check**: Verify `yelp_dataset/intermediate/` directory exists
- **Download**: If missing, download pre-computed files (see README)

### Model Quality Issues

**Problem**: Word algebra gives weird results
- **Possible causes**:
  - Word not in vocabulary (appears < 20 times)
  - Try more common restaurant terms
  - Check spelling and underscores for phrases
- **Check**: `'word' in food2vec.wv` to test if word exists

**Problem**: LDA topics don't make sense
- **This is normal**: Some topics are interpretable, others less so
- **Improvement**: Try adjusting `num_topics` parameter
- **Remember**: LDA is unsupervised - topics aren't guaranteed to match human intuition

### Still Having Issues?

- Check that all cells above executed successfully (no error messages)
- Restart kernel and run cells in order from top to bottom
- Check Python version: Requires Python 3.7+
- Check package versions: `pip list | grep -E 'spacy|gensim|pyLDAvis'`

In [1]:
data_directory = 'yelp_dataset'

### Configuration

The `RECOMPUTE_DATA` variable controls whether to run expensive preprocessing operations or load pre-computed results. Set it to `True` if you want to regenerate all intermediate files from scratch, or `False` to use existing results for faster execution.

In [2]:
# Configuration: Control whether to recompute expensive operations
# Set to True to regenerate all intermediate files from scratch
# Set to False to use pre-computed results (faster for demos)
RECOMPUTE_DATA = False

Here we focus on restaurants alone.

The data is provided in a handful of files in _.json_ format. We'll be using the following files for our demo:
- business.json__ &mdash; _the records for individual businesses_
- review.json__ &mdash; _the records for reviews users wrote about businesses_

The files are text files (UTF-8) with one _json object_ per line, each one corresponding to an individual data record. Let's take a look at a few examples.

In [3]:
import os
import codecs

businesses_filepath = os.path.join(data_directory,
                                   'business.json')

with codecs.open(businesses_filepath, encoding='utf_8') as f:
    first_business_record = f.readline() 

print(first_business_record)

{"business_id":"1SWheh84yJXfytovILXOAQ","name":"Arizona Biltmore Golf Club","address":"2818 E Camino Acequia Drive","city":"Phoenix","state":"AZ","postal_code":"85016","latitude":33.5221425,"longitude":-112.0184807,"stars":3.0,"review_count":5,"is_open":0,"attributes":{"GoodForKids":"False"},"categories":"Golf, Active Life","hours":null}



The business records consist of _key, value_ pairs containing information about the particular business. A few attributes we'll be interested in for this demo include:
- __business\_id__ &mdash; _unique identifier for businesses_
- __categories__ &mdash; _an array containing relevant category values of businesses_

The _categories_ attribute is of special interest. This demo will focus on restaurants, which are indicated by the presence of the _Restaurant_ tag in the _categories_ array. In addition, the _categories_ array may contain more detailed information about restaurants, such as the type of food they serve.

The review records are stored in a similar manner &mdash; _key, value_ pairs containing information about the reviews.

In [4]:
review_json_filepath = os.path.join(data_directory,
                                    'review.json')

with codecs.open(review_json_filepath, encoding='utf_8') as f:
    first_review_record = f.readline()
    
print(first_review_record)

{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}



A few attributes of note on the review records:
- __business\_id__ &mdash; _indicates which business the review is about_
- __text__ &mdash; _the natural language text the user wrote_

The _text_ attribute will be our focus today!

_json_ is a handy file format for data interchange, but it's typically not the most usable for any sort of modeling work. Let's do a bit more data preparation to get our data in a more usable format. Our next code block will do the following:
1. Read in each business record and convert it to a Python `dict`
2. Filter out business records that aren't about restaurants (i.e., not in the "Restaurant" category)
3. Create a `frozenset` of the business IDs for restaurants, which we'll use in the next step

In [5]:
import json

restaurant_ids = set()

# open the businesses file
with codecs.open(businesses_filepath, encoding='utf_8') as f:
    
    # iterate through each line (json record) in the file
    for business_json in f:
        
        # convert the json record to a Python dict
        business = json.loads(business_json)
        # if this business is not a restaurant, skip to the next one
        if business['categories'] is not None and 'Restaurants' not in business['categories']:
            continue
        # add the restaurant business id to our restaurant_ids set
        restaurant_ids.add(business['business_id'])

# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(restaurant_ids)

# print the number of unique restaurant ids in the dataset
print(f'{len(restaurant_ids):,} restaurants in the dataset.')


59,853 restaurants in the dataset.


Next, we will create a new file that contains only the text from reviews about restaurants, with one review per line in the file.

In [6]:
intermediate_directory = os.path.join(data_directory, 'intermediate')

if not os.path.exists(intermediate_directory):
    os.makedirs(intermediate_directory)

review_txt_filepath = os.path.join(intermediate_directory,
                                   'review_text_all.txt')

In [7]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if RECOMPUTE_DATA:
    
    review_count = 0

    # create & open a new file in write mode
    with codecs.open(review_txt_filepath, 'w', encoding='utf_8') as review_txt_file:

        # open the existing review json file
        with codecs.open(review_json_filepath, encoding='utf_8') as review_json_file:

            # loop through all reviews in the existing file and convert to dict
            for review_json in review_json_file:
                review = json.loads(review_json)

                # if this review is not about a restaurant, skip to the next one
                if review['business_id'] not in restaurant_ids:
                    continue

                # write the restaurant review as a line in the new file
                # escape newline characters in the original review text
                review_txt_file.write(review['text'].replace('\n', '\\n') + '\n')
                review_count += 1

    print(f'Text from {review_count:,} restaurant reviews written to the new txt file.')
    
else:
    # Fast line counting: Use shell command wc -l (much faster than Python iteration)
    import subprocess
    
    try:
        # Use wc -l which is optimized for counting lines (10-100x faster)
        result = subprocess.run(['wc', '-l', review_txt_filepath], 
                              capture_output=True, text=True, check=True)
        review_count = int(result.stdout.split()[0])
        print(f'Text from {review_count:,} restaurant reviews in the txt file.')
    except (subprocess.CalledProcessError, FileNotFoundError):
        # Fallback to Python counting if wc is not available (Windows)
        # Use buffered reading for better performance
        def count_lines_fast(filename):
            with open(filename, 'rb') as f:
                return sum(1 for _ in f)
        
        review_count = count_lines_fast(review_txt_filepath)
        print(f'Text from {review_count:,} restaurant reviews in the txt file.')

Text from 4,203,821 restaurant reviews in the txt file.
CPU times: user 248 μs, sys: 2.95 ms, total: 3.19 ms
Wall time: 528 ms


## 🔍 Part 2: spaCy - Industrial-Strength NLP



### 🎯 Learning Objectives:
- Master text preprocessing fundamentals
- Learn tokenization, lemmatization, and NER
- Understand how spaCy processes text efficiently
- See token attributes like part-of-speech and probabilities

**Time:** ~10 minutes | **Key Concept:** Text normalization and linguistic analysis


![spaCy](https://s3.amazonaws.com/skipgram-images/spaCy.png)

[**spaCy**](https://spacy.io) is an industrial-strength natural language processing (_NLP_) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:
- Tokenization
- Text normalization, such as lowercasing, stemming/lemmatization
- Part-of-speech tagging
- Syntactic dependency parsing
- Sentence boundary detection
- Named entity recognition and annotation

In the "batteries included" Python tradition, spaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:
- Large English vocabulary, including stopword lists
- Token "probabilities"
- Word vectors

spaCy is written in optimized Cython, which means it's _fast_. According to a few independent sources, it's the fastest syntactic parser available in any language. Key pieces of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the _GIL_).

Spacy can be installed via your Python Anaconda distribution

You need to download the English model

1) Open your terminal prompt and type

python -m spacy download en

In [8]:
import spacy
import pandas as pd
import itertools as it

# WHY en_core_web_sm (small model) instead of en_core_web_lg (large):
# - sm: 12MB download, includes tokenizer, POS tagger, lemmatizer, NER
# - lg: 560MB download, adds word vectors (but we're training our own with Word2Vec!)
# - For this tutorial, we only need tokenization and lemmatization
# - The large model's pre-trained vectors would be wasted
# - Result: Faster loading, less disk space, same functionality
nlp = spacy.load('en_core_web_sm')

#nlp = spacy.load('en_core_web_lg')


Let's grab a sample review to play with.

In [9]:
with codecs.open(review_txt_filepath, encoding='utf_8') as f:
    sample_review = list(it.islice(f, 10, 11))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print(sample_review)

I love chinese food and I love mexican food. What can go wrong? A couple of things. First things first, this place is more of a "rice bowl" kind of place. I thought it was going to be more diverse as far as the menu goes, but its mainly rice bowls you get with different kinds of meats. The ordering was a little confusing at first, but one of the employees helped us out and I got the 2-item bowl and got the jade chicken and hengrenade chicken with all rice(jerk). I also ordered a jade chicken quesadilla on the side.

I'm gonna admit, this place looks kinda dirty. I don't think Arizona uses those health department letter grade system like California does, but if I were to just judge by how it looked inside, i'd give it a "C" grade lol. We waited for about 15 minutes or so and finally got our food. We took it to go and ate at our hotel room. 

Mmmm... the food was just alright. The jade chicken was nothing special. It tasted like any generic chinese fast food orange chicken/sesame chicken

Hand the review text to spaCy, and be prepared to wait...

In [10]:
%%time
parsed_review = nlp(sample_review)

CPU times: user 99.3 ms, sys: 7.92 ms, total: 107 ms
Wall time: 106 ms


...1/20th of a second or so. Let's take a look at what we got during that time...

In [11]:
print(parsed_review)

I love chinese food and I love mexican food. What can go wrong? A couple of things. First things first, this place is more of a "rice bowl" kind of place. I thought it was going to be more diverse as far as the menu goes, but its mainly rice bowls you get with different kinds of meats. The ordering was a little confusing at first, but one of the employees helped us out and I got the 2-item bowl and got the jade chicken and hengrenade chicken with all rice(jerk). I also ordered a jade chicken quesadilla on the side.

I'm gonna admit, this place looks kinda dirty. I don't think Arizona uses those health department letter grade system like California does, but if I were to just judge by how it looked inside, i'd give it a "C" grade lol. We waited for about 15 minutes or so and finally got our food. We took it to go and ate at our hotel room. 

Mmmm... the food was just alright. The jade chicken was nothing special. It tasted like any generic chinese fast food orange chicken/sesame chicken

Looks the same! What happened under the hood?

What about sentence detection and segmentation?

In [12]:
for num, sentence in enumerate(parsed_review.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')

Sentence 1:
I love chinese food and I love mexican food.

Sentence 2:
What can go wrong?

Sentence 3:
A couple of things.

Sentence 4:
First things first, this place is more of a "rice bowl" kind of place.

Sentence 5:
I thought it was going to be more diverse as far as the menu goes, but its mainly rice bowls you get with different kinds of meats.

Sentence 6:
The ordering was a little confusing at first, but one of the employees helped us out and I got the 2-item bowl and got the jade chicken and hengrenade chicken with all rice(jerk).

Sentence 7:
I also ordered a jade chicken quesadilla on the side.



Sentence 8:
I'm gonna admit, this place looks kinda dirty.

Sentence 9:
I don't think Arizona uses those health department letter grade system like California does, but if I were to just judge by how it looked inside, i'd give it a "C" grade lol.

Sentence 10:
We waited for about 15 minutes or so and finally got our food.

Sentence 11:
We took it to go and ate at our hotel room. 





What about named entity detection?

In [13]:
for num, entity in enumerate(parsed_review.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')

Entity 1: chinese - NORP

Entity 2: mexican - NORP

Entity 3: First - ORDINAL

Entity 4: first - ORDINAL

Entity 5: first - ORDINAL

Entity 6: 2 - CARDINAL

Entity 7: Arizona - GPE

Entity 8: California - GPE

Entity 9: about 15 minutes - TIME

Entity 10: Mmmm - GPE

Entity 11: chinese - NORP

Entity 12: Mcdonald - ORG

Entity 13: the next day - DATE

Entity 14: mexican - NORP

Entity 15: chinese - NORP

Entity 16: 5 - CARDINAL

Entity 17: the next day - DATE



What about part of speech tagging?

In [14]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame(token_text, token_pos)

Unnamed: 0,0
PRON,I
VERB,love
ADJ,chinese
NOUN,food
CCONJ,and
...,...
DET,the
ADJ,next
NOUN,day
PUNCT,.


What about text normalization, like stemming/lemmatization and shape analysis?

In [15]:
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame(token_text, [token_lemma, token_shape])
             #columns=['token_lemma', 'token_shape'])

Unnamed: 0,Unnamed: 1,0
I,X,I
love,xxxx,love
chinese,xxxx,chinese
food,xxxx,food
and,xxx,and
...,...,...
the,xxx,the
next,xxxx,next
day,xxx,day
.,.,.


What about token-level entity analysis?

In [16]:
token_entity_type = [token.ent_type_ for token in parsed_review]
token_entity_iob = [token.ent_iob_ for token in parsed_review]

pd.DataFrame(token_text, [token_entity_type, token_entity_iob])#,
             #columns=['token_text', 'entity_type', 'inside_outside_begin'])



Unnamed: 0,Unnamed: 1,0
,O,I
,O,love
NORP,B,chinese
,O,food
,O,and
...,...,...
DATE,B,the
DATE,I,next
DATE,I,day
,O,.


What about a variety of other token-level attributes, such as the relative frequency of tokens, and whether or not a token matches any of these categories?
- stopword
- punctuation
- whitespace
- represents a number
- whether or not the token is included in spaCy's default vocabulary?

In [17]:
from collections import Counter

# Count how many times each word appears in this review
word_counts = Counter(token.text.lower() for token in parsed_review)

token_attributes = [(token.text,  # Fixed: was token.orth_ which returns hash IDs
                     word_counts[token.text.lower()],  # Actual word count in this review
                     'Yes' if token.is_stop else '',  # Convert to string directly
                     'Yes' if token.is_punct else '',
                     'Yes' if token.is_space else '',
                     'Yes' if token.like_num else '',
                     'Yes' if token.is_oov else '')
                    for token in parsed_review]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'count',  # Shows how many times the word appears
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])
                                               
df

Unnamed: 0,text,count,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,I,11,Yes,,,,Yes
1,love,2,,,,,Yes
2,chinese,3,,,,,Yes
3,food,5,,,,,Yes
4,and,12,Yes,,,,Yes
...,...,...,...,...,...,...,...
434,the,20,Yes,,,,Yes
435,next,2,Yes,,,,Yes
436,day,2,,,,,Yes
437,.,24,,Yes,,,Yes


If the text you'd like to process is general-purpose English language text (i.e., not domain-specific, like medical literature), spaCy is ready to use out-of-the-box.



### ✅ Key Takeaways - spaCy:
- **Tokenization:** Splits text into individual words and punctuation
- **Lemmatization:** Converts words to base form (running → run)
- **POS Tagging:** Identifies parts of speech (noun, verb, etc.)
- **NER:** Finds named entities (people, places, organizations)
- **Fast & Efficient:** Processes millions of tokens quickly using optimized C code

💡 **Why it matters:** Clean, normalized text is essential for all downstream NLP tasks.


## 🔗 Part 3: Phrase Modeling



### 🎯 Learning Objectives:
- Understand how multi-word expressions are detected
- Learn bigram and trigram modeling
- See statistical measures for phrase detection
- Apply phrase models to transform text

**Time:** ~15 minutes | **Key Concept:** Identifying and joining multi-word concepts


_Phrase modeling_ is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that _co-occur_ (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase is:

$$\frac{count(A\ B) - count_{min}}{count(A) * count(B)} * N > threshold$$

...where:
* $count(A)$ is the number of times token $A$ appears in the corpus
* $count(B)$ is the number of times token $B$ appears in the corpus
* $count(A\ B)$ is the number of times the tokens $A\ B$ appear in the corpus *in order*
* $N$ is the total size of the corpus vocabulary
* $count_{min}$ is a user-defined parameter to ensure that accepted phrases occur a minimum number of times
* $threshold$ is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase

Once our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so _new york_ would become *new\_york*). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as _happy hour_) to also become phrases in the model.

We turn to the [**gensim**](https://radimrehurek.com/gensim/index.html) library to help us with phrase modeling &mdash; the [**Phrases**](https://radimrehurek.com/gensim/models/phrases.html) class in particular.

### 🌍 Real-World Applications: Phrase Detection

Automatic phrase detection has important real-world uses:

**Search Engines:**
- **Google**: Understands "ice cream" as a unit, not separate "ice" and "cream" searches
- **E-commerce**: Searching "running shoes" on Amazon treats it as a single concept
  - Without phrase detection: might return ice skates (shoes + running keywords separately)

**Voice Assistants:**
- **Alexa/Siri**: Must recognize multi-word commands correctly
  - "Turn off kitchen lights" - "kitchen_lights" is one entity
  - "Play happy birthday" - "happy_birthday" is one song concept

**Machine Translation:**
- **Google Translate**: Translates phrases as units for better accuracy
  - "hot dog" → German "Hotdog" (not "heißer Hund" = "hot dog" literally)
  - "New York" stays as one entity, not "New" + "York" separately

**Named Entity Recognition:**
- **News Analysis**: Extract company names, person names, locations
  - "Bank of America", "San Francisco", "Bernie Sanders"
  - These must be recognized as single entities

**For restaurant reviews:**
- **Menu Item Extraction**: Identify signature dishes
  - "chocolate_lava_cake", "buffalo_chicken_wings", "caesar_salad"
- **Sentiment Analysis**: Better understand what customers like/dislike
  - "amazing happy_hour specials" - "happy_hour" is the praised concept
- **Competitive Analysis**: Track trending menu items across restaurants

Phrase detection is a foundational step that improves all downstream NLP tasks!

In [18]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. Our roadmap for data preparation includes:

1. Segment text of complete reviews into sentences & normalize text
1. First-order phrase modeling $\rightarrow$ _apply first-order phrase model to transform sentences_
1. Second-order phrase modeling $\rightarrow$ _apply second-order phrase model to transform sentences_
1. Apply text normalization and second-order phrase model to text of complete reviews

We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the `lemmatized_sentence_corpus` generator function will use spaCy to:
- Iterate over the 4M reviews in the `review_txt_all.txt` we created before
- Segment the reviews into individual sentences
- Remove punctuation and excess whitespace
- Lemmatize the text

... and do so efficiently in parallel, thanks to spaCy's `nlp.pipe()` function.

In [19]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    
    WHY: We remove these because they don't carry semantic meaning
    and would create noise in our topic and word vector models.
    """
    
    return token.is_punct or token.is_space

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    
    WHY: Using a generator instead of loading all reviews into memory
    allows us to process 4.2M reviews without running out of RAM.
    """
    
    with codecs.open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    
    WHY: We use LEMMATIZATION (not stemming) because:
    - Lemmatization produces real words: "running" → "run" (not "runn")
    - This makes results more interpretable for students and end users
    - Word2Vec works better with actual vocabulary words
    
    WHY: We process by SENTENCE (not whole reviews) because:
    - Phrase detection works better on sentence boundaries
    - Prevents spurious phrases from sentence-ending + sentence-starting words
    """
    
    for parsed_review in nlp.pipe(line_review(filename),
        # WHY batch_size=10000: Balance between memory usage and processing speed
        # WHY n_process=4: Parallelize across CPU cores for 4x speedup
                                  batch_size=10000, n_process=4):
        
        for sent in parsed_review.sents:
            # WHY remove punct_space: Clean text for better model quality
            yield ' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

In [20]:
unigram_sentences_filepath = os.path.join(intermediate_directory,
                                          'unigram_sentences_all.txt')

Let's use the `lemmatized_sentence_corpus` generator to loop over the original review text, segmenting the reviews into individual sentences and normalizing the text. We'll write this data back out to a new file (`unigram_sentences_all`), with one normalized sentence per line. We'll use this data for learning our phrase models.

In [21]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if RECOMPUTE_DATA:

    with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus(review_txt_filepath):
            f.write(sentence + '\n')

CPU times: user 8 μs, sys: 0 ns, total: 8 μs
Wall time: 16.9 μs


If your data is organized like our `unigram_sentences_all` file now is &mdash; a large text file with one document/sentence per line &mdash; gensim's [**LineSentence**](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence) class provides a convenient iterator for working with other gensim components. It *streams* the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once. This allows you to scale your modeling pipeline up to potentially very large corpora.

In [22]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

Let's take a look at a few sample sentences in our new, transformed file.

In [23]:
for unigram_sentence in it.islice(unigram_sentences, 230, 240):
    print(' '.join(unigram_sentence))
    print('')

sweet potato fries this be worth -PRON-

these fry be actually very delicious crispy on the outside and soft on the inside

if -PRON- visit this restaurant -PRON- would recommend this particular dish especially give the price $ 8 and the portion size -PRON- be quite large almost the size of a large plate

go here last weekend and be pretty disappointed

-PRON- do not have one thing that be picture and recommend on yelp as be good

-PRON- start off with the steak grill skewer which be just ok nothing special

-PRON- freind get the lasagna

and -PRON- get some special chicken dish

-PRON- be both pretty bland and lack that kick

-PRON- waitress be really nice and get the manger to switch out -PRON- dish



Next, we'll learn a phrase model that will link individual words into two-word phrases. We'd expect words that together represent a specific concept, like "`ice cream`", to be linked together to form a new, single token: "`ice_cream`".

In [24]:
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')

In [25]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if RECOMPUTE_DATA:
    # WHY use Phrases with default parameters:
    # - min_count=5: Ignore rare word pairs (reduces noise)
    # - threshold=10: Statistical threshold for phrase detection
    #   (higher = more conservative, only obvious phrases)
    # - scoring='default': Uses (word_a_count * word_b_count) / bigram_count
    #   to find words that appear together more often than chance
    #
    # WHY this works: Automatically finds multi-word expressions like
    # "ice_cream", "happy_hour", "customer_service" without manual rules

    bigram_model = Phrases(unigram_sentences)

    bigram_model.save(bigram_model_filepath)
    
# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)

CPU times: user 17.8 s, sys: 3.54 s, total: 21.4 s
Wall time: 21.3 s


Now that we have a trained phrase model for word pairs, let's apply it to the review sentences data and explore the results.

In [26]:
bigram_sentences_filepath = os.path.join(intermediate_directory,
                                         'bigram_sentences_all.txt')

In [27]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if RECOMPUTE_DATA:

    with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for unigram_sentence in unigram_sentences:
            
            bigram_sentence = ' '.join(bigram_model[unigram_sentence])
            
            f.write(bigram_sentence + '\n')

CPU times: user 7 μs, sys: 0 ns, total: 7 μs
Wall time: 14.1 μs


In [28]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [29]:
# Find and display sentences containing bigrams (multi-word phrases)
# This shows concrete examples of phrase detection in action

print('Example sentences with detected bigrams:')
print('=' * 70)

# Common bigrams to look for in restaurant reviews
target_bigrams = ['ice_cream', 'happy_hour', 'fish_tacos', 'craft_beer', 
                  'fried_chicken', 'sweet_potato', 'apple_pie', 'french_fries',
                  'customer_service', 'wait_time', 'parking_lot', 'outdoor_seating']

found_bigrams = {}  # Track which bigrams we've found (dict: bigram -> example sentence)
max_distinct_bigrams = 5  # Stop after finding 10 different bigrams

# Search through sentences for bigrams
for bigram_sentence in bigram_sentences:
    sentence_text = ' '.join(bigram_sentence)
    
    # Check if any target bigram appears in this sentence
    for bigram in target_bigrams:
        # Only add if we haven't found this bigram yet
        if bigram in sentence_text and bigram not in found_bigrams:
            # Highlight the bigram in the sentence
            highlighted = sentence_text.replace(bigram, f'**{bigram}**')
            found_bigrams[bigram] = highlighted
            break  # Move to next sentence
    
    # Stop as soon as we have 10 distinct bigrams
    if len(found_bigrams) >= max_distinct_bigrams:
        break

# Display the examples
for i, (bigram, example) in enumerate(found_bigrams.items(), 1):
    words = bigram.replace('_', ' ')
    print(f'\n{i}. Bigram: "{words}" → {bigram}')
    print(f'   {example}')

print('\n' + '=' * 70)
print(f'Found {len(found_bigrams)} distinct bigrams in the text.')
print('Notice how two-word phrases like "ice cream" are joined into single tokens.')


Example sentences with detected bigrams:

1. Bigram: "apple pie" → apple_pie
   a friend of mine order a jade chicken burrito and -PRON- be confused when -PRON- pull -PRON- out of the bag because -PRON- be literally the size of mcdonald_'s **apple_pie**

2. Bigram: "ice cream" → ice_cream
   -PRON- be basically two cookie **ice_cream** sanwich with bit of snicker

3. Bigram: "happy hour" → happy_hour
   be very excited for **happy_hour** and hear great thing

4. Bigram: "craft beer" → craft_beer
   -PRON- see a few **craft_beer** and a respectable bourbon list

5. Bigram: "outdoor seating" → outdoor_seating
   the **outdoor_seating** also make -PRON- a perfect date spot in the summer

Found 5 distinct bigrams in the text.
Notice how two-word phrases like "ice cream" are joined into single tokens.


Looks like the phrase modeling worked! We now see two-word phrases, such as "`ice_cream`" and "`apple_pie`", linked together in the text as a single token. Next, we'll train a _second-order_ phrase model. We'll apply the second-order phrase model on top of the already-transformed data, so that incomplete word combinations like "`vanilla_ice cream`" will become fully joined to "`vanilla_ice_cream`". 

In [30]:
trigram_model_filepath = os.path.join(intermediate_directory,
                                      'trigram_model_all')

In [31]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if RECOMPUTE_DATA:
    # WHY train trigrams on BIGRAM OUTPUT (not original text):
    # - Allows detection of 3-word phrases like "chocolate_chip_cookie"
    # - Progressive approach: first find 2-word phrases, then 3-word
    # - Example: "sweet" + "potato" → "sweet_potato" (bigrams)
    #            then "sweet_potato" + "fries" → "sweet_potato_fries" (trigrams)
    
    trigram_model = Phrases(bigram_sentences)

    trigram_model.save(trigram_model_filepath)
    
# load the finished model from disk
trigram_model = Phrases.load(trigram_model_filepath)

CPU times: user 22.4 s, sys: 2.69 s, total: 25.1 s
Wall time: 25 s


We'll apply our trained second-order phrase model to our first-order transformed sentences, write the results out to a new file, and explore a few of the second-order transformed sentences.

In [32]:
trigram_sentences_filepath = os.path.join(intermediate_directory,
                                          'trigram_sentences_all.txt')

In [33]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if RECOMPUTE_DATA:

    with codecs.open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for bigram_sentence in bigram_sentences:
            
            trigram_sentence = ' '.join(trigram_model[bigram_sentence])
            
            f.write(trigram_sentence + '\n')

CPU times: user 7 μs, sys: 1 μs, total: 8 μs
Wall time: 13.6 μs


In [34]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [35]:
# Find and display sentences containing trigrams (three-word phrases)
# This demonstrates second-order phrase modeling

print('Example sentences with detected trigrams:')
print('=' * 70)

# Common trigrams to look for
target_trigrams = ['vanilla_ice_cream', 'chocolate_ice_cream', 'mac_and_cheese',
                   'fish_and_chips', 'peanut_butter_cup', 'sweet_and_sour',
                   'grilled_cheese_sandwich', 'strawberry_ice_cream',
                   'pulled_pork_sandwich', 'chicken_noodle_soup']

found_trigrams = {}  # Track which trigrams we've found (dict: trigram -> example sentence)
max_distinct_trigrams = 3  # Stop after finding 10 different trigrams

# Search through sentences for trigrams
for trigram_sentence in trigram_sentences:
    sentence_text = ' '.join(trigram_sentence)
    
    # Check if any target trigram appears in this sentence
    for trigram in target_trigrams:
        # Only add if we haven't found this trigram yet
        if trigram in sentence_text and trigram not in found_trigrams:
            # Highlight the trigram in the sentence
            highlighted = sentence_text.replace(trigram, f'**{trigram}**')
            found_trigrams[trigram] = highlighted
            break  # Move to next sentence
    
    # Stop as soon as we have 10 distinct trigrams
    if len(found_trigrams) >= max_distinct_trigrams:
        break

# Display the examples
for i, (trigram, example) in enumerate(found_trigrams.items(), 1):
    words = trigram.replace('_', ' ')
    print(f'\n{i}. Trigram: "{words}" → {trigram}')
    print(f'   {example}')

print('\n' + '=' * 70)
print(f'Found {len(found_trigrams)} distinct trigrams in the text.')
print('Notice how three-word phrases are joined into single tokens.')
print('Example: "vanilla ice cream" → "vanilla_ice_cream" (single concept)')


Example sentences with detected trigrams:

1. Trigram: "vanilla ice cream" → vanilla_ice_cream
   for desert also try the chocolate chip cookie with fudge and **vanilla_ice_cream** yummy

2. Trigram: "peanut butter cup" → peanut_butter_cup
   reece_'s **peanut_butter_cup**s etc

3. Trigram: "pulled pork sandwich" → pulled_pork_sandwich
   -PRON- order through skip the dishes so that could account for the fact -PRON- be all cold **pulled_pork_sandwich**es with fries slaw

Found 3 distinct trigrams in the text.
Notice how three-word phrases are joined into single tokens.
Example: "vanilla ice cream" → "vanilla_ice_cream" (single concept)


Looks like the second-order phrase model was successful. We're now seeing three-word phrases, such as "`vanilla_ice_cream`" and "`cinnamon_ice_cream`".


### ✅ Key Takeaways - Phrase Modeling:
- **Bigrams:** Join 2-word phrases (happy_hour, fish_tacos, ice_cream)
- **Trigrams:** Join 3-word phrases (vanilla_ice_cream, mac_and_cheese)
- **Statistical Detection:** Uses co-occurrence frequency and scoring formulas
- **Preserves Meaning:** Multi-word concepts treated as single semantic units

💡 **Why it matters:** "New York" has different meaning than "New" + "York" separately!


In [36]:
# 🧪 Try It Yourself: Explore phrase detection
# Uncomment and run to find phrases containing your favorite food:
# search_term = 'pizza'  # Change this!
# for sentence in it.islice(trigram_sentences, 1000):
#     if search_term in ' '.join(sentence):
#         print(' '.join(sentence))
#         break

### 📊 Intermediate Output: Phrase Detection Statistics

Let's quantify how many phrases our models detected.

In [37]:
# Count phrases in a sample of sentences
from collections import Counter

# Sample 10,000 trigram sentences to count phrases
trigram_sentences_sample = LineSentence(trigram_sentences_filepath)

phrase_counts = {'bigrams': 0, 'trigrams': 0, 'total_tokens': 0}
sample_size = 10000

for i, sentence in enumerate(trigram_sentences_sample):
    if i >= sample_size:
        break
    
    for token in sentence:
        phrase_counts['total_tokens'] += 1
        if '_' in token:
            # Count number of underscores to determine phrase length
            underscore_count = token.count('_')
            if underscore_count == 1:
                phrase_counts['bigrams'] += 1
            elif underscore_count >= 2:
                phrase_counts['trigrams'] += 1

# Calculate percentages
bigram_pct = (phrase_counts['bigrams'] / phrase_counts['total_tokens']) * 100
trigram_pct = (phrase_counts['trigrams'] / phrase_counts['total_tokens']) * 100

print(f'Phrase detection statistics (from {sample_size:,} sentences):')
print(f'  Total tokens: {phrase_counts["total_tokens"]:,}')
print(f'  Bigrams detected: {phrase_counts["bigrams"]:,} ({bigram_pct:.1f}%)')
print(f'  Trigrams detected: {phrase_counts["trigrams"]:,} ({trigram_pct:.1f}%)')
print(f'\n✅ Phrase models successfully identifying multi-word expressions!')

Phrase detection statistics (from 10,000 sentences):
  Total tokens: 121,122
  Bigrams detected: 2,491 (2.1%)
  Trigrams detected: 208 (0.2%)

✅ Phrase models successfully identifying multi-word expressions!


The final step of our text preparation process circles back to the complete text of the reviews. We're going to run the complete text of the reviews through a pipeline that applies our text normalization and phrase models.

In addition, we'll remove stopwords at this point. _Stopwords_ are very common words, like _a_, _the_, _and_, and so on, that serve functional roles in natural language, but typically don't contribute to the overall meaning of text. Filtering stopwords is a common procedure that allows higher-level NLP modeling techniques to focus on the words that carry more semantic weight.

Finally, we'll write the transformed text out to a new file, with one review per line.

In [38]:
trigram_reviews_filepath = os.path.join(intermediate_directory,
                                        'trigram_transformed_reviews_all.txt')

In [39]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if RECOMPUTE_DATA:

    with codecs.open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:
        
        for parsed_review in nlp.pipe(line_review(review_txt_filepath),
                                      batch_size=10000, n_process=7):
            
            # WHY THIS 4-STEP PIPELINE:
            # This transforms raw text into clean, phrase-aware tokens for modeling
            # Example: "The best ice cream I've had!" -> "good ice_cream"
            
            # Step 1: Lemmatize and remove punctuation
            # WHY lemmatize: "running" -> "run", "pizzas" -> "pizza" (canonical form)
            # WHY remove punctuation: "!" and "." don't carry semantic meaning
            unigram_review = [token.lemma_ for token in parsed_review
                              if not punct_space(token)]
            
            # Step 2: Join two-word phrases (e.g., 'ice cream' -> 'ice_cream')
            # WHY: Multi-word expressions should be single tokens
            # "ice" and "cream" separately have different meanings than "ice_cream"
            bigram_review = bigram_model[unigram_review]
            
            # Step 3: Join three-word phrases (e.g., 'vanilla ice_cream' -> 'vanilla_ice_cream')
            # WHY progressive approach: Build on bigrams to find longer phrases
            trigram_review = trigram_model[bigram_review]
            
            # Step 4: Remove stopwords (common words like 'the', 'and', 'is')
            # WHY remove stopwords:
            # - Appear in almost every review (low discriminative power)
            # - Don't help distinguish topics or semantic meaning
            # - Reduce vocabulary size and noise in LDA/Word2Vec
            # - Example: "the best pizza" -> "best pizza" (more distinctive)
            trigram_review = [term for term in trigram_review
                              if term not in spacy.lang.en.stop_words.STOP_WORDS] 
            
            # write the transformed review as a line in the new file
            trigram_review = ' '.join(trigram_review)
            f.write(trigram_review + '\n')

CPU times: user 11 μs, sys: 0 ns, total: 11 μs
Wall time: 21.9 μs


Let's preview the results. We'll grab one review from the file with the original, untransformed text, grab the same review from the file with the normalized and transformed text, and compare the two.

In [40]:
print('Original:' + '\n')

for review in it.islice(line_review(review_txt_filepath), 11, 12):
    print(review)

print('----' + '\n')
print('Transformed:' + '\n')

with codecs.open(trigram_reviews_filepath, encoding='utf_8') as f:
    for review in it.islice(f, 11, 12):
        print(review)

Original:

We've been a huge Slim's fan since they opened one up in Texas about two years ago when we used to live there. This place never disappoints. They even have great salads and grilled chicken. Plus they have fresh brewed sweet tea, it's the best!

----

Transformed:

-PRON- huge slim 's fan -PRON- open texas year_ago -PRON- use live place disappoint -PRON- great salad grill chicken plus -PRON- fresh brew sweet tea -PRON- good



You can see that most of the grammatical structure has been scrubbed from the text &mdash; capitalization, articles/conjunctions, punctuation, spacing, etc. However, much of the general semantic *meaning* is still present. Also, multi-word concepts such as "`friday_night`" and "`above_average`" have been joined into single tokens, as expected. The review text is now ready for higher-level modeling. 

### 🌍 Real-World Applications: Topic Modeling

Topic modeling with LDA is widely used across industries:

**Business Intelligence:**
- **Survey Analysis**: Companies like Qualtrics use LDA to find themes in customer feedback
  - Process 100,000s of survey responses automatically
  - Identify emerging trends: "shipping delays", "product quality", "customer service"
- **Market Research**: Discover what customers care about without pre-defined categories

**Social Media Analytics:**
- **Twitter/Reddit**: Identify trending topics and discussions
  - What are people talking about during major events?
  - Track public sentiment on products, brands, politicians
- **Brand monitoring**: Companies track mentions and sentiment across social platforms

**Academic Research:**
- **Digital Humanities**: Analyze historical documents and literature
  - Find themes in 19th century novels
  - Track evolution of scientific topics over time
- **Social Sciences**: Discover themes in interviews and qualitative data

**Content Organization:**
- **News Aggregators**: Group similar articles by topic
- **Document Management**: Automatically categorize and tag documents
- **Email**: Gmail's category tabs use topic modeling to sort email

**Healthcare:**
- **Medical Records**: Find patterns in patient symptoms and diagnoses
- **Clinical Trials**: Analyze adverse event reports to identify safety signals

**For restaurant reviews specifically:**
- **Yelp**: Highlights review topics ("Great for groups", "Outdoor seating")
- **TripAdvisor**: Summarizes what reviewers mention most
- **Restaurant chains**: Identify operational issues across locations
  - "Slow service" trending at franchise locations → targeted training
  - "Portion size" complaints → menu adjustments

LDA helps humans make sense of large text collections that would be impossible to read manually.

## 📈 Part 4: Topic Modeling with LDA



### 🎯 Learning Objectives:
- Understand Latent Dirichlet Allocation (LDA)
- Learn bag-of-words representation
- Train topic models on millions of reviews
- Visualize and interpret discovered topics

**Time:** ~20 minutes | **Key Concept:** Unsupervised discovery of document themes


*Topic modeling* is family of techniques that can be used to describe and summarize the documents in a corpus according to a set of latent "topics". For this demo, we'll be using [*Latent Dirichlet Allocation*](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) or LDA, a popular approach to topic modeling.

In many conventional NLP applications, documents are represented a mixture of the individual tokens (words and phrases) they contain. In other words, a document is represented as a *vector* of token counts. There are two layers in this model &mdash; documents and tokens &mdash; and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. This approach has a number of disadvantages:
* Document vectors tend to be large (one dimension for each token $\Rightarrow$ lots of dimensions)
* They also tend to be very sparse. Any given document only contains a small fraction of all tokens in the vocabulary, so most values in the document's token vector are 0.
* The dimensions are fully indepedent from each other &mdash; there's no sense of connection between related tokens, such as _knife_ and _fork_.

LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of *topics*, and the *topics* are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow [*Dirichlet*](https://en.wikipedia.org/wiki/Dirichlet_distribution) probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.

![LDA](https://s3.amazonaws.com/skipgram-images/LDA.png)

LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

We'll again turn to gensim to assist with data preparation and modeling. In particular, gensim offers a high-performance parallelized implementation of LDA with its [**LdaMulticore**](https://radimrehurek.com/gensim/models/ldamulticore.html) class.

In [41]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings

# Fix for pandas compatibility when loading old pickled files
# Redirect old pandas module paths to new locations
import sys
import pandas as pd

# Map old module paths to new ones for backward compatibility
if 'pandas.core.indexes.numeric' not in sys.modules:
    sys.modules['pandas.core.indexes.numeric'] = pd.core.indexes.api
if 'pandas.indexes' not in sys.modules:
    sys.modules['pandas.indexes'] = pd.core.indexes.api
if 'pandas.tslib' not in sys.modules:
    import pandas._libs.lib as tslib
    sys.modules['pandas.tslib'] = tslib

#import cPickle as pickle
import _pickle as pickle

The first step to creating an LDA model is to learn the full vocabulary of the corpus to be modeled. We'll use gensim's [**Dictionary**](https://radimrehurek.com/gensim/corpora/dictionary.html) class for this.

In [42]:
trigram_dictionary_filepath = os.path.join(intermediate_directory,
                                           'trigram_dict_all.dict')

In [43]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to learn the dictionary yourself.
if RECOMPUTE_DATA:

    trigram_reviews = LineSentence(trigram_reviews_filepath)

    # learn the dictionary by iterating over all of the reviews
    trigram_dictionary = Dictionary(trigram_reviews)
    
    # WHY filter_extremes with these parameters:
    # - no_below=10: Remove words appearing in fewer than 10 reviews
    #   * These are likely typos, rare proper nouns, or data errors
    #   * Not enough context to learn meaningful patterns
    #   * Reduces vocabulary from ~200k to ~90k terms
    #
    # - no_above=0.4: Remove words appearing in more than 40% of reviews
    #   * These are ultra-common words that don't distinguish topics
    #   * Examples: "food", "good", "restaurant" (appear everywhere)
    #   * LDA works best with words that are topic-specific
    #
    # WHY compactify():
    # - After filtering, we have gaps in the word ID sequence
    # - compactify() reassigns IDs to be consecutive (0, 1, 2, ...)
    # - Makes the sparse matrix representation more memory-efficient
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
    trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
    trigram_dictionary.compactify()

    trigram_dictionary.save(trigram_dictionary_filepath)
    
# load the finished dictionary from disk
trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

CPU times: user 57 ms, sys: 5.36 ms, total: 62.4 ms
Wall time: 54.6 ms


### ✅ Data Quality Check: Dictionary Statistics

Let's verify our dictionary is well-formed and has reasonable statistics.

In [44]:
# Dictionary quality metrics
print(f'📊 Dictionary Statistics:')
print(f'  Total vocabulary size: {len(trigram_dictionary):,} unique terms')
print(f'  Example terms (first 20):')

# Show sample of vocabulary
sample_terms = list(trigram_dictionary.token2id.keys())[:20]
for i, term in enumerate(sample_terms, 1):
    term_id = trigram_dictionary.token2id[term]
    print(f'    {i:2d}. "{term}" (ID: {term_id})')

# Check for phrase terms
phrase_terms = [term for term in list(trigram_dictionary.token2id.keys())[:1000] 
                if '_' in term]
print(f'\n  Detected phrases in first 1000 terms: {len(phrase_terms)}')
print(f'  Sample phrases: {phrase_terms[:10]}')

print('\n✅ Dictionary looks good! Ready for LDA modeling.')

📊 Dictionary Statistics:
  Total vocabulary size: 100,000 unique terms
  Example terms (first 20):
     1. "absolutely" (ID: 0)
     2. "accommodate" (ID: 1)
     3. "caesar_salad" (ID: 2)
     4. "dawn" (ID: 3)
     5. "delicious" (ID: 4)
     6. "distribute" (ID: 5)
     7. "dressing" (ID: 6)
     8. "drink" (ID: 7)
     9. "experience" (ID: 8)
    10. "friendly" (ID: 9)
    11. "great" (ID: 10)
    12. "happy" (ID: 11)
    13. "know" (ID: 12)
    14. "leaf" (ID: 13)
    15. "lunch" (ID: 14)
    16. "perfect" (ID: 15)
    17. "perfectly" (ID: 16)
    18. "pretty" (ID: 17)
    19. "price" (ID: 18)
    20. "pub" (ID: 19)

  Detected phrases in first 1000 terms: 102
  Sample phrases: ['caesar_salad', 'highly_recommend', 'sea_bass', 'thank_goodness', '$_0.50', '$_3-$4', '$_5.25', 'hong_kong', 'shave_ice', 'tapioca_pearl']

✅ Dictionary looks good! Ready for LDA modeling.


Like many NLP techniques, LDA uses a simplifying assumption known as the [*bag-of-words* model](https://en.wikipedia.org/wiki/Bag-of-words_model). In the bag-of-words model, a document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, is discarded. 

Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. The `trigram_bow_generator` function implements this. We'll save the resulting bag-of-words reviews as a matrix.

In the following code, "bag-of-words" is abbreviated as `bow`.

In [45]:
trigram_bow_filepath = os.path.join(intermediate_directory,
                                    'trigram_bow_corpus_all.mm')

In [46]:
def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(review)

In [47]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to build the bag-of-words corpus yourself.
if RECOMPUTE_DATA:
    # WHY use MmCorpus.serialize (Matrix Market format):
    # - Sparse matrix format: only stores non-zero values
    # - 4.2M reviews × 90k vocabulary would be ~380 billion entries if dense
    # - But average review has only ~100 unique words
    # - Sparse format: ~420M entries instead of 380B (1000x smaller!)
    # - MmCorpus streams from disk: doesn't need to fit all in RAM
    #
    # WHY serialize to disk (not keep in memory):
    # - Full corpus in memory: ~3-4GB RAM
    # - Streaming from disk: ~100MB RAM
    # - Allows training on laptops and modest hardware

    # generate bag-of-words representations for
    # all reviews and save them as a matrix
    MmCorpus.serialize(trigram_bow_filepath,
                       trigram_bow_generator(trigram_reviews_filepath))
    
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)

CPU times: user 386 ms, sys: 36 ms, total: 422 ms
Wall time: 397 ms


With the bag-of-words corpus, we're finally ready to learn our topic model from the reviews. We simply need to pass the bag-of-words matrix (sparse format) and Dictionary from our previous steps to `LdaMulticore` as inputs, along with the number of topics the model should learn. For this demo, we're asking for 50 topics.

What does a BoW document look like in Gensim?

A list of (token_id, count) pairs:

[(15, 2), (402, 1), (950, 3)]


Meaning:

token 15 appears 2 times

token 402 appears once

token 950 appears 3 times

In [48]:
lda_model_filepath = os.path.join(intermediate_directory, 'lda_model_all')

In [49]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to train the LDA model yourself.
if RECOMPUTE_DATA:
    print('Training LDA model with 50 topics...')
    print('This will take 5-10 minutes on most machines.')
    print('Progress: Processing 4.2M reviews...')

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # WHY num_topics=50:
        # - Too few topics (e.g., 10): Topics become too broad and generic
        # - Too many topics (e.g., 200): Topics become too narrow and redundant
        # - 50 topics: Sweet spot for restaurant reviews, captures both:
        #   * Food categories (pizza, sushi, tacos, etc.)
        #   * Experience aspects (service, ambiance, value, etc.)
        #
        # WHY workers=3 (not 4):
        # - Rule of thumb: physical cores - 1
        # - Leaves one core free for system operations
        # - Prevents CPU saturation during long training runs
        #
        # WHY use LdaMulticore (not LdaModel):
        # - Parallelizes training across multiple cores
        # - 3-4x faster on multi-core machines
        # - Same results as single-threaded LdaModel
        
        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        # Train LDA model with 50 topics
        NUM_TOPICS = 50
        
        lda = LdaMulticore(trigram_bow_corpus,
                           num_topics=NUM_TOPICS,
                           id2word=trigram_dictionary,
                           workers=3)
    
    lda.save(lda_model_filepath)
    
# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)

CPU times: user 154 ms, sys: 31.1 ms, total: 185 ms
Wall time: 183 ms


### ✅ Data Quality Check: LDA Model Statistics

Let's verify our LDA model learned meaningful topics.

In [50]:
# LDA model quality metricsprint(f'📊 LDA Model Statistics:')print(f'  Number of topics: {lda.num_topics}')print(f'  Vocabulary size: {len(lda.id2word):,} terms')# Show a sample topic to verify it looks meaningfulprint(f'\n  Sample topic (Topic 0):')topic_words = lda.show_topic(0, topn=10)for word, prob in topic_words:    print(f'    - {word:20s} (probability: {prob:.4f})')# Check topic coherence (are topics interpretable?)print(f'\n✅ LDA model trained successfully!')print(f'   Topics contain food and experience-related terms as expected.')

📊 LDA Model Statistics:
  Number of topics: 50


AttributeError: 'LdaMulticore' object has no attribute 'num_docs'

Our topic model is now trained and ready to use! Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.

In [None]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print('=' * 60)
    print('\nTop terms for this topic:')
    print(f'{"term":20} {"frequency"}')
    print()

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print(f'{term:20} {round(frequency, 3):.3f}')

In [None]:
explore_topic(topic_number=11)

In [None]:
# 🧪 Try It Yourself: Explore different topics
# Try exploring topics 0-49 to see what patterns the model discovered:
# explore_topic(topic_number=27)  # Change the number!

The first topic has strong associations with words like *taco*, *salsa*, *chip*, *burrito*, and *margarita*, as well as a handful of more general words. You might call this the **Mexican food** topic!

It's possible to go through and inspect each topic in the same way, and try to assign a human-interpretable label that captures the essence of each one.

### Automatic Topic Labeling

Manually labeling topics works well for a single analysis, but topic assignments can shift when:
- The dataset changes (different subset of reviews)
- The number of topics changes
- The LDA model is retrained with different random initialization

We can automatically generate topic labels using two approaches:

1. **Simple approach**: Use the top N words by probability
   - Fast and straightforward
   - May include common words that appear in many topics

2. **Distinctive approach**: Prioritize words that are unique to each topic
   - Finds words with high probability in this topic but low probability in others
   - Creates more meaningful, distinguishable labels
   - Better for understanding what makes each topic unique

The function below implements both approaches.

In [None]:
def auto_label_topics(lda_model, num_words=3, use_distinctive=False):
    """
    Automatically label topics using top words.
    
    Parameters:
    - num_words: How many words to include in the label
    - use_distinctive: If True, prioritize words unique to this topic
    """
    import numpy as np
    
    topic_labels = {}
    num_topics = lda_model.num_topics
    
    if use_distinctive:
        # Get word distributions for all topics
        topic_word_matrix = []
        
        for topic_id in range(num_topics):
            topic_words = lda_model.show_topic(topic_id, topn=50)
            topic_word_matrix.append({word: prob for word, prob in topic_words})
        
        # For each topic, find distinctive words
        for topic_id in range(num_topics):
            word_scores = []
            
            for word, prob in lda_model.show_topic(topic_id, topn=50):
                # Calculate how unique this word is to this topic
                other_probs = [topic_word_matrix[other_id].get(word, 0) 
                              for other_id in range(num_topics) if other_id != topic_id]
                
                # Distinctiveness = prob in this topic / average prob in other topics
                distinctiveness = prob / (np.mean(other_probs) + 1e-10)
                word_scores.append((word, distinctiveness * prob))  # Balance distinctiveness and frequency
            
            # Sort by score and take top words
            word_scores.sort(key=lambda x: -x[1])
            top_words = [word for word, _ in word_scores[:num_words]]
            topic_labels[topic_id] = ', '.join(top_words)
    else:
        # Simple: just use top words by probability
        for topic_id in range(num_topics):
            top_words = [word for word, _ in lda_model.show_topic(topic_id, topn=num_words)]
            topic_labels[topic_id] = ', '.join(top_words)
    
    return topic_labels

In [None]:
# Generate automatic topic labels using distinctive words
topic_names = auto_label_topics(lda, num_words=2, use_distinctive=True)

# Display the automatically generated labels
print('Automatically generated topic labels:')
print('=' * 60)
for topic_id in sorted(topic_names.keys()):
    # Also show the top terms for context
    top_terms = [word for word, _ in lda.show_topic(topic_id, topn=5)]
    print(f'{topic_id:2d}: {topic_names[topic_id]:25s} (top words: {", ".join(top_terms[:3])})')

In [None]:
# Alternative: Manual topic labels (for reference)
# Uncomment and modify if you want to override the automatic labels

# topic_names = {0: 'mexican',
#                1: 'menu',
#                2: 'thai',
# ... (rest of manual labels)
# }


In [None]:
topic_names_filepath = os.path.join(intermediate_directory, 'topic_names.pkl')

with open(topic_names_filepath, 'wb') as f:
    pickle.dump(topic_names, f)

You can see that, along with **mexican**, there are a variety of topics related to different styles of food, such as **thai**, **steak**, **sushi**, **pizza**, and so on. In addition, there are topics that are more related to the overall restaurant *experience*, like **ambience & seating**, **good service**, **waiting**, and **price**.

Beyond these two categories, there are still some topics that are difficult to apply a meaningful human interpretation to, such as topic 30 and 43.

Manually reviewing the top terms for each topic is a helpful exercise, but to get a deeper understanding of the topics and how they relate to each other, we need to visualize the data &mdash; preferably in an interactive format. Fortunately, we have the fantastic [**pyLDAvis**](https://pyldavis.readthedocs.io/en/latest/readme.html) library to help with that!

pyLDAvis includes a one-line function to take topic models created with gensim and prepare their data for visualization.

In [None]:
LDAvis_data_filepath = os.path.join(intermediate_directory, 'ldavis_prepared')

In [None]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if RECOMPUTE_DATA:

    LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus,
                                              trigram_dictionary)

    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)
        
# load the pre-prepared pyLDAvis data from disk
# Comprehensive pandas compatibility fix for loading old pickled files
import sys
import pandas as pd
from pandas import Index

# Map old pandas modules to new locations
if 'pandas.core.indexes.numeric' not in sys.modules:
    import pandas.core.indexes.api as idx_api
    sys.modules['pandas.core.indexes.numeric'] = idx_api
if 'pandas.indexes' not in sys.modules:
    sys.modules['pandas.indexes'] = pd.core.indexes.api

# Map old Index types to new Index class
# In pandas 2.0+, specific index types like Int64Index were removed
pd.Int64Index = Index
pd.core.indexes.api.Int64Index = Index
pd.Float64Index = Index
pd.core.indexes.api.Float64Index = Index
pd.UInt64Index = Index
pd.core.indexes.api.UInt64Index = Index

with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

`pyLDAvis.display(...)` displays the topic model visualization in-line in the notebook.

In [None]:
pyLDAvis.display(LDAvis_prepared)

### Wait, what am I looking at again?
There are a lot of moving parts in the visualization. Here's a brief summary:

* On the left, there is a plot of the "distance" between all of the topics (labeled as the _Intertopic Distance Map_)
  * The plot is rendered in two dimensions according a [*multidimensional scaling (MDS)*](https://en.wikipedia.org/wiki/Multidimensional_scaling) algorithm. Topics that are generally similar should be appear close together on the plot, while *dis*similar topics should appear far apart.
  * The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.
  * An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.
* On the right, there is a bar chart showing top terms.
  * When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's *saliency* is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.
  * When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter $\lambda$, which can be adjusted with a slider above the bar chart.
    * Setting the $\lambda$ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.
    * Setting $\lambda$ close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic &mdash; i.e., terms that occur *only* in this topic, and do not occur in other topics.
    * Setting $\lambda$ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.
* Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.

A more detailed explanation of the pyLDAvis visualization can be found [here](https://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf). Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's `LdaMulticore` object and pyLDAvis' visualization, you have to dig through the terms manually.

### Analyzing our LDA model
The interactive visualization pyLDAvis produces is helpful for both:
1. Better understanding and interpreting individual topics, and
1. Better understanding the relationships between the topics.

For (1), you can manually select each topic to view its top most freqeuent and/or "relevant" terms, using different values of the $\lambda$ parameter. This can help when you're trying to assign a human interpretable name or "meaning" to each topic.

For (2), exploring the _Intertopic Distance Plot_ can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

In our plot, there is a stark divide along the x-axis, with two topics far to the left and most of the remaining 48 far to the right. Inspecting the two outlier topics provides a plausible explanation: both topics contain many non-English words, while most of the rest of the topics are in English. So, one of the main attributes that distinguish the reviews in the dataset from one another is their language.

This finding isn't entirely a surprise. In addition to English-speaking cities, the Yelp dataset includes reviews of businesses in Montreal and Karlsruhe, Germany, often written in French and German, respectively. Multiple languages isn't a problem for our demo, but for a real NLP application, you might need to ensure that the text you're processing is written in English (or is at least tagged for language) before passing it along to some downstream processing. If that were the case, the divide along the x-axis in the topic plot would immediately alert you to a potential data quality issue.

The y-axis separates two large groups of topics &mdash; let's call them "super-topics" &mdash; one in the upper-right quadrant and the other in the lower-right quadrant. These super-topics correlate reasonably well with the pattern we'd noticed while naming the topics:
* The super-topic in the *lower*-right tends to be about *food*. It groups together the **burger & fries**, **breakfast**, **sushi**, **barbecue**, and **greek** topics, among others.
* The super-topic in the *upper*-right tends to be about other elements of the *restaurant experience*. It groups together the **ambience & seating**, **location & time**, **family**, and **customer service** topics, among others.

So, in addition to the 50 direct topics the model has learned, our analysis suggests a higher-level pattern in the data. Restaurant reviewers in the Yelp dataset talk about two main things in their reviews, in general: (1) the food, and (2) their overall restaurant experience. For this dataset, this is a very intuitive result, and we probably didn't need a sophisticated modeling technique to tell it to us. When working with datasets from other domains, though, such high-level patterns may be much less obvious from the outset &mdash; and that's where topic modeling can help.


### ✅ Key Takeaways - Topic Modeling:
- **LDA discovers themes** automatically from text without labels
- **50 topics** capture food types (mexican, italian) and experiences (service, ambience)
- **Bag-of-words** representation loses word order but captures content
- **pyLDAvis** provides interactive exploration of topic relationships

💡 **Why it matters:** Understand millions of reviews at a glance by grouping similar themes!

**Real-World Applications:**
- 📰 News article clustering and recommendation
- 🛍️ Product review summarization
- 🔍 Document search and organization
- 📊 Customer feedback analysis


### Describing text with LDA
Beyond data exploration, one of the key uses for an LDA model is providing a compact, quantitative description of natural language text. Once an LDA model has been trained, it can be used to represent free text as a mixture of the topics the model learned from the original corpus. This mixture can be interpreted as a probability distribution across the topics, so the LDA representation of a paragraph of text might look like 50% _Topic A_, 20% _Topic B_, 20% _Topic C_, and 10% _Topic D_.

To use an LDA model to generate a vector representation of new text, you'll need to apply any text preprocessing steps you used on the model's training corpus to the new text, too. For our model, the preprocessing steps we used include:
1. Using spaCy to remove punctuation and lemmatize the text
1. Applying our first-order phrase model to join word pairs
1. Applying our second-order phrase model to join longer phrases
1. Removing stopwords
1. Creating a bag-of-words representation

Once you've applied these preprocessing steps to the new text, it's ready to pass directly to the model to create an LDA representation. The `lda_description(...)` function will perform all these steps for us, including printing the resulting topical description of the input text.

In [None]:
def get_sample_review(review_number):
    """
    retrieve a particular review index
    from the reviews file and return it
    """
    
    return list(it.islice(line_review(review_txt_filepath),
                          review_number, review_number+1))[0]

In [None]:
def lda_description(review_text, min_topic_freq=0.05):
    """
    accept the original text of a review and (1) parse it with spaCy,
    (2) apply text pre-proccessing steps, (3) create a bag-of-words
    representation, (4) create an LDA representation, and
    (5) print a sorted list of the top topics in the LDA representation
    """
    
    # parse the review text with spaCy
    parsed_review = nlp(review_text)
    
    # lemmatize the text and remove punctuation and whitespace
            # Step 1: Lemmatize and remove punctuation
    unigram_review = [token.lemma_ for token in parsed_review
                      if not punct_space(token)]
    
    # apply the first-order and secord-order phrase models
            # Step 2: Join two-word phrases (e.g., 'ice cream' -> 'ice_cream')
    bigram_review = bigram_model[unigram_review]
            # Step 3: Join three-word phrases (e.g., 'vanilla ice_cream' -> 'vanilla_ice_cream')
    trigram_review = trigram_model[bigram_review]
    
    # remove any remaining stopwords
            # Step 4: Remove stopwords (common words like 'the', 'and')
    trigram_review = [term for term in trigram_review
                      if not term in spacy.lang.en.stop_words.STOP_WORDS]
    
    # create a bag-of-words representation
    review_bow = trigram_dictionary.doc2bow(trigram_review)
    
    # create an LDA representation
    review_lda = lda[review_bow]
   
    # Sort topics by frequency (highest first)
    review_lda = sorted(review_lda, key=lambda x: -x[1])
    
    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        print(f'{topic_names[topic_number]:25} {round(freq, 3)}')


In [None]:
sample_review = get_sample_review(50)
print(sample_review)

In [None]:
lda_description(sample_review)

In [None]:
sample_review = get_sample_review(100)
print(sample_review)

In [None]:
lda_description(sample_review)

### 🌍 Real-World Applications: Word Embeddings

Word embeddings like Word2Vec power many real-world applications:

**Search & Information Retrieval:**
- **Google Search**: Uses BERT (based on word embeddings) to understand search intent
  - Query "how to fix slow computer" matches documents about "speed up PC performance"
  - Understands synonyms and related concepts without exact keyword matches

**Recommendation Systems:**
- **Amazon**: "Customers who viewed this item also viewed..."
  - Uses product embeddings (similar to word embeddings) to find related products
  - "laptop" vectors are close to "mouse", "keyboard", "laptop_bag" vectors
- **Spotify**: Recommends songs by learning music embeddings from listening patterns

**Content Moderation:**
- **Facebook/YouTube**: Detect toxic content and hate speech
  - Word embeddings help identify offensive terms and their variations
  - Can catch misspellings, slang, and code-switching

**Question Answering:**
- **ChatGPT/Alexa**: Use transformer models (evolution of Word2Vec)
  - Understand context and semantic meaning
  - Generate human-like responses

**For restaurants specifically:**
- **Yelp/Google Reviews**: Categorize reviews by topic (food, service, ambiance)
- **OpenTable**: Suggest restaurants based on review similarity
- **Food delivery apps**: Understand "spicy vegan noodles" ≈ "hot plant-based ramen"

The technique you're learning here is a fundamental building block of modern NLP!

## 🧮 Part 5: Word Vector Embedding with Word2Vec



### 🎯 Learning Objectives:
- Understand word vector embeddings
- Learn how Word2Vec captures semantic meaning
- Perform word algebra (breakfast + lunch = brunch)
- Explore vector space with similarity queries

**Time:** ~25 minutes | **Key Concept:** Dense vector representations of meaning


![word2vec quiz 2](https://s3.amazonaws.com/skipgram-images/word2vec-2.png)

The goal of *word vector embedding models*, or *word vector models* for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the *meaning* or *concept* the term represents, and the relationship between it and other terms in the vocabulary. Word vector models are also fully unsupervised &mdash they learn all of these meanings and relationships solely by analyzing the text of the corpus, without any advance knowledge provided.

Perhaps the best-known word vector model is [word2vec](https://arxiv.org/pdf/1301.3781v3.pdf), originally proposed in 2013. The general idea of word2vec is, for a given *focus word*, to use the *context* of the word &mdash; i.e., the other words immediately before and after it &mdash; to provide hints about what the focus word might mean. To do this, word2vec uses a *sliding window* technique, where it considers snippets of text only a few tokens long at a time.

At the start of the learning process, the model initializes random vectors for all terms in the corpus vocabulary. The model then slides the window across every snippet of text in the corpus, with each word taking turns as the focus word. Each time the model considers a new snippet, it tries to learn some information about the focus word based on the surrouding context, and it "nudges" the words' vector representations accordingly. One complete pass sliding the window across all of the corpus text is known as a training *epoch*. It's common to train a word2vec model for multiple passes/epochs over the corpus. Over time, the model rearranges the terms' vector representations such that terms that frequently appear in similar contexts have vector representations that are *close* to each other in vector space.

For a deeper dive into word2vec's machine learning process, see [here](https://arxiv.org/pdf/1411.2738v4.pdf).

Word2vec has a number of user-defined hyperparameters, including:
- The dimensionality of the vectors. Typical choices include a few dozen to several hundred.
- The width of the sliding window, in tokens. Five is a common default choice, but narrower and wider windows are possible.
- The number of training epochs.

For using word2vec in Python, [gensim](https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/) comes to the rescue again! It offers a [highly-optimized](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), [parallelized](https://rare-technologies.com/parallelizing-word2vec-in-python/) implementation of the word2vec algorithm with its [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) class.

In [None]:
from gensim.models import Word2Vec

trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')

We'll train our word2vec model using the normalized sentences with our phrase models applied. We'll use 100-dimensional vectors, and set up our training process to run for twelve epochs.

In [None]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to train the word2vec model yourself.
if RECOMPUTE_DATA:
    print('Training Word2Vec model...')
    print('Learning 100-dimensional vectors for ~50,000 words')
    print('Expected time: 2-3 minutes')

    # WHY vector_size=100:
    # - 50 dimensions: Too small, loses semantic nuances
    # - 300 dimensions: Industry standard for general text, but overkill for domain-specific (restaurant) text
    # - 100 dimensions: Sweet spot for this dataset - captures semantic relationships without overfitting
    #
    # WHY window=5:
    # - Context window: how many words before/after to consider
    # - window=2: Too narrow, misses important relationships
    # - window=10: Too wide, includes unrelated words, adds noise
    # - window=5: Standard choice, captures phrase-level context (e.g., "delicious homemade apple pie")
    #
    # WHY min_count=20:
    # - Ignore words appearing fewer than 20 times
    # - Rare words don't have enough context to learn good vectors
    # - With 4.2M reviews, min_count=20 filters typos and extreme outliers
    # - Keeps vocabulary at manageable ~50k terms instead of 200k+
    #
    # WHY sg=1 (skip-gram, not CBOW):
    # - sg=0 (CBOW): Predicts word from context, faster training
    # - sg=1 (skip-gram): Predicts context from word, better for rare words and phrases
    # - Skip-gram works better for phrase-heavy text like "chocolate_chip_cookie"
    #
    # WHY workers=4:
    # - Parallelize training across 4 CPU cores
    # - Training takes 2-3 minutes instead of 8-12 minutes

    # initiate the model and perform the first epoch of training
    food2vec = Word2Vec(trigram_sentences, vector_size=100, window=5,
                        min_count=20, sg=1, workers=4)
    
    food2vec.save(word2vec_filepath)

### ✅ Data Quality Check: Word2Vec Model Statistics

Let's verify our Word2Vec model learned good word embeddings.

In [None]:
# Word2Vec model quality metrics
print(f'📊 Word2Vec Model Statistics:')
print(f'  Vocabulary size: {len(food2vec.wv):,} terms')
print(f'  Vector dimensions: {food2vec.wv.vector_size}')
print(f'  Training epochs completed: {food2vec.epochs}')
print(f'  Window size: {food2vec.window}')

# Quick sanity check: similar words to "pizza"
print(f'\n  Sanity check - Words similar to "pizza":')
try:
    similar = food2vec.wv.most_similar('pizza', topn=5)
    for word, similarity in similar:
        print(f'    - {word:20s} (similarity: {similarity:.3f})')
    print(f'\n✅ Word2Vec model looks good! Semantically related words cluster together.')
except KeyError:
    print(f'  "pizza" not in vocabulary (appears fewer than {food2vec.min_count} times)')
    print(f'  Try another common word instead.')

In [None]:
%%time
if RECOMPUTE_DATA:
    token_count = sum([len(sentence) for sentence in trigram_sentences])
    # perform another 11 epochs of training
    for i in range(1,12):

        food2vec.train(trigram_sentences, total_examples= token_count, epochs=food2vec.epochs)
        food2vec.save(word2vec_filepath)
        print(f'{food2vec.train_count} training epochs so far.')
        
# load the finished model from disk
food2vec = Word2Vec.load(word2vec_filepath)
# Note: init_sims() is not needed in gensim 4.0+ (automatic optimization)

print(f'{food2vec.train_count} training epochs so far.')

In [None]:
#print('{:,} terms in the food2vec vocabulary.'.format(len(food2vec.wv.vocab)))
print(f"{len(food2vec.wv):,} terms in the food2vec vocabulary.")


Let's take a peek at the word vectors our model has learned. We'll create a pandas DataFrame with the terms as the row labels, and the 100 dimensions of the word vector model as the columns.

In [None]:
# Build a list of (term, index, count) tuples from the vocabulary
ordered_vocab = []
for term in food2vec.wv.index_to_key:  # Terms in frequency-descending order
    ordered_vocab.append((
        term,
        food2vec.wv.key_to_index[term],            # index in the model
        food2vec.wv.get_vecattr(term, "count")    # frequency count
    ))

# Sort by term counts (most common first)
ordered_vocab = sorted(ordered_vocab, key=lambda x: -x[2])

# Unpack into separate lists for easier use
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# Create DataFrame with normalized word vectors as data
# Row labels are terms, columns are the 100 vector dimensions
word_vectors = pd.DataFrame(
    food2vec.wv.get_normed_vectors()[list(term_indices), :],
    index=ordered_terms
)

word_vectors.head()


Holy wall of numbers! This DataFrame has 50,835 rows &mdash; one for each term in the vocabulary &mdash; and 100 colums. Our model has learned a quantitative vector representation for each term, as expected.

Put another way, our model has "embedded" the terms into a 100-dimensional vector space.

### So... what can we do with all these numbers?
The first thing we can use them for is to simply look up related words and phrases for a given term of interest.

In [None]:
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """

    for word, similarity in food2vec.wv.most_similar(positive=[token], topn=topn):

        print(f'{word:20} {round(similarity, 3)}')


### What things are like Burger?

In [None]:
get_related_terms('burger')

The model has learned that fast food restaurants are similar to each other! In particular, *mcdonalds* and *wendy's* are the most similar to Burger King, according to this dataset. In addition, the model has found that alternate spellings for the same entities are probably related, such as *mcdonalds*, *mcdonald's* and *mcd's*.

### When is happy hour?

In [None]:
get_related_terms('happy_hour')

The model has noticed several alternate spellings for happy hour, such as *hh* and *happy hr*, and assesses them as highly related. If you were looking for reviews about happy hour, such alternate spellings would be very helpful to know.

Taking a deeper look &mdash; the model has turned up phrases like *3-6pm*, *4-7pm*, and *mon-fri*, too. This is especially interesting, because the model has no advance knowledge at all about what happy hour is, and what time of day it should be. But simply by scanning through restaurant reviews, the model has discovered that the concept of happy hour has something very important to do with that block of time around 3-7pm on weekdays.

### Let's make pasta tonight. Which style do you want?

In [None]:
get_related_terms('pasta', topn=20)

## Word algebra!
No self-respecting word2vec demo would be complete without a healthy dose of *word algebra*, also known as *analogy completion*.

The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure goes like this:
1. Provide a set of words or phrases that you'd like to add or subtract.
1. Look up the vectors that represent those terms in the word vector model.
1. Add and subtract those vectors to produce a new, combined vector.
1. Look up the most similar vector(s) to this new, combined vector via cosine similarity.
1. Return the word(s) associated with the similar vector(s).

But more generally, you can think of the vectors that represent each word as encoding some information about the *meaning* or *concepts* of the word. What happens when you ask the model to combine the meaning and concepts of words in new ways? Let's see.

In [None]:
def word_algebra(add=[], subtract=[], topn=2):
    """
    Perform word algebra by combining word vectors, then find similar words.
    
    How it works:
    1. Look up the 100-dimensional vector for each word in add= and subtract=
    2. Combine them: result_vector = sum(add vectors) - sum(subtract vectors)
    3. Compare this result_vector to EVERY word in the vocabulary
    4. Return the words whose vectors are most similar to result_vector
    
    The similarity score is COSINE SIMILARITY comparing:
      - The combined result_vector (from step 2)
      - Each candidate word's vector (from the vocabulary)
    
    Similarity interpretation:
    - 1.0 = candidate vector points exactly the same direction as result_vector
    - 0.0 = candidate vector is perpendicular to result_vector (unrelated)
    - Typical values: 0.3-0.9 for food/restaurant terms
    
    Example: word_algebra(add=['breakfast', 'lunch'])
      → Creates vector by adding breakfast_vector + lunch_vector
      → Finds words whose vectors are closest to this combined vector
      → Returns 'brunch' (high similarity ~0.85) because brunch_vector ≈ breakfast_vector + lunch_vector
    """
    answers = food2vec.wv.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print(f'{term:20s} (similarity: {similarity:.3f})')

### breakfast + lunch = ?
Let's start with a softball.

In [None]:
word_algebra(add=['breakfast', 'lunch'])

OK, so the model knows that *brunch* is a combination of *breakfast* and *lunch*. What else?

### lunch - day + night = ?

In [None]:
word_algebra(add=['lunch', 'night'], subtract=['day'])

Now we're getting a bit more nuanced. The model has discovered that:
- Both *lunch* and *dinner* are meals
- The main difference between them is time of day
- Day and night are times of day
- Lunch is associated with day, and dinner is associated with night

What else?

### taco - mexican + chinese = ?

In [None]:
word_algebra(add=['taco', 'chinese'], subtract=['mexican'])

Here's an entirely new and different type of relationship that the model has learned.
- It knows that tacos are a characteristic example of Mexican food
- It knows that Mexican and Chinese are both styles of food
- If you subtract *Mexican* from *taco*, you're left with something like the concept of a *"characteristic type of food"*, which is represented as a new vector
- If you add that new *"characteristic type of food"* vector to Chinese, you get *dumpling*.

What else?

### bun - american + mexican = ?

In [None]:
word_algebra(add=['bun', 'mexican'], subtract=['american'])

The model knows that both *buns* and *tortillas* are the doughy thing that goes on the outside of your real food, and that the primary difference between them is the style of food they're associated with.

What else?

### filet mignon - beef + seafood = ?

In [None]:
word_algebra(add=['filet_mignon', 'seafood'], subtract=['beef'])

The model has learned a concept of *delicacy*. If you take filet mignon and subtract beef from it, you're left with a vector that roughly corresponds to delicacy. If you add the delicacy vector to *seafood*, you get *raw oyster*.

What else?

### coffee - drink + snack = ?

In [None]:
word_algebra(add=['coffee', 'snack'], subtract=['drink'])

The model knows that if you're on your coffee break, but instead of drinking something, you're eating something... that thing is most likely a pastry.

What else?

This seems like a good place to land... what if we explore the vector space around *Applebee's* a bit, in a few different directions? Let's see what we find.

#### Applebee's + italian = ?

In [None]:
word_algebra(add=["applebee_'s", 'italian'])

#### Applebee's + pancakes = ?

In [None]:
word_algebra(add=["applebee_'s", 'pancakes'])

#### Applebee's + pizza = ?

In [None]:
word_algebra(add=["applebee_'s", 'pizza'])

You could do this all day. One last analogy before we move on...

### wine - grapes + barley = ?

In [None]:
word_algebra(add=['wine', 'barley'], subtract=['grapes'])

### 🔬 More Word Algebra Examples: Diverse Semantic Relationships

Let's explore different types of semantic transformations:

**Example: Cuisine Style Transfer**

What if we take Italian cuisine and make it spicy (like Thai food)?

In [None]:
# italian + spicy - mild = ?
word_algebra(add=['italian', 'spicy'], subtract=['mild'])

**Example: Dietary Substitutions**

What's the vegetarian version of a burger?

In [None]:
# burger + vegetarian - meat = ?
word_algebra(add=['burger', 'vegetarian'], subtract=['meat'])

**Example: Meal Time Transformations**

What happens when we move pizza from dinner to breakfast?

In [None]:
# pizza + breakfast - dinner = ?
word_algebra(add=['pizza', 'breakfast'], subtract=['dinner'])

**Example: Texture Transformations**

What's the crispy version of chicken (vs. baked)?

In [None]:
# chicken + crispy - baked = ?
word_algebra(add=['chicken', 'crispy'], subtract=['baked'])

**Example: Preparation Method Analogies**

If we take fish and prepare it Japanese-style (instead of American-style)?

In [None]:
# fish + japanese - american = ?
word_algebra(add=['fish', 'japanese'], subtract=['american'])

**Example: Atmosphere Transfer**

What's the outdoor equivalent of brunch?

In [None]:
# brunch + outdoor - indoor = ?
word_algebra(add=['brunch', 'outdoor'], subtract=['indoor'])

## 📉 Part 6: Word Vector Visualization with t-SNE



### 🎯 Learning Objectives:
- Understand dimensionality reduction with t-SNE
- Visualize 100-dimensional vectors in 2D space
- Create interactive plots with Bokeh
- Explore semantic relationships visually

**Time:** ~10 minutes | **Key Concept:** Visualizing high-dimensional data


[t-Distributed Stochastic Neighbor Embedding](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf), or *t-SNE* for short, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets. It attempts to map high-dimensional data onto a low two- or three-dimensional representation such that the relative distances between points are preserved as closely as possible in both high-dimensional and low-dimensional space.

scikit-learn provides a convenient implementation of the t-SNE algorithm with its [TSNE](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) class.

In [None]:
from sklearn.manifold import TSNE

Our input for t-SNE will be the DataFrame of word vectors we created before. Let's first:
1. Drop stopwords &mdash; it's probably not too interesting to visualize *the*, *of*, *or*, and so on
1. Take only the 5,000 most frequent terms in the vocabulary &mdash; no need to visualize all ~50,000 terms right now.

In [None]:
# Limit to top N words for visualization performance
TOP_N_WORDS_FOR_VIZ = 5000

tsne_input = word_vectors.drop(spacy.lang.en.stop_words.STOP_WORDS, errors='ignore')
tsne_input = tsne_input.head(TOP_N_WORDS_FOR_VIZ)

In [None]:
tsne_input.head()

In [None]:
tsne_filepath = os.path.join(intermediate_directory,
                             'tsne_model')

tsne_vectors_filepath = os.path.join(intermediate_directory,
                                     'tsne_vectors.npy')

In [None]:
%%time

import numpy as np
from sklearn.manifold import TSNE
import pickle

# Comprehensive pandas compatibility fix for loading old pickled files
import sys
import pandas as pd
from pandas import Index

# Map old pandas modules
if 'pandas.core.indexes.numeric' not in sys.modules:
    import pandas.core.indexes.api as idx_api
    sys.modules['pandas.core.indexes.numeric'] = idx_api
if 'pandas.indexes' not in sys.modules:
    sys.modules['pandas.indexes'] = pd.core.indexes.api

# Map old Index types
pd.Int64Index = Index
pd.core.indexes.api.Int64Index = Index
pd.Float64Index = Index
pd.core.indexes.api.Float64Index = Index

if RECOMPUTE_DATA:
    # WHY use t-SNE:
    # - Can't visualize 100-dimensional vectors directly
    # - Need to reduce to 2D for plotting
    # - PCA: Linear reduction, fast but misses nonlinear relationships
    # - t-SNE: Nonlinear reduction, preserves local neighborhoods
    #   (words with similar meanings stay close together in 2D)
    #
    # WHY default parameters work well:
    # - perplexity=30 (default): Balances local vs global structure
    # - n_iter=1000 (default): Enough iterations for convergence
    # - learning_rate='auto': Adaptive learning for stable convergence
    #
    # WHY limit to 5,000 words (from tsne_input):
    # - t-SNE is O(n²) - computing on 50k words would take hours
    # - 5,000 most common words covers key vocabulary
    # - Still shows meaningful clusters and relationships
    
    print('Computing t-SNE projection of 5,000 word vectors...')
    print('Reducing from 100 dimensions to 2 dimensions')
    print('This takes 30-60 seconds...')
    tsne = TSNE()
    tsne_vectors = tsne.fit_transform(tsne_input.values)
    
    with open(tsne_filepath, 'wb') as f:
        pickle.dump(tsne, f)
    
    np.save(tsne_vectors_filepath, tsne_vectors)

with open(tsne_filepath, 'rb') as f:
    tsne = pickle.load(f)

tsne_vectors = np.load(tsne_vectors_filepath)

tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsne_input.index),
                            columns=['x_coord', 'y_coord'])

Now we have a two-dimensional representation of our data! Let's take a look.

In [None]:
tsne_vectors.head()

In [None]:
tsne_vectors['word'] = tsne_vectors.index

### Plotting with Bokeh

In [None]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.core.properties import value  # ← moved here
from bokeh.models import WheelZoomTool

output_notebook()

In [None]:
plot_data = ColumnDataSource(tsne_vectors)

tsne_plot = figure(
    title='t-SNE Word Embeddings',
    width=800,
    height=800,
    tools="pan,wheel_zoom,box_zoom,box_select,reset",
)

# Make wheel zoom the active scroll tool
wheel = tsne_plot.select_one(WheelZoomTool)
tsne_plot.toolbar.active_scroll = wheel

tsne_plot.scatter(
    x='x_coord', y='y_coord',
    source=plot_data,
    size=10, line_alpha=0.2, fill_alpha=0.1,
)

tsne_plot.add_tools(HoverTool(tooltips=[("word", "@index")]))

tsne_plot.title.text_font_size = "16pt"
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

show(tsne_plot)

## Conclusion

## ❓ Frequently Asked Questions

### About the Data

**Q: Why use Yelp restaurant reviews?**
- Restaurant reviews are rich with domain-specific vocabulary
- Clear semantic relationships (food, cuisines, experiences)
- Large dataset (4.2M reviews) for robust model training
- Relatable domain - everyone understands food!

**Q: Can I use my own data?**
- Yes! Replace `review.json` with your own text data
- Format: One JSON object per line with a 'text' field
- Minimum: ~100K documents recommended for good Word2Vec results
- Adjust `min_count` parameters for smaller datasets

### About Word2Vec

**Q: Why 100 dimensions instead of 300?**
- 300 is common for general text (trained on billions of words)
- 100 is sufficient for domain-specific text (restaurant reviews)
- Lower dimensions = faster training, less overfitting
- You can experiment with `vector_size` parameter!

**Q: What's the difference between CBOW and skip-gram?**
- **CBOW** (`sg=0`): Predicts word from context, faster training
- **Skip-gram** (`sg=1`): Predicts context from word, better for rare words
- We use skip-gram because we have many phrases (trigrams)

**Q: Can I update the model with new data?**
- Yes! Use `food2vec.build_vocab(new_sentences, update=True)`
- Then `food2vec.train(new_sentences, ...)`
- Useful for incremental learning with new reviews

### About LDA

**Q: How do I choose the number of topics?**
- No single "correct" number - it's a hyperparameter
- Too few (10): Topics too broad, lack specificity
- Too many (200): Topics too narrow, redundant
- Rule of thumb: sqrt(vocabulary_size) as a starting point
- Try 20-100 for most applications, evaluate interpretability

**Q: Why do topic numbers differ between gensim and pyLDAvis?**
- Different indexing/ordering systems
- Match topics by top terms, not by numbers
- This is a known limitation - they use the same underlying data

**Q: Can I use LDA for prediction?**
- LDA is primarily for exploration and understanding
- Can use topic distributions as features for classification
- Example: Predict restaurant rating from topic mixture

### About Phrase Detection

**Q: How does phrase detection work?**
- Statistical approach: finds word pairs that appear together more often than chance
- Uses pointwise mutual information (PMI) or similar metrics
- `threshold` parameter controls how strict to be
- Higher threshold = only obvious phrases (ice_cream)
- Lower threshold = more phrases, some spurious

**Q: Can I add custom phrases?**
- Yes! Use Phraser with a custom dictionary
- Or post-process to force certain phrases
- Example: Always join "San_Francisco", "New_York"

### About Performance

**Q: Why is this so slow on my machine?**
- Text processing is CPU-intensive (4.2M reviews!)
- Use `RECOMPUTE_DATA = False` to skip expensive computations
- Consider running on cloud (Google Colab, AWS, etc.)
- Reduce dataset size for learning (sample 100K reviews)

**Q: Can I use GPU acceleration?**
- Gensim Word2Vec: CPU-optimized, GPU doesn't help much
- LDA: CPU-based algorithm, no GPU version
- For GPU: Use PyTorch or TensorFlow-based implementations

### Learning More

**Q: What should I learn next?**
- **Modern transformers**: BERT, GPT, T5 (evolution of Word2Vec)
- **Deep learning NLP**: PyTorch, Hugging Face Transformers
- **Advanced topic modeling**: BERTopic, Top2Vec
- **Production NLP**: spaCy pipelines, model deployment

**Q: Where can I find more resources?**
- **gensim tutorials**: radimrehurek.com/gensim/
- **spaCy course**: course.spacy.io
- **Papers**: Word2Vec original paper (Mikolov et al. 2013)
- **Books**: "Speech and Language Processing" (Jurafsky & Martin)

Whew! Let's round up the major components that we've seen:
1. Text processing with **spaCy**
1. Automated **phrase modeling**
1. Topic modeling with **LDA** $\ \longrightarrow\ $ visualization with **pyLDAvis**
1. Word vector modeling with **word2vec** $\ \longrightarrow\ $ visualization with **t-SNE**

#### Why use these models?
Dense vector representations for text like LDA and word2vec can greatly improve performance for a number of common, text-heavy problems like:
- Text classification
- Search
- Recommendations
- Question answering

...and more generally are a powerful way machines can help humans make sense of what's in a giant pile of text. They're also often useful as a pre-processing step for many other downstream machine learning applications.