## Take a few hours so don't run

# TOPIC IDENTIFICATION

Install yake if you haven't already. Yake is an unsupervised approach to key phrase detection. We can specify the size of n-grams to detect. We will use trigrams. https://github.com/LIAAD/yake

In [None]:
% pip install git+https://github.com/LIAAD/yake

In [None]:
import pandas as pd

reviews = pd.read_json('../data/ys-reviews-with-categories.json')

restuarants = reviews.loc[reviews.category == "restaurant"]

sample: list[str] = restuarants.text.tolist()

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation

import yake as yk

stop_words = stopwords.words("english")
# Define a keyword extractor that returns the top 10,000 trigrams
keyword_extractor = yk.KeywordExtractor(n=3,top=10000)
filtered_words = []
    
def removed_stopwords(text: str) -> str:
    # make lower case and remove all punctuations
    text = text.lower().translate(str.maketrans('', '', punctuation))
    tokens = word_tokenize(text)
    # filter stopwords    
    filtered = [word for word in tokens if word not in stop_words]
    filtered = [word for word in tokens if word not in filtered_words]
    return ' '.join(filtered)

# Join all restaurant reviews into one corpus
corpus = "\n".join(sample)
corpus = removed_stopwords(corpus)
# Perform keyword extraction on review corpus
extractions = keyword_extractor.extract_keywords(corpus)

# Save results to a CSV
with open('keywords.csv', 'w') as outfile:
    outfile.write(pd.DataFrame(extractions).to_csv(index=False))

In [None]:
keywords_df = pd.read_csv('keywords.csv')
keywords = keywords_df[0:1000][keywords_df.columns[0]].tolist()
keywords

['food service food',
 'food food service',
 'service food food',
 'service food service',
 'food customer service',
 'friendly staff food',
 'friendly service food',
 'fast service food',
 'food fast service',
 'food friendly service',
 'food service service',
 'food excellent service',
 'nice place food',
 'customer service food',
 'food friendly staff',
 'place service food',
 'place food service',
 'food excellent food',
 'service food nice',
 'service excellent food',
 'love love love',
 'food nice place',
 'excellent food food',
 'food delicious food',
 'service friendly staff',
 'service delicious food',
 'service awesome food',
 'service service food',
 'excellent food service',
 'food service',
 'service food love',
 'food service love',
 'food service prices',
 'restaurant food service',
 'food place service',
 'food food',
 'service food',
 'place food place',
 'food nice service',
 'excellent service food',
 'delicious food service',
 'food service fast',
 'food food friend

## Insights

**Observations:**
- For our purposes, we will ignore value words like "good", "nice", or "great" as they do not provide context on the subject
- Food is the most common meaningful topic that shows up in the restaurant review corpus
- Service is the next most common
- Location descriptors like "place" and "atmosphere" show up occasionally
- Cleanliness shows up occasionally
- Price shows up occassionally

**Conclusions**
- We have determined that food, service, location, cleanliness, and price are common topics in the restaurant review corpus.
- Now, we will detect if each of these topics is present in a review using word embeddings and cosine similarity.

# TOPIC DETECTION

## Set Up

Install libraries. SpaCy is for NLP pipeline and Sense2Vec is for token similarity.

In [None]:
%pip install sense2vec
%pip install spacy

Run this if you don't have `ys-reviews-restaurants.json` already!

In [2]:
import pandas as pd
reviews = pd.read_json('../data/ys-reviews-with-categories.json')

restaurants = reviews.loc[reviews.category == "restaurant"]
restaurants.drop(columns=['category'], inplace=True)
restaurants.to_json('../data/ys-reviews-restaurants.json', orient='records')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurants.drop(columns=['category'], inplace=True)


Run this command to download the default spaCy pipeline. If this doesn't work, try running in the command line.

In [None]:
! python -m spacy download encore_web_sm

Go to this link to download a zip archive of the Reddit Sense2Vec model: https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz
Make sure the folder `s2v_old` is in your directory!

## Start here if you've already done set up!

In [1]:
import pandas as pd
import spacy
from spacy import Language
from sense2vec import Sense2Vec
# 
s2v = Sense2Vec().from_disk("../s2v_old")
# Create a pipe that converts lemmas to lower case:
@Language.component("lower_case_lemmas")
def lower_case_lemmas(doc) :
    for token in doc :
        token.lemma_ = token.lemma_.lower()
    return doc
# Initialize default spaCy pipeline
nlp = spacy.load('en_core_web_sm', disable=['ner'])
# lower_case_lemmas to pipeline
nlp.add_pipe(factory_name="lower_case_lemmas", after="tagger")
# Sanity check to make sure we have the right pipeline order
print(nlp.pipe_names)

['tok2vec', 'tagger', 'lower_case_lemmas', 'parser', 'attribute_ruler', 'lemmatizer']


Load the reviews into a spaCy doc object. This takes ~45 minutes to run because it is tokenizing, parsing, lemmatizing, etc.

In [None]:

# Pipe restaurant review text into spaCy pipeline
# Each review is a "doc" in "docs"
reviews = pd.read_json('../data/ys-reviews-restaurants.json', orient='records')
docs = list(nlp.pipe(reviews['text'].to_list()))

Use `topicDetection()` for food and service. `topic_list` contains similar words that describe the topic. Sense2Vec takes the average of the vector representations of each word. This defines a vector that is centered around the topic region of of vector space.

In [None]:
# Detect if a topic defined by topic_list is present in a sentence (span from spaCy doc)
# If a doc has n sentences, return a list of n booleans, where each index represent a topic present or not
# pos is a list of parts of speech to consider from doc
# thresh is a threshold for cosine similarity. If similarity > threshold, topic is present
def topicDetection(sentence, topic_list : list[str], pos : list[str], thresh) -> list[int]:
    indices = []
    for i, token in enumerate(sentence):
      # Construct string to pass to Sense2Vec
      s = token.lemma_ + "|" + token.pos_
      # Only consider tokens that Sense2Vec model knows and are from specified part of speech
      if (s in s2v and token.pos_ in pos) and (s2v.similarity(s, topic_list) > thresh):
        indices.append(i)
    # return a list of indices where topic was detected
    return indices
    

Use `separateTopicsDetection()` on location, clean, and price topics. These topics are described by disjoint concepts that don't create a meaningful average vector. For example, "clean" and "dirty" are opposite ideas. `separateTopicsDetection()` looks for tokens that are similar to at least word in `topics_list`.

In [None]:
# Operates like TopicDetection, except looks or matches to each string in topics_list seperately
# Instead of averaging their vector representations
def seperateTopicsDetection(sentence, topics_list : list[str], thresh, exclude_pos = []) -> list[int]:
    indices = []
    for i, token in enumerate(sentence):
      # Skip token if explicitly told to ignore part of speech
      if token.pos_ in exclude_pos:
        continue
      # Construct string to pass to Sense2Vec
      s = token.lemma_ + "|" + token.pos_
      # Only consider tokens that Sense2Vec model knows
      if s in s2v:
        # Add to indices list if token matches at least one topic from topic_list
        for topic in topics_list:
          if s2v.similarity(s, topics_list) > thresh:
            indices.append(i)
            break
    return indices

## Detect Food topic

Perform `topicDetection()` on each sentence of a doc in docs. `topicDetection()` returns a list of indices where topic was detected in the sentence. Record the token lemma, doc index, sentence index, and token index in `food_hits` list.

In [None]:
# Sense2Vec will compute an average vector from the vector representation of these tokens
food = ["food|NOUN", "pizza|NOUN", "meal|NOUN", "taco|NOUN", "chinese|ADJ", "mexican|ADJ", "sushi|NOUN", "bone|NOUN", "drink|NOUN", "pho|NOUN", "curry|NOUN", "coffee|NOUN", "teriyaki|NOUN"]
food_hits = []
for i, doc in enumerate(docs):
  for j, sentence in enumerate(doc.sents):
    # 
    for k in topicDetection(sentence, food, ["NOUN", "ADJ"], 0.6):
      # for each token where the food topic is detected
      # record lemma, doc index, sentence index, and token index
      food_hits.append([sentence[k].lemma_ , i, j, k])

Convert `food_hits` to a dataframe and save to JSON file.

In [None]:
food_hits = pd.DataFrame(data=food_hits, columns=['lemma', 'doc_index', 'sentence_index', 'token_index'])
food_hits.to_json('../data/topics/food-hits-restaurant-reviews.json', orient='records')

## Detect Service Topic

Same process as the food topic.

In [None]:
service = ["waiter|NOUN", "staff|NOUN", "service|NOUN", "employee|NOUN"]
service_hits = []
for i, doc in enumerate(docs):
  for j, sentence in enumerate(doc.sents):
    for k in topicDetection(sentence, service, ["NOUN", "ADJ"], 0.7):
      # for each token where the food topic is detected
      # record lemma, doc index, sentence index, and token index
      service_hits.append([sentence[k].lemma_ , i, j, k])

Issue: the word "restaurant" is very similar to the service topic, but shouldn't be included. Filter out instances of restaurant.

In [None]:
remove = ["restaurant", "restraunt", "restaraunt"]
service_hits = pd.DataFrame(data=service_hits, columns=['lemma', 'doc_index', 'sentence_index', 'token_index'])
# Remove "restaurant" or any typos from service hits
service_hits = service_hits[~((service_hits['lemma'] == "restaurant") | (service_hits['lemma'] == "restraunt") | (service_hits['lemma'] == "restaraunt"))]
service_hits.to_json('../data/topics/service-hits-restaurant-reviews.json', orient='records')

## Detect Location Topic

Use `seperateTopicsDetection()` to detect if tokens match any of the tokens in `location` list.

In [None]:
location = ["crowded|ADJ", "atmosphere|NOUN", "quiet|ADJ", "interior|NOUN", "music|NOUN", "environment|NOUN", "space|NOUN", "vibe|NOUN", "location|NOUN"]
location_hits = []
for i, doc in enumerate(docs):
  for j, sentence in enumerate(doc.sents):
    for k in seperateTopicsDetection(sentence, location, 0.67):
      # for each token where the food topic is detected
      # record lemma, doc index, sentence index, and token index
      location_hits.append([sentence[k].lemma_ , i, j, k])

In [None]:
location_hits = pd.DataFrame(data=location_hits, columns=['lemma', 'doc_index', 'sentence_index', 'token_index'])
location_hits = location_hits[~((location_hits['lemma'] == "especially"))]
location_hits.to_json('../data/topics/location-hits-restaurant-reviews.json', orient='records')

## Detect Clean Topic

In [None]:
clean = ["clean|ADJ", "dirty|ADJ", "fly|NOUN", "cockroach|NOUN", "filthy|ADJ", "spotless|ADJ"]
clean_hits = []
for i, doc in enumerate(docs):
  for j, sentence in enumerate(doc.sents):
    for k in seperateTopicsDetection(sentence, clean, 0.7):
      # for each token where the food topic is detected
      # record lemma, doc index, sentence index, and token index
      clean_hits.append([sentence[k].lemma_ , i, j, k])

In [None]:
clean_hits = pd.DataFrame(data=clean_hits, columns=['lemma', 'doc_index', 'sentence_index', 'token_index'])
clean_hits.to_json('../data/topics/clean-hits-restaurant-reviews.json', orient='records')

## Detect Price Topic

In [None]:
price = ["cheap|ADJ", "expensive|ADJ", "price|NOUN", "worth|NOUN", "payment|NOUN", "tip|NOUN"]
price_hits = []
for i, doc in enumerate(docs):
  for j, sentence in enumerate(doc.sents):
    # exclude verbs like "pay" or "buy"
    for k in seperateTopicsDetection(sentence, price, 0.7, ["VERB"]):
      # for each token where the food topic is detected
      # record lemma, doc index, sentence index, and token index
      price_hits.append([sentence[k].lemma_ , i, j, k])

In [None]:
price_hits = pd.DataFrame(data=price_hits, columns=['lemma', 'doc_index', 'sentence_index', 'token_index'])
# clean_hits = clean_hits[~((location_hits['lemma'] == "especially"))]
price_hits.to_json('../data/topics/price-hits-restaurant-reviews.json', orient='records')