# Document Creation for NLP Tasks
This notebook loads the Google reviews data and creates two distinct documents for each review:
1. **Review Document:** The raw text of the customer review.
2. **Business Document:** A concatenation of the business name, category, and description.

In [1]:
import pandas as pd

## 1. Load the data

In [3]:
df = pd.read_csv('/Users/yumin/Documents/GitHub/TikTok-TechJam-2025/data_gpt_labeler/final_data_labeled_1.csv')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,rating,text,business_name,business_category,business_description,_id,policy_label
0,0,5,My husband took me here for my birthday! The ...,Buona Sera,"['Italian restaurant', 'Restaurant']",Casual trattoria serving familiar Italian entr...,1.089786951023629e+20_1507964592497,1
1,1,4,was a great place. Now closed. Too bad. We'll...,Buona Sera,"['Italian restaurant', 'Restaurant']",Casual trattoria serving familiar Italian entr...,1.124085651879834e+20_1519880201044,1
2,2,5,"cozy, great food. love how you can sign your b...",Buona Sera,"['Italian restaurant', 'Restaurant']",Casual trattoria serving familiar Italian entr...,1.1571615809204435e+20_1490033397528,1
3,3,4,"Not your average little Italian joint, don't e...",Buona Sera,"['Italian restaurant', 'Restaurant']",Casual trattoria serving familiar Italian entr...,1.181306411818766e+20_1511165172600,1
4,4,3,Service was great and the mozzarella fritta an...,Buona Sera,"['Italian restaurant', 'Restaurant']",Casual trattoria serving familiar Italian entr...,1.1430211684992369e+20_1505179108178,1


## 2. Create Review and Business Documents

In [5]:
df['review_document'] = df['text'].astype(str)
df['business_document'] = df['business_name'].fillna('') + ' ' + df['business_category'].fillna('') + ' ' + df['business_description'].fillna('')
df[['review_document', 'business_document']].head()

Unnamed: 0,review_document,business_document
0,My husband took me here for my birthday! The ...,"Buona Sera ['Italian restaurant', 'Restaurant'..."
1,was a great place. Now closed. Too bad. We'll...,"Buona Sera ['Italian restaurant', 'Restaurant'..."
2,"cozy, great food. love how you can sign your b...","Buona Sera ['Italian restaurant', 'Restaurant'..."
3,"Not your average little Italian joint, don't e...","Buona Sera ['Italian restaurant', 'Restaurant'..."
4,Service was great and the mozzarella fritta an...,"Buona Sera ['Italian restaurant', 'Restaurant'..."


## 3. Display a single example

In [6]:
print('Review Document:')
print(df['review_document'].iloc[0])
print('---')
print('Business Document:')
print(df['business_document'].iloc[0])

Review Document:
My husband took me here for my birthday!  The best meal we have had here! The food is fresh and delicious!  They have a large selection of wine. It's is a very peaceful,  romantic setting.  I loved it. We will definitely come back!
---
Business Document:
Buona Sera ['Italian restaurant', 'Restaurant'] Casual trattoria serving familiar Italian entrees in a family-friendly setting with funky decor.


## 4. Create the Corpus

In [7]:
corpus = []
for index, row in df.iterrows():
    corpus.append(row['review_document'])     
    corpus.append(row['business_document']) 
print(f'Total number of documents in the corpus: {len(corpus)}')
print('---')
print('First 4 documents in the corpus:')
for doc in corpus[:4]:
    print(doc)    
print('---')

Total number of documents in the corpus: 20000
---
First 4 documents in the corpus:
My husband took me here for my birthday!  The best meal we have had here! The food is fresh and delicious!  They have a large selection of wine. It's is a very peaceful,  romantic setting.  I loved it. We will definitely come back!
Buona Sera ['Italian restaurant', 'Restaurant'] Casual trattoria serving familiar Italian entrees in a family-friendly setting with funky decor.
was a great place.  Now closed. Too bad. We'll miss having comfort food from Buona Seras
Buona Sera ['Italian restaurant', 'Restaurant'] Casual trattoria serving familiar Italian entrees in a family-friendly setting with funky decor.
---


## 5. Preprocess Corpus for LDA

In [8]:
import nltk 
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def preprocess_text(text):    
    tokens = word_tokenize(text.lower())    
    return [lemmatizer.lemmatize(w) for w in tokens if w.isalpha() and w not in stop_words]
processed_corpus = [preprocess_text(doc) for doc in corpus]

[nltk_data] Downloading package punkt to /Users/yumin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/yumin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/yumin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 6. Create Dictionary and Corpus for LDA

In [9]:
from gensim import corpora
dictionary = corpora.Dictionary(processed_corpus)
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]

## 7. Train the LDA Model

In [10]:
from gensim.models import LdaModel
lda_model = LdaModel(bow_corpus, num_topics=100, id2word=dictionary, passes=15)
for idx, topic in lda_model.print_topics(-1):    print(f'Topic: {idx} Words: {topic}')

Topic: 0 Words: 0.498*"range" + 0.220*"snug" + 0.141*"neighborhood" + 0.012*"non" + 0.011*"trailhead" + 0.007*"confusing" + 0.004*"choosing" + 0.004*"shortest" + 0.001*"gated" + 0.000*"musbi"
Topic: 1 Words: 0.094*"starbucks" + 0.078*"bite" + 0.078*"coffeehouse" + 0.077*"light" + 0.076*"signature" + 0.076*"roast" + 0.075*"wifi" + 0.075*"chain" + 0.075*"known" + 0.074*"availability"
Topic: 2 Words: 0.177*"please" + 0.102*"city" + 0.081*"broad" + 0.072*"cashier" + 0.050*"reward" + 0.047*"purchase" + 0.041*"ranch" + 0.040*"onolicious" + 0.023*"unlike" + 0.012*"answering"
Topic: 3 Words: 0.235*"taste" + 0.231*"like" + 0.133*"ice" + 0.120*"much" + 0.079*"cream" + 0.058*"server" + 0.033*"flavor" + 0.018*"butter" + 0.016*"treat" + 0.009*"question"
Topic: 4 Words: 0.247*"kitchen" + 0.117*"super" + 0.113*"experience" + 0.083*"wonderful" + 0.043*"quickly" + 0.040*"various" + 0.036*"located" + 0.027*"ate" + 0.021*"sign" + 0.021*"incredible"
Topic: 5 Words: 0.274*"topping" + 0.242*"valley" + 0.055

## 8. Visualize the Topics

In [12]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, bow_corpus, dictionary)
vis

## 9. Compute Similarity Score

In [13]:
import numpy as np 
from scipy.spatial.distance import cosine
def get_lda_vector(text):    
    processed_text = preprocess_text(text)    
    bow_vector = dictionary.doc2bow(processed_text)    
    lda_vector = lda_model.get_document_topics(bow_vector, minimum_probability=0.0)    
    dense_vector = np.zeros(lda_model.num_topics)    
    for topic_num, prop_topic in lda_vector:        
        dense_vector[topic_num] = prop_topic    
    return dense_vector
    
def calculate_cosine_similarity(vec1, vec2):    
    return 1 - cosine(vec1, vec2)
    
# Example with the first review
sample_review = df['review_document'].iloc[0]
sample_business = df['business_document'].iloc[0]
review_vector = get_lda_vector(sample_review)
business_vector = get_lda_vector(sample_business)
similarity_score = calculate_cosine_similarity(review_vector, business_vector)
print(f'Review: {sample_review}')
print(f'Business: {sample_business}')
print(f'Similarity Score (Trustworthiness): {similarity_score:.4f}')

Review: My husband took me here for my birthday!  The best meal we have had here! The food is fresh and delicious!  They have a large selection of wine. It's is a very peaceful,  romantic setting.  I loved it. We will definitely come back!
Business: Buona Sera ['Italian restaurant', 'Restaurant'] Casual trattoria serving familiar Italian entrees in a family-friendly setting with funky decor.
Similarity Score (Trustworthiness): 0.1355
