### Word Embeddings Neural Network Training

The purpose of this notebook is to train and save a model that will convert each word in the Yelp reviews corpus to a word vector. Then, once the model training is complete, it can be used in this sentiment analysis project to create a new and imporved dataset for correctly classifying the sentiment of a Yelp review and possibly implementing that into an app review summarizer. This neural network is trained on 20,000 reviews.

### Technical Challenges

The main technical challenges to overcome will be the following:
1. Implementing the gensim package correctly so the Word2Vec model can be trained
2. Balancing the training time with the accuracy, as the number of weights the model will need to solve for could be 900 million (from about 1.5 million words in the corpus, 300 neurons, and 2 layers)
3. Evaluating the performance of this model, comparing it to the performance of the other approaches, and interpreting the results

**We will be using the gensim package to achieve these goals**

### Import Libraries

In [9]:
import gensim
import gensim.models
from gensim import utils
import pandas as pd
import numpy as np

### Load the Yelp dataset

In [2]:
reviews = pd.read_csv("yelp_reviews_v3.csv")
reviews.drop("Unnamed: 0", axis = 1, inplace = True)
reviews.head(5)

Unnamed: 0,date,review,rating,isEdited,title,userName,developerResponse
0,2024-11-22 22:44:23,I say it can be fantastic because some people ...,5,False,Yelp can be fantastic,Robg80,
1,2024-12-12 22:08:33,Yelp's developers have been spamming false 5-s...,1,False,Review botting should not be tolerated!,itsbad):,
2,2024-10-11 18:43:56,I will not be using Yelp ever again. After a t...,1,False,Horrible,jennausuwiajdneka,
3,2024-09-22 20:35:32,During think tank meetings with other business...,1,False,Is yelp fair?,Srepman,"{'id': 46973211, 'body': 'Thank you for taking..."
4,2024-12-13 03:52:13,If I could give this place a 0 star I absolute...,1,False,Horrible service,Tsimmons96,


### Define a generator to feed each review to the model

In [5]:
# This will give the tokenized docs (reviews) to the Word2Vec model
class MyCorpus:
    def __iter__(self):
        for doc in reviews['review']:
            yield utils.simple_preprocess(doc)

In [8]:
# Display the first 3 docs
corpus = MyCorpus()

for i, doc in enumerate(corpus):
    if i > 2:
        break
    else:
        print(f"Doc {i + 1}")
        print(doc)

Doc 1
['say', 'it', 'can', 'be', 'fantastic', 'because', 'some', 'people', 'choose', 'to', 'not', 'post', 'positive', 'reviews', 'and', 'try', 'to', 'disparage', 'businesses', 'with', 'multiple', 'reviews', 'that', 'are', 'intentionally', 'left', 'to', 'hurt', 'the', 'business', 'think', 'yelp', 'has', 'done', 'better', 'job', 'of', 'making', 'sure', 'that', 'these', 'reviews', 'are', 'accurate', 'and', 'also', 'giving', 'the', 'business', 'owners', 'the', 'opportunity', 'to', 'review', 'the', 'reviews', 'in', 'some', 'cases', 'revoking', 'the', 'reviews', 'which', 'believe', 'business', 'owner', 'should', 'have', 'the', 'right', 'to', 'do', 'will', 'always', 'use', 'yelp', 'use', 'it', 'for', 'everything', 'that', 'search', 'for', 'to', 'make', 'sure', 'that', 'get', 'the', 'best', 'quality', 'service', 'the', 'only', 'way', 'yelp', 'piggy', 'get', 'better', 'is', 'if', 'we', 'make', 'it', 'better', 'by', 'posting', 'honest', 'reviews', 'remembering', 'the', 'good', 'service', 'as', '

### Train the Word2Vec Model

See documentation for details: https://radimrehurek.com/gensim/models/word2vec.html     
The following code initializes, builds vocabulary, and trains the neural network that we can query for the word embeddings.
- Note: we don't need to explicity call .train() here, since training is already done in this step and we don't need to update the neural weights any more. .train() can be used if the model will be trained over multiple sessions.

In [12]:
model = gensim.models.Word2Vec(
    sentences = corpus, # document source
    vector_size = 300, # no. of neurons to use/size of the word vectors
    window = 5, # context window
    sg = 1, # skip-gram model
    min_count = 2, # minimum number of times a word has to appear
    workers = 2, # parallelization
    negative = 10, # negative sampling
    seed = 34
)

Explore it

In [14]:
model.wv.most_similar(positive = 'horrible')

[('terrible', 0.8331689834594727),
 ('lousy', 0.7492699027061462),
 ('horrific', 0.7417861223220825),
 ('awful', 0.7399375438690186),
 ('crappy', 0.7157835960388184),
 ('nonexistent', 0.7104875445365906),
 ('horrendous', 0.7045280933380127),
 ('prejudice', 0.6949297189712524),
 ('weak', 0.6941744685173035),
 ('disgusting', 0.6920444965362549)]

In [15]:
model.wv.most_similar(positive = 'yelp')

[('yelps', 0.763877272605896),
 ('religiously', 0.7354929447174072),
 ('ap', 0.7177634239196777),
 ('factor', 0.7171481251716614),
 ('unknown', 0.7122806906700134),
 ('exclusively', 0.7120497226715088),
 ('researching', 0.7117982506752014),
 ('heavily', 0.7111935019493103),
 ('contributing', 0.711171567440033),
 ('incentives', 0.7108086943626404)]

In [16]:
model.wv.most_similar(positive = 'amazing')

[('awesome', 0.8157830238342285),
 ('phenomenal', 0.793861448764801),
 ('incredible', 0.7890821099281311),
 ('fantastic', 0.7877715826034546),
 ('outstanding', 0.7709816694259644),
 ('fabulous', 0.7696936130523682),
 ('superb', 0.7609773278236389),
 ('terrific', 0.7587746977806091),
 ('wonderful', 0.7521498203277588),
 ('spectacular', 0.7124101519584656)]

### Save the model to disk

In [17]:
import tempfile

with tempfile.NamedTemporaryFile(prefix = 'gensim-model-', delete = False) as tmp:
    temporary_filepath = tmp.name
    model.save(temporary_filepath)

Note: this will save the model in a temporary filepath, if you need to load it again, find it in your temporary filepath and move it to the working directory.

### Create a dataframe using the model

In [18]:
# Calculates an average word vector for a document
def word_vector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0
    for token in tokens:
        try:
            vec += model.wv[token].reshape((1, size))
            count += 1
        except KeyError:
            continue
    
    if count != 0:
        vec /= count
        return vec

In [19]:
# Creates a dataframe where each row represents the doc's average word vector
def calculate_num_docs(corp):
    num_docs = 0
    for i, doc in enumerate(corp):
        num_docs += 1
    return num_docs

num_docs = calculate_num_docs(corpus)

wordvec_arrays = np.zeros((num_docs, 300))

for i, doc in enumerate(corpus):
    wordvec_arrays[i, :] = word_vector(doc, 300)

wordvec_df = pd.DataFrame(wordvec_arrays)

wordvec_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,-0.227540,-0.170826,0.115066,0.149463,-0.026197,0.008393,-0.024517,0.025794,0.067489,0.097428,...,-0.036327,0.172643,0.034766,0.276930,0.098627,-0.086076,-0.055905,-0.056114,-0.002599,0.101394
1,-0.199238,-0.200579,0.116510,0.111658,-0.064551,0.020515,-0.017088,0.006766,0.083793,0.163523,...,-0.059782,0.178555,0.034858,0.283163,0.098012,-0.086388,-0.065492,-0.055511,-0.080624,0.102011
2,-0.229403,-0.180689,0.149343,0.133359,-0.036580,0.031731,-0.011592,-0.025878,0.081141,0.153825,...,-0.038574,0.177649,0.016944,0.292426,0.114490,-0.116326,-0.075048,-0.072371,-0.070862,0.127622
3,-0.237999,-0.193224,0.123863,0.154978,-0.065055,0.050980,0.001317,0.003766,0.086413,0.102024,...,-0.042580,0.149542,0.012284,0.302916,0.106729,-0.074216,-0.042382,-0.052624,-0.039711,0.112512
4,-0.203110,-0.214482,0.103448,0.181231,0.010507,-0.013264,-0.050693,0.035647,0.080304,0.143670,...,-0.005573,0.180557,0.059697,0.240646,0.098397,-0.047990,-0.073707,-0.052928,-0.058238,0.162846
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,-0.249139,-0.171055,0.097228,0.166573,-0.020160,0.042528,-0.017260,0.009062,0.058821,0.084938,...,-0.031552,0.154122,0.021346,0.282789,0.088902,-0.080313,-0.079995,-0.057006,0.013764,0.097351
19996,-0.215090,-0.221100,0.088470,0.121938,0.019137,-0.050513,0.013177,0.038186,0.057446,0.141945,...,-0.013193,0.158821,0.045236,0.262819,0.139202,-0.017365,-0.102897,-0.045341,-0.061465,0.188925
19997,-0.189118,-0.215414,0.072661,0.152371,-0.037037,-0.008754,0.008548,0.101564,0.067704,0.090034,...,-0.012003,0.181716,0.094480,0.284015,0.117345,-0.019341,-0.033495,-0.063061,-0.003370,0.059299
19998,-0.238020,-0.189898,0.068737,0.205997,-0.052160,0.024170,-0.036865,0.039188,0.100074,0.162745,...,-0.038671,0.158611,0.056266,0.266769,0.099429,-0.048835,-0.034498,-0.065789,0.064915,0.104764


The above dataframe will be used to build random forest and logistic regression models for sentiment classification.