<a href="https://colab.research.google.com/github/VidushiSharma31/ML-DL/blob/main/Deep%20Learning/word2vec_gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2Vec Model for Cell Phone and Accessories Reviews

### Install Libraries

This cell installs the necessary libraries for this notebook: `gensim` for word embedding and `python-Levenshtein` for string similarity calculations.

In [6]:
!pip install gensim
!pip install python-Levenshtein

Collecting gensim
  Using cached gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Using cached gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
Installing collected packages: gensim
Successfully installed gensim-4.4.0


### Import Libraries

This cell imports the installed libraries: `gensim` and `pandas`.

In [None]:
import gensim
import pandas as pd
from gensim.utils import simple_preprocess
import os

### Load Data

This cell loads the dataset from a JSON file named "Cell_Phones_and_Accessories_5.json" into a pandas DataFrame and displays the first few rows.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [8]:
df = pd.read_json("Cell_Phones_and_Accessories_5.json", lines=True)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


### Check Data Shape

This cell displays the number of rows and columns in the DataFrame.

In [9]:
df.shape

(194439, 9)

### Preprocess Review Text

This cell preprocesses the 'reviewText' column of the DataFrame using `gensim.utils.simple_preprocess` to tokenize and clean the text.

In [10]:
review_text = [simple_preprocess(text) for text in df['reviewText'].dropna()]

### Display Original Review Text

This cell displays the original review text for the first review in the dataset for comparison.

In [12]:
df.reviewText.loc[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

### Display Preprocessed Review Text Example

This cell displays the preprocessed review text for the first review in the dataset.

In [23]:
review_text[0]

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

### Initialize Word2Vec Model

This cell initializes a Word2Vec model with specified parameters: a window size of 10, a minimum word count of 2, and using 4 workers.

In [17]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=os.cpu_count(),
)

### Build Vocabulary

This cell builds the vocabulary for the Word2Vec model from the preprocessed review text.

In [18]:
model.build_vocab(review_text, progress_per=1000)

### Train Word2Vec Model

This cell trains the Word2Vec model using the preprocessed review text for a specified number of epochs.

In [19]:
model.train(review_text, total_examples=model.corpus_count, epochs=10)

(123010819, 167737950)

### Find Most Similar Words

This cell finds and displays the words most similar to the word "horrible" based on the trained Word2Vec model.

In [25]:
model.wv.most_similar("horrible")

[('terrible', 0.9057676196098328),
 ('awful', 0.8142561912536621),
 ('atrocious', 0.6511831879615784),
 ('poor', 0.6298261880874634),
 ('lousy', 0.6170556545257568),
 ('crappy', 0.6105121970176697),
 ('bad', 0.5979559421539307),
 ('weird', 0.5782413482666016),
 ('horrid', 0.5412096977233887),
 ('phenomenal', 0.5305992960929871)]

### Calculate Word Similarity (Good vs Nice)

This cell calculates and displays the similarity score between the words "good" and "nice" using the trained Word2Vec model.

In [21]:
float(model.wv.similarity(w1="good", w2="nice"))

0.7101591229438782

### Calculate Word Similarity (Good vs Terrible)

This cell calculates and displays the similarity score between the words "good" and "terrible" using the trained Word2Vec model.

In [22]:
float(model.wv.similarity(w1="good", w2="terrible"))

0.5507915019989014