# Contents:
1. Explore Amazon Reviews from Cell Phones and Accessories category.
2. Apply Word2Vec
3. Study the similarities of words produced my the model

### Import relevant libraries

In [1]:
import pandas as pd
import gensim

### Read the data
1. link to the dataset:
- http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz 

1. lines parameter
- specify whether each line in the JSON file represents a separate JSON object

In [2]:
file_path = "C:/Users/satha/Downloads/reviews_Cell_Phones_and_Accessories_5.json/Cell_Phones_and_Accessories_5.json"
df = pd.read_json(file_path, lines= True)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [3]:
df.shape

(194439, 9)

### Column we are interested in

In [4]:
df.reviewText[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

### Preprocessing 
1. Use simple_preprocess from gensim
2. Word2Vec
3. build_vocab

#### 1. gensim.utils.simple_preprocess:
- Tokenization: Splits text into words or tokens.
- Lowercasing: Converts all tokens to lowercase.
- Punctuation Removal: Removes special characters.
- Optional Filtering: Can filter out short or long tokens.
- Optional Stop Word Removal: Can remove common stop words.
- Whitespace Trimming: Trims leading/trailing spaces.
- Encoding Handling: Ensures correct handling of encoding and Unicode.

In [5]:
# for demonstration:
gensim.utils.simple_preprocess(df.reviewText[0])

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

In [6]:
# On all the reviews
Reviews = df.reviewText.apply(gensim.utils.simple_preprocess)

#### Word2Vec
1. Parameters:
- window: Maximum distance for word context.
- min_count: Minimum word frequency for inclusion.
- workers: Number of CPU cores for parallel training.
- size: Dimensionality of word vectors.
- seed: Set a random seed for reproducibility.
- epochs: Number of training iterations (epochs).

In [7]:
model = gensim.models.Word2Vec(
    window = 10,
    min_count = 2,
    workers = 4,
    sg=0,
    epochs= 10
)

#### model.build_vocab
1. Use: 
- create a vocabulary from a collection of text data (corpus). 
- so that the model will have knowledge of the words in your corpus.
2. parameters:
- progress_per :monitoring progress and specifies how often progress updates are reported.
3. need:
- Feature Space: Each unique word becomes a feature in NLP models.
- Reduced Dimensionality: Focus on a smaller, task-specific vocabulary.
- Efficiency: Smaller vocabularies save memory and enhance computational speed.
- Consistency: Ensure consistent treatment of words across the corpus.
- Minimizing Noise: Remove infrequent and stop words to reduce data noise.

In [8]:
model.build_vocab(
    Reviews,
    progress_per= 1000
)

In [9]:
print(f"The no. of epochs = {model.epochs}")
print(f"The no. of samples = {model.corpus_count}")

The no. of epochs = 10
The no. of samples = 194439


#### Training

In [10]:
model.train(Reviews, 
            total_examples= model.corpus_count,
            epochs= model.epochs)

(123016755, 167737950)

- The first number is the size of the training corpus (how many words it contains).
- The second number indicates how many times the model's parameters were updated based on the training data.

In [12]:
model.save("./saved_model.model")