# 1. Word2Vec using Genism
`Word2Vec` is a popular technique in natural language processing (NLP) to learn word embeddings, which are dense vector representations of words that capture their syntactic and semantic meanings. Developed by Tomas Mikolov and his team at Google in 2013, `Word2Vec` is designed to map words into a continuous vector space such that words with similar meanings are located in close proximity to each other.

### Key Concepts of Word2Vec:

1. **Training Models**: Word2Vec provides two model architectures for training:
   - **Continuous Bag of Words (CBOW)**: This model predicts a target word based on its context. The context is typically a fixed-size window of surrounding words. CBOW is faster and has slightly better accuracy for frequent words.
   - **Skip-Gram**: This model works in the opposite way of CBOW; it uses a word to predict a target context. Skip-Gram performs well with small amounts of data and represents well even rare words or phrases.

2. **Dimensionality and Context Window**: The size of the embedding vectors and the context window size are crucial parameters. The vector size (typically between 50 and 300) balances between capturing enough information about words and keeping the model size manageable. The window size determines how many words before and after the target word are considered as context.

3. **Optimization Techniques**:
   - **Negative Sampling**: Used to speed up training and improve quality of the resulting word vectors by only updating a subset of the model’s weights during training rather than all of them.
   - **Hierarchical Softmax**: An alternative to negative sampling, particularly useful for large vocabularies, as it speeds up computation.

4. **Training Process**: During training, Word2Vec adjusts the word vectors in the embedding space to ensure that words that share common contexts are located close to one another in the space. This is achieved through a process of feeding word pairs according to their linguistic contexts into a simple neural network with a single hidden layer, and continuously adjusting the weights (word vectors) using gradient descent based on the loss between predicted and actual context words.

**Applications**:
   - **Semantic Similarity**: Word2Vec can measure the semantic similarity between two words based on their vector proximity.
   - **Analogies**: It can solve analogies. For example, given "man is to woman as king is to ?", the model can predict "queen".
   - **Word Clustering**: Grouping words with similar meanings.
   - **Feature Vectors**: Word vectors can be used as feature inputs for various machine learning models.



What is Genism?
----
**Gensim** is an open-source Python library designed specifically for unsupervised topic modeling and natural language processing. It is particularly useful for handling large text collections, using data streaming and efficient incremental algorithms, which makes it distinct from other NLP libraries that often require the entire dataset to fit into memory.

### Key Features of Gensim:

1. **Efficiency**: Gensim is highly efficient with its memory usage and processing speed, which is achieved through its use of incremental algorithms. This means that Gensim can work with large datasets that do not fit into memory, processing them in a streaming fashion.

2. **Scalability**: It is scalable in both computational resources and the size of the data. Gensim is designed to handle large-scale text collections with the help of data streaming and efficient data structures.

3. **Ease of Use**: Despite its focus on efficiency and scalability, Gensim is user-friendly and easy to get started with, requiring minimal setup and external dependencies.

4. **Algorithm Variety**: Gensim includes implementations of various popular algorithms for topic modeling and vector space modeling, including:
   - **Word2Vec**: To generate word embeddings by training a shallow neural network.
   - **Doc2Vec**: An extension of Word2Vec that learns to represent documents in addition to words.
   - **Latent Dirichlet Allocation (LDA)**: A probabilistic model for discovering abstract topics within a collection of documents.
   - **TF-IDF (Term Frequency-Inverse Document Frequency)**: A statistical measure used to evaluate how important a word is to a document in a collection or corpus.

5. **Integration**: It can be easily integrated with other machine learning frameworks like Scikit-Learn for further analysis and model enhancements.

6. **Support for Various Text Formats**: Gensim can easily handle raw text or Bag of Words (BoW) formats, and it includes tools for preprocessing text like tokenization, stemming, and more.

### Typical Uses of Gensim:
- **Topic Modeling**: Discovering abstract topics from a large volume of text.
- **Similarity Queries**: Finding similar documents or words in a collection.
- **Document Clustering and Classification**: Grouping text documents into clusters or classifying them into predefined categories based on their content.

Gensim is particularly favored in academic environments and industry settings where the handling of large textual data is required, and it is a go-to library for many researchers and practitioners working on topic modeling and document similarity tasks.

> Let's experience Genism ourself...

In [1]:
import gensim
import pandas as pd

# 2. Reading and Exploring the Dataset

In [2]:
df = pd.read_json("/kaggle/input/sports-and-outdoor-review-dataset/Sports_and_Outdoors_5.json", lines=True)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,AIXZKN4ACSKI,1881509818,David Briner,"[0, 0]",This came in on time and I am veru happy with ...,5,Woks very good,1390694400,"01 26, 2014"
1,A1L5P841VIO02V,1881509818,Jason A. Kramer,"[1, 1]",I had a factory Glock tool that I was using fo...,5,Works as well as the factory tool,1328140800,"02 2, 2012"
2,AB2W04NI4OEAD,1881509818,J. Fernald,"[2, 2]",If you don't have a 3/32 punch or would like t...,4,"It's a punch, that's all.",1330387200,"02 28, 2012"
3,A148SVSWKTJKU6,1881509818,"Jusitn A. Watts ""Maverick9614""","[0, 0]",This works no better than any 3/32 punch you w...,4,It's a punch with a Glock logo.,1328400000,"02 5, 2012"
4,AAAWJ6LW9WMOO,1881509818,Material Man,"[0, 0]",I purchased this thinking maybe I need a speci...,4,"Ok,tool does what a regular punch does.",1366675200,"04 23, 2013"


In [3]:
df.shape

(296337, 9)

### Simple Preprocessing & Tokenization
The first thing to do for any data science task is to clean the data.
For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. 
This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

Most of these preprocessing step can be done using the following method:

In [4]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

In [5]:
review_text

0         [this, came, in, on, time, and, am, veru, happ...
1         [had, factory, glock, tool, that, was, using, ...
2         [if, you, don, have, punch, or, would, like, t...
3         [this, works, no, better, than, any, punch, yo...
4         [purchased, this, thinking, maybe, need, speci...
                                ...                        
296332    [this, is, water, bottle, done, right, it, is,...
296333    [if, you, re, looking, for, an, insulated, wat...
296334    [this, hydracentials, sporty, oz, double, insu...
296335    [as, usual, received, this, item, free, in, ex...
296336    [hydracentials, insulated, oz, water, bottle, ...
Name: reviewText, Length: 296337, dtype: object

In [6]:
review_text.loc[0][:10]

['this', 'came', 'in', 'on', 'time', 'and', 'am', 'veru', 'happy', 'with']

In [7]:
df.reviewText.loc[0]

'This came in on time and I am veru happy with it, I haved used it already and it makes taking out the pins in my glock 32 very easy'

Comparing these two output shows, that in preprocessing step we also omit some meaningless words like `I`. 

# 3. Training the Word2Vec Model

Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using `min_count` parameter.

Workers define how many CPU threads to be used.

#### Initialize the model

In [8]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)


#### Build Vocabulary

In [9]:
model.build_vocab(review_text, progress_per=1000)

#### Train the Word2Vec Model

In [10]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(91338651, 121496535)

# 4. Save the Model

Save the model so that it can be reused in other applications

In [11]:
model.save("./word2vec-Sports-and-Outdoors-5.model")

# 5. Finding Similar Words and Similarity between words
[Further Reading](https://radimrehurek.com/gensim/models/word2vec.html)

In [12]:
model.wv.most_similar("awesome")

[('amazing', 0.8858018517494202),
 ('fantastic', 0.8143672347068787),
 ('awsome', 0.7724143862724304),
 ('incredible', 0.7592576742172241),
 ('excellent', 0.7334620356559753),
 ('outstanding', 0.731977641582489),
 ('great', 0.7243435978889465),
 ('wonderful', 0.7196093201637268),
 ('exceptional', 0.6570886373519897),
 ('unbeatable', 0.6394888162612915)]

In [13]:
model.wv.similarity(w1="adidas", w2="nike")

0.85530967

In [14]:
model.wv.similarity(w1="great", w2="good")

0.7907072

### Further Reading

You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html

Explore other Datasets related to Amazon Reviews: http://jmcauley.ucsd.edu/data/amazon/
    - You can use these dataset for doing more exercises