**Word2Vec** is a popular **word embedding technique** developed by a team at Google, used to convert words into **dense numerical vectors** that capture **semantic relationships** between them.

---

### **Core Idea:**

Word2Vec transforms words into vectors in such a way that **words with similar meanings** have **similar vectors** (i.e., they lie close together in the vector space).

---

###  **How it works:**

It uses a **shallow neural network** trained on large text corpora using one of two methods:

#### 1. **CBOW (Continuous Bag of Words)**

* Predicts the **target word** from the surrounding context words.
* Example: Given “I \_\_\_ football every weekend”, predict the word “play”.

#### 2. **Skip-gram**

* Predicts the **context words** from a single target word.
* Example: Given the word “play”, predict words like “I”, “football”, “weekend”.

---

### **Output:**

Each word gets represented as a vector (e.g., 100 or 300 dimensions) such that:

* **king - man + woman ≈ queen**
* **Paris - France + Italy ≈ Rome**

---

### **Advantages:**

* Captures **semantic meaning** of words.
* Works well for **analogies and word similarities**.
* Vectors are dense (unlike sparse matrices from BoW or TF-IDF).

---

### **Disadvantages:**

* Ignores **word order** and **grammar**.
* Doesn't handle **out-of-vocabulary** words well.
* Can't capture **polysemy** (same word, different meanings) directly.


In [1]:
# self trained model 

import gensim 
from gensim.models import Word2Vec,KeyedVectors

In [3]:
model = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)

In [6]:
model['cricket'].shape 
# as you can see, the word is distributed into 300 values 

(300,)

In [7]:
model.most_similar('man')

[('woman', 0.7664011716842651),
 ('boy', 0.6824871301651001),
 ('teenager', 0.6586929559707642),
 ('teenage_girl', 0.6147903203964233),
 ('girl', 0.5921714305877686),
 ('suspected_purse_snatcher', 0.5716364979743958),
 ('robber', 0.5585119128227234),
 ('Robbery_suspect', 0.5584410429000854),
 ('teen_ager', 0.5549196600914001),
 ('men', 0.5489761233329773)]

In [8]:
model.most_similar('movie')

[('film', 0.8676772117614746),
 ('movies', 0.8013108372688293),
 ('films', 0.7363011837005615),
 ('moive', 0.6830361485481262),
 ('Movie', 0.6693680286407471),
 ('horror_flick', 0.6577848792076111),
 ('sequel', 0.6577793955802917),
 ('Guy_Ritchie_Revolver', 0.6509751677513123),
 ('romantic_comedy', 0.6413198709487915),
 ('flick', 0.6321909427642822)]

In [9]:
model.most_similar('war')

[('wars', 0.7484657764434814),
 ('War', 0.6410669684410095),
 ('invasion', 0.5892110466957092),
 ('Persian_Gulf_War', 0.5890660285949707),
 ('Vietnam_War', 0.5886476039886475),
 ('Iraq', 0.5885993242263794),
 ('unwinnable_quagmire', 0.5681803226470947),
 ('un_winnable', 0.5606350302696228),
 ('occupation', 0.5506216883659363),
 ('conflict', 0.5506187081336975)]

In [12]:
# analogies 
model.most_similar(positive=["woman", "king"], negative=["man"])
# Expected: queen

# king-man + woman = Queen 

[('queen', 0.7118191123008728),
 ('monarch', 0.6189674735069275),
 ('princess', 0.5902430415153503),
 ('crown_prince', 0.5499458909034729),
 ('prince', 0.5377322435379028),
 ('kings', 0.5236843824386597),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134939193726),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411403656006)]

In [15]:
model.similarity('cat','dog')

# similarity score 

0.76094574

In [16]:
model.doesnt_match(["breakfast", "dinner", "lunch", "banana"])

'banana'

In [17]:
model.most_similar(positive=["man", "nurse"], negative=["woman"])

[('nurses', 0.5750778913497925),
 ('medic', 0.5732707977294922),
 ('registered_nurse', 0.5555101037025452),
 ('x_ray_technician', 0.5553551316261292),
 ('Nurse', 0.5527041554450989),
 ('doctor', 0.542094886302948),
 ('respiratory_therapist', 0.5328323245048523),
 ('nursing', 0.5252007842063904),
 ('paramedic', 0.5221819281578064),
 ('physician', 0.500717043876648)]

In [22]:
# text vectorisation for nlp tasks 

import numpy as np
import pandas as pd

def sentence_vector(sentence):
    words = [word for word in sentence.lower().split() if word in model]
    return np.mean([model[word] for word in words], axis=0) if words else np.zeros(300)

vec = sentence_vector("I love natural language processing")
pd.Series(vec)

0     -0.031787
1      0.017822
2      0.059863
3      0.146631
4     -0.094458
         ...   
295   -0.047931
296   -0.007642
297   -0.098560
298   -0.024707
299    0.045444
Length: 300, dtype: float32

In [23]:
vec = sentence_vector("I like data science")
pd.Series(vec)

0     -0.071472
1      0.024292
2      0.132496
3      0.149841
4     -0.031219
         ...   
295   -0.001678
296    0.044617
297   -0.058624
298    0.016479
299   -0.006271
Length: 300, dtype: float32