## Word Embeddings in NLP (Word2vec, GloVe, fastText)

<img src="https://miro.medium.com/v2/resize:fit:1060/format:webp/1*LdviucnshWgIIcQvhTTF-g.png" width="250">

Word embeddings are word vector representations where words with similar meaning have similar representation.

### Word Vectors
Old methods like **One-Hot Encoding** create massive, sparse vectors where words like "King" and "Queen" have zero mathematical relationship. Embeddings, however, place words in a continuous vector space where words with similar meanings are physically close to each other.

A **one-hot encoded vector** is a numerical representation of categorical data where a vector of length \(N\) (total unique categories) has a '1' at the index of the specific category and '0's elsewhere.
 - transforms categorical, non-numerical data (e.g., [Red, Blue, Green]) into binary vectors (e.g., [0, 0, 0, 0, 0, 0]) suitable for machine learning algorithms. 

### Word2vec (The prediction approach)
In word2vec there are 2 architectures: 
- **CBOW (Continuous Bag of Words)**:
    - Predicts a "target" word based on its context. (e.g., "The [?] sat on the mat").
    - Faster to train
- **Skip Gram**: 
    - Predicts the "context" words given a target word. (e.g., "Cat -> [The, sat, on, the, mat]").
    - Better for large datasets and infrequent words

It is based on a shallow Neural Network with one hidden layer. It performs a linear transformation to map words into lower-dimensional space. 
- Every word is initialized in a vocabulary with a random vector (usually 100 to 300 dimensions).
- These are then passed through the neural network to predict the context/target.
- The network calculates the error (how far the prediction was from the actual word).
- The network adjusts the weights.
- The Weights are the **Embeddings**. Each row in that matrix is now a meaningful vector for a word.

#### Optimization 
If the vocabulary has 100,000 words, the final "Softmax" layer of the neural network has to calculate probabilities for 100,000 outputs every single time. Word2Vec solves this with:
- Negative Sampling: Instead of updating all 100,000 words, the model picks the "correct" word and a small handful (5–20) of "noise" words (random words that aren't in the context). It only updates the weights for those few words.  
- Hierarchical Softmax: Uses a binary tree (Huffman tree) to reduce the complexity of the output layer from $O(V)$ to $O(\log V)$.

#### Result
Because the model forces words that appear in similar contexts to have similar weights, they end up near each other in space.
- **Cosine Similarity**: The angle between vectors can be measured to find synonyms.
- **Clustering**: Words like "Apple," "Banana," and "Orange" will form a cluster in the "Fruit" region of the vector space.

### GloVe (Global Vectors for Word Representation)
While Word2Vec looks at local context windows, GloVe looks at Global Co-occurrence.
- **How it works**: It builds a massive matrix of how often every word appears next to every other word in the entire corpus (e.g., Wikipedia).
- **The Logic**: If "Ice" and "Solid" appear together frequently, but "Steam" and "Solid" do not, the vectors for Ice and Solid will be moved closer together.
- **Advantage**: It captures global statistics better than Word2Vec.

### fastText
It is the "evolution" of Word2Vec.
- **The Innovation**: Instead of treating a word as the smallest unit, it breaks words into n-grams (sub-words). For example, "apple" becomes [ap, ppl, ple].
- **The Main Feature**: It can handle Out-of-Vocabulary (OOV) words. If the model hasn't seen the word "biophysics," it can still guess the meaning based on "bio" and "physics."
- **Use Case**: Excellent for languages with lots of prefixes/suffixes (like German or Turkish).