## One-Hot Encoding in NLP

### Context
One-hot encoding is a fundamental technique in Natural Language Processing (NLP) used to represent categorical data, such as words or characters, in a numerical format.

#### Key Points:
- **Purpose**: Converts categorical data (e.g., words, characters) into numerical representations.
- **Usage**:
  - Often used as a preprocessing step in NLP tasks.
  - Prepares text data for input into machine learning models.
- **How It Works**:
  - Each word/token is represented as a vector.
  - All elements in the vector are set to `0`, except for one element corresponding to the word's index, which is set to `1`.

This method provides a foundational understanding for transitioning to more advanced embedding techniques like Word2Vec or GloVe.


### Example

Let's implement one-hot encoding for a small example corpus in Python.


In [6]:
# Import necessary libraries
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample text corpus
corpus = ["hello world", "machine learning", "hello machine"]

# Step 1: Tokenize the corpus into individual words
tokens = set(word for sentence in corpus for word in sentence.split())

# Step 2: Create a word-to-index mapping
word_to_index = {word: idx for idx, word in enumerate(tokens)}
print("Word-to-Index Mapping:", word_to_index)

# Step 3: Generate one-hot encodings
vocab_size = len(tokens)
one_hot_encodings = {}

for word, idx in word_to_index.items():
    encoding = np.zeros(vocab_size)
    encoding[idx] = 1
    one_hot_encodings[word] = encoding

print("\nOne-Hot Encodings:")
for word, encoding in one_hot_encodings.items():
    print(f"{word}: {encoding}")

# Encoding an example sentence
sentence = "hello world"
encoded_sentence = [one_hot_encodings[word] for word in sentence.split()]
print("\nEncoded Sentence:")
for word, encoding in zip(sentence.split(), encoded_sentence):
    print(f"{word}: {encoding}")

# Shape of the encoded sentence
encoded_sentence_array = np.array(encoded_sentence)
print("\nShape of the encoded sentence:", encoded_sentence_array.shape)

Word-to-Index Mapping: {'machine': 0, 'learning': 1, 'world': 2, 'hello': 3}

One-Hot Encodings:
machine: [1. 0. 0. 0.]
learning: [0. 1. 0. 0.]
world: [0. 0. 1. 0.]
hello: [0. 0. 0. 1.]

Encoded Sentence:
hello: [0. 0. 0. 1.]
world: [0. 0. 1. 0.]

Shape of the encoded sentence: (2, 4)


#### Explanation of the Shape
- The shape of the encoded sentence will be `(n, m)`, where:
  - `n` is the number of words in the sentence.
  - `m` is the size of the vocabulary (number of unique words in the corpus).
- Each row in the encoded sentence represents a one-hot encoded vector for a single word.
- For example, if the sentence "hello world" has 2 words and the vocabulary size is 4, the shape of the encoded sentence will be `(2, 4)`.


#### Advantages and Limitations

| **Advantages**                                | **Limitations**                                                                 |
|-----------------------------------------------|---------------------------------------------------------------------------------|
| Simple and easy to implement                  | High dimensionality for large vocabularies                                     |
| Effective for small vocabularies              | Does not capture semantic relationships between words                          |
| Provides a foundation for understanding more advanced techniques | Inefficient representation for datasets with a large number of unique categories |
| Interpretable and deterministic representation| Sparse matrices lead to memory inefficiency                                    |
| Useful for tasks requiring clear separation of categories | Unable to handle unseen words or categories during inference                   |
| No computation required beyond basic indexing | No inherent ability to represent word similarity or context  



### Conclusion
One-hot encoding is a simple yet powerful technique for converting categorical text data into numerical form, facilitating its use in machine learning models. While effective for small datasets, it has limitations such as high dimensionality and lack of semantic meaning. Advanced methods like word embeddings (e.g., Word2Vec, GloVe) address these limitations by capturing semantic relationships between words in a dense, continuous vector space.