# One-hot Encoding

It is one of the simplest forms of word vectorization. In this method, each word in a vocabulary is represented as a unique vector with all zeros except for a single one in the position corresponding to that word.

![one-hot-encoding.jpg](../imgs/one-hot-encoding.jpg)

## Corpus
We define a small set of sample sentences to form our corpus.

In [1]:
# Define the corpus with a few sample sentences
corpus = [
    "I love cats.",
    "I hate cats and dogs.",
    "I have a dog.",
]

Each document in the corpus is split into words.

In [2]:
corpus = [doc.split(" ") for doc in corpus]
print("Corpus split into words:", corpus)

Corpus split into words: [['I', 'love', 'cats.'], ['I', 'hate', 'cats', 'and', 'dogs.'], ['I', 'have', 'a', 'dog.']]


## Creating the Vocabulary

In [3]:
# Create a vocabulary set
vocabulary = set()
for document in corpus:
    for term in document:
        vocabulary.add(term)
        
# Convert the vocabulary set to a list
vocabulary = list(vocabulary)
print("Vocabulary list:", vocabulary)

Vocabulary list: ['and', 'I', 'dogs.', 'cats', 'cats.', 'hate', 'have', 'love', 'dog.', 'a']


## One-Hot Encoding Implementation

In [4]:
def one_hot_encoding(corpus):
    vocabulary = set()
    for document in corpus:
        for term in document:
            vocabulary.add(term)
    
    one_hot_matrix = list()

    for document in corpus:
        temp = list()
        for term in vocabulary:
            if term in document:
                temp.append(1)
            else:
                temp.append(0)
        one_hot_matrix.append(temp)

    return one_hot_matrix

In [5]:
one_hot_encoding(corpus)

[[0, 1, 0, 0, 1, 0, 0, 1, 0, 0],
 [1, 1, 1, 1, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 1, 0, 1, 1]]