# One-Hot Encoding Basics

One-hot encoding is a foundational technique in natural language processing for converting words into a numerical format that machine learning models can understand. The process begins by building a vocabulary from the text data, where each unique word is assigned a distinct integer index. This index is then used to create a binary vector for each word: the vector has all zeros except for a single one at the position corresponding to the word’s index in the vocabulary.

For example, consider the following vocabulary, where each word is mapped to a unique number:

In [None]:
# Example vocabulary
vocab = {"cat": 1,
         "chases": 2,
         "mouse": 3,
         "dog": 4,
         "runs": 5,
         "cheese": 6,
         "eats": 7,
         "the": 8,
         "a": 9
         }

To one-hot encode a sentence such as `"cat chases mouse"`, we can use the following function:

In [None]:
def get_onehot_vector(text):
    onehot_encoded = []
    for word in text.split():
        temp = [0] * len(vocab)
        if word in vocab:
            temp[vocab[word]-1] = 1
        onehot_encoded.append(temp)
    return onehot_encoded

get_onehot_vector("cat chases mouse")
# Output: [[1,0,0,0,0,0,0,0,0], [0,1,0,0,0,0,0,0,0], [0,0,1,0,0,0,0,0,0]]

In this output, each word in the sentence is represented as a vector of length 9 (the size of the vocabulary), with a single one indicating the position of that word in the vocabulary.

We can also perform one-hot encoding using scikit-learn’s `CountVectorizer`, which is a convenient tool for transforming text into numerical features. While `CountVectorizer` is typically used to count word occurrences, we can set the `binary=True` parameter to ensure that each word is represented as a 1 if it appears in the text, and 0 otherwise—effectively giving us one-hot encoding at the document level.

For example, let’s define our vocabulary and use `CountVectorizer` as follows:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Define our vocabulary
vocab = ["cat", "chases", "mouse", "dog", "runs", "cheese", "eats", "the", "a"]

# Initialize CountVectorizer with our vocabulary and binary=True for one-hot encoding
vectorizer = CountVectorizer(vocabulary=vocab, binary=True)

# Transform the sentence
X = vectorizer.transform(["cat chases mouse"])

# Convert to array
print(X.toarray())
# Output: [[1 1 1 0 0 0 0 0 0]]

This output is a single vector representing the entire sentence "cat chases mouse." Each position in the vector corresponds to a word in our vocabulary, with a 1 indicating the presence of that word in the sentence.

If we want to obtain a separate one-hot vector for each word—just like our earlier manual approach—we can transform each word individually:

In [None]:
words = "cat chases mouse".split()
X = vectorizer.transform(words)
print(X.toarray())
# Output: [[1 0 0 0 0 0 0 0 0],
#          [0 1 0 0 0 0 0 0 0],
#          [0 0 1 0 0 0 0 0 0]]

Now, each row is a one-hot vector for a single word, matching the output of our manual one-hot encoding function.

To see the mapping of words to their indices in the vector, we can access the vocabulary dictionary that `CountVectorizer` uses:

In [None]:
print(vectorizer.vocabulary_)
# Output: {'cat': 0, 'chases': 1, 'mouse': 2, ...}

Notice that scikit-learn assigns indices starting from 0, which is a slight difference from our earlier example where we started from 1.