# One hot encoding of text

## One hot encoding with pure python

In [1]:
documents = ["Dog bites man", "Man bites dog", "Dogs eat meat.", "Man eats food"]

docs = [doc.lower().replace(".", "") for doc in documents]
docs

['dog bites man', 'man bites dog', 'dogs eat meat', 'man eats food']

In [3]:
# Build the vocabulary
vocab = { }
count = 0

for doc in docs:
    for word in doc.split(" "):
        if word not in vocab:
            count += 1
            vocab[word] = count

print(vocab)

{'dog': 1, 'bites': 2, 'man': 3, 'dogs': 4, 'eat': 5, 'meat': 6, 'eats': 7, 'food': 8}


In [4]:
# Create one hot encoder function
def get_onehot_vector(str):
    onehot_encoded = []
    l = len(vocab)

    for word in str.split():
        temp = [0] * l

        if word in vocab:
            temp[vocab[word] - 1] = 1

        onehot_encoded.append(temp)

    return onehot_encoded

#### One hot encode first sentence

In [5]:
first_sentence = docs[0]
first_sentence

'dog bites man'

In [6]:
get_onehot_vector(first_sentence)

[[1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0]]

#### One hot encode second sentence

In [7]:
print(docs[1])
get_onehot_vector(docs[1])

man bites dog


[[0, 0, 1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0]]

In [12]:
# One hot encoding - some word within vocab - some outsides
get_onehot_vector("man and dog are good")

[[0, 0, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0]]

In [13]:
# One hot encode of random text not in vocab
get_onehot_vector("fires in volcanos are lava")

[[0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0]]

## One hot encoding using scikit-learn

We encode our corpus as a one-hot numeric array using **scikit-learn's OneHotEncoder**.

**One Hot Encoding**: In one-hot encoding, each word w in corpus vocabulary is given a unique integer id that is between 1 and |V|, where V is the set of corpus vocab. Each word is then represented by a V-dimensional binary vector of 0s and 1s.

**Label Encoding**: In Label Encoding, each word w in our corpus is converted into a numeric value between 0 and n-1 (where n refers to number of unique words in our corpus).


In [16]:
%pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scikit-learn
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [scikit-learn][0m [scikit-learn]
[1A[2KSuccessfully installed scikit-learn-1.6.1 threadpoolctl-3.6.0
Note: you may need to restart the kernel to use updated packages.


In [14]:
S1 = 'dog bites man'
S2 = 'man bites dog'
S3 = 'dog eats meat'
S4 = 'man eats food'

In [23]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

data = [S1.split(), S2.split(), S3.split(), S4.split()]

values = data[0] + data[1] + data[2] + data[3]

print("The data: ",values)

The data:  ['dog', 'bites', 'man', 'man', 'bites', 'dog', 'dog', 'eats', 'meat', 'man', 'eats', 'food']


In [24]:
#Label Encoding
label_encoder = LabelEncoder()

integer_encoded = label_encoder.fit_transform(values)

print("Label Encoded:",integer_encoded)

Label Encoded: [1 0 4 4 0 1 1 2 5 4 2 3]


In [27]:

import numpy as np

data_2d = np.array(data).reshape(-1, 1)
print(data_2d)

[['dog']
 ['bites']
 ['man']
 ['man']
 ['bites']
 ['dog']
 ['dog']
 ['eats']
 ['meat']
 ['man']
 ['eats']
 ['food']]


In [33]:
# One-Hot Encoding
onehot_encoder = OneHotEncoder(sparse_output=False)
onehot_encoded = onehot_encoder.fit_transform(data_2d)
print("Onehot Encoded Matrix:\n", onehot_encoded)

Onehot Encoded Matrix:
 [[0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]]
