# Feature Extraction 

## 1. One-Hot Encoding

One-Hot Encoding is a method that represents each token (word) in a vocabulary as a binary vector whose length equals the vocabulary size.
Only one position in the vector is marked with 1, and all other positions are 0.

Each word has its own unique vector.

When applied to a sentence, One-Hot Encoding produces a matrix, where:

Each row = one token from the sentence

Each row is a one-hot vector (a unique indicator representation)

* `sen1= i love nlp`

* `sen2= i dont like nlp`

`[ I ]     → [1, 0, 0, 0, 0]`

`[ love ]  → [0, 1, 0, 0, 0]`

`[ nlp ]   → [0, 0, 1, 0, 0]`

after vectorization:

`sen1=[[1, 0, 0, 0, 0],`

       [0, 1, 0, 0, 0],
 
       [0, 0, 1, 0, 0]]


In [1]:
# using pandae

In [2]:
import pandas as pd

sentences = {
    0: ["i", "love", "nlp"],
    1: ["nlp", "is", "fun", "and", "powerful"],
    2: ["i", "am", "learning"],
    3: ["deep", "learning", "and", "nlp"]
}

In [3]:
# Build exploded DataFrame: one row per (sentence_id, token)
rows = []
for sid, tokens in sentences.items():
    for t in tokens:
        rows.append({"sent_id": sid, "token": t})

In [4]:
rows

[{'sent_id': 0, 'token': 'i'},
 {'sent_id': 0, 'token': 'love'},
 {'sent_id': 0, 'token': 'nlp'},
 {'sent_id': 1, 'token': 'nlp'},
 {'sent_id': 1, 'token': 'is'},
 {'sent_id': 1, 'token': 'fun'},
 {'sent_id': 1, 'token': 'and'},
 {'sent_id': 1, 'token': 'powerful'},
 {'sent_id': 2, 'token': 'i'},
 {'sent_id': 2, 'token': 'am'},
 {'sent_id': 2, 'token': 'learning'},
 {'sent_id': 3, 'token': 'deep'},
 {'sent_id': 3, 'token': 'learning'},
 {'sent_id': 3, 'token': 'and'},
 {'sent_id': 3, 'token': 'nlp'}]

In [5]:
df = pd.DataFrame(rows)
df

Unnamed: 0,sent_id,token
0,0,i
1,0,love
2,0,nlp
3,1,nlp
4,1,is
5,1,fun
6,1,and
7,1,powerful
8,2,i
9,2,am


In [6]:
sentence_matrices_catograc = []

for sid in sorted(df['sent_id'].unique()):
    mat = df[df['sent_id'] == sid].drop(columns=['sent_id']).values
    sentence_matrices_catograc.append(mat)

In [7]:
sentence_matrices_catograc

[array([['i'],
        ['love'],
        ['nlp']], dtype=object),
 array([['nlp'],
        ['is'],
        ['fun'],
        ['and'],
        ['powerful']], dtype=object),
 array([['i'],
        ['am'],
        ['learning']], dtype=object),
 array([['deep'],
        ['learning'],
        ['and'],
        ['nlp']], dtype=object)]

In [8]:
# create dummies for each token row for token
dummies = pd.get_dummies(df["token"]).astype(int)
dummies

Unnamed: 0,am,and,deep,fun,i,is,learning,love,nlp,powerful
0,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,1,0,0,0,0
5,0,0,0,1,0,0,0,0,0,0
6,0,1,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,1
8,0,0,0,0,1,0,0,0,0,0
9,1,0,0,0,0,0,0,0,0,0


In [9]:
df_with_dummies = pd.concat([df[["sent_id"]], dummies], axis=1)
df_with_dummies

Unnamed: 0,sent_id,am,and,deep,fun,i,is,learning,love,nlp,powerful
0,0,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,1,0
3,1,0,0,0,0,0,0,0,0,1,0
4,1,0,0,0,0,0,1,0,0,0,0
5,1,0,0,0,1,0,0,0,0,0,0
6,1,0,1,0,0,0,0,0,0,0,0
7,1,0,0,0,0,0,0,0,0,0,1
8,2,0,0,0,0,1,0,0,0,0,0
9,2,1,0,0,0,0,0,0,0,0,0


In [10]:
print("Vocabulary (columns):", list(dummies.columns), "\n")

Vocabulary (columns): ['am', 'and', 'deep', 'fun', 'i', 'is', 'learning', 'love', 'nlp', 'powerful'] 



In [11]:
sentence_matrices_catograc

[array([['i'],
        ['love'],
        ['nlp']], dtype=object),
 array([['nlp'],
        ['is'],
        ['fun'],
        ['and'],
        ['powerful']], dtype=object),
 array([['i'],
        ['am'],
        ['learning']], dtype=object),
 array([['deep'],
        ['learning'],
        ['and'],
        ['nlp']], dtype=object)]

In [12]:
sentence_matrices = []

for sid in sorted(df_with_dummies['sent_id'].unique()):
    mat = df_with_dummies[df_with_dummies['sent_id'] == sid].drop(columns=['sent_id']).values
    sentence_matrices.append(mat)

In [13]:
sentence_matrices

[array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]]),
 array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]),
 array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]),
 array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]])]

In [14]:
# using sklearn 

In [15]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

In [16]:
one_hot=pd.DataFrame()

In [17]:
# Fit OneHotEncoder on all tokens (vocabulary)
enc = OneHotEncoder(handle_unknown="ignore")
token_ohe = enc.fit_transform(df[['token']])   # shape: (total_tokens, vocab_size)
arr=token_ohe.toarray()
arr

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]])

In [18]:
vocab = enc.get_feature_names_out()  # token names as columns
vocab

array(['token_am', 'token_and', 'token_deep', 'token_fun', 'token_i',
       'token_is', 'token_learning', 'token_love', 'token_nlp',
       'token_powerful'], dtype=object)

## 2. Bag of words

Bag of Words (BoW) is a unigram-based text representation technique where each document (sentence or paragraph) is converted into a vector based solely on the frequency of individual words, ignoring grammar, order, and context.
In a BoW model, the text is treated as a “bag” of unigrams (single words), where only the presence or count of each word matters.

In [19]:
from pathlib import Path

In [20]:
root=Path("data")
root.mkdir(exist_ok=True)
path=root/"little.csv"
df=pd.read_csv(path)

In [21]:
df

Unnamed: 0,text,output
0,I love cats,1
1,I like cats,1
2,cats are good,1
3,I dont like cats,0
4,I like dogs,0


In [22]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
# you can set max_features (by default it is false)

In [23]:
bow=cv.fit_transform(df['text'])
bow=bow.toarray()
bow

array([[0, 1, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 0],
       [1, 1, 0, 0, 1, 0, 0],
       [0, 1, 0, 1, 0, 1, 0],
       [0, 0, 1, 0, 0, 1, 0]])

In [24]:
cv.vocabulary_

{'love': 6, 'cats': 1, 'like': 5, 'are': 0, 'good': 4, 'dont': 3, 'dogs': 2}

In [25]:
bow[1]

array([0, 1, 0, 0, 0, 1, 0])

## 3. N-grams

An n-gram is a contiguous sequence of n tokens (usually words) extracted from a text. N-grams represent local context by preserving the order of tokens within a fixed-length window of size n.

`Unigram (n = 1) BoW(bag of words`

A unigram is an n-gram consisting of a single token. It treats each word independently without considering surrounding words.

In [26]:
cv=CountVectorizer(ngram_range=(1,1))

In [27]:
bow=cv.fit_transform(df['text'])
bow=bow.toarray()
bow

array([[0, 1, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 0],
       [1, 1, 0, 0, 1, 0, 0],
       [0, 1, 0, 1, 0, 1, 0],
       [0, 0, 1, 0, 0, 1, 0]])

In [28]:
cv.vocabulary_

{'love': 6, 'cats': 1, 'like': 5, 'are': 0, 'good': 4, 'dont': 3, 'dogs': 2}

`Bigram (n = 2)`

A bigram is an n-gram made of two consecutive tokens. It captures the direct relationship between a word and its immediate successor.

In [29]:
cv=CountVectorizer(ngram_range=(2,2))

In [30]:
bow=cv.fit_transform(df['text'])
bow=bow.toarray()
bow

array([[0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0],
       [1, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0],
       [0, 0, 0, 0, 1, 0]])

In [31]:
cv.vocabulary_

{'love cats': 5,
 'like cats': 3,
 'cats are': 1,
 'are good': 0,
 'dont like': 2,
 'like dogs': 4}

`Trigram (n = 3)`

A trigram is an n-gram composed of three consecutive tokens. It preserves a slightly larger context than bigrams by considering two-word dependencies.

In [32]:
cv=CountVectorizer(ngram_range=(3,3))

In [33]:
bow=cv.fit_transform(df['text'])
bow=bow.toarray()
bow

array([[0, 0],
       [0, 0],
       [1, 0],
       [0, 1],
       [0, 0]])

In [34]:
cv.vocabulary_

{'cats are good': 0, 'dont like cats': 1}

In [35]:
cv=CountVectorizer(ngram_range=(1,3))

In [36]:
bow=cv.fit_transform(df['text'])
bow=bow.toarray()
bow

array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0]])

In [37]:
cv.vocabulary_

{'love': 13,
 'cats': 2,
 'love cats': 14,
 'like': 10,
 'like cats': 11,
 'are': 0,
 'good': 9,
 'cats are': 3,
 'are good': 1,
 'cats are good': 4,
 'dont': 6,
 'dont like': 7,
 'dont like cats': 8,
 'dogs': 5,
 'like dogs': 12}

## 4. TFIDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a statistical weighting measure used to evaluate how important a word is to a document within a collection (corpus).
It increases proportionally with the number of times a word appears in a document (TF) but is offset by how frequently the word appears across all documents (IDF), which reduces the weight of common terms.

Term Frequency (TF)

* `TF(t, d) = f(t, d) / |d|`

where:

f(t, d) = frequency of term t in document d

|d| = total number of terms in document d

Inverse Document Frequency (IDF)

* `IDF(t) = log( N / n(t) )`

Smoothed form:

* `IDF(t) = log( (N + 1) / (n(t) + 1) ) + 1`

where:

N = total number of documents

n(t) = number of documents containing term t

TF-IDF

* `TF-IDF(t, d) = TF(t, d) × IDF(t)`

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()

In [39]:
tfidf.fit_transform(df['text']).toarray()

array([[0.        , 0.49084524, 0.        , 0.        , 0.        ,
        0.        , 0.87124678],
       [0.        , 0.64374446, 0.        , 0.        , 0.        ,
        0.76524053, 0.        ],
       [0.65690037, 0.37008621, 0.        , 0.        , 0.65690037,
        0.        , 0.        ],
       [0.        , 0.42395393, 0.        , 0.75251519, 0.        ,
        0.5039682 , 0.        ],
       [0.        , 0.        , 0.83088075, 0.        , 0.        ,
        0.55645052, 0.        ]])

In [40]:
tfidf.idf_

array([2.09861229, 1.18232156, 2.09861229, 2.09861229, 2.09861229,
       1.40546511, 2.09861229])

In [41]:
tfidf.get_feature_names_out()

array(['are', 'cats', 'dogs', 'dont', 'good', 'like', 'love'],
      dtype=object)