# Representation of data

https://www.analyticsvidhya.com/blog/2022/02/machine-learning-techniques-for-text-representation-in-nlp/

## One-Hot Encoding

A sentence is a vector where each entry shows the presence of a word.  
Not very funny.

In [1]:
import numpy as np
import pandas as pd

sentences = ['Author writes on Analytics Vidhya', 'Vidhya reads and writes comment on Analytics Vidhya', 'Vidhya appreciates author']
df = pd.DataFrame({"text":sentences, "output":[1,1,0]})
df.head()

Unnamed: 0,text,output
0,Author writes on Analytics Vidhya,1
1,Vidhya reads and writes comment on Analytics V...,1
2,Vidhya appreciates author,0


In [2]:
# Define a function that takes a list of unique words and a sentence and returns a one-hot encoded array
def one_hot_encode(unique_words, sentence):
    one_hot_encoded = [0] * len(unique_words)
    for word in sentence.split():
        if word in unique_words:
            one_hot_encoded[unique_words.index(word)] = 1
    return one_hot_encoded

# Create a list of unique words
unique_words = list(set(" ".join(df["text"]).split()))

# Apply the one-hot encoding function to each sentence in the "text" column
one_hot_encoded = [one_hot_encode(unique_words, sentence) for sentence in df["text"]]

# Print the resulting array
print(one_hot_encoded)

[[1, 0, 0, 0, 0, 1, 0, 1, 1, 1], [1, 1, 0, 1, 1, 1, 0, 1, 0, 1], [0, 0, 1, 0, 0, 1, 1, 0, 0, 0]]


## Bag of words

A sentence is a vector where each entry shows the count of a word.  
Not very funny, either.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
bow = cv.fit_transform(df['text'])
cv.vocabulary_

{'author': 3,
 'writes': 8,
 'on': 5,
 'analytics': 0,
 'vidhya': 7,
 'reads': 6,
 'and': 1,
 'comment': 4,
 'appreciates': 2}

In [4]:
print(bow[1].toarray())

[[1 1 0 0 1 1 1 2 1]]


## N-grams

Count the combinations. May be able to capture semantic meaning of the sentence.

In [5]:
cv = CountVectorizer(ngram_range=[2,2])
bow = cv.fit_transform(df['text'])
cv.vocabulary_

{'author writes': 3,
 'writes on': 10,
 'on analytics': 5,
 'analytics vidhya': 0,
 'vidhya reads': 8,
 'reads and': 6,
 'and writes': 1,
 'writes comment': 9,
 'comment on': 4,
 'vidhya appreciates': 7,
 'appreciates author': 2}

## TF-IDF (Term Frequency and Inverse Document Frequency)

The core idea of assigning weightage is the word that appears multiple time in a document but has a rare appearance in corpus then it is very important for that document so it gives more weightage to that word. This weightage is calculated by two terms known as TF and IDF. 

### Term frequency (TF)

The number of occurrences of a word in a document divided by a total number of terms in a document is referred to as Term Frequency.

### Inverse document frequency (IDF)

Total number of documents in corpus divided by the total number of documents with term T in them and taking the log of a complete fraction is inverse document frequency.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf.fit_transform(df['text']).toarray()

array([[0.46609584, 0.        , 0.        , 0.46609584, 0.        ,
        0.46609584, 0.        , 0.361965  , 0.46609584],
       [0.30716074, 0.40387946, 0.        , 0.        , 0.40387946,
        0.30716074, 0.40387946, 0.47707544, 0.30716074],
       [0.        , 0.        , 0.72033345, 0.54783215, 0.        ,
        0.        , 0.        , 0.42544054, 0.        ]])

In [7]:
print(list(zip(tfidf.get_feature_names_out(), tfidf.idf_)))

[('analytics', 1.2876820724517808), ('and', 1.6931471805599454), ('appreciates', 1.6931471805599454), ('author', 1.2876820724517808), ('comment', 1.6931471805599454), ('on', 1.2876820724517808), ('reads', 1.6931471805599454), ('vidhya', 1.0), ('writes', 1.2876820724517808)]
