# <center>NLP💬🔉 By 🎯Udaya</center>

# One-Hot Encoding (OHE)
#### Description:
One-hot encoding represents each word in a vocabulary as a binary vector. Each vector has the same length as the size of the vocabulary, and only one element is "1" (hot), indicating the presence of the word, while all other elements are "0".

#### Example:
For a vocabulary of {apple, orange, banana}:

* "apple" → [1, 0, 0]
* "orange" → [0, 1, 0]
* "banana" → [0, 0, 1]

In [1]:
import numpy as np

words = np.array(['apple', 'orange', 'banana']).reshape(-1, 1)
words

array([['apple'],
       ['orange'],
       ['banana']], dtype='<U6')

In [2]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoder

In [3]:
one_hot_encoded = encoder.fit_transform(words)
print(one_hot_encoded)

[[1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


#### Advantages:

* Simple and easy to implement.
* No loss of information for small vocabularies.
#### Disadvantages:

* Inefficient for large vocabularies (high dimensionality).
* Does not capture semantic meaning or relationships between words.
* sparcity
* Not getting fixed length so we need fixed size input

# Bag of Words (BoW)
#### Description:
Bag of Words is a method where the text is represented as an unordered collection of words, disregarding grammar and word order but keeping multiplicity. It typically involves creating a vector of word frequencies.

#### Example:
For the sentences "I love apples" and "I love oranges":

#### Vocabulary: {I, love, apples, oranges}
* "I love apples" → [1, 1, 1, 0]
* "I love oranges" → [1, 1, 0, 1]

### Example 1

In [4]:
sentences = ["I love apples", "I love oranges"]

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

vectorizer

In [6]:
bow_encoded = vectorizer.fit_transform(sentences).toarray()
print(bow_encoded)

[[1 1 0]
 [0 1 1]]


### Example 2

In [7]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'text':["people of odishaare good","odisha love naveen pattnaik","odisha is not a good state"],"O/P":[1,1,0]})
df

  from pandas.core import (


Unnamed: 0,text,O/P
0,people of odishaare good,1
1,odisha love naveen pattnaik,1
2,odisha is not a good state,0


In [8]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
cv

In [9]:
bow = cv.fit_transform(df['text'])
bow

<3x11 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [10]:
# Vocab

print(cv.vocabulary_)
print("Index positions 👆👆")

{'people': 9, 'of': 7, 'odishaare': 6, 'good': 0, 'odisha': 5, 'love': 2, 'naveen': 3, 'pattnaik': 8, 'is': 1, 'not': 4, 'state': 10}
Index positions 👆👆


In [11]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[1 0 0 0 0 0 1 1 0 1 0]]
[[0 0 1 1 0 1 0 0 1 0 0]]
[[1 1 0 0 1 1 0 0 0 0 1]]


In [12]:
bow.toarray()

array([[1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0],
       [0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0],
       [1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1]], dtype=int64)

### Types

### 1. Binary Bag of Words
#### Description:
Binary Bag of Words represents whether a word is present or absent in a document, ignoring its frequency. It assigns a binary value (0 or 1) to each word in the vocabulary, indicating its presence (1) or absence (0) in the document.

#### Example:
Consider the sentences:

* "I love apples"
* "I love oranges"
#### The vocabulary would be: ['I', 'apples', 'love', 'oranges']

The binary Bag of Words representation would be:

* "I love apples" → [1, 1, 1, 0]
* "I love oranges" → [1, 0, 1, 1]
* Code Example (using scikit-learn):

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True)

sentences = ["I love apples", "I love oranges"]

binary_bow_encoded = vectorizer.fit_transform(sentences).toarray()
print(binary_bow_encoded)

[[1 1 0]
 [0 1 1]]


### 2. Count Bag of Words
#### Description:
Count Bag of Words represents the frequency of each word in the document. It assigns an integer value to each word in the vocabulary, indicating the number of times that word appears in the document.

#### Example:
Consider the same sentences:

* "I love apples"
* "I love oranges"
#### The vocabulary and Count Bag of Words representation would be the same as the Binary Bag of Words example above, but with the count of occurrences instead of binary values.

#### Code Example (using scikit-learn) :

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()


sentences = ["I love apples", "I love oranges"]
count_bow_encoded = vectorizer.fit_transform(sentences).toarray()
print(count_bow_encoded)

[[1 1 0]
 [0 1 1]]


#### Advantages:
* Simple and intuitive representation.
* Captures the occurrence of words in the document.
* Fixed size input we get

#### Disadvantages:
* Ignores word order and context.
* Sparse matrix
* Out of vocab
* Common words may dominate the representation.
* word ordering changed

#### Let's deal with words ordering

# Uni-grams (1-grams)
Description: Uni-grams are single words or tokens without any specific sequence. They represent the simplest form of text representation where each word is treated individually.

#### Example:
For the sentence "The quick brown fox", the Uni-grams are ["The", "quick", "brown", "fox"].

In [15]:
from nltk.tokenize import word_tokenize
sentence = "Udaya is looking for a new role in NLP"


tokens = word_tokenize(sentence)

uni_grams = tokens

print("Uni-grams:", uni_grams)

Uni-grams: ['Udaya', 'is', 'looking', 'for', 'a', 'new', 'role', 'in', 'NLP']


# Bi-grams (2-grams)
Description: Bi-grams are sequences of two adjacent words or tokens. They capture pairs of words that occur together in the text and provide more context than Uni-grams.

#### Example:
For the sentence "The quick brown fox", the Bi-grams are ["The quick", "quick brown", "brown fox"].

In [16]:
from nltk.util import ngrams
from nltk.tokenize import word_tokenize


sentence = "The quick brown fox"
tokens = word_tokenize(sentence)


bi_grams = list(ngrams(tokens, 2))
print("Bi-grams:", bi_grams)

Bi-grams: [('The', 'quick'), ('quick', 'brown'), ('brown', 'fox')]


# N-grams
Description: N-grams refer to sequences of 𝑛
n adjacent words or tokens. They can capture longer patterns and dependencies within the text.

#### Example:
For the sentence "The quick brown fox", examples of higher N-grams include:

* 3-grams (Trigrams): ["The quick brown"]
* 4-grams (Four-grams): ["The quick brown fox"]

In [17]:
from nltk.util import ngrams
from nltk.tokenize import word_tokenize


sentence = "The quick brown fox"
tokens = word_tokenize(sentence)

In [18]:
# Trigrams (3-grams)
tri_grams = list(ngrams(tokens, 3))
print("Trigrams:", tri_grams)

Trigrams: [('The', 'quick', 'brown'), ('quick', 'brown', 'fox')]


In [19]:
# Four-grams (4-grams)
four_grams = list(ngrams(tokens, 4))
print("Four-grams:", four_grams)

Four-grams: [('The', 'quick', 'brown', 'fox')]


### Advantages of N-grams:
* Contextual Information: N-grams capture sequential patterns in text, providing contextual information crucial for tasks like language modeling and sentiment analysis.

* Feature Representation: They serve as effective feature representations in machine learning models, enhancing accuracy by incorporating word sequences.

* Flexibility: N-grams can be adjusted to capture different levels of linguistic context (e.g., Bi-grams, Tri-grams), offering flexibility in analysis.

### Disadvantages of N-grams:
* Data Sparsity: Higher-order N-grams may suffer from data sparsity issues, as many sequences may occur infrequently or only once in the dataset.

* Computational Cost: Storing and processing N-grams, especially in large datasets, can be memory-intensive and computationally expensive.

* Overfitting: Higher-order N-grams can lead to overfitting, where the model learns specific sequences that do not generalize well to new data.

* Generalization: N-grams capture local dependencies but may miss long-range dependencies or semantic relationships between distant words.

# TF-IDF (Term Frequency-Inverse Document Frequency)
Term Frequency (TF): Measures the frequency of a term t within a document d. It indicates how frequently a term appears in a document relative to the total number of terms in that document.
* TF(t,d) = Number of times term t appears in document d / Total number of terms in document d

#### Example: If the word "cat" appears 3 times in a document with a total of 100 words, the TF for "cat" in that document is 3/100 = 0.03


Inverse Document Frequency (IDF): Measures how important a term is across the entire corpus. It diminishes the weight of terms that appear frequently in many documents and emphasizes terms that are rare in the corpus.
* IDF(t,D)=log(Total number of documents in the corpus D / Number of documents containing term t+1)

#### Example: If there are 1,000,000 documents in the corpus and the word "cat" appears in 100,000 documents, the IDF for "cat" is log( 1000000 / 100000 + 1 )

TF-IDF(t,d,D) = TF(t,d) × IDF(t,D)

#### Application
### TF-IDF is used primarily in:

* Information Retrieval: Ranking documents based on relevance to a query.
* Keyword Extraction: Identifying important words or phrases within a document.
* Text Summarization: Focusing on the most significant terms to represent the essence of a document.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

In [21]:
# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)
# Get the TF-IDF matrix
tfidf_matrix = X.toarray()

tfidf_matrix

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

In [22]:
# Print the TF-IDF matrix
print("TF-IDF Matrix: 👇")
print(tfidf_matrix)

TF-IDF Matrix: 👇
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


In [23]:
# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()
print("\nFeature Names: 👇")
print(feature_names)


Feature Names: 👇
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


# Word2Vec Models
Word2Vec is a popular technique in Natural Language Processing (NLP) that was developed by Tomas Mikolov and his team at Google. It is used to produce word embeddings, which are dense vector representations of words in a continuous vector space.

### There are two primary models in Word2Vec:

### i) Continuous Bag of Words (CBOW):

* Predicts the target word (context word) based on the surrounding words (context).
The context words are averaged or summed to predict the target word.
More efficient and faster for larger datasets.
* The CBOW model predicts the target word based on the context of surrounding words.

In [24]:
from gensim.models import Word2Vec

# using common_texts as a sample dataset
from gensim.test.utils import common_texts  

# Define the model with CBOW architecture (sg=0)
model_cbow = Word2Vec(sentences=common_texts, vector_size=100, 
                      window=5, min_count=1, workers=4, sg=0)

#### Train the model

In [25]:
model_cbow.train(common_texts, total_examples=len(common_texts), epochs=10)

vector = model_cbow.wv['computer']

vector

array([-0.00515774, -0.00667028, -0.0077791 ,  0.00831315, -0.00198292,
       -0.00685696, -0.0041556 ,  0.00514562, -0.00286997, -0.00375075,
        0.0016219 , -0.0027771 , -0.00158482,  0.0010748 , -0.00297881,
        0.00852176,  0.00391207, -0.00996176,  0.00626142, -0.00675622,
        0.00076966,  0.00440552, -0.00510486, -0.00211128,  0.00809783,
       -0.00424503, -0.00763848,  0.00926061, -0.00215612, -0.00472081,
        0.00857329,  0.00428459,  0.0043261 ,  0.00928722, -0.00845554,
        0.00525685,  0.00203994,  0.0041895 ,  0.00169839,  0.00446543,
        0.0044876 ,  0.0061063 , -0.00320303, -0.00457706, -0.00042664,
        0.00253447, -0.00326412,  0.00605948,  0.00415534,  0.00776685,
        0.00257002,  0.00811905, -0.00138761,  0.00808028,  0.0037181 ,
       -0.00804967, -0.00393476, -0.0024726 ,  0.00489447, -0.00087241,
       -0.00283173,  0.00783599,  0.00932561, -0.0016154 , -0.00516075,
       -0.00470313, -0.00484746, -0.00960562,  0.00137242, -0.00

#### Training from scratch

In [26]:
from gensim.models import Word2Vec


text_data = [
    "This is the first sentence",
    "Here is another sentence",
    "Word embeddings are a type of word representation",
    "Word2Vec is a popular word embedding technique",
    "Gensim is a library for topic modelling and document indexing"
]

# Tokenize the text data
tokenized_sentences = [sentence.lower().split() for sentence in text_data]


# Initialize the Word2Vec model
model = Word2Vec(
    sentences=tokenized_sentences,  
    # Tokenized sentences
    vector_size=100,                
    # Size of word vectors
    window=5,                       
    # Context window size
    min_count=1,                    
    # Minimum frequency count for words
    workers=4,                      
    # Number of worker threads to train the model
    sg=0                            
    # Training algorithm: 0 for CBOW, 1 for Skip-gram
)


# Train the Word2Vec model
model.train(tokenized_sentences, total_examples=len(tokenized_sentences), epochs=10)

# Save the model
model.save("word2vec.model")

# Load the model
model = Word2Vec.load("word2vec.model")

# Get the vector for a specific word
vector = model.wv['word2vec']

# Find similar words
similar_words = model.wv.most_similar('word2vec')
print("Vector for 'word2vec':", vector)

Vector for 'word2vec': [-7.1953363e-03  4.2302292e-03  2.1700179e-03  7.4393377e-03
 -4.8945886e-03 -4.5648050e-03 -6.0984357e-03  3.3043448e-03
 -4.5003183e-03  8.5221231e-03 -4.2910338e-03 -9.1071632e-03
 -4.8183631e-03  6.4211967e-03 -6.3796784e-03 -5.2579357e-03
 -7.3032537e-03  6.0265823e-03  3.3544898e-03  2.8453746e-03
 -3.1387201e-03  6.0324157e-03 -6.1554727e-03 -1.9737275e-03
 -5.9837005e-03 -9.8952802e-04 -2.0193043e-03  8.4806308e-03
  7.9721030e-05 -8.5778618e-03 -5.4202839e-03 -6.8745711e-03
  2.6961106e-03  9.4524780e-03 -5.8118002e-03  8.2588550e-03
  8.5373139e-03 -7.0707994e-03 -8.8822590e-03  9.4671631e-03
  8.3741248e-03 -4.6951962e-03 -6.7239637e-03  7.8414902e-03
  3.7663493e-03  8.0907475e-03 -7.5726504e-03 -9.5301094e-03
  1.5815090e-03 -9.8086046e-03 -4.8825922e-03 -3.4565933e-03
  9.6187191e-03  8.6261341e-03 -2.8344153e-03  5.8288481e-03
  8.2397368e-03 -2.2637893e-03  9.5287645e-03  7.1633095e-03
  2.0453148e-03 -3.8449026e-03 -5.0832839e-03 -3.0441210e-03
 

In [27]:
# Print the results

print("Words similar to 'word2vec': 👇\n", similar_words)

Words similar to 'word2vec': 👇
 [('topic', 0.1968393474817276), ('word', 0.12312232702970505), ('modelling', 0.08296788483858109), ('first', 0.08066906034946442), ('popular', 0.06567048281431198), ('is', 0.054732438176870346), ('library', 0.03177386522293091), ('and', 0.016726572066545486), ('indexing', 0.015996284782886505), ('are', 0.013051173649728298)]


### ii) Skip-gram:

* Predicts the surrounding words (context) given a target word.
The target word is used to predict context words.
Better for infrequent words and smaller datasets.
* The Skip-gram model predicts the context words given a target word.

In [28]:
from gensim.models import Word2Vec
from gensim.test.utils import common_texts  
# using common_texts as a sample dataset

# Define the model with Skip-gram architecture (sg=1)
model_skipgram = Word2Vec(sentences=common_texts, 
                          vector_size=100, window=5, min_count=1, 
                          workers=4, sg=1)

# Train the model
model_skipgram.train(common_texts, 
                     total_examples=len(common_texts), epochs=10)

# Access the vector for a specific word
vector = model_skipgram.wv['computer']

# Print the vector
print(vector)

[-0.00515774 -0.00667028 -0.0077791   0.00831315 -0.00198292 -0.00685696
 -0.0041556   0.00514562 -0.00286997 -0.00375075  0.0016219  -0.0027771
 -0.00158482  0.0010748  -0.00297881  0.00852176  0.00391207 -0.00996176
  0.00626142 -0.00675622  0.00076966  0.00440552 -0.00510486 -0.00211128
  0.00809783 -0.00424503 -0.00763848  0.00926061 -0.00215612 -0.00472081
  0.00857329  0.00428459  0.0043261   0.00928722 -0.00845554  0.00525685
  0.00203994  0.0041895   0.00169839  0.00446543  0.0044876   0.0061063
 -0.00320303 -0.00457706 -0.00042664  0.00253447 -0.00326412  0.00605948
  0.00415534  0.00776685  0.00257002  0.00811905 -0.00138761  0.00808028
  0.0037181  -0.00804967 -0.00393476 -0.0024726   0.00489447 -0.00087241
 -0.00283173  0.00783599  0.00932561 -0.0016154  -0.00516075 -0.00470313
 -0.00484746 -0.00960562  0.00137242 -0.00422615  0.00252744  0.00561612
 -0.00406709 -0.00959937  0.00154715 -0.00670207  0.0024959  -0.00378173
  0.00708048  0.00064041  0.00356198 -0.00273993 -0.0

## Average Word2Vec Implementation

In [29]:
!pip install gensim

Collecting FuzzyTM>=0.4.0 (from gensim)
  Obtaining dependency information for FuzzyTM>=0.4.0 from https://files.pythonhosted.org/packages/2d/30/074bac7a25866a2807c1005c7852c0139ac22ba837871fc01f16df29b9dc/FuzzyTM-2.0.9-py3-none-any.whl.metadata
  Using cached FuzzyTM-2.0.9-py3-none-any.whl.metadata (7.9 kB)
Collecting pyfume (from FuzzyTM>=0.4.0->gensim)
  Obtaining dependency information for pyfume from https://files.pythonhosted.org/packages/ed/ea/a3b120e251145dcdb10777f2bc5f18b1496fd999d705a178c1b0ad947ce1/pyFUME-0.3.4-py3-none-any.whl.metadata
  Using cached pyFUME-0.3.4-py3-none-any.whl.metadata (9.7 kB)
Collecting simpful==2.12.0 (from pyfume->FuzzyTM>=0.4.0->gensim)
  Obtaining dependency information for simpful==2.12.0 from https://files.pythonhosted.org/packages/9d/0e/aebc2fb0b0f481994179b2ee2b8e6bbf0894d971594688c018375e7076ea/simpful-2.12.0-py3-none-any.whl.metadata
  Using cached simpful-2.12.0-py3-none-any.whl.metadata (4.8 kB)
Collecting fst-pso==1.8.1 (from pyfume->Fuzz

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\udaya\\anaconda3\\Lib\\site-packages\\~andas.libs\\msvcp140-ef6047a69b174ada5cb2eff1d2bc9a62.dll'
Consider using the `--user` option or check the permissions.



In [30]:
import gensim
from gensim.models import Word2Vec, KeyedVectors

### `model used` : `word2vec-google-news-300`

In [31]:
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

In [32]:
vector_king = wv['king']
vector_king

# 300 dim

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [33]:
vector_king.shape

(300,)

In [34]:
wv['cricket']

array([-3.67187500e-01, -1.21582031e-01,  2.85156250e-01,  8.15429688e-02,
        3.19824219e-02, -3.19824219e-02,  1.34765625e-01, -2.73437500e-01,
        9.46044922e-03, -1.07421875e-01,  2.48046875e-01, -6.05468750e-01,
        5.02929688e-02,  2.98828125e-01,  9.57031250e-02,  1.39648438e-01,
       -5.41992188e-02,  2.91015625e-01,  2.85156250e-01,  1.51367188e-01,
       -2.89062500e-01, -3.46679688e-02,  1.81884766e-02, -3.92578125e-01,
        2.46093750e-01,  2.51953125e-01, -9.86328125e-02,  3.22265625e-01,
        4.49218750e-01, -1.36718750e-01, -2.34375000e-01,  4.12597656e-02,
       -2.15820312e-01,  1.69921875e-01,  2.56347656e-02,  1.50146484e-02,
       -3.75976562e-02,  6.95800781e-03,  4.00390625e-01,  2.09960938e-01,
        1.17675781e-01, -4.19921875e-02,  2.34375000e-01,  2.03125000e-01,
       -1.86523438e-01, -2.46093750e-01,  3.12500000e-01, -2.59765625e-01,
       -1.06933594e-01,  1.04003906e-01, -1.79687500e-01,  5.71289062e-02,
       -7.41577148e-03, -

In [35]:
wv.most_similar('happy')

[('glad', 0.7408890724182129),
 ('pleased', 0.6632170677185059),
 ('ecstatic', 0.6626912355422974),
 ('overjoyed', 0.6599286794662476),
 ('thrilled', 0.6514049172401428),
 ('satisfied', 0.6437949538230896),
 ('proud', 0.636042058467865),
 ('delighted', 0.627237856388092),
 ('disappointed', 0.6269949674606323),
 ('excited', 0.6247665286064148)]

In [36]:
wv.similarity('women','queen')

0.20191032