## Embeddings 
* It represents words as numbers so computers can understand similarities:
* Turns words into vectors
* Similar words have similar vectors

## What are the Embeddings?
* Embeddings are the vector representation of categorical data
* Computers can understand the meaning and relationships between them.

## Why Do we Need Embeddings ?
* In the past , methods like one-hot encoding were used to represent words.
  * This approach has limitations:
    * NO Seamntic Meaning - One hot vectors don't capture relationships between words.
    * High-Dimensionality - for large vocabularies, these vectors become very sparse and memory-intensive

## What can we Do with Embeddings ?
* If you ask a computer to find words similar to "king", it will look at the numbers and see that "queen" is close to "king" because they both have Royalty score.
* For sentences Embeddings can help computers to understand meaning, it can find relevant information even if the extra words don't match.
* Embeddings allow computers to group based on meaning, not just exact words.

In [7]:
def generate_ngrams(text, n):
    words=text.split()
    ngrams=[tuple(words[i:i+1]) for i in range(len(words)-n+1)]
    return ngrams

dummy_text= "This is a sample text fpr n-grams implementation in NLP using Python."

trigrams=generate_ngrams(dummy_text, 3)
print("Trigrams:", trigrams)

Trigrams: [('This',), ('is',), ('a',), ('sample',), ('text',), ('fpr',), ('n-grams',), ('implementation',), ('in',), ('NLP',)]


In [9]:
! pip install  gensim



In [13]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

dummy_corpus = ["This is a sample sentence.", "Another example sentence."]

tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in dummy_corpus]

# CBOW Model 
cbow_model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, sg=0, min_count=1)


# test the CBOW Model 
word_vector=cbow_model.wv['sample']
similar_words=cbow_model.wv.most_similar('sample', topn=3)

print("Word Vector for 'sample' : ", word_vector)
print("Similar Words to 'sample' : ", similar_words)

Word Vector for 'sample' :  [-0.00713902  0.00124103 -0.00717672 -0.00224462  0.0037193   0.00583312
  0.00119818  0.00210273 -0.00411039  0.00722533 -0.00630704  0.00464722
 -0.00821997  0.00203647 -0.00497705 -0.00424769 -0.00310898  0.00565521
  0.0057984  -0.00497465  0.00077333 -0.00849578  0.00780981  0.00925729
 -0.00274233  0.00080022  0.00074665  0.00547788 -0.00860608  0.00058446
  0.00686942  0.00223159  0.00112468 -0.00932216  0.00848237 -0.00626413
 -0.00299237  0.00349379 -0.00077263  0.00141129  0.00178199 -0.0068289
 -0.00972481  0.00904058  0.00619805 -0.00691293  0.00340348  0.00020606
  0.00475375 -0.00711994  0.00402695  0.00434743  0.00995737 -0.00447374
 -0.00138926 -0.00731732 -0.00969783 -0.00908026 -0.00102275 -0.00650329
  0.00484973 -0.00616403  0.00251919  0.00073944 -0.00339215 -0.00097922
  0.00997913  0.00914589 -0.00446183  0.00908303 -0.00564176  0.00593092
 -0.00309722  0.00343175  0.00301723  0.00690046 -0.00237388  0.00877504
  0.00758943 -0.00954765