# Vector Semantics and Embeddings

**Vector Semantics** refers to the meaning of the words where words are represented in the form of vectors. This is the standard way to represent word meaning in NLP.

**Embedding** refers to the learned representation for representing words in vector form.


## TF-IDF (Term Frequency and Term Document Frequency)
In ***term Document matrix**, each row represents a word in the vocabulary and each column represent a document (sentence) from the collection. Whereas, each cell in this matrix represents the number of times a particular word occurs in particular document (sentence).



In [1]:
import nltk
#nltk.download('all')

# Para contains a paragraph about nepal
paragraph = """Nepal is a beautiful country. Nepal is small country. Nepal is peaceful country."""

print(paragraph)

Nepal is a beautiful country. Nepal is small country. Nepal is peaceful country.


'nltk.download' is not recognized as an internal or external command,
operable program or batch file.


In [2]:
import string
from nltk import sent_tokenize, word_tokenize

#convert to lower case
para = paragraph.lower()
print(para)

nepal is a beautiful country. nepal is small country. nepal is peaceful country.


In [3]:
# Define Utilities function to handle numbers, punctuations and white space
# Import regular expression
import re

# Define function to remove numbers from the text, if any
def remove_numbers(para):
    result = re.sub(r'\d+', '', para)
    return result


# Define function to remove punctuations from the text
def remove_punctuations(para):
    translator = str.maketrans('', '', string.punctuation)
    return para.translate(translator)


# Remove whitespace from text
def remove_whitespace(para):
    return " ".join(para.split())

In [4]:
# Use those utilities to clean numbers, punctuations and whitespaces
para = remove_numbers(para)
para = remove_whitespace(para)

print(para)

nepal is a beautiful country. nepal is small country. nepal is peaceful country.


In [5]:
# Sentence tokenization for the original paragraph
sent_tokens = sent_tokenize(paragraph)
print(sent_tokens)

['Nepal is a beautiful country.', 'Nepal is small country.', 'Nepal is peaceful country.']


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
trans = tfidf.fit_transform(sent_tokens)
print(trans)

  (0, 1)	0.4128585720620119
  (0, 0)	0.6990303272568005
  (0, 2)	0.4128585720620119
  (0, 3)	0.4128585720620119
  (1, 5)	0.6990303272568005
  (1, 1)	0.4128585720620119
  (1, 2)	0.4128585720620119
  (1, 3)	0.4128585720620119
  (2, 4)	0.6990303272568005
  (2, 1)	0.4128585720620119
  (2, 2)	0.4128585720620119
  (2, 3)	0.4128585720620119


This shows that the in (x, y) shown above, x refers to the sentence and x refers to the sentences and y refers to the words.

In [7]:
# Get the words used as features
tfidf.get_feature_names_out()

array(['beautiful', 'country', 'is', 'nepal', 'peaceful', 'small'],
      dtype=object)