## Comparison of different Word Embeddings on Text Similarity — A use case in NLP

Natural Language Processing (NLP) is one of the key components in Artificial Intelligence (AI), which carries the ability to make machines understand human language. A lot of information is being generated in unstructured format be it reviews, comments, posts, articles, etc wherein, a large amount of data is in natural language. NLP allows machines to understand and extract patterns from such text data by applying various techniques such as text similarity, information retrieval, document classification, entity extraction, clustering.

Text Similarity is one of the essential techniques of NLP which is being used to find the closeness between two chunks of text by it’s meaning or by surface. Computers require data to be converted into a numeric format to perform any machine learning task. In order to perform such tasks, various word embedding techniques are being used i.e., Bag of Words, TF-IDF, word2vec to encode the text data. This will allow you to perform NLP operations such as finding similarity between two sentences to extract semantically similar questions from FAQ corpus, searching similar documents from the database, recommending semantically similar news articles.

In [4]:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from unidecode import unidecode
import string


def pre_process(corpus):
    # convert input corpus to lower case.
    corpus = corpus.lower()
    # collecting a list of stop words from nltk and punctuation form
    # string class and create single array.
    stopset = stopwords.words('english') + list(string.punctuation)
    # remove stop words and punctuations from string.
    # word_tokenize is used to tokenize the input corpus in word tokens.
    corpus = " ".join([i for i in word_tokenize(corpus) if i not in stopset])
    # remove non-ascii characters
    corpus = unidecode(corpus)
    return corpus


pre_process(
    "Sample of non ASCII: Ceñía. How to remove stopwords and punctuations?")

'sample non ascii cenia remove stopwords punctuations'

In [5]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()
sentence = "The striped bats are hanging on their feet for best"
words = word_tokenize(sentence)

for w in words:
    print(w, " : ", lemmatizer.lemmatize(w))

The  :  The
striped  :  striped
bats  :  bat
are  :  are
hanging  :  hanging
on  :  on
their  :  their
feet  :  foot
for  :  for
best  :  best
