# 1. Semantic similarity
- Semantic similarity is the task of finding similar sentences that appear in a similar context.
- For this task I will use sentence embedding and I will describe you how it can be used in real applications.

### Similarity Search in Vector Space
- For this example I will use Sentence Embeddings using Siamese BERT-Networks.
- To demonstrate the use of vector fields, I will import the pre-trained bert-base-nli-mean-tokens model.
- Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP (Natural Language Processing) pre-training developed by Google.
- This model will generate one embedding vector for every sentence. Similar sentences tend to appear in a similar context.
- When comparing embedding vectors, it is common to use cosine similarity.

In [76]:
import numpy as np

from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cdist

Load pretrained bert model with sentence_transformers library:

In [48]:
embedder = SentenceTransformer('bert-base-nli-mean-tokens')

Corpus with example sentences:

In [102]:
sentences = [
    "Compressing / Decompressing Folders & Files",
    "How do you tell whether a string is an IP or a hostname",
    "Convert Bytes to Floating Point Numbers in Python"
]

We first generate one embedding for every sentence in a corpus:

In [103]:
%%time
corpus_embeddings = embedder.encode(sentences)

CPU times: user 12.7 ms, sys: 22.4 ms, total: 35.1 ms
Wall time: 36.4 ms


One sentence is represented by an embedding vector with 768 numbers

In [104]:
corpus_embeddings[0].shape

(768,)

Then, we generate the embeddings for different query sentences:

In [105]:
%%time
queries = [
    "zipping up files",
    "determine if something is an IP"
]
query_embeddings = embedder.encode(queries)

CPU times: user 24.3 ms, sys: 0 ns, total: 24.3 ms
Wall time: 23 ms


We then use scipy to find the most-similar embeddings for queries in the corpus:

In [106]:
for query, query_embedding in zip(queries, query_embeddings):
    distances = cdist([query_embedding], corpus_embeddings, "cosine")[0]
    distances_sorted = np.argsort(distances)
    
    print('Query: %s' % query)
    print('Top 3 most similar sentences in corpus:')
    for i in range(3):
        print('\t%s (Score %.4f)' % (sentences[distances_sorted[i]], 1-distances[distances_sorted[i]]))
    print('\n')

Query: zipping up files
Top 3 most similar sentences in corpus:
	Compressing / Decompressing Folders & Files (Score 0.7851)
	Convert Bytes to Floating Point Numbers in Python (Score 0.3926)
	How do you tell whether a string is an IP or a hostname (Score 0.3224)


Query: determine if something is an IP
Top 3 most similar sentences in corpus:
	How do you tell whether a string is an IP or a hostname (Score 0.8345)
	Convert Bytes to Floating Point Numbers in Python (Score 0.4138)
	Compressing / Decompressing Folders & Files (Score 0.3728)




### Similarity Search in Vector Space with Elasticsearch
- With Elasticsearch, we can determine textual similarity by using vector embeddings.
- For sentence embeddings, the cosine similarity between two sentence vectors can reveal the semantic similarity of the corresponding sentences.
- Starting from Elasticsearch 7.2 cosine similarity is available as a predefined function which is usable for document scoring.

### We could use text embeddings to allow for retrieving similar sentences:
1. During indexing, each question is run through a sentence embedding model to produce a numeric vector.
2. When a user enters a query, it is run through the same sentence embedding model to produce a vector.
3. To rank the responses, we calculate the vector similarity between each sentence and the query vector.

### Limitations
- Sentence scoring with cosine similarity is relatively expensive and should be used together with filters to limit the number of sentences for which scores need to be calculated.
- For larger scale similarity search in dense vectors, FAISS library for "billion-scale similarity search with GPUs" might be a good choice.
- This is an example of how embedding models could be used with vector fields, and not as a production-ready solution.

### Conclusions
- Embedding techniques provide a powerful way to capture the linguistic content of a piece of text.
- By indexing embeddings and scoring based on vector distance, we can compare sentences using a notion of similarity that goes beyond their word-level overlap.
- This method can be used together with filters to limit the number of sentences for which scores need to be calculated.

# 2. Topic Analysis
- Topic analysis is a machine learning technique that automatically assigns topics to text data. Topic analysis tools analyze unstructured text, including emails and social media interactions.
- There are two different approaches to topic analysis:
 - Topic modeling: used to discover the main topics within a bunch of texts
 - Topic classification: used to automatically categorize texts by topics

### Topic Modeling
- Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of sentences, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of sentences.
- It doesn’t require a predefined list of tags or training data that’s been previously classified by humans.
- Topic modeling involves counting words and grouping similar word patterns to infer topics within unstructured data.
- By detecting patterns such as word frequency and distance between words, a topic model clusters feedback that is similar, and words and expressions that appear most often. With this information, you can quickly deduce what each set of texts are talking about.

### Latent Dirichlet Allocation (LDA)
- Latent Dirichlet Allocation (LDA) are based on the underlying assumptions: the distributional hypothesis (similar topics make use of similar words) and the statistical mixture hypothesis (sentences talk about several topics) for which a statistical distribution can be determined.
- The purpose of LDA is mapping each sentence to a set of topics which covers a good deal of the words in the sentence.
- LDA ignores syntactic information and treats sentences as bags of words. It also assumes that all words in the document can be assigned a probability of belonging to a topic.
- The goal of LDA is to determine the mixture of topics that a sentence contains.

LDA assumes that topics and documents look like this:

<img src="images/1.png">

And, when LDA models a new document, it works this way:

<img src="images/2.png">

In [53]:
import gensim
import nltk
import re
import stop_words

tokenizer = nltk.tokenize.RegexpTokenizer(r'\b[^\d\W]+\b')
en_stop = stop_words.get_stop_words('en')
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()

def clean_text(text):
    text = text.lower()
    tokens = tokenizer.tokenize(text)
    tokens = [word for word in tokens if word not in en_stop]  # remove stop words from tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]  # lemmatize tokens
    tokens = [word for word in tokens if len(word) > 1]  # remove word containing less than one char
    return tokens

Load pretrained LDA model with NYTimes News Articles dataset

In [19]:
MODEL_PATH = 'nytimes_lda.model'
dictionary = gensim.corpora.dictionary.Dictionary.load('dictionary_nytimes')
ldamodel = gensim.models.LdaModel.load(MODEL_PATH)

This model was trained with 25 number of topics. Let's see some of them:

In [63]:
topics = ldamodel.print_topics()
for i in range(10):
    print('topic: %d - words: %s' % (topics[i][0], re.findall(r"[a-z|A-Z]+", topics[i][1])))

topic: 6 - words: ['zika', 'puerto', 'virus', 'rico', 'woman', 'health', 'mosquito', 'tobacco', 'clare', 'pregnancy']
topic: 7 - words: ['art', 'work', 'museum', 'artist', 'new', 'mr', 'gallery', 'will', 'fashion', 'said']
topic: 0 - words: ['dr', 'said', 'drug', 'study', 'patient', 'health', 'year', 'can', 'medical', 'cancer']
topic: 1 - words: ['bank', 'money', 'tax', 'financial', 'pay', 'million', 'cost', 'year', 'fund', 'bond']
topic: 23 - words: ['china', 'chinese', 'russian', 'russia', 'north', 'nuclear', 'korea', 'state', 'beijing', 'united']
topic: 13 - words: ['game', 'goal', 'first', 'second', 'yankee', 'two', 'season', 'series', 'said', 'scored']
topic: 2 - words: ['company', 'percent', 'said', 'year', 'new', 'google', 'like', 'technology', 'facebook', 'service']
topic: 21 - words: ['company', 'million', 'billion', 'year', 'business', 'deal', 'sale', 'investor', 'executive', 'said']
topic: 9 - words: ['said', 'state', 'law', 'court', 'case', 'will', 'justice', 'federal', 'ri

Let's clean a sentence and transform it into a bag of words:

In [39]:
sentence = 'Credit the Cavaliers for playing stout defense and the officials for allowing physical play.'
tokenized_data = clean_text(sentence)
bow = dictionary.doc2bow(tokenized_data)

Get topic for a sentence:

In [62]:
topics = ldamodel.get_document_topics(bow)
topics = sorted(topics, key=lambda x: x[1], reverse=True)
topic_id, topic_score = topics[0]

print('Sentence: %s' % sentence)
print('Topic id: %d - Similarity Score: %.3f' % (topic_id, topic_score))
print('Topic words: %s' % re.findall(r"[a-z|A-Z]+", ldamodel.print_topic(topic_id)))

Sentence: Credit the Cavaliers for playing stout defense and the officials for allowing physical play.
Topic id: 10 - Similarity Score: 0.732
Topic words: ['game', 'point', 'team', 'curry', 'warrior', 'said', 'player', 'season', 'first', 'coach']


### Some notes about LDA Model from gensim:
 * Is streamed: training documents may come in sequentially, no random access required.
 * Runs in constant memory w.r.t. the number of documents: size of the training corpus does not affect memory footprint, can process corpora larger than RAM.
 * Is distributed: makes use of a cluster of machines, if available, to speed up model estimation.
 * This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents. The model can also be updated with new documents for online training.

### Topic Classification
- Topic classification is a supervised machine learning technique, one that needs training before being able to automatically analyze texts.

Once the text is transformed into vectors and the training data is tagged with the expected tags, this information is fed to an algorithm to create the classification model:

<img src="images/5.png">

This classification model is now able to classify new texts because it has learned how to make predictions automatically:

<img src="images/6.png">

### Use Cases & Applications
- Topic modeling and topic classification can automatically tag user message according to a specific topic.
- This can be used in a query to retrieve similar messages from a database.

# 3. Other methods to extract relevant information from texts

### Named Entity Recognition:
* Named entity extractors locate and classify named entities, like names, organizations, locations, and monetary values, in unstructured texts. AI programs recognize these titles and values through their unique word sequences, and then classify them.
* Named entity extraction can be used to reveal important data and provide content recommendations.

<img src="images/4.png">

### Keyword Extraction
* Keyword extraction extracts relevant terms and phrases from within a text. These are terms that help to summarize the text, are significant to the writer’s viewpoint, or significant to the overall concept of the text.

<img src="images/3.png">

### Intent Classification
* Intent classification automatically finds purpose and goals in text.
* With an intent classifier, you could easily locate this query among the numerous user interactions you receive on a daily basis, and automatically categorize it.

# Final thoughts
* These methods enumerated above can be combined in order to make better suggestions to the user.
* Topic Analysis, Named Entity Recognition, Keyword Extraction and Intent Classification can be used to extract relevant information from text.
* Semantic Similarity should be used to find sentences that appear in a similar context with a given query and should be used together with filters (Topic Analysis, NER, Keywords and Intent Labels) to limit the number of sentences for which scores need to be calculated.

# References
1. Papers:
 * Attention Is All You Need: https://arxiv.org/pdf/1706.03762.pdf
 * BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: https://arxiv.org/pdf/1810.04805.pdf
 * Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks: https://arxiv.org/pdf/1908.10084.pdf
 * Latent Dirichlet Allocation: http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
2. Libraries:
 * https://github.com/UKPLab/sentence-transformers
 * https://github.com/facebookresearch/faiss
 * https://github.com/deepmipt/DeepPavlov/tree/master/examples
3. Interesting topics:
 * Google’s Universal Sentence Encoder: https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html

Contact me:
- www.linkedin.com/in/catalinmelter/
- catalin.melter@gmail.com