## 1.- Semantic Text Similarity

## 1.1.- Theorical background

<img src='c1.PNG'>

### 1.1.1- Wordnet semantic dictionary

<img src='c2.PNG'>

### 1.1.1.1- WordNet Path Similarity

<img src='c3.PNG'>

### 1.1.1.2- Lowest/Least common subconsumer (LCS) and Lin Similarity

LCS is about finding out the hypernym of a pair of words. For example, an elk is a kind of deer (elk is a hyponym of deer as for example boxer is an hyponym of dog), so *LCS(deer,elk)=deer*. Deer and giraffe are both ruminants, so *LCS(deer,giraffe)=ruminant*. As horses are not ruminants, we have to move upper through the hierarchy in order to find the LCS - *LCS(deer,horse)=ungulate*

<img src='c4.PNG'>

<img src='c5.PNG'>

### 1.1.2- Collocations and Distributional similarity

Based on the quote "You know a word by the company it keeps". That means two words that are frequently appearing in similar contexts are more likely to be similar or more likely to be semantically related. So if you have two words that keep appearing in very similar contexts or that could replace another word in the similar context, and still the meaning remains the same, then they are more likely to be semantically related.

<img src='c6.PNG'>

<img src='c7.PNG'>

Once you have defined this context you can compute the strength of association between words based on how frequently these words co-worker or how frequently they collocate. That's why it's called Collocations. For example, if you have two words that keep coming next to each other, then you would want to say that they are very highly related to each other. On the other side, if they don't occur together, then they are not necessarily very similar.

<img src='c8.PNG'>

## 1.2.- Applying it to Python

#### WordNet easily imported into Python through NLTK

In [3]:
import nltk
from nltk.corpus import wordnet as wn

In [7]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

#### Find appropiate sense of the words

In [11]:
deer = wn.synset('deer.n.01') # First synset of the word deer as a noun
elk= wn.synset('elk.n.0.1') # First synset of the elk deer as a noun

WordNetError: no lemma 'elk.n' with part of speech '0'

#### 1_Find path similarity - Based on WordNet framework corpus

In [10]:
deer.path_similarity(elk) # would be 0.5 as it is 1/2
deer.path_similarity(horse) # would be 0.14 as it is 1/7

# Look at c3 in order to recall the reasoning of this calculation

NameError: name 'elk' is not defined

#### 2_Find Lin similarity - Based on LCS framework corpus

In [13]:
nltk.download('wordnet_ic')

[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/wordnet_ic.zip.


True

In [14]:
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')

deer.lin_similarity(elk,brown_ic) # would be 0.77
deer.lin_similarity(horse,brown_ic) # would be 0.86

NameError: name 'elk' is not defined

You'll notice especially here, that this is not using the distance between two concepts explicitly. So deer and horse, that were very far away in the WordNet hierarchy still get the higher significance and higher similarity between them. And that is because, in typical contexts and the information that is contained by these words deer and horse, you have deer and horse are enough closer in similarity because they are both basically mammals. But Elk is a very specific instance of deer and not necessarily, in the particular Lin similarity doesn't come out as close

#### 3_NLTK Collocations and Association measures

In [None]:
import nltk
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures() # defining the model

finder = BigramCollocationFinder.from_words(text) # setting it up by fitting it into our corpus (text)
# And finding then all the bigrams located on the input

finder.nbest(bigram_measures.pmi,10) # choosing the top10 bigrams (pairs) based on bigram_measures' PMI

You can define bigrams as NLTK collocations bigrams, bigram association measures, and then you can learn that based on a corpus, given as text here, so text corpus and then, you can get the top 10 pairs using the PMI measure from bigram_measures

**We can also use the method finder such as frequency filter**. So suppose you want all bigram measures that are, there you have supposed 10 or more occurrences of words only then can you keep them, then you could do something like finder.apply_freq_filter (10). That would then restrict any pair that does not occur at least 10 times in your corpus.

In [None]:
finder.apply_freq_filter(10) # The output would be the pairs that appear at least 10 different times on the corpus

<img src='c9.PNG'>

------------------------------------------

# 2.- Topic Modeling

Looking through the pic below, you'll notice that some words have been highlighted. So you have words such as genes and genomes that are highlighted in yellow, words such as computer, and predictions, and computer analysis, and computation are in blue. And then you have organism, or survive, or life in pink. This demonstrates that any article you see is more likely to be formed of different topics or sub-units that intermingle very seamlessly in weaving out an article.

<img src='c10.PNG'>

 **Summing up, this shows that documents are typically a mixture of topics.**

<img src='c11.PNG'>

**So what is a topic modeling? Topic modeling is a coarse-level analysis of what is in a text collection. When you have a large corpus, and you want to make sense of what this collection is about, you would probably use topic modeling.**

<img src='c12.PNG'>

A topic is a subject of theme of a discourse, and topics are represented by a word distribution. And that means that you have some probability of a word appearing in that topic. THe same word would have, then, different probabilities depending on the topic. So for example, if you see a basketball, or a player, or a fee, or a score, you are more likely to be in the topic of sports. And if you are in the topic of sports, then words such as player and team and score are more likely to appear. The word "TEAM" could also appear in social science for instance. However, the probability of seeing this word in this topic would be lower than the same rate in the sports topic

So when you're doing topic modeling, what's known, what's given to you? **What you're given is a text collection or a corpus, and you are somehow given the number of topics**. **What's not known are the actual topics and the distribution of topics** (the percentage of the text that each topic is about is unknown).

<img src='c13.PNG'>

**LDA is by far one of the most popular topic models in order to figure out the distribution of words in a particular document and what is a probability of a word in a topic.**

# 3.- Generative Models and LDA

### 3.1.- Theoretic background

<img src='c14.PNG'>

**In practice, the question to do in order to modeling is: How many topics are there?** Finding or even guessing it is a hard task

<img src='c15.PNG'>

### 3.2.- Working with LDA on Python

<img src='c16.PNG'>

**So, summing up, getting that document term matrix would be the important first step in finding out, and in working with LDA. And then once you have done that, once you have built this document term matrix, you build the LDA models on top of it**:

In [1]:
import gensim
from gensim import corpora,models

dictionary = corpora.Dictionary(doc_set) # creating the dictionary mapping btw IDs and words
corpus = [dictionary.doc2bow(doc) for doc in doc_set] # creating corpus (creates the document-term matrix)
lda = gensim.models.lda.LdaModel(corpus,num_topics=4,id2word=dictionary,passes=50) # setting up the model, specifying the number
# of topics we want to learn
print(lda.print_topics(num_topics=4,num_words=5)) # it gives the top5 words of each topic (4)

lda models can also be used to find topic distribution of documents

<img src='c17.PNG'>

## 4.- Information Extraction

Once we know how to get sets of words from text analyzed, how can we convert this unstructured text to structured form? You don't necessarily have to convert it all, but more of what im trying to get here is, how do you extract relevant information from unstructured text. And if you want to make it searchable or make it usable later on you would probably put it in a structured form as it has been traditionally kept. So that way it is conversion from unstructured to structured form text.

**Here comes Information Extraction's task, as its goal is to identify and extract fields of interest from free text.**

<img src='c18.PNG'>

**What is it considered in a whole text as a field of interest?**

The 4 w's: what, who, where, when can help us to detect Named Entities.

<img src='c19.PNG'>

<img src='c20.PNG'>

<img src='c21.PNG'>

Name Entity Recognition becomes a key building block in addressing these tasks and these advanced NLP tasks, and named Entity Recognition systems use the supervised machine learning approaches and text mining approaches that we have discussed in the course. So for example, if the entity that you need to recognize is a date, you are using typically RE modelling that we've talked about in week one. If you are talking about extracting names, you are not only using emotional learning model to identify what is an identity and what label you should get it, but also the features that you're going to use are coming from what we talked about in week two. So, for example, we want to know that, yes, if it is capitalized or not, but what is the part of speech of a particular word? Is it a noun or a verb? What is the semantic role that a particular word is playing in a given context in a sentence? And these could be features that you would then put in in a named entity recognition model. NLTK has an in-built NER model that does trained, or new datasets and so on, for the standard task for the person, organization, location task.