## Lemmatization:

If we checks on wikipedia, the definition for Lemmatization is 'the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form'. 

In NLP, this process aims to find the correct lemma for a word in a text. For instance, for the word "running", get the lemma "run", and not confuse with running as andjective (like in "running shoes"). 

There are many ways to get the root of a word (or stem), and it's depends on our objectives or tasks. 

In [1]:
from scripts import normalization
from scripts.tokenizer import SpacyTokenizer

In [2]:
sent = "I love to listen to some music when I'm running"

In [3]:
spacy_tokenizer = SpacyTokenizer('en_core_web_lg')
#To run this must have downloaded the spacy model. You could dowloading via delivery/install_modules. Check Readme

In [None]:
# Tokenize sentences using spacy and whitespace tokenization
tokenized_whitespace = spacy_tokenizer.wsp_tokenizer(sent)
tokenized_whitespace

As result we got a list so tokens, just as they appears in the original sentence. But to get the lemmas for each one of them, we need some thing else. We can start trying to get stems for each token, as first approach.  

In [None]:
from nltk.stem import SnowballStemmer
stm = SnowballStemmer('english')

In [None]:
#stemming with SnowballStemmer just for instance
stemmed_sent = [stm.stem(word) for word in tokenized_whitespace]
stemmed_sent

This stems we got, appears be the correct ones, but it must be said, this process does not takes in count the morphological category for the analyzed token. One way to improve our results could rest in consider the POS tag of the token to get its correct lemma. For instance, we could think the word "running". It could correspond with the meanning of "execution" (NOUN) of some process or, on the other hand, could be a VERB: 'to run'. More over, In our example we can observe that "'m" token does not correspond with the lemma of the verb 'to be'. So, we can instantiate a lemmatizer, that takes the morphological information od the word in count. 

In [None]:
#Lemmatize sentences using spacy. This method keeps stopwords and return lemmas for every token 
lematized_sentence = spacy_tokenizer.lemmatizer(sent)
lematized_sentence

In this case, we finally can see how "'m" token was recognized as a inflexion of 'to be'. And we finally got a list of lemmas from the original sentence. 



But our sentence, or our list of lemmas has a lot of non-meanning words. I mean, words like 'to' or the pronoun 'I', for instance, does not gives us too much information for this tasks. If we want to compare two or more sentences, We could start cleanning the sentences.

In [None]:
#This method cleans stopwords and pronouns and returns only relevant lemmas from a sentence 
clean_base_sentence = spacy_tokenizer.lemma_tokenizer(sent)
clean_base_sentence

This last method could be usefull for as, if we would calculate similarity between two or more sentences. For instance, we could use it as input to get a TF-IDF score to measure their similarity

## Similarity

### TFIDF 
or Lexical Similarity

In [4]:
from scripts.vectorizers import TFIDFVectorizer

In [11]:
tfidf_vectorizer = TFIDFVectorizer(spacy_tokenizer.lemmatizer)
#Can pass any tokenizer of class SpacyTokenizer or other

This tokenizer get the lemmas for each token and only keeps the meanning word of the sentence. Stopwords and pronouns are not included in the result to try avoid a harmfull score.  

In [7]:
#sent = "I love to listen to some music when I'm running"

sentences = [
    "I like to run and listen rock music",
    "I love music"
    ]


In [8]:
print(tfidf_vectorizer.calculate_tfidf(sent, sentences))

{'most_similar_sentence': 'I love music', 'score': 0.6670239951717103}


  'stop_words.' % sorted(inconsistent))


If our aim is get a real similarity measurement we need to use better approaches. 
Anyway, TF-IDF Is good to set a baseline, for instance.  Until now, if two sentences share some words, specific words or parts of them, we could encounter with non happy results. We didnt consider False positive cases until now


In [13]:
fp_sent = "I really want to share it with you"
fp_sent2 = [
    "I don't want to share it with you",
    "I want to share it with you"]

In [14]:
print(tfidf_vectorizer.calculate_tfidf(fp_sent, fp_sent2))

{'most_similar_sentence': "I don't want to share it with you", 'score': 0.6410554491745126}


### Word Embeddings
But we could use word embeddings to improve our manning of 'similarity'. Using specific domain embeddings trained with specific domain data, our similarity measurement could change. This kind of semantic similarity could be usefull for us:

In [None]:
from scripts.vectorizers import FastTextVectorizer

As I have no specification about the domain or data specific to train a domain specific word ebeddings, I choosed one of the fasttext lib. 

In [15]:
fasttext_vectorizer = FastTextVectorizer('models/fasttext/cc.en.300.bin')

# I will use these random fasttext model to evaluate similarity between one sentence and other(s). 
# Loading could be slow.
# Could accept a custom trained vectors 

NameError: name 'FastTextVectorizer' is not defined

In [None]:
from scripts import cosine_similarity as cs

def eval_similarity(sents_list, vectorizer):
    vectorized_sents = vectorizer.vectorize_sentences(sents_list)
    similarity = cs.cosine_one_many(vectorized_sents[0], vectorized_sents[1:])
    max_sim = max(similarity)
    max_sim_idx = similarity.index(max_sim)+1 #Revisar implementaciòn de cosine_one_many
    sent_idx_sim = sents_list[max_sim_idx]
    
    result = dict(most_similar_sentence=sent_idx_sim, score=max_sim)
    return result

In [None]:
all_sentences  = [sent] + sentences

In [None]:
print(eval_similarity(all_sentences, fasttext_vectorizer))

## Disclaimers:
To use these fasttext embeddings maybe was not the best way to achive this task, at least in terms of my owns machine resourses (model loading takes to much time, and invest it for compare two sentences, i guess was not the best experice)

It must be improve, but as first direction I would look is the training of an specific word embeddings using fassttext or trying to use a model of BERT with a custom trained layer, but it all depends on the task and the type of data that I had. 

Another way to improve the similarity measurement, maybe could be parsing the sentences to attend to the dependencies of the tokens inside a sentence. I must be honest, i've never did it before, but I know of some pappers that attends to this way. 




# 3. Problem: 
#### Suppose you had a corpus of corporate agreements, for example: vendor agreements, NDAs, privacy policies, etc. Describe your thoughts on building a model that is able to cluster the documents so that you would expect NDAs to be in one cluster and privacy policies in another cluster.

### Answer
Depending on the quantity of the data, or if I had this data labeled or not, for instance to decided if I go for a supervised or unsupervised process. But as a first approach may be I would try to implement a SVM as I did it before, for instance. 
We could build a trainingset for each class on our classifier and try to get better results as improve the training process. 


On the other hand, if I will choose to go by the unsupervised path, may be I could try to implement some clustering algorithms (as KMeans or another), suppossing K=2, but the contra here could be determine this number or if this is the correct according our real domain. 