### Exploring usefulness of NLP features like POS, Named Entity Recognition, Coreference and Lemmatizaton using CoreNLP, Spacy and Fasttext

#### Task Sentence Completion: Probablistic Approach

The task of sentence completion in NLP is to predict missing words in a sentence or incomplete beginings and endings of a sentence. Example "John _ Susan in the mall". Here a missing word needs to be predicted such that the sentence is coherent and syntactically correct. Sentence completion task has numerous applications like in email reply generation, spell check applications, bot interaction, determining the correctness of automated machine translation etc. 

One of the most popular and simple method of sentence completion is using N gram models. For example in bi gram model we calculate the probability of pair of words in a corpus and this information is used to predict the probability of missing words in the test corpus. This method suffers from limitations, such as it does not capture the relation the missing word(s) might have with words other than immediate words in a sentence. For example in a sentence "I saw a tiger which was really very _ ". The missing word could be either fierce or talkative. If a bi gram model is used here we will see the P(fierce| very)  and P(talkative|very) which does not take into account the object "tiger" for which the adjective needs to be predicted.
I will briefly describe how NLP tasks like POS tagging, Named entity, dependency parsing are useful for this task.

Part of Speech Tagging: Given a corpus, we first get part of speech tag for each word in the corpus. Using this we can build a Hidden Markov Model by calculating the transitional probabilities (eg probability of verb following a noun) and emission probability(ie how probable is a word given a tag). The transition probablity will be used to predict tag of the missing word/words and emission probability will be used to predict the probability of each word given a tag.

Dependency Parsing: In a dependency parse tree each word is node in the tree and directed edges represent relations between words. This information is useful in the task of Sentence completion. Instead of calculating $P(w_i| w_i-1)$ we can calcuate probability of $P(w_i | p_i)$ where $p_i$ is parent of word in the depedency tree. For example in the previous example "I saw a tiger which was really very fierce/talkative", instead of using P(fierce|very) and P(talkative|very), P(fierce|tiger) and P(talkative|tiger) can be used since 'fierce' and 'tiger', 'talkative' and 'tiger' are connected in the parse tree of our sentence.
Given a corpus of data, we get a dependency parse tree of each sentence in the corpus. Now the probability of each pair of words which are connected in the dependency parse tree is calculated. To predict missing words in the test sentence, a dependency parse tree is  created for sentence filled with all possible words.(or words already shortlisted using POS) Probability of missing word is now calculated using the probabilities of the relations we have calculated during the training time.

#### Task Question Answering 

Given a text, the task is to find answers to questions from the given text. Named entities resolution is one of the important step of this task. Lets say we have a large corpus of data. Our system accepts a question from the user then finds the expected named entity of the answer of the question. The system then looks into the corpus and eliminates text that does not mention the expected named entity. In this way we are able to eliminate texts not useful for answering the question. The selected text can then be processed further to extract the relevant answer. The words in the text can be replaced by their named entities for this task.

Coreference information can be used in this task to replace each pronoun with the noun it is refering to. Eg in our example "John met Susan in the mall. She told him that she is traveling to Europe next week.", the question answering system can be fed with the sentence "John met Susan in the mall. Susan told John that Susan is traveling to Europe next week." after anaphora resolution.

The task of question answers also requires word search in the given corpus. It will be useful if different words with same lemma are replaced by their lemma. This would significantly reduce the search space. Eg Consider the text " She is a caring person. She cares for all. Last year she took care of all the patients". Here words caring, cares, care have the same lemma 'care'.

### Comparing CoreNLP and Spacy

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')
doc = nlp('John met Susan in the mall. She told him that she is traveling to Europe next week.')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_)

John john PROPN NNP nsubj
met meet VERB VBD ROOT
Susan susan PROPN NNP dobj
in in ADP IN prep
the the DET DT det
mall mall NOUN NN pobj
. . PUNCT . punct
She -PRON- PRON PRP nsubj
told tell VERB VBD ROOT
him -PRON- PRON PRP dobj
that that ADP IN mark
she -PRON- PRON PRP nsubj
is be VERB VBZ aux
traveling travel VERB VBG ccomp
to to ADP IN prep
Europe europe PROPN NNP pobj
next next ADJ JJ amod
week week NOUN NN npadvmod
. . PUNCT . punct


Spacy gives 2 POS tags one is simple part-of-speech tag and other detailed part-of-speech tag.
1. John is tagged as a proper noun and proper noun singular in Spacy, CoreNLP also tags John as proper noun singular. Same true for met, Susan, in, the, mall, ., She, him, that, she, is, travelling, Europe, week. For words 'to', 'next' both give different results. Spacy tags 'to' as adposition, conjunction whereas CoreNLP gives it a infinitival to tag. Similarly for 'next' Spacy gives Adjective where CoreNLP gives conjunction/preposition.
2. Lemma of 'She', 'him' is different in Spacy and CoreNLP. Spacy gives lemma as '-PRON-' whereas CoreNLP give 'she' as the lemma.

#### Get JSON from coreNLP server

In [27]:
import json
import requests
sentence = "Smith ate an apple. It was very tasty. He felt nice. John also ate it, but he did not like it."
parameters = {"annotators":"tokenize,ssplit,pos,ner,lemma,dcoref,depparse,parse",
              "outputFormat":"json"}
req = requests.Request(method = 'POST', url = 'http://localhost:9000', 
                     data = sentence,  params = parameters)

r = req.prepare()
s = requests.session()
resp = s.send(r)
data = resp.json()

#### Part of speech : Vectorized using one hot encoding

##### First step is to get a unique one hot encoding for each tag

In [28]:
import numpy as np
import math
import pandas as pd
# Extract all part of speech from the training data
all_pos_tags = []
for i in range(0, len(data['sentences'])):
    for j in range(0, len(data['sentences'][i]['tokens'])):
        all_pos_tags.append(data['sentences'][i]['tokens'][j]['pos'])

unique_dataframe = []
for each in all_pos_tags:
    if each not in unique_dataframe:
        unique_dataframe.append(each)
df = pd.DataFrame(unique_dataframe)
vectorized_pos = pd.get_dummies(df, prefix ='POS')
vectorized_pos

Unnamed: 0,"POS_,",POS_.,POS_CC,POS_DT,POS_IN,POS_JJ,POS_NN,POS_NNP,POS_PRP,POS_RB,POS_VBD
0,0,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,1,0,0
6,0,0,0,0,0,0,0,0,0,1,0
7,0,0,0,0,0,1,0,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,0
9,0,0,1,0,0,0,0,0,0,0,0


##### Next replace all tags in the corpus with their vector

In [5]:
# vectorize the POS in text
vectorized_pos_for_each_word = []
for i in range(0, len(data['sentences'])):
    for j in range(0, len(data['sentences'][i]['tokens'])):
        vectorized_pos_for_each_word.append(vectorized_pos['POS_'+data['sentences'][i]['tokens'][j]['pos']])
numpy_vector = np.array(vectorized_pos_for_each_word)
numpy_vector

array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0,

##### Now if we take the sum along the column, the final vector will give us the count of tags in the text in a vectorized form

In [31]:
final_vector = numpy_vector.sum(axis=0)
final_vector

array([2, 5, 1, 1, 4, 5, 3, 2, 1, 1, 1], dtype=uint64)

#### Lemma

##### Each word in the text will be replaced by its lemma. The resulting sentence will be vectorized using word2vec

In [33]:
# replace each word with its lemma
pos_word_mapping = {}
word_in_sentence = []
for i in range(0, len(data['sentences'])):
    sentence_words = []
    for j in range(0, len(data['sentences'][i]['tokens'])):
        sentence_words.append(data['sentences'][i]['tokens'][j]['word'])
        if data['sentences'][i]['tokens'][j]['word'] not in pos_word_mapping:
            pos_word_mapping[data['sentences'][i]['tokens'][j]['word']] = data['sentences'][i]['tokens'][j]['pos']
    word_in_sentence.append(sentence_words)
    
def get_modified_sentence(word_in_sentence):
    modified_sentence = ""
    for i in range(0, len(word_in_sentence)):
        modified_sentence = modified_sentence + ' '.join(word_in_sentence[i])
    return modified_sentence
for i in range(0, len(data['sentences'])):
    for j in range(0, len(data['sentences'][i]['tokens'])):
        word_in_sentence[i][j] = data['sentences'][i]['tokens'][j]['lemma']

print("Sentence after replacing words by their lemma's")
sentence = get_modified_sentence(word_in_sentence)
print(sentence)

Sentence after replacing words by their lemma's
Smith eat a apple .it be very tasty .he feel nice .John also eat it , but he do not like it .


#### Anaphora resolution

#####  Each pronoun in the text is replaced by the corresponding NN Noun singular or mass, NNS Noun plural, NNP Proper noun singular, NNPS Proper noun plural. The resulting sentence will be vectorized using word2vec

In [36]:

coreferences = []
for key in data['corefs'].keys():
    if len(data['corefs'][key]) > 1:
        coreferences.append(data['corefs'][key])
nouns = []
for i in range(len(coreferences)):
    for j in range(0, len(coreferences[i])):
        tok = coreferences[i][j]['text'].split()
        for each in tok:
            if pos_word_mapping[each] in ['NN', 'NNS','NNP','NNPS']:
                nouns.append(each)
                break

for i in range(len(coreferences)):
     for j in range(0, len(coreferences[i])):
            tok = coreferences[i][j]['text'].split()
            if nouns[i] not in tok:
                word_in_sentence[coreferences[i][j]['sentNum'] -1][coreferences[i][j]['headIndex']-1] = nouns[i]
print("Text after replacing coreference by the noun it is refering to:")
print(get_modified_sentence(word_in_sentence))

Text after replacing coreference by the noun it is refering to:
Smith eat a apple .apple be very tasty .Smith feel nice .John also eat apple , but John do not like apple .


Now the text can be vectorized using word2vec.

In [37]:
import nltk
from nltk.data import find
import gensim
nltk.download('word2vec_sample')
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

[nltk_data] Downloading package word2vec_sample to /Users/Aarushi-
[nltk_data]     Mac/nltk_data...
[nltk_data]   Package word2vec_sample is already up-to-date!


In [39]:
text_vector = []
for i in range(0, len(word_in_sentence)):
    for j in range(0, len(word_in_sentence[i])):
        if word_in_sentence[i][j] in model.vocab:
            text_vector.append(model[word_in_sentence[i][j]])
text_vector = np.array(text_vector)
print(text_vector)
print("Shape of the resulting vector",text_vector.shape)

[[ 0.0487721   0.0482533  -0.0679696  ... -0.0793844   0.0747147
   0.0835352 ]
 [-0.0557888  -0.0381166  -0.0111751  ... -0.0171524   0.0745005
  -0.0183652 ]
 [-0.0205091  -0.0509621  -0.00384546 ...  0.0435042  -0.00660332
   0.109382  ]
 ...
 [ 0.0484068  -0.0542491   0.0678809  ... -0.0620387   0.02782
  -0.0745577 ]
 [ 0.0568747   0.075654   -0.00163481 ...  0.0241449  -0.0799465
   0.0391684 ]
 [-0.0205091  -0.0509621  -0.00384546 ...  0.0435042  -0.00660332
   0.109382  ]]
Shape of the resulting vector (20, 300)


### Spacy

#### Lemma

In [11]:
sentence = "John met Susan in the mall. She told him that she is traveling to Europe next week."
nlp = spacy.load('en_core_web_sm')
doc = nlp(sentence)
spacy_result = []
word_in_sentence = []
for token in doc:
    spacy_result.append([token.text, token.lemma_, token.tag_, token.dep_])
    word_in_sentence.append(token.text)

In [12]:
spacy_result

[['John', 'john', 'NNP', 'nsubj'],
 ['met', 'meet', 'VBD', 'ROOT'],
 ['Susan', 'susan', 'NNP', 'dobj'],
 ['in', 'in', 'IN', 'prep'],
 ['the', 'the', 'DT', 'det'],
 ['mall', 'mall', 'NN', 'pobj'],
 ['.', '.', '.', 'punct'],
 ['She', '-PRON-', 'PRP', 'nsubj'],
 ['told', 'tell', 'VBD', 'ROOT'],
 ['him', '-PRON-', 'PRP', 'dobj'],
 ['that', 'that', 'IN', 'mark'],
 ['she', '-PRON-', 'PRP', 'nsubj'],
 ['is', 'be', 'VBZ', 'aux'],
 ['traveling', 'travel', 'VBG', 'ccomp'],
 ['to', 'to', 'IN', 'prep'],
 ['Europe', 'europe', 'NNP', 'pobj'],
 ['next', 'next', 'JJ', 'amod'],
 ['week', 'week', 'NN', 'npadvmod'],
 ['.', '.', '.', 'punct']]

In [13]:
for i in range(len(word_in_sentence)):
    word_in_sentence[i] = spacy_result[i][1]
print("Sentence after word is replaced with lemma using Spacy")
print(' '.join(word_in_sentence))

Sentence after word is replaced with lemma using Spacy
john meet susan in the mall . -PRON- tell -PRON- that -PRON- be travel to europe next week .


#### Name entity recognition

In [14]:
ner_dict = {}
for ent in doc.ents:
    ner_dict[ent.text] = ent.label_

In [15]:
# replacing each named entities
for i in range(len(word_in_sentence)):
    if spacy_result[i][0] in ner_dict:
        word_in_sentence[i] = ner_dict[spacy_result[i][0]]
print("Sentence after word is replaced with its named entity using Spacy")
print(' '.join(word_in_sentence))

Sentence after word is replaced with its named entity using Spacy
PERSON meet PERSON in the mall . -PRON- tell -PRON- that -PRON- be travel to LOC next week .


#### Vectorization of the sentence using spacy

In [16]:
doc = nlp(' '.join(word_in_sentence))
doc.vector

array([ 1.32948613e+00,  1.31929946e+00,  1.48117661e+00,  1.40827036e+00,
        4.33249027e-01,  3.18837523e-01, -1.30168188e+00, -1.05192912e+00,
        1.28231049e+00,  8.63556027e-01,  8.53094533e-02,  7.18300715e-02,
       -2.28958391e-03, -1.61102140e+00,  1.13392644e-01, -5.83186805e-01,
       -9.30373132e-01, -1.25734675e+00, -1.14793098e+00,  2.48901650e-01,
       -2.39418611e-01, -8.04257572e-01, -2.90454477e-01,  2.39891887e-01,
       -1.88766420e-01,  6.62348390e-01,  1.47018284e-01,  1.01151800e+00,
       -9.57176983e-01, -1.30864906e+00, -9.61815566e-02, -3.02313417e-01,
        6.92688167e-01, -1.09056699e+00, -2.20576331e-01, -8.94120395e-01,
        1.61020505e+00,  4.89308685e-02,  7.09146321e-01, -1.24255955e+00,
       -1.06523025e+00,  9.00087178e-01,  9.51568007e-01, -1.35924184e+00,
       -8.24857533e-01,  1.42789078e+00,  3.76268514e-02, -1.01400010e-01,
        1.61380982e+00,  5.48961520e-01,  3.81963521e-01, -6.37452602e-01,
       -9.81649905e-02,  

#### Training a model using fasttext which could classify text from a novel into 'Romantic' or 'Adventure' category

In [17]:
# importing the model
import fasttext

Training a skip gram model from the corpus taken from Pride and Prejudice by Jane Austin and adventures of Huckleberry finn by Mark Twain using skipgram function in fastext. CBOW model can aslo be used instead.

In [18]:
model = fasttext.skipgram('data.txt', 'model')

A model.bin and model.vec will be created. We will now load the model using the load_model function. Then train a supervised classifier using training data in the file 'train_data.txt'. The first word in the training file is the label which should be appended with __label__, followed by the training sentence

In [19]:
model = fasttext.load_model('model.bin')
classifier = fasttext.supervised('train_data.txt', 'model')

We now test the classifier on test data

In [20]:
result = classifier.test('test.txt')
print("Precision:", result.precision)
print("Recall:", result.recall)
result.nexamples

Precision: 0.75
Recall: 0.75


12

We can also give text from the novels and check the predicted tag along with the probability of the predicted tag. The first sentence is from 'Adventure' category and other from 'Romantic'. The classifier predicts them correctly

In [54]:
texts = ['By and by he rolled out and jumped up on his feet looking wild, and he see me and went for me.', 'With the officers! cried Lydia. I wonder my aunt did not tell us of _that_.']
# Or with the probability
labels = classifier.predict_proba(texts)
print(labels)

[[('Adventure', 0.5)], [('Romantic', 0.5)]]


#### Classification using Spacy vectors

Performed the same classification task using the same data vectorized by spacy. Used sklearn logistic regression to train the classifier. The classifier gave an accuracy of 91% to identify whether the sentence is from Adventure novel or Romantic novel

In [21]:
def get_vectors(filename):
    labels = []
    sentences = []
    with open(filename,"r") as f:
        for line in f:
            splitted_data = line.split(" ",1)
            labels.append(splitted_data[0])
            sentences.append(splitted_data[1])
    labels_num = []
    for e in labels:
        if e == '__label__Adventure':
            labels_num.append(0)
        else:
            labels_num.append(1)
    vectorized_sentences = []
    for sentence in sentences:
        doc = nlp(sentence)
        vectorized_sentences.append(doc.vector)
    return labels_num, vectorized_sentences

In [22]:
labels_num, vectorized_sentences = get_vectors('train_data.txt')
labels_num_test, vectorized_sentences_test = get_vectors('test.txt')

In [23]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(vectorized_sentences, labels_num)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [24]:
print("Accuracy of:", model.score(vectorized_sentences_test, labels_num_test))

Accuracy of: 0.9166666666666666


In [25]:
predicted = model.predict(vectorized_sentences_test)
index = 0
for sen in vectorized_sentences_test:
    print( "Predicted:", "__label__Adventure " if predicted[index]==0 else "__label_Romantic ", "Actual:","__label__Adventure " if labels_num_test[index]==0 else "__label_Romantic ")
    index = index + 1

Predicted: __label__Adventure  Actual: __label__Adventure 
Predicted: __label_Romantic  Actual: __label__Adventure 
Predicted: __label__Adventure  Actual: __label__Adventure 
Predicted: __label__Adventure  Actual: __label__Adventure 
Predicted: __label__Adventure  Actual: __label__Adventure 
Predicted: __label_Romantic  Actual: __label_Romantic 
Predicted: __label_Romantic  Actual: __label_Romantic 
Predicted: __label_Romantic  Actual: __label_Romantic 
Predicted: __label_Romantic  Actual: __label_Romantic 
Predicted: __label_Romantic  Actual: __label_Romantic 
Predicted: __label_Romantic  Actual: __label_Romantic 
Predicted: __label_Romantic  Actual: __label_Romantic 
