###Define Kneser-Ney smoothing in Plain English

Kneser smoothing is a method to helps solve the problem of predicting ngrams with words that the training set has not seen before.  We create a probability of an unknown ngrams and word history and usage to predict how the next set of ngrams will be.

###Define Parse Tree in Plain English

A Parse Tree is a representation of the syntactic structure of sentence for context free grammar.  It allows us to work with individual parts of the sentence using subtrees.  

### Word2Vec in Plain English

Word2Vec is a neural network that turns words in a numerical representation.  Words are represented by continouous vectors with a continuous bag of words model.  This very important because you want to turn words into numbers becauase and represent it in a vector space becaues it is easier for computers to understand and to calculate.    

###Recall, Precision, F1 score by code

In [5]:
from __future__ import division

#confusion matrix
matrix = [[713 ,  8],
         [ 33 , 80]]

#True positive / (False Negative + True Positive)

def confusion(matrix):
    precision = matrix[1][1] / (matrix[1][1] + matrix[0][1]) 
    recall = matrix[1][1] / (matrix[1][1] + matrix[1][0])
    f1_score = (2*precision*recall) / (recall + precision)
    return "precision: %s recall: %s fl_score: %s" % (precision, recall, f1_score)

In [6]:
print confusion(matrix)

precision: 0.909090909091 recall: 0.70796460177 fl_score: 0.796019900498


###Recall and F1 score with existing library

In [7]:
import numpy as np
from sklearn.metrics import precision_score ,recall_score, confusion_matrix, f1_score
y_true = np.array([1,0,1,0,1,0])
y_pred = np.array([1,1,1,1,1,1])

print precision_score(y_true, y_pred)
print recall_score(y_true, y_pred)
print f1_score(y_true, y_pred)

0.5
1.0
0.666666666667


#Edit Distance Code


In [8]:
def levenshtein(s1, s2):
    """Takes 2 words, returns Levenshtein distance.
    
    >>>levenshtein('foo', 'poo')
    1
    
    >>>levenshtein('intention', 'execution')
    5

    """
    
    if len(s1) > len(s2): # If one word is shorter than the other then change the order (bookkeeping to be consistent)
        s1 , s2 = s2 , s1
 
    if len(s2) == 0: # Make are getting a real word, 
        # if we are not getting a real word the cost is simply dropping all the letters in one of the words i.e. the length
        return len(s1)
 
    previous_row = range(len(s2) + 1) # Creating an array of length of the second word+1
   
    for i, c1 in enumerate(s1): # Interate through the first word 
        current_row = [i + 1]
        for j, c2 in enumerate(s2): # Interate through the second word
            if c1 == c2:
                current_row.append(previous_row[j])
            else:
                insertions = previous_row[j + 1] + 1 
                deletions = current_row[-1] + 1
                substitutions = previous_row[j] + 1
                current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]


In [9]:
print levenshtein('foo', 'poo')
print levenshtein('intention', 'execution')

1
5


In [10]:
#using NLP Library

from nltk.metrics import distance

print distance.edit_distance('foo','poo')
print distance.edit_distance('intention', 'execution')

1
5


#Ngrams

In [52]:
from nltk.util import skipgrams
sentence = "The Star Wars saga continues with this seventh entry"
sentence = sentence.split(' ')
sentence
list(skipgrams(sentence, 2, 2))

[('The', 'Star'),
 ('The', 'Wars'),
 ('The', 'saga'),
 ('Star', 'Wars'),
 ('Star', 'saga'),
 ('Star', 'continues'),
 ('Wars', 'saga'),
 ('Wars', 'continues'),
 ('Wars', 'with'),
 ('saga', 'continues'),
 ('saga', 'with'),
 ('saga', 'this'),
 ('continues', 'with'),
 ('continues', 'this'),
 ('continues', 'seventh'),
 ('with', 'this'),
 ('with', 'seventh'),
 ('with', 'entry'),
 ('this', 'seventh'),
 ('this', 'entry'),
 ('seventh', 'entry')]

In [54]:
from nltk.util import ngrams
print list(ngrams(sentence,2))
print list(ngrams(sentence,3))

[('The', 'Star'), ('Star', 'Wars'), ('Wars', 'saga'), ('saga', 'continues'), ('continues', 'with'), ('with', 'this'), ('this', 'seventh'), ('seventh', 'entry')]
[('The', 'Star', 'Wars'), ('Star', 'Wars', 'saga'), ('Wars', 'saga', 'continues'), ('saga', 'continues', 'with'), ('continues', 'with', 'this'), ('with', 'this', 'seventh'), ('this', 'seventh', 'entry')]


In [28]:
sentence = "The Star Wars saga continues with this seventh entry"

def bigram(sentence):
    bigram  = []
    sentence = sentence.split(' ')
    for i in range(len(sentence)+1):
        if len(sentence[i:i+2]) == 2:
            bigram.append(sentence[i:i+2])
    return bigram    

In [29]:
bigram(sentence)

[['The', 'Star'],
 ['Star', 'Wars'],
 ['Wars', 'saga'],
 ['saga', 'continues'],
 ['continues', 'with'],
 ['with', 'this'],
 ['this', 'seventh'],
 ['seventh', 'entry']]

In [30]:
def trigram(sentence):
    trigram  = []
    sentence = sentence.split(' ')
    for i in range(len(sentence)):
        if len(sentence[i:i+3]) == 3:
            trigram.append(sentence[i:i+3])
    return trigram    

In [31]:
trigram(sentence)

[['The', 'Star', 'Wars'],
 ['Star', 'Wars', 'saga'],
 ['Wars', 'saga', 'continues'],
 ['saga', 'continues', 'with'],
 ['continues', 'with', 'this'],
 ['with', 'this', 'seventh'],
 ['this', 'seventh', 'entry']]

In [56]:
list(ngrams(sentence,3))

[('The', 'Star', 'Wars'),
 ('Star', 'Wars', 'saga'),
 ('Wars', 'saga', 'continues'),
 ('saga', 'continues', 'with'),
 ('continues', 'with', 'this'),
 ('with', 'this', 'seventh'),
 ('this', 'seventh', 'entry')]

#disambiguate word sense

In [35]:
from nltk.wsd import lesk

sentence = "Luke, I am your father."
sentence = sentence.split(' ')
print(lesk(sentence, 'father'))

Synset('father.n.01')


In [36]:
from nltk.corpus import wordnet as wn

for ss in wn.synsets('father'):
    print(ss, ss.definition())
    
# Based on the sysnet bank, father in this case is a male parent     

(Synset('father.n.01'), u'a male parent (also used as a term of address to your father)')
(Synset('forefather.n.01'), u'the founder of a family')
(Synset('father.n.03'), u"`Father' is a term of address for priests in some churches (especially the Roman Catholic Church or the Orthodox Catholic Church); `Padre' is frequently used in the military")
(Synset('church_father.n.01'), u'(Christianity) any of about 70 theologians in the period from the 2nd to the 7th century whose writing established and confirmed official church doctrine; in the Roman Catholic Church some were later declared saints and became Doctor of the Church; the best known Latin Church Fathers are Ambrose, Augustine, Gregory the Great, and Jerome; those who wrote in Greek include Athanasius, Basil, Gregory Nazianzen, and John Chrysostom')
(Synset('father.n.05'), u'a person who holds an important or distinguished position in some organization')
(Synset('father.n.06'), u'God when considered as the first person in the Trinit

#code using an existing NLP library to tag part-of-speech

In [37]:
import nltk
pos_tag = nltk.pos_tag(sentence)

In [38]:
pos_tag

[('Luke,', 'NNP'),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('your', 'PRP$'),
 ('father.', 'NN')]

In [39]:
from textblob import TextBlob

star_wars = TextBlob("Luke, I am your father.")

In [40]:
star_wars.tags

[('Luke', u'NNP'),
 ('I', u'PRP'),
 ('am', u'VBP'),
 ('your', u'PRP$'),
 ('father', u'NN')]

In [59]:
star_wars.ngrams(n=2)
 

[WordList(['Luke', 'I']),
 WordList(['I', 'am']),
 WordList(['am', 'your']),
 WordList(['your', 'father'])]