##NLTK based ngram and POS-ngram parser

###Requirements:
    * NLTK
    * NLTK downloads:
<pre><code>
\#in terminal type
import nltk
nltk.download() 
</code></pre>
A pop up window appears. In the window choose:
    * Models => Treebanks Part of Speech Tagger, Treebank Part of Speech Tagger (Maximum entropy), Punkt Tokenizer Models

In [1]:
import string
import nltk     #download NLTK http://www.nltk.org/
from nltk.stem.wordnet import WordNetLemmatizer

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') #Add a tokanizer packege with import nltk \n nltk.download()
lmtzr = WordNetLemmatizer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

In [2]:
# Tokenizers
# Given a sentence returns a list of ngrams
# Note: requires local sentence parsing
table = string.maketrans("","")

def get_word_ngrams(text, n):
    ngram_list = []
    word_list = [lmtzr.lemmatize(word) for word in text.lower().translate(table, string.punctuation).split()]
    padding = ["<s>"] * (n -1)
    word_list = padding + word_list + padding
    for i in range(len(word_list) - (n-1)):
        ngram_list.append(" ".join(word_list[i:i+(n)]))
    return ngram_list
    
# Get word unigrams
def get_pos_ngrams(text, n):
    pos_list = []
    word_list = [lmtzr.lemmatize(word) for word in text.lower().translate(table, string.punctuation).split()]
    padding = ["<s>"] * (n -1)
    pos_list_temp = [ pair[1] for pair in nltk.pos_tag(word_list)] # uncomment this for unigram POS
    pos_list_temp = padding + pos_list_temp + padding
    for i in range(len(pos_list_temp) - (n-1)):
        pos_list.append(" ".join(pos_list_temp[i:i+(n)]))
    return pos_list


In [3]:
text = "Guys, shall we meet to discuss next steps? Anyone interested to meet on Friday? "

sentences = [sent for sent in sent_detector.tokenize(text.strip())]
for sent in sentences:
    print sent
    print "Unigrams: ",get_word_ngrams(sent, 1)
    print "Bigrams: ", get_word_ngrams(sent, 2)
    print "Part of speech bigrams: ", get_pos_ngrams(sent, 2), "\n\n"

Guys, shall we meet to discuss next steps?
Unigrams:  [u'guy', 'shall', 'we', 'meet', 'to', u'discus', 'next', u'step']
Bigrams:  [u'<s> guy', u'guy shall', 'shall we', 'we meet', 'meet to', u'to discus', u'discus next', u'next step', u'step <s>']
Part of speech bigrams:  ['<s> NN', 'NN MD', 'MD PRP', 'PRP VBP', 'VBP TO', 'TO VB', 'VB JJ', 'JJ NN', 'NN <s>'] 


Anyone interested to meet on Friday?
Unigrams:  ['anyone', 'interested', 'to', 'meet', 'on', 'friday']
Bigrams:  ['<s> anyone', 'anyone interested', 'interested to', 'to meet', 'meet on', 'on friday', 'friday <s>']
Part of speech bigrams:  ['<s> NN', 'NN VBD', 'VBD TO', 'TO VB', 'VB IN', 'IN NN', 'NN <s>'] 


