<h1 align="center"> Natural Language Processing - 102 </h1>

## Program so far 
***

* Python
* Statistics
* Supervised Machine Learning
* Unsupervised Machine Learning
* NLP 101

## Agenda for the Day
***

- Topic Modeling
2. POS Tagging
- Chunking
- Parsing

## Topic Modeling

NLP is all about unstructured data, and one of the problem industry is facing today is about amount of data that any System has to process. Often its not practical to read through a huge volume of data and get some insights about that data. Consider google news, there are hundred of thousands of news get published on daily basis. So we need a way to group news with some keywords in order to understand what is going on. 

![alt text](../images/topic_modelling.png "Title")

- One in red are classes, which are fixed and with the help of training data, we can build news classifier.
- But one in green are topics, that are identified run time. And process of identification of topics is totally unsupervised. And Topic modelling is one the best way to understand, repersent any unstructured text without actually getting into it.

__Topic Modelling__ as the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.

A __Topic__ can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

### Applications of Topic Modelling

- Document Clustering.
    1. Group news.
    2. Group emails.
    3. Group similar medical notes etc.
- Keywords Generation. Can be used for SEO.
- Build WordCloud.
- Build Search Engines.
- Build knowledge-graph(aka ontologies).



### How Topic Modelling Works

Topics are generally important words in text. 
- Frequency count can be one of the way to identify topics.
- TF-IDF can also be used for Topic Modelling.
- Or most famous, LDA (Latent Dirichlet Allocation)

## Latent Dirichlet Allocation

Suppose you have the following set of sentences:

- I like to eat broccoli and bananas.
- I ate a banana and spinach smoothie for breakfast.
- Chinchillas and kittens are cute.
- My sister adopted a kitten yesterday.
- Look at this cute hamster munching on a piece of broccoli.

LDA will try to identify words which have been used in similar context and will calculate probability of occuring two words togther.
In the above example, LDA will create topics like:
    
- Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
- Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals).



## How LDA Works?

LDA involves a detailed inderstanding of Baysian Probabilisitc Approach. However, here is an intuitive explanation of how LDA operates:

Let's from the 

### Topic Modelling with Gensim

gensim(https://radimrehurek.com/gensim/) package in python implements most of topic modelling algorithms.

* We'll walk through a basic application of Topic Modeling with LDA
* We'll also cover the basic NLP operations necessary for the application
    

In [1]:
# create sample documents
# We will use author data
import nltk
text=open("../data/C50train/AaronPressman/2537newsML.txt").read()
sents = nltk.sent_tokenize(text)

# compile documents
doc_complete =sents

### Let's fast-forward through pre-processing

* After the processing, we'll have *texts* - a tokenized, stopped and stemmed list of words from a single document
* Let’s fast forward and loop through all our documents and appended each one to *texts*
* So now *texts* is a list of lists, one list for each of our original documents

In [2]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
from pprint import pprint

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = set(stopwords.words('english'))

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

Using TensorFlow backend.


In [3]:
   
# create sample documents
# We will use author data
import nltk
text=open("../data/C50train/AaronPressman/2537newsML.txt").read()
sents = nltk.sent_tokenize(text)

# compile sample documents into a list
doc_set = sents

# list for tokenized documents in loop
texts = []

In [4]:
# loop through document list
for i in doc_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

print("\n##### texts")
print(texts)

print("\n##### The lines in texts")
for line in texts:
    print(line)


##### texts
[['break', 'u', u'justic', u'depart', 'world', 'wide', 'web', 'site', 'last', 'week', u'highlight', 'internet', u'continu', u'vulner', u'hacker'], [u'unidentifi', u'hacker', u'gain', u'access', u'depart', 'web', 'page', 'august', '16', u'replac', 'hate', u'fill', u'diatrib', u'label', u'depart', u'injustic', u'includ', 'swastika', u'pictur', 'adolf', 'hitler'], [u'justic', u'offici', u'quickli', u'pull', 'plug', u'vandalis', 'page', u'secur', u'flaw', u'allow', u'hacker', 'gain', u'entri', u'like', 'exist', u'thousand', u'corpor', u'govern', 'web', u'site', u'secur', u'expert', 'said'], ['vast', u'major', u'site', u'vulner', 'said', 'richard', 'power', 'senior', 'analyst', u'comput', u'secur', u'institut'], [u'justic', u'depart', u'singl'], [u'justic', u'depart', u'offici', 'said', u'compromis', 'web', 'site', u'connect', u'comput', u'contain', u'sensit', u'file'], ['web', 'site', 'http', 'www', 'usdoj', 'gov', u'includ', u'copi', u'press', u'releas', u'speech', u'publicli

## What's next?

* To generate an LDA model, we need to understand how frequently each term occurs within each document
* To do that, we need to construct a document-term matrix with a package called *gensim*

# Topic Modeling with gensim

## Getting started with gensim?




In [5]:
from gensim import corpora, models

dictionary = corpora.Dictionary(texts)
print(dictionary)

Dictionary(175 unique tokens: [u'theme', u'identifi', u'offici', u'month', u'report']...)


* The Dictionary() function traverses texts, assigning a unique integer id to each unique token while also collecting word counts and relevant statistics
* To see each token’s unique integer id, try -

In [6]:
print(dictionary.token2id)

{u'theme': 129, u'identifi': 167, u'offici': 40, u'month': 117, u'report': 97, u'simpli': 139, u'hate': 28, u'yet': 140, u'islam': 88, u'web': 4, u'adolf': 29, u'sensit': 58, u'manufactur': 118, u'offic': 95, u'breach': 79, u'vulner': 13, u'swastika': 27, u'take': 137, u'non': 130, u'march': 83, u'variou': 112, u'wide': 1, u'financ': 159, u'kind': 136, u'nation': 85, u'break': 9, u'mention': 154, u'govern': 45, u'press': 64, u'world': 3, u'password': 150, u'vast': 55, u'measur': 135, u'gov': 72, u'like': 36, u'corpor': 41, u'elgan': 107, u'magazin': 101, u'editor': 109, u'contain': 61, u'found': 100, u'page': 21, u'prevent': 172, u'www': 62, u'replac': 19, u'continu': 8, u'past': 80, u'growth': 125, u'connect': 59, u'year': 86, u'close': 165, u'happen': 141, u'beyond': 144, u'hacker': 14, u'said': 35, u'expert': 37, u'spokesman': 75, u'abl': 115, u'bad': 138, u'label': 22, u'access': 23, u'use': 143, u'internet': 11, u'injustic': 15, u'insecur': 105, u'common': 131, u'specialist': 113,

Next, our dictionary must be converted into a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) -

In [7]:
corpus = [dictionary.doc2bow(text) for text in texts]
for line in corpus:
    print(line)

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]
[(2, 2), (4, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1)]
[(4, 1), (5, 1), (7, 1), (14, 1), (21, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 2), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1)]
[(7, 1), (13, 1), (35, 1), (38, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1)]
[(2, 1), (5, 1), (56, 1)]
[(2, 1), (4, 1), (5, 1), (7, 1), (35, 1), (40, 1), (52, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1)]
[(4, 1), (7, 1), (25, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1)]
[(35, 1), (36, 1), (38, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1)]
[(80, 1), (81, 1), (82, 1)]
[(4, 1), (7, 1),

* The doc2bow() function converts dictionary into a bag-of-words
* The result, *corpus*, is a list of vectors equal to the number of documents
* In each document vector is a series of tuples
* The tuples are (term ID, term frequency) pairs
* This includes terms that actually occur - terms that do not occur in a document will not appear in that document’s vector

## Creating the LDA Model

*corpus* is a (sparse) document-term matrix and now we’re ready to generate an LDA model

In [58]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)

## Parameters to the LDA model

https://radimrehurek.com/gensim/models/ldamodel.html
* num_topics
    - required
    - An LDA model requires the user to determine how many topics should be generated
    - Our document set is small, so we’re only asking for three topics
* id2word
    - required
    - The LdaModel class requires our previous dictionary to map ids to strings
* passes
    - optional
    - The number of laps the model will take through corpus
    - The greater the number of passes, the more accurate the model will be
    - A lot of passes can be slow on a very large corpus.

In [59]:
print(ldamodel)

LdaModel(num_terms=175, num_topics=3, decay=0.5, chunksize=2000)


In [60]:
print(ldamodel.print_topics())

[(0, u'0.033*"secur" + 0.033*"site" + 0.023*"web" + 0.023*"hacker" + 0.023*"depart" + 0.017*"comput" + 0.017*"said" + 0.017*"justic" + 0.012*"offici" + 0.012*"gain"'), (1, u'0.033*"said" + 0.033*"hacker" + 0.033*"site" + 0.025*"internet" + 0.025*"web" + 0.025*"secur" + 0.018*"flaw" + 0.018*"window" + 0.018*"magazin" + 0.010*"vulner"'), (2, u'0.020*"magazin" + 0.020*"site" + 0.020*"take" + 0.020*"web" + 0.020*"said" + 0.011*"fidel" + 0.011*"hole" + 0.011*"institut" + 0.011*"spokeswoman" + 0.011*"access"')]


In [61]:
for topic in ldamodel.print_topics(num_topics=2):
    print(topic)

(2, u'0.020*"magazin" + 0.020*"site" + 0.020*"take" + 0.020*"web" + 0.020*"said" + 0.011*"fidel" + 0.011*"hole" + 0.011*"institut" + 0.011*"spokeswoman" + 0.011*"access"')
(0, u'0.033*"secur" + 0.033*"site" + 0.023*"web" + 0.023*"hacker" + 0.023*"depart" + 0.017*"comput" + 0.017*"said" + 0.017*"justic" + 0.012*"offici" + 0.012*"gain"')


In [62]:
for topic in ldamodel.print_topics(num_topics=3, num_words=3):
    print(topic)

(0, u'0.033*"secur" + 0.033*"site" + 0.023*"web"')
(1, u'0.033*"said" + 0.033*"hacker" + 0.033*"site"')
(2, u'0.020*"magazin" + 0.020*"site" + 0.020*"take"')


* Within each topic are the three most probable words to appear in that topic

## Topics in detail
Let's now look at a topic in detail. Let us see how distinct the topics are, and if they seem to capture any context.

In [63]:
print(ldamodel.print_topic(topicno=0))

0.033*"secur" + 0.033*"site" + 0.023*"web" + 0.023*"hacker" + 0.023*"depart" + 0.017*"comput" + 0.017*"said" + 0.017*"justic" + 0.012*"offici" + 0.012*"gain"


In [64]:
print(ldamodel.print_topic(topicno=1))

0.033*"said" + 0.033*"hacker" + 0.033*"site" + 0.025*"internet" + 0.025*"web" + 0.025*"secur" + 0.018*"flaw" + 0.018*"window" + 0.018*"magazin" + 0.010*"vulner"


In [65]:
print(ldamodel.print_topic(topicno=2))

0.020*"magazin" + 0.020*"site" + 0.020*"take" + 0.020*"web" + 0.020*"said" + 0.011*"fidel" + 0.011*"hole" + 0.011*"institut" + 0.011*"spokeswoman" + 0.011*"access"


## Do the topics make sense?

In [66]:
for topic in ldamodel.print_topics(num_topics=3, num_words=3):
    print(topic)

(0, u'0.033*"secur" + 0.033*"site" + 0.023*"web"')
(1, u'0.033*"said" + 0.033*"hacker" + 0.033*"site"')
(2, u'0.020*"magazin" + 0.020*"site" + 0.020*"take"')


## Refining the model

Two topics seems like a better fit for our documents!

In [67]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

for topic in ldamodel.print_topics(num_topics=2, num_words=4):
    print(topic)

(0, u'0.037*"secur" + 0.032*"site" + 0.018*"magazin" + 0.017*"said"')
(1, u'0.028*"web" + 0.028*"said" + 0.027*"hacker" + 0.027*"site"')


Let's try it with more passes:

In [68]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=200)

for topic in ldamodel.print_topics(num_topics=2, num_words=4):
    print(topic)

(0, u'0.023*"site" + 0.023*"web" + 0.017*"take" + 0.010*"hole"')
(1, u'0.033*"secur" + 0.033*"site" + 0.030*"said" + 0.029*"hacker"')


## Predicting Topic for new documents

In [None]:
doc_f = "Are Health professionals justified in saying that brocolli is good for your health?" 

doc_set = [doc_f]

# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
    
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

infer = ldamodel[corpus[0]]

# https://radimrehurek.com/gensim/wiki.html

In [70]:
#Lets check by default LDA parameters
print(ldamodel)

LdaModel(num_terms=175, num_topics=2, decay=0.5, chunksize=2000)


## Deep Tree Parsing

One of the advance topic in NLP is Lexical Analysis of text wherein we try to analyze and understand text. This process is called deep tree parsing in NLP world where we try to analyze relationships amongst the text.
- Text parsing is important when you want to know relationships in text. For example <i>Delhi is capital of India<i>, here Delhi and India are related and having a relationship <b>is capital of<b> 

In [46]:
grammar1 = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)
sent = "the dog saw a man in the park"
tokens=nltk.word_tokenize(sent)
rd_parser = nltk.RecursiveDescentParser(grammar1)
for tree in rd_parser.parse(tokens):
    print(tree)


(S
  (NP (Det the) (N dog))
  (VP
    (V saw)
    (NP (Det a) (N man) (PP (P in) (NP (Det the) (N park))))))
(S
  (NP (Det the) (N dog))
  (VP
    (V saw)
    (NP (Det a) (N man))
    (PP (P in) (NP (Det the) (N park)))))


![alt text](../images/deep_parsing.png "Title")

Here as well we have to define our grammar, which looks quite tedious job. But there are other NLP packages such as Stanford CoreNLP which provides funcitons to generate parse tree from unstructured text without defining any grammar.
- Parse tree provides us meaningful and true relations and also kind of relations they share. Also called facts.
- Tree Parsing is used to build knowledge base from unstructured corpus. Check DbPedia.

### POS Tagging

Part of Speech tags are grammatical consituents (Noun, Verbs, Adverb, Adjectives) and this process of POS tagging classify tokens into their part-of-speech tags and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories. Here is the definition from wikipedia:
    
<i>In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.<i>    

In [15]:
nltk.pos_tag(tokens)

[('The', 'DT'),
 ('Justice', 'NNP'),
 ('Department', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('be', 'VB'),
 ('singled', 'VBN'),
 ('out', 'RP'),
 ('.', '.')]

These DT, NNP, MD etc are pos tags taken from the standard list of Penn TreeBank Tagsets. It can be found here
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

POS tagging is one of the basic and very important component of NLP, as NLP mainly works on linguistics, i.e. way of writing language and Grammar is important part of it. POS tagging in the world of NLP is solved problem and works well if language is written well formatted.

POS tagging is also supervised learning solution that uses features like previous word, next word, is first letter capitalized etc.

NLTK has a function to get pos tags and it works after tokenization process. 

In our problem of Author Identification, we can create multiple features using POS Tagging.
1. Number of Nouns, Verbs, Adjectives etc.
2. How many times sentence starts with Adverb. Meaning words like Basically, Typically etc.
    

In [23]:
# We can get more details about any POS tag using help funciton of NLTK as follows.
nltk.help.upenn_tagset("RB")

RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...


### Chunking

Chunking is a process of extracting phrases(aka chunks) from unstructured text. Instead of just simple tokens which may not repersent actual meaning of text, its advisable to use phrases such as "New Delhi" as a single word instead of New and Delhi separate words.

- Chunking is done using linguistic rules(language grammar rules), such as when two proper nouns occur together, merge them to make a single word. For Example "South Africa".
-  Chunking works on top of POS tagging, it uses pos-tags as input and provide chunks as output. 
-  Similar to POS tags, there are standard set of Chunk tags like Noun Phrase(NP), Verb Phrase (VP) etc.
-  Most data scientist uses N-Grams instead of chunker, but n-grams ends up creating a lots and lots of meaningless words.
-  Chunking is very important when you want to extract information from text such as Locations, Person Names etc. In NLP called Named Entity Extraction.
-  In Author Identification, we can hvae features like how many Named entity author uses in a sentence.
-  What kind of countries/continents, author mostly refer in his articles.

There are a lot of libraries which gives phrases out-of-box such as Spacy or TextBlob. NLTK just provides a mechanism using regular expressions to generate chunks.


In [40]:
#Define your grammar using regular expressions
#For example a phrase starting with determiners(The/an/a) followed by noun or adjective will be a noun phrase. such as "a greedy dog"
parser = ('''
    NP: {<DT>? <JJ>* <NN>*} # NP
    P: {<IN>}           # Preposition
    PP: {<P> <NP>}      # PP -> P NP
    VP: {<V.*> <PP|RB|V.*>*}  # VP -> V (NP|PP)*
    ''')
line="Unidentified hackers gained access to the department's web page on August 16 and replaced it with a hate-filled diatribe labelled the Department of Injustice that included a swastika and a picture of Adolf Hitler."
chunkParser = nltk.RegexpParser(parser)
negation_result={}
tagged = nltk.pos_tag(nltk.word_tokenize(line))
tree = chunkParser.parse(tagged)
negated_entity=""
negated_value=""
negation=None
for subtree in tree.subtrees():
    print subtree


(S
  (NP Unidentified/JJ)
  hackers/NNS
  (VP gained/VBD)
  (NP access/NN)
  to/TO
  (NP the/DT department/NN)
  's/POS
  (NP web/JJ page/NN)
  (P on/IN)
  August/NNP
  16/CD
  and/CC
  (VP replaced/VBD)
  it/PRP
  (PP (P with/IN) (NP a/DT hate-filled/JJ diatribe/NN))
  (VP labelled/VBD)
  (NP the/DT)
  Department/NNP
  (P of/IN)
  Injustice/NNP
  that/WDT
  (VP included/VBD)
  (NP a/DT swastika/NN)
  and/CC
  (NP a/DT picture/NN)
  (P of/IN)
  Adolf/NNP
  Hitler/NNP
  ./.)
(NP Unidentified/JJ)
(VP gained/VBD)
(NP access/NN)
(NP the/DT department/NN)
(NP web/JJ page/NN)
(P on/IN)
(VP replaced/VBD)
(PP (P with/IN) (NP a/DT hate-filled/JJ diatribe/NN))
(P with/IN)
(NP a/DT hate-filled/JJ diatribe/NN)
(VP labelled/VBD)
(NP the/DT)
(P of/IN)
(VP included/VBD)
(NP a/DT swastika/NN)
(NP a/DT picture/NN)
(P of/IN)
