# Information Extraction and Named Entity Recognition

## Overview
In this tutorial you will learn about extracting information from unstructured text. This is an important step which lets us make sense of the raw data which we often get in the real world.
We will learn about the following in this tutorial:
* Information Extraction and various techniques
* Regular Expressions and how to use them in Python
* Named Entity Recognition
* Condition Random Fields (A popular algorithms used in various NER systems.)

## Learning Objectives

After completing this tutorial, you will be able to do the following:
    * Gain hands-on knowledge on using Regular Expressions to extract intended data from unstructured text.
    * Know how to extract Named Entities from raw text
    * Use regex in Python
    * Explore a popular algorithm behind NER systems.
    
## Pre-requisites

After completing this tutorial, you will be able to do the following:
    * Gain hands-on knowledge on using Regular Expressions to extract intended data from unstructured text.
    * Know how to extract Named Entities from raw text
    * Use regex in Python
    * Explore a popular algorithm behind NER systems.


# Information Extraction

### 1.1 Why do we need IE?


This tutorial is aimed at providing knowledge about __Information Extraction(IE)__ or __Information Retrieval(IR)__, the whats', the hows' and the whys'. IE is the study of techniques used to extract necessary information from raw data. IE is vast, and one of the most important field of study in data sciences, both theoritically and practically. It has been the interest and an active area of work for researchers for decades. Cleaning data, extracting necessary data to form features from vast amount of unstructured data is also what constitutes for 50-60 % of time spent by a data scientist on a real project. 


Typically in real life scenrios, applications generate lot of data. Most part of this data may / may not be needed for data science algorithms. So IE forms a bridge between the unstructured, raw data to the strructured formatted numerical data needed by various machine learning algorithms to perform their task. The flow diagram below gives a high level view of where IE systems play their part in the big picture.

<img src="../images/ie_flowdiag.png"/>

### Impact of IE/IR

* With greater advancements in IE systems and techniques, we have garnered better capability to process the data and with greater accuracy. 

* Lot of big data technologies arose from the fact that we needed to process vast amount of data before/after feeding into actual business logic code. 

* The accuracy of machine learning algorithms have increased because of increase in the accuracy of feature set acquired from raw text.

### Techniques / Approaches

Following are the major techniques used to extract information from text:-

* __Regular Expresions__ : Simple regular expressions used to find the pattern in the text and extract matching information.
* __Classification Algorithms__ : Classification algorithms are also being used to extract out a subset of text of interest.
* __Deep Learning Systems__ : Recurrent Neural Netwoks (RNN) are being used to extract out information of interest from raw text and have impressive accuracy in extracting information.
* __Probablistic Models__ : Probablistic Models like Condition Random Fields (CRF) are gaining popularity for their efficiency in capturing sequential information.


We are going to take up Regular Expressions and study it in detail.

### What is Regular Expression ?

A regular expression is a collection of one of more characters that define a pattern. Usually this represents a string. As an intuition it is an abstract representation of a string. Regular expression is a very powerful tool which can be used in searching a string or all the occurances of a string in a text,extract information from a piece of text.
Following is a very basic representation of a regular expresion(infact the most basic one):-

Regular Expression -> sh

When we use the above regular expression, it basically says that find all the occurances in a text which contains 's' and 'h' in that order. Let's see an example in the text below

Text -> She sells sea <mark>sh</mark>ells on the sea <mark>sh</mark>ore.

** The point to note here is that the first occurance of the regex(Regular Expressions in short) is She, but is entirely ignored. It is because regexes are case sensitive. So a regex 'sh' would mean to select all occurances of the string sh and in lower case.

### Okay, I am getting a hang of it now. Tell me more...

Regular expressions are broadly a collection of characters, called metacharacters, and a condition specifying their repetition, called multipliers. Let's try and understand them in a bit more detail.
*  __Metacharacter__ - These are regular expression specific characters which basically outline the selection criterion for the string.

We will be seeing some of the metacharacters used in regex in the tutorial that follows.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __1. .(dot)__ - The dot is a metacharacter which represents any character(much like the joker in the game of cards). The important thing to note here is that the dot will represent only a ** single ** character. 

For Example:

Sentence -> She sells sea-shells on the sea-shore. 

>In the sentence, there are 38 matches to the regex .(dot), since each character (including the white-space) is a match.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __2. [] (Brackets)__ - The brackets specify a range of characters to be matched. It fine tunes the dot(.), which basically selects any character, to a predifined set of characters to be matched. 
>Sample Usage-:
[a-e], select all the text which has a,b,c,d,e <br/>
[1-8], select any occurance of 1,2,3,4,5,6,7,8<br/>
For Example:
<br/>
Regex -> [1-9] will select a single digit in the sentence
<br/>
Sentence -> A leap year comes after <mark>4</mark> years.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __3. ^ (Negation)__ - The negation expression lets us find the all the characters which do not match a given expression.

>For Example:
<br/>
Regex -> [^1-9] will select all the characters except the digits in the range 1-9.
<br/>
Sentence -> <mark> A leap year comes after</mark> 4 <mark> years</mark>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __4. \ (Escape Character)__ - Escape characters are the ones which let us escape the metacharacters of regular expression.

>Regex -> \..* will select all the characters after the '.'(dot). Please note that we escaped the '.'(dot) so that the regex engine does not mistake it to be a metacharacter and consider it as a dot in the sentence.
<br/>
Sentence -> Hi<mark>. How are you ?</mark>



* __Multipliers__ - These are conditional loops which specify the number of times a find operation should be performed. When used in conjunction with an expression, it indicates repetition of that expression, 0 or more number of times.
Following are the multipliers in the regular expressions.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __1. * (asterix)__ - This means that the expression occurs zero of more number of times. Example:

>&nbsp;&nbsp;&nbsp;&nbsp;Taking the above example, the regex \..* uses '*' multiplier to suggest that select any number of characters that come after dot.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __2. + (plus)__ - Expression occurs one of more number of times.

> &nbsp;&nbsp;&nbsp;&nbsp;
Continuing the previous example, if we modify the regex to '\..+', it would mean select dot and one or more 
characters after dot. Please note that there will be no match in case there are no characters after dot. For example,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;
Sentence-> Hi.<br/>
&nbsp;&nbsp;&nbsp;&nbsp;
Regex -> '\ ..*' , There will be a match. . will be selected, since the regex does not mandate any character to be 
&nbsp;&nbsp;&nbsp;&nbsp;
present after dot.<br/>
&nbsp;&nbsp;&nbsp;&nbsp;
Regex -> '\ ..+' , There will be no match, since there are no characters after not.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __3. ? (question mark)__ - Expression occurs **zero or 1 time**

>Regex- s.?a<br/>
Sentence-> She sells <mark>sea</mark> shells on the <mark>sea</mark> shore

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __4. {3}__ - Expression occurs ** three ** times

>Regex- s.{3}ls<br/>
Sentence-> She sells sea <mark>shells</mark> on the sea shore

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __5. {3,5}__ - Expression occurs between 3 and 5 times. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __6. {3,}__ - Expression occurs **atleast** three times. This means that a minimum of 3 repetitions will occur of the expression. There is no cap of the maximum number of times the expression will occur.

>Regex- (sea){2,}<br/>
Sentence - She sells sea shells on the <mark>seasea</mark> shore

#### Some simple Regular Expression

Let us take a look at some simple regular expression and try to understand how it really works.

* Regex- **a.p**

    The above regular expression will try to find three characters, a, followed by an character from a-z, followed by p

    Text- The <mark>app</mark>le of my eye
    
* Regex **a.p.***

    The above regex will try to find three or more character. It will do the following:
        * Find a sequence that starts with a
        * Then accept any character(the dot)
        * Then match p
        * then match 0 or more instances of any character.
        
    Let us apply this regex in the above example
    
    Text - The <mark>apple of my eye</mark>
    
* Let's up things a bit. Now we'll try to find out an email address from a piece of text( Did someone say data scraping ? ).

   Regex - **[a-z]+@.+\..{2,10}**
   
   Here is what the above regex means:-
       * Find one or more characters in the range a-z the until you get '@'
       * Then find one of more character, untill you find '.'
       * Then find a maximum of ten characters after you find . (assuming the longest tld available is .technology which is 10 characters)
        
    Text -> Please write to use at <mark>info@grayatom.com</mark>
    
    The above regex is used for simplicity and there are lot of cases that get missed by it. For example, since we have already learnt that the regexes are case sensitive, so the above will miss out on capitalized words. Also we do encounter numbers in emails which would again get missed by the regex. We leave it upto the student to revise the regex to cover all the cases as an exercise.
    
### 1.2 Regular Expression in Python

Regular Expressions in Python are supported by a popular package called 're'. This package contains all the necessary codebase related to regular expressions. Let us dive into code and try and understand how to use regular expressions to find desired information.

```python

# Import the re package
import re

# Let us use the same text as mentioned in the example above
text = "The apple of my eye"

# The following snippet will use the search function of re package. It will take two parameters:
# 1. The first parameter will be the regular expression to search for
# 2. The second parameter is the text in which the pattern is to be searched
matched = re.search('a.p',text)

'''
    If the pattern is found:-
        1. The match variable gets a True(boolean) value
        2. Otherwise, it returns None
'''

if matched:
    print("Pattern Found")
```
#### Output

```
pattern Found

```
#### RE package in more detail

The re package mainly provides 3 operations based on regular expressions. 

1. Match - Checks for the pattern only at the start of the string.
2. Search - Checks for the pattern anywhere in the string
3. Search and Replace - As the name suggests, it will search for a given pattern and replace all the occurances with the string provided. The search and replace function additionally takes a max parameter, which specifies  maximum number of occurances to replace.

The overall structure of match and search function / method looks like this.

|pattern|string|flags|
|-------|------|-----|
|Regex pattern|String to search|Modifier flags

__Group Functions:__

Group functions, when used on top of match / search will return the entire matched strings in form of an array(tuple) or a single match (if given the group number.)

__Modifier Flags:__

Following are some of the modifier flags. The reader is encouraged to read about all of them.

|Flag|Description|
|----|-----------|
|I/IGNORECASE|Perform case-insensitive matching|
|L/LOCALE|Interpret words as per the current locale|
|M/MULTILINE|When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline)|
|S/DOTALL|Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline. Corresponds to the inline flag (?s).|


Now lets dive into some more examples and see how we can leverage the various re functions.


```python
# Import re package
import re
#Sample text
text = "She sells sea-shells on the sea shore."

# Match 'sh' in the given text. Do not consider Case while matching 
match = re.match(r'sh',text,flags=re.IGNORECASE)
# Prints the value of match. Rather inconclusive and not helpful if not grouped.
print(match)
# Match in itself does not give much information. As can be seen by printing the variable.
#Junk value like '<_sre.SRE_Match object; span=(0, 2), match='Sh'>' gets printed

# $$$$$ Groups to the rescue
match = re.match(r'sh',text,flags=re.IGNORECASE).group()

# Prints the matched expression
print(match)

```

#### Output


```python

<_sre.SRE_Match object; span=(0, 2), match='Sh'>
Sh

```


```python
# Now let us up things a bit. Let us take the email example we discussed earlier.

text = "Please write to use at info@greyatom.com"

# We will be using the search function since we expect the email to be anywhere and not just at the start
match = re.search(r'[a-z]+@.+..{2,10}',text).group(0)

# The group function will find all the groups of occurances. We pick the first occurance
print(match)

```

## Task:

### Develop a fully functional Email Regex
The above email regex does not cover all the cases. The reader is expected to complete the following exercise and figure out a regex that covers all the cases generally found in an email.
__Hint:__ The above regex will fail to collect the email info1@grayatom.com

#### Instructions
* Identify all possible use cases and write a regex that covers all the email types
* Replace the empty regex, with your regex. Please do not modify any other lines
* Run the cell, If your regex is right, a success message will be printed. Otherwise, an error message will be printed and you can retry again.






In [1]:
# Please do not modify the below line
%run corrections.py

# Please type in the regex below
regex=""

#Please do not modify the line below
print(checkEmailRegex(regex))


The regex is Incorrect. Please try again


# 2.Named Entity Recognition (NER)

### 2.1 What is Named Entity Recognition?

In the previous chapter, we learnt about Information Extraction. One important technique under Information Extraction (IE) umbrella is Named Entity Recognition, popularly called as NER. 

In this technique, we try and identify the named entities mentioned in an unstructured text. The named entities could be anything ranging from a person's name to company's name like Apple Computers to city names like Mumbai. In simpler sense/words, you can imagine entities or named entities to be Nouns(Proper Noun to be more specific). Names of people like Peter, of companies like Microsoft, of places like Manhatten are all Named Entities.
Commonly occurring Named Entites are:
1. Person's Name
2. Company / Organization Names
3. City / State / Country Name (Location)

### Applications

There are bunch of areas in which NER Systems are used. Some of the use cases could be
1. In support cases where in the fault in a product mentioned.
2. In news, identify the main subjects in the news.
3. To find a relation between various entities described in a document

In short NER Systems are an integral part on NLP and using them can provide an insight into the data right away and its practical applications in day to day NLP tasks are immense.


## How Does NER System work

Most of the NER Systems today incorporate the following processes to identify the Named Entities in a sentence or any unstructured piece of text.
1. Tokenization
2. PoS Tagging
3. Classification

Lets get to know each of the processes in short. For the rest of the tutorial we will take the following piece of text for analysis. Its a summary of a popular Marvel Movie released recently.

Captain Marvel is an extraterrestrial Kree warrior who finds herself caught in the middle of an intergalactic battle between her people and the Skrulls. Living on Earth in 1995, she keeps having recurring memories of another life as U.S. Air Force pilot Carol Danvers. With help from Nick Fury, Captain Marvel tries to uncover the secrets of her past while harnessing her special superpowers to end the war with the evil Skrulls.

``` python
text="Captain Marvel is an extraterrestrial Kree warrior who finds herself caught in the middle of an intergalactic battle between her people and the Skrulls. Living on Earth in 1995, she keeps having recurring memories of another life as U.S. Air Force pilot Carol Danvers. With help from Nick Fury, Captain Marvel tries to uncover the secrets of her past while harnessing her special superpowers to end the war with the evil Skrulls."

```

### 1. Tokenization

Tokenization is the process in which the sentence or the text is broken down into individual words that make them. So the above text will be broken down into array of words like ["Captain","Marvel","is","an","extraterrestrial"....].
Tokenization is the first step in preprocessing in which it eliminates all the white spaces, prepares a word array of the entire text and makes it ready for further processing.

### 2.  Pos Tagging

PoS or Parts of Speech tagging is the process in which each of the tokenized word, in step 1 is tagged as a Part of Speech like Noun, Verb, Adjective, Pronoun, Conjunction etc. A sample Pos Tagged word could be seen in the figure below. 
Also to decipher the short forms in the picture(or in further discussions) please consult the Pos Tag list at the link https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

<img src="../images/postagged.png"/>

### 3. Classification

Post PosTagging, we take the output we get, and then run it through a classifier to identify the chunks as
a Person, Organization, Location etc. 
Over the years various algorithms have been used for this right from rule/grammar based in the early days to Deep Learning based sophisticated algothims today. 
There are two popular algorithms that we use today are:
1. Conditional Random Fields (Mostly in use today. Readers will learn more about this in the following section of the tutorial.)
2. RNN based deep neural networks (Gaining lot of popularity)




### 2.2 Named Entity Recognition in Practise Today

Lot of work has been done in the area of Entity Recognition and there are some popular tools that are available in the market today. Some of them are:
1. NLTK (Available in python)
2. Spacy (Python)
3. Stanford NER (Java, also available in python as an external package)

We will take a look some code samples from NLTK and Spacy

### NER using NLTK

As mentioned earlier, the process of any NER system requires, tokenizing, POS tagging and finally identifying the named entities.
Let's try to find out the Named Entites in our sample text

``` python

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

text_tokens = nltk.word_tokenize(text)
#    The above statement breaks the text sample into tokens of independent words. For Example , the first line
#    text would be broken down into something like
#        "Captain","Marvel","is","an","extraterrestrail"........        
    
pos_tags = nltk.pos_tag(text_tokens)

#    The above code will now tag the tokenized words into various Parts of Speech. An example of it can be seen in the 
#    following image

```

<img src="../images/postagged.png"/>

```python

"""    
    We now use NER Classifier to Identify the named entities in the piece of unstructured text. Following is a sample
    snippet to do so. Running the below snippet will produce the classified Named entity chunks (this is done 
    using an inbuilt classifier) which classifies the tagged text into Person, Organization etc. 
    The following figure shows this in more detail.
"""
ne_chunks = nltk.ne_chunk(pos_tags)
for name in ne_chunks:
    if hasattr(name, 'label'):
        print(name.label(), ' - '.join(c[0] for c in name.leaves()))
```
<img src="../images/nechunk.png"/>

As you can see from the picture above, Marvel is recognized as a Person, Skrulls is recognized as an Organizations and Carol is recognized as a Person.

### NER using Spacy

Now lets quickly take a look at the NER output using another popular NER system which is gaining lot of popularity
lately, Spacy. Following is a sample snippet

```python
import spacy
#The following line loads the English libraries needed for Spacy,viz, English tokenizer, tagger, parser, NER and word vectors
model = spacy.load('en_core_web_sm')

#Parse the above text into Spacy's model. This will tokenize the words, pos tag them and identify the named
#entities automatically.

parsed = model(text)

#Now, to find named entities
for entity in parsed.ents:
    print(entity.text, entity.label_)
    
```

The following figure contains the output of the program when run:

<img src="../images/spacy.png"/>

__Lets see some of the insights that we can draw from our sample text :__

```python
# Gain more insight into the relationship of the words with each other. Following is a snippet to show the syntactic
# dependencies
from spacy import displacy
displacy.serve(parsed, style='dep')

#The output of the above is attached as an image for reference. The image is pretty wide, so it has been cropped 
#to fit here

```

<img src="../images/dep_tree.png"/>

Now let's also try to plot a bar graph depicting the number of Entities which were found and the Type of those entities (like Person, Organization etc).

```python
import matplotlib.pyplot as plt
import numpy as np

counter = {}
for entity in parsed.ents:
    if entity.label_ in counter.keys():
        count = counter[entity.label_] + 1
        counter[entity.label_] = count
    else:
        counter[entity.label_] = 1
        
plt.xlabel('Entity Types', fontsize=5)
plt.ylabel('Num Occurences', fontsize=5)
plt.bar(counter.keys(),counter.values())

```

#### Output

<img src="../images/barr.png"/>

## Task:
### Find out Named Entities in a sample text
This is an exercise in which the reader is expected to find out named entities from a sample text. We wil be using both NTLK and Spacy NER in the exercise.

#### Instructions
Please follow the instructions given below to complete the exercise.

    1. Load the NER library
    2. In case of NLTK:
        a) Tokenize the text.
        b) Find out the POS tags.
        c) Use the POS tags to find out the Named Entities in the text.
        
    3. In case of Spacy:
        a) Load the English model.
        b) Pass the text to the loaded model to get the Named Entities.

In [2]:
#Import NLTK
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
#Import NLTK End

#Import Spacy
import spacy
model = spacy.load('en_core_web_sm')
#Import Spacy End

text = "Avengers: Endgame is an upcoming American superhero film based on the Marvel Comics superhero team the Avengers, produced by Marvel Studios and set for distribution by Walt Disney Studios Motion Pictures. It is the direct sequel to 2018's Avengers: Infinity War, a sequel to 2012's Marvel's The Avengers and 2015's Avengers: Age of Ultron, and the 22nd film in the Marvel Cinematic Universe (MCU). The film is directed by Anthony and Joe Russo with a screenplay by Christopher Markus and Stephen McFeely and features an ensemble cast including Robert Downey Jr., Chris Evans, Mark Ruffalo, Chris Hemsworth, Scarlett Johansson, Jeremy Renner, Don Cheadle, Paul Rudd, Brie Larson, Karen Gillan, Danai Gurira, Bradley Cooper, and Josh Brolin. In the film, the surviving members of the Avengers and their allies work to reverse the damage caused by Thanos in Infinity War.The film was announced in October 2014 as Avengers: Infinity War – Part 2. The Russo brothers came on board to direct in April 2015, and by May, Markus and McFeely signed on to script the film. In July 2016, Marvel removed the title, referring to it simply as Untitled Avengers film. Filming began in August 2017 at Pinewood Atlanta Studios in Fayette County, Georgia, shooting back-to-back with Infinity War, and ended in January 2018. Additional filming took place in the Metro and Downtown Atlanta areas and New York. The official title was revealed in December 2018."

#NLTK specific Code start

text_tokens = nltk.word_tokenize(text)    
pos_tags = nltk.pos_tag(text_tokens)
ne_chunks = nltk.ne_chunk(pos_tags)
for name in ne_chunks:
    if hasattr(name, 'label'):
        print(name.label(), ' - '.join(c[0] for c in name.leaves()))
#NLTK specific Code end

#Spacy specific Code start
parsed = model(text)

for entity in parsed.ents:
    print(entity.text, entity.label_)
#Spacy specific code end


GPE American
ORGANIZATION Marvel - Comics
ORGANIZATION Avengers
PERSON Marvel - Studios
PERSON Walt - Disney - Studios
PERSON Marvel
ORGANIZATION Avengers
GPE Ultron
ORGANIZATION Marvel - Cinematic - Universe
ORGANIZATION MCU
PERSON Anthony
PERSON Joe - Russo
PERSON Christopher - Markus
PERSON Stephen - McFeely
PERSON Robert - Downey - Jr.
PERSON Chris - Evans
PERSON Mark - Ruffalo
PERSON Chris - Hemsworth
PERSON Scarlett - Johansson
PERSON Jeremy - Renner
PERSON Don - Cheadle
PERSON Paul - Rudd
PERSON Brie - Larson
PERSON Karen - Gillan
PERSON Danai - Gurira
PERSON Bradley - Cooper
PERSON Josh - Brolin
ORGANIZATION Avengers
GPE Thanos
GPE Infinity
GPE Russo
PERSON Markus
PERSON Marvel
ORGANIZATION Avengers
ORGANIZATION Pinewood - Atlanta - Studios
GPE Fayette - County
GPE Georgia
ORGANIZATION Infinity - War
ORGANIZATION Metro
PERSON Downtown - Atlanta
GPE New - York
American NORP
the Marvel Comics ORG
Marvel Studios ORG
Walt Disney Studios Motion Pictures ORG
2018 DATE
2012 DATE
Marve

# 3.Condition Random Fields


### 3.1 Introduction to CRF
Named Entity Recognition has been one of very active area of study in Information Extraction. It is because of the challanges it poses and the necessary information that can be extracted out of a piece of unstructured text. Hence, lot of tools have been successfully applied for extracting them out. However many of these tools have difficulty modelling ovelapping, non-independant features, like factoring in the tags(POS) of the surrounding words, their capitalization patterns etc, which in essence give lot of insight about the structure of the sentence (in English language atleast). Let us try to understand this in some detail.

__Tesla said, he is not going to Spain.__

Let us try to analyze the above sentence from a non technical perspective. Following are the conclusions that we can draw after reading the sentence.

    * The subject of the sentence is a proper noun (Tesla).
    * The verb are: said, going
    * The predicate of the sentence is said, he is not going to Spain.
    * There is one more proper noun in the sentence (Spain)
    
If we look at the above sentence more closely, we can also make the following observations:-

    * Only two words are Capitalized in the sentence, and they are both Proper Nouns (or Named Entities)
    * If a verb succeeds a capitalized word, then the word is a Noun.
    
Now, if we notice, the last two observations we made, involved considering all the words in the sentence and their characteristics, like whether or not the next word is a VERB, which of the words have the first letter in Caps etc. We will call this information(and similar information) as the contextual information. These are kind of features that are dependent on the preceeding or the succeeding observation (word) in the sentence.

__Condition Random Fields(CRF)__ are <a href="https://en.wikipedia.org/wiki/Discriminative_model" target="_blank">discriminitive</a> <a href="https://en.wikipedia.org/wiki/Graphical_model" target="_blank"> graphical models</a> that can model these overlapping (or correlation) between the features. It is a sequence modelling tool which factors in the correlation and predicts the conditional probablity. CRF defines the conditional probablity of a tag, given the word,i.e, given the word in a sentence / text, what is the probablity that it belongs to a certain class(tag) of named entities, like Person, Organization, Location etc. 

For simplicity, we will be studying a special case of CRF, linear chain CRF. Mathematically, it can be denoted as:

$P(Z|X)=\frac{1}{K}exp(\sum_{n=1}^N\sum_{i=1}^F\lambda_if_i(z_{n-1},z_n,X,n))$

where,

Z = { $z_1,z_2...z_n$ } is a set of N labels(tags) like Person, Organization, Location etc. It is the classes which need to be predicted by the CRF model

X = {$x_1,x_2...x_n$} is a set of N observations(words) in the sentence / text.

$f_i$ is the feature function.

$\lambda_i$ is the learning parameters and weight of the feature function $f_i$.

K is the partition function(or normalization function) which is used to normalize the value of the function in the range of (0,1) so as to make it a valid probablity value. Mathematically $K$ is the sum of the feature function over all the observations and it is defined as:

$K=\sum_Zexp(\sum_{n=1}^N\sum_{i=1}^F\lambda_if_i(z_{n-1},z_n,X,n))$


## Feature Function

Feature function are the central piece to CRF because it is this function that provides the contextual information to the CRF model. In linear-chain CRF, the feature function is denoted by $f_i(z_{n-1},z_n,X,n)$. It means that the function embodies in it the following information

    * The previous and the next state information or in our example we can call this as tags(z).
    * All the input words in the sentence / text.
    * The position of the word in the sentence / text.
    
For example, we can define a simple feature function to produce binary values,i.e, 1 if the current word is Tesla and if the current state is Person.

$f_1(z_{n-1},z_n,X,n)=\begin{cases}1&\text{if z_n = Person and x_n = Tesla}\\0&\text{Otherwise}\end{cases}$

The way the above feature will get used depends directly on the weight associated with the feature $\lambda_1$. Whenever $\lambda_1>0$ , it increases the probablity that whenever the word 'Tesla' is encountered, the model would prefer the tag 'Person' for it. Similarly, if the value of $\lambda<0$, the model would avoid associating the tag 'Person' to the word 'Tesla'. And finally, if $\lambda=0$, it would not alter the behavior of the model in any way,i.e, it will have no effect.

Let us define another feature function which say, tries to establish the relationship between the tag 'Person' and when followed by the word 'said', for example __Tesla said,__ :-

$f_2(z_{n-1},z_n,X,n)=\begin{cases}1&\text{if z_n = Person and x_n+1 = said}\\0&\text{Otherwise}\end{cases}$

In the above feature function, $\lambda_2>0$ whenver 'said' succeeds a words with tag 'Person'.

Now if we compare the above feature functions, we find that it can both be applied to our feature functions $f_1$ and $f_2$ for the example __Tesla said,..__. This is an example of overlapping features wherein both the feature functions get activated on a particular input set. It boosts the probablity of the word 'Tesla' being classified as 'Person'.

### Feature Selection

We discussed feature functions is detail. Now let us take a look at how do we arrive at selection of these features for a given NER task. Mostly the selection of what features to use involves lot of techniques used in feature engineering. However, one can always start with some features and build more along the way

* One of the features could be a simple combination of words and tags. For example (x= Tesla, z= Person), (x=SpaceX, z= Organization)

* Another feature set could be the position of the capitalized words in the sentence. 

* Another feature could be the neighbouring words and their POStags.

So, the idea is to select the candidate features that would help increase the effectiveness of the CRF model.


## Applications of CRF

Typically CRFs find its use in processing correlated sequential data, like in identifying parts of speech in a sentence. Parts of speech of a sentence rely of positioning of the word in the sentence, like which word is before or after a certain word in a sentence, and by using features that take advantage of this, we can use CRF to learn how to distinguish which word belongs to which part of speech.

Another very popular application of CRFs is in Named Entity Recognition, where it is used to extract proper nouns in a sentence and then classify it into whether the proper noun is a Person or a Place or a Company etc.

It also finds use in Computer Vision for image segmentation and analysis. It has also been used to identify objects and its attributes in images.(as shown below)

<img src='../images/crf_image.png'/>

Image source: https://www.oreilly.com/library/view/deep-learning-for/9781788295628/a64a86ee-4873-4833-9196-81c70e9e3389.xhtml
        
### 3.2 CRF using Python

In the section above, we gained a deeper insight into Condition Random Fields and how it works and its applications. In this section we will see how we can implement our own CRF using python.

To implement CRF using Python, we will be using the following libraries(available in Python):-

    * Pandas (pandas) - To read the dataset and use the its 'dataframe' datastructures for processing
    * Sklearn (sklearn) - We will using sklearn's APIs for splitting the given dataset into train and test datasets so that we can train our model on one dataset and test it out on the other.
    * Sklearn CRF suite (sklearn_crfsuite) - We will also be using CRF suite's CRF implementation API. It is always helpful in learning and in using in practical applications to use a well written and tested implementation of any machine learning algorithm.
    
We will now acquaint ourselves with the necessary steps that we need to do to use CRF in python.


1. Import Pandas, Sklearn and CRF Suite from SKlearn libraries in python for later use

```python
import pandas as pd
import sklearn_crfsuite
from sklearn.model_selection import train_test_split
```

2. Once we have the dataset, we use the loaded libraries to read the dataset from csv file and put it in
a dataframe (A pandas datastructure)

```python
data_from_csv = pd.read_csv("../data/ner.csv",encoding = "ISO-8859-1",error_bad_lines=False)
```
The above snippet will load the contents of the csv file in a variable of type 'dataframe' and assign it to the
variable data_from_csv

3. Once we have the data, our next step is to prepare the data in a way that we can feed it to the crfsuite 
API of sklearn library. The inpuit to the api is a set of feature object which consists of the following features:
  1. Whether or not the word is in lower case
  2. The adjacent words to the word.
  3. Where or not the word is in upper case.
  4. Whether or not the word is a title or is a heading in the text.
  5. If the word consist of digits only.
  6. The POS tags of the word.
  7. The POS tags of the adjacent words.

4. Once we have processed the data, we need to train our CRF algorithm with the data. The CRF algorithm of the 
sklearn API can be initialized in the following way:-

```python
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
```

5. Once we have the model initialized, we will then call the fit() function of the CRF model to train the values. The function take the folowing parameters:

    1. X -> Features to be trained.
    2. Y -> The class labels against which the training needs to be done.

```python
crf.fit(X,Y)
```

6. The above function trains the CRF algorithm in the dataset provided. Once trained, we will use the model to predict the values from the test dataset. The following snippet would be used for this.

```python
Y = crf.predict(X_test)
```

We pass the untrained data to the model and we get the output as the predicted class label (Y). Voila !

# 4. Named Entity Recognition using Deep Learning

### 4.1 Model Architecture

In the previous chapter, we saw how we could extract named entities from raw text using CRF. In this chapter we will see how we can do the same using Deep Learning. To do this, we will be using a very popular Deep learning framework in python, known as Keras. We will be using a special type of RNN,i.e, Bi-LSTM(Bi-directional Long short term momory) for this tutorial. More on Bi-LSTM in the following topic.

The dataset used in this tutorial will be the same as that in the previous tutorial so that the readers could corelate things easily. We will be modifying the input feature code a bit to make it compatible with the Keras engine requirements.

## Bi-LSTM Cell

Bi-LSTM cells are a special form of LSTM(Long Short Term Memory) cells. Psst, if you are not familier with LSTMs or RNNs in general, I recommend you to read the following blogs:

For RNN -> http://karpathy.github.io/2015/05/21/rnn-effectiveness/

For LSTM -> https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Bi-Directional are in essence two sets of LSTM cells working together. One set of them pass/store the contextual information from backwards to forward, whereas the other set passes information from forward to backward,i.e, from future to past. The following figure would help in gaining a better understanding of Bi-LSTM cells.

<img src='../images/lstm_bilstm.png'/>

## NER using Keras

To find out Named Entities, we will be using a high level neural network API called Keras. Using Keras, gives us 2 major benefits:

    * We do not have to delve into low level implementation of Bi-LSTM and rather concentrate on application specific problem,
    
    * Keras works on top of many of the popular backends like Tensorflow, CNTK, Theanos etc and exposes the high level APIs for ease of use. 

The overall architecture of how the various components work together can be seen in the flow diagram below.

<img src='../images/deeplearning_flow.png'/>


Let's now see what are the steps needed for the identifying Named Entities using Keras.

1. The first and formost step is to import all the necessary libraries that we are going to need for our exercice. We are going to import the following libraries:-
    
    *Numpy (For numerical calculations)
    
    *Pandas (Reading and processing our dataset)
    
    *Sklearn (Many helper libraries to process the dataset, split the train and test data etc)
    
    *Keras (for Bi-LSTM implementation )

```python
import numpy as np
import pandas as pd
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
```

2. Post all imports, we use the loaded libraries to read the dataset from csv file and put it in
a dataframe (A pandas datastructure)

```python
data_from_csv = pd.read_csv("../data/ner.csv",encoding = "ISO-8859-1",error_bad_lines=False)
```

The above snippet will load the contents of the csv file in a variable of type 'dataframe' and assign it to the
variable data_from_csv

3. Once the dataset is loaded, we process the dataset as per the input requirement of the Keras API. The processing of the data is going to be similar to what we did in CRF section where in we extract all the words along with their POS tags.

Once we have the dataset, we are going to pad the dataset to the fixed length input vector. This is done primarily to make it consistant with the requirements of the neural network input layer that the all the input vectors have the same dimensionality. The following snippet pads the input vector 

```python
from keras.preprocessing.sequence import pad_sequences
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=n_words - 1)
```

4. Once we have the data ready, we split it into training data and test data. As the name suggests, the test data is to test the efficiency and accuracy of our model after training.

```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
```

5. We are now ready to feed the data to Keras' Bi-LSTM API. Let's try to understand the process in detail.

    * The first step is to instantiate a Keras Tensor. A Keras tensor is a tensor object from the underlying backend( in our example, Tensorflow. Could be CNTK / Theano as well). We initialize this with certain attributes that allow us to build a Keras model just by knowing the inputs and outputs of the model. The following snippet initialized a Keras model
```python
input = Input(shape=<"A shape tuple">)
```

As you can see the Input() takes a minimum of the dimensionality of the input vectors as parameter.


    * We will now get the word embeddings for the input data. Word embeddings, on a very high level, are a set of feature vector that defines every word in the text. So every input word is converted into a set of feature vectors denoting that word. The Embedding layer in Keras is defined as the first hidden layer of the network. It must specify 3 arguments:-
    
    > Input Dimension (input_dim) : Size of the vocabulary of the text data.
    > Output Dimenstion (output_dim) : The size of the vector space in which all the words will be embedded.
    > Input Length(input_length) : The length of the input sequences.

```python
embedding = Embedding(input_dim=n_words, output_dim=50, input_length=max_len)
```
Once we have the word embeddings for each of the words in the input text, we feed this to the Bidirectional API.

    * We use the Bidirectional API to initlize the LSTM cells. The Bidirectinal API is just a wrapped API sitting on top of Kera's default RNN API implementation. In here we primarily specify the number of LSTM cells that we need to initialize and additional parameters. Following are the parameters that it takes:
    > layer : A Recurrent instance . The Recurrent instance that we pass in our case is that of an LSTM cell. The LSTM cell takes in host of parameters, but the primary one is number of LSTM cells that need to be deployed.


```python
Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))
```

     * Once we have added the implementation of the API, we must prepare the output layer. To prepare this, we pass a Dense( a densely connected NN layer) to a TimeDistributed layer. Now in Keras, the second dimension is related to time dimension. This means that if your data is n-dimensional, you could apply a TimeDistributed, which is applicable to (n-1) dimensions. The TimeDistributed wrapper allows to apply a layer to every temporal slice in the input.

```python
out = TimeDistributed(Dense(len(tags), activation="softmax"))(model)  # softmax output layer
```

    * Now, we need to compile our model and specify the loss functions etc.
```python
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])
```

    * Once the model is compiled, we train the model using the input data as shown in the following:
    
```python
trained = model.fit(X_train, Y batch_size=<batchsize>, epochs=5, validation_split=0.1, verbose=1)
```

We then use the trained model to predict the named entities from text. 

