# Stanford NER Tagging

NER stands for **Named Entity Recognition**. It is to extract information from unstructured text, basically by extraction a real world entity from the text (e.g. Person, Organization, Event, etc). This is to determine the relationships between different named entities.

This guide is only relevant to Windows Operating System.
Credits: Code methods belongs to Chuck Dishmon

## Set-up NLTK in Python

Firstly, please install NLTK via Anaconda:
- [Anaconda]https://repo.anaconda.com/archive/Anaconda3-5.2.0-Windows-x86_64.exe

The above link will install a distribution of python packages and applications which includes NLTK.

After installing Anaconda, search and open Jupyter Notebook.
Run this code:
```python
import nltk
nltk.download("punkt")```

## Set-up Java for Stanford NER
Given Stanford NER runs on Java, we will also need Java Runtime Environment (JRE) in order to use NLTK a a python parser. <br>
To download (only if you do not have Java installed):
- [Java] https://java.com/en/download/


## Set-up Stanford NER for Python

Stanford NER is a Java implementation of a Named Entity Recognizer. NER labels sequences of words in a text whih are the names of things, such as person and company names, or gene and protein names. It has good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION) and even for other languages.

Once you have installed NLTK python, please download and extract the following Stanford files into separate folders:

1. [Stanford Parser] https://nlp.stanford.edu/software/stanford-parser-full-2018-02-27.zip
2. [Stanford NER] https://nlp.stanford.edu/software/stanford-ner-2018-02-27.zip
3. [Stanford Log-Linear POS Tagger] https://nlp.stanford.edu/software/stanford-postagger-full-2018-02-27.zip


## Configurations

### 1. Environment Variables Window
Once you have all relevant packages and files, go to the search panel and search for "Environment Variables"

### 2. Create two new variables

#### CLASSPATH
Create a CLASSPATH user variable by clicking on the button New then add the following values
```
Drive:\path\to\stanford-ner-2018-02-27
Drive:\path\to\stanford-parser-full-2018-02-27
Drive:\path\to\stanford-postagger-2018-02-27
```
Example:
```
C:\Users\Daryl\Desktop\Stanford NER\stanford-ner-2018-02-27
C:\Users\Daryl\Desktop\Stanford NER\stanford-parser-full-2018-02-27
C:\Users\Daryl\Desktop\Stanford NER\stanford-postagger-full-2018-02-27
```

#### STANFORD_MODELS
Create another user variable named STANFORD_MODELS and add the following values
```
Drive:\path\to\stanford-postagger-2018-02-27\models
Drive:\path\to\stanford-ner-2018-02-27\classifiers
```

## Testing the Stanford NER

```python
#import packages
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk.chunk import conlltags2tree
from nltk.tree import Tree
```
Please find the correct type of files for the paths and add an r (raw string literal) before the quotation marks in the paths.
```python
#Configurations for py-java and setting paths for stanfordNERTagger
import os
java_path = r"C:/Program Files/Java/jre1.8.0_144/bin/java.exe"
os.environ['JAVAHOME'] = java_path

stanford_ner_path = r"C:/Users/Daryl/Desktop/Stanford NER/stanford-ner-2018-02-27/stanford-ner-2018-02-27/stanford-ner-3.9.1.jar"
stanford_classifier = r"C:/Users/Daryl/Desktop/Stanford NER/stanford-ner-2018-02-27/stanford-ner-2018-02-27/classifiers/english.all.3class.distsim.crf.ser.gz"
```

### Methods

```python
#Process and tokenize the words in the text file
def process_text(text_file):
    text = open(text_file,errors = "ignore")
    text = text.read()
    tokenized_text = word_tokenize(text)
    return tokenized_text
```

```python
#Create a Stanford NER Tagger object and use to tag each tokenized word
def stanford_tagger(tokenized_text):
    #create tagger 
    st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
    #tag every tokenized word
    classified_text = st.tag(tokenized_text)
    return classified_text
```

```python
#Tag tokens with standard NLP BIO tags
def bio_tagger(classified_text):
    bio_tagged = []
    prev_tag = "O"
    for token, tag in classified_text:
        if tag == "O": #O
            bio_tagged.append((token,tag))
            prev_tag = tag
            continue
        if tag != "O" and prev_tag == "O": #Begin NE
            bio_tagged.append((token, "B-" + tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag == tag: #Inside NE
            bio_tagged.append((token, "I-" + tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag != tag: #Adjacent NE
            bio_tagged.append((token, "B-" + tag))
            prev_tag = tag
    return bio_tagged
```

```python
#Create Chunk Tree
def stanford_tree(bio_tagged):
    tokens, ne_tags = zip(*bio_tagged)
    pos_tags = [pos for token, pos in pos_tags(tokens)]
    conlltags = [(token, pos, ne) for token, pos, ne in zip(tokens, pos_tags, ne_tags)]
    ne_tree = conlltags2tree(conlltags)
    return ne_tree
```

```python
#Parse named entities from tree
def structure_ne(ne_tree):
    ne = []
    for subtree in ne_tree:
        #if subtree is a noun chunk; NE != "O"
        if type(subtree) == Tree: 
            ne_label = subtree.label()
            ne_string = " ".join([token for token, pos in subtree.leaves()])
            ne.append((ne_string, ne_label))
    return ne
```

```python
#Process the whole tagging by calling the methods
def main():
    tokenized_text = process_text(r"C:\Users\Daryl\Desktop\Data Analysis\Stanford NER\text1.txt")
    tagged = stanford_tagger(tokenized_text)
```

```Sample Output:
[('``', 'O'),
 ('Singapore', 'LOCATION'),
 ('has', 'O'),
 ('also', 'O'),
 ('registered', 'O'),
 ('our', 'O'),
 ('concerns', 'O'),
 ('with', 'O'),
 ('the', 'O'),
 ('relevant', 'O'),
 ('US', 'LOCATION'),
 ('and', 'O'),
 ('China', 'LOCATION'),
 ('departments', 'O'),
 ('and', 'O'),
 ('is', 'O'),
 ('continuing', 'O'),
 ('to', 'O'),
 ('engage', 'O'),
 ('them', 'O'),
 (',', 'O'),
 ("''", 'O'),
 ('said', 'O'),
 ('the', 'O'),
 ('spokesperson', 'O'),
 (',', 'O'),
 ('who', 'O'),
 ('did', 'O'),
 ('not', 'O'),
 ('respond', 'O'),
 ('to', 'O'),
 ('a', 'O'),
 ('query', 'O'),
 ('on', 'O'),
 ('the', 'O'),
 ('number', 'O'),
 ('of', 'O'),
 ('firms', 'O'),
 ('affected', 'O'),
 ('.', 'O'),
 ('In', 'O'),
 ('January', 'O'),
 (',', 'O'),
 ('US', 'LOCATION'),
 ('President', 'O'),
 ('Donald', 'PERSON'),
 ('Trump', 'PERSON')]
```
