# Results for Place Name Recognition using Wapiti and NeuroNER

This notebook shows the results for various configurations of Wapiti and NeuroNER tools on several different training and test sets. For now the list of datasets:
    -  ACE corpus
    -  Conll corpus


In [1]:
##calculates and prints precision and recall for the resf file only taking into consideration the tags in  tagf var.
def acc(resf,tagf):
    t=0.0
    c=0
    tot=0
    ptot=0
    p=0.0
    res1=open(resf).readlines()
    for line in res1:
        line1=line.split()
        if len(line1)>2:
            if line1[-2] in tagf:
                if line1[-2]==line1[-1]:
                    t+=1
                c+=1
            if line1[-1] in tagf:
                if line1[-1]==line1[-2]:
                    p+=1
                ptot+=1
            tot+=1
    print("Total predictions: "+str(ptot))
    print("Total entities: "+ str(c))
    rec=t/c
    pre=p/ptot
    print("recall: "+str(rec))
    print("precision: "+str(pre))
    return pre,rec

In [2]:
##calculates and prints precision and recall for the resf file only taking into consideration the tags in  tagf var.
def acc2(resf):
    t=0.0
    c=0
    tot=0
    ptot=0
    p=0.0
    res1=open(resf).readlines()
    for line in res1:
        line1=line.split()
        if len(line1)>2:
            if line1[-2]!="O" and line1[-2]!="0":
                if line1[-2]==line1[-1]:
                    t+=1
                c+=1
            if line1[-1]!="O" and line1[-1]!="0":
                if line1[-1]==line1[-2]:
                    p+=1
                ptot+=1
            tot+=1
    print("Total predictions: "+str(ptot))
    print("Total entities: "+ str(c))
    rec=t/c
    pre=p/ptot
    print("recall: "+str(rec))
    print("precision: "+str(pre))
    return pre,rec

In [9]:
## calculates and returns fbeta score using the parameters
def fbeta(beta,pre,rec):
    den = beta*beta* pre + rec
    nom = (beta*beta+1)*pre*rec
    return nom/den

In [12]:
tagsace=["GPE","LOC"]
tagscon=["I-LOC"]


***NOTE:*** User must change the value of the resultsfile variable for each result accordingly. I used the default addresses for them. The program will not work if the address is not given properly.

## 1)Wapiti Results

Wapiti is a ML tool used generally for sequence labelling, such as POS tagging, NER etc. Wapiti uses Conditional Random Fields which are proven to be very powerful for labeling sequential data. Below is the link to have more information about the tool:

https://wapiti.limsi.fr/manual.html

## Results for Conll alone

Results below use the Conll dataset and the predefined Conll features for learning

### Result 1

* **Training set**: Conll training set 

* **Test set**: Conll testa,testb
* **Pattern file**: nppattern.txt

* **Configurations**: L1 norm penalty 5 

* **Terminal call for training**: `wapiti train -p patternfile -1 5 trainfile modelfile`
* **Terminal call for prediction**: `wapiti label -m modelfile testfile outputfile`

**testa**

In [13]:
resultsfile="results/resa.txt"
beta=1
pre,rec=acc(resultsfile,tagscon)
f=fbeta(beta,pre,rec)
print("F1 score:"+str(f))

Total predictions: 2158
Total entities: 2094
1826.0
1826.0
recall: 0.872015281757
precision: 0.846153846154
F1 score:0.858889934149


**testb**

In [5]:
resultsfile="results/resb.txt"
beta=1
pre,rec=acc(resultsfile,tagscon)
f=fbeta(beta,pre,rec)
print("F1 score:"+str(f))

Total predictions: 1946
Total entities: 1919
recall: 0.813444502345
precision: 0.802158273381
F1 score:0.807761966365


### Result 2

* **Training set**: Conll training set 

* **Test set**: Conll testa,testb
* **Pattern file**: nppattern.txt

* **Configurations**: default mode with no L1 penalty 

* **Terminal call for training**: `wapiti train -p patternfile trainfile modelfile`
* **Terminal call for prediction**: `wapiti label -m modelfile testfile outputfile`

**testa**

In [6]:
resultsfile="results/resa2.txt"
beta=1
pre,rec=acc(resultsfile,tagscon)
f=fbeta(beta,pre,rec)
print("F1 score:"+str(f))

Total predictions: 2098
Total entities: 2094
recall: 0.916427889207
precision: 0.914680648236
F1 score:0.915553435115


**testb**

In [7]:
resultsfile="results/resb2.txt"
beta=1
pre,rec=acc(resultsfile,tagscon)
f=fbeta(beta,pre,rec)
print("F1 score:"+str(f))

Total predictions: 1878
Total entities: 1919
recall: 0.853048462741
precision: 0.87167199148
F1 score:0.862259678694


### Result 3

* **Training set**: Conll training set (sentence splitted version)

* **Test set**: Conll testa,testb
* **Pattern file**: nppattern.txt

* **Configurations**: default mode with no L1 penalty

* **Terminal call for training**: `wapiti train -p patternfile trainfile modelfile`
* **Terminal call for prediction**: `wapiti label -m modelfile testfile outputfile`

** testa **

In [8]:
resultsfile="results/wapresa"
beta=1
pre,rec=acc(resultsfile,tagscon)
f=fbeta(beta,pre,rec)
print("F1 score:"+str(f))

Total predictions: 2108
Total entities: 2094
recall: 0.920248328558
precision: 0.914136622391
F1 score:0.917182294146


**testb**

In [9]:
resultsfile="results/wapresb"
beta=1
pre,rec=acc(resultsfile,tagscon)
f=fbeta(beta,pre,rec)
print("F1 score:"+str(f))

Total predictions: 1896
Total entities: 1919
recall: 0.852006253257
precision: 0.862341772152
F1 score:0.857142857143


## Results for ACE alone

There are many entity types available in the ACE dataset. For the purpose of our project we only consider the entities with tags GPE and LOC. As in the case for Conll we ignore the boundaries (BIO representation).

### Result 4

No features are given to the CRF learner. These can be considered as the baseline performance of Wapiti on ACE alone.
A lot of improvements can be done over these scores.
* **Training set**: 90% of ACE corpus (no features except for surface form + regex)
* **Test set**:  10% of ACE corpus
* **Pattern file**: acepats
* **Configurations**: L1 penalty 1

* **Terminal call for training**: `wapiti train -p -1 1 patternfile trainfile modelfile`
* **Terminal call for prediction**: `wapiti label -m modelfile testfile outputfile`

In [10]:
resultsfile="results/res1"
beta=1
pre,rec=acc(resultsfile,tagsace)
f=fbeta(beta,pre,rec)
print("F1 score:"+str(f))

Total predictions: 358
Total entities: 284
recall: 0.838028169014
precision: 0.664804469274
F1 score:0.741433021807


### Result 5

* **Training set**: 90% of ACE corpus (no features except for surface form + regex)
* **Test set**:  10% of ACE corpus
* **Pattern file**: acepats
* **Configurations**: Default mode with no penalty

* **Terminal call for training**: `wapiti train -p patternfile trainfile modelfile`
* **Terminal call for prediction**: `wapiti label -m modelfile testfile outputfile`

In [11]:
resultsfile="results/res2"
beta=1
pre,rec=acc(resultsfile,tagsace)
f=fbeta(beta,pre,rec)
print("F1 score:"+str(f))

Total predictions: 349
Total entities: 284
recall: 0.845070422535
precision: 0.687679083095
F1 score:0.758293838863


### Result 6

* **Training set**: 100% of Conll and 70% of ACE corpus (no features except for surface form + regex)
* **Test set**:  20% of ACE corpus
* **Pattern file**: acepats
* **Configurations**: Default mode with no penalty

* **Terminal call for training**: `wapiti train -p patternfile trainfile modelfile`
* **Terminal call for prediction**: `wapiti label -m modelfile testfile outputfile`

results are in the "results/merres*.txt" files (3 files for models that are slightly different). Results are similar for all 3 configurations.

In [17]:
resultsfile="results/merres.txt"
beta=1
pre,rec=acc2(resultsfile)
f=fbeta(beta,pre,rec)
print("F1 score:"+str(f))

Total predictions: 1679
Total entities: 1897
recall: 0.724828676858
precision: 0.818939845146
F1 score:0.769015659955


In [18]:
resultsfile="results/merres2.txt"
beta=1
pre,rec=acc2(resultsfile)
f=fbeta(beta,pre,rec)
print("F1 score:"+str(f))

Total predictions: 1727
Total entities: 1897
recall: 0.731154454402
precision: 0.803126809496
F1 score:0.765452538631


In [19]:
resultsfile="results/merres3.txt"
beta=1
pre,rec=acc2(resultsfile)
f=fbeta(beta,pre,rec)
print("F1 score:"+str(f))

Total predictions: 1762
Total entities: 1897
recall: 0.733790195045
precision: 0.790011350738
F1 score:0.760863623941


## 2) NeuroNER Results

NeuroNER is a Named Entity Recognition Tool that uses Bi-LSTM CRF with word and character embeddings.

The tool gives state-of-art results for the Conll dataset

More information is available at:

http://neuroner.com/

Also the github repo https://github.com/Franck-Dernoncourt/NeuroNER has very detailed information about the tool.

### Pretrained Model on Conll testset

As the pretrained model is trained on Conll training set, the model performs very well on Conll test set.

Confusion matrix for this result is in the "results/NeuroNer_Conlltest_Confmat.pdf" file in our github repo.

In [5]:
res1=open("results/NeuroNer_Conlltestres.txt").read()
print(res1)

processed 46435 tokens with 5648 phrases; found: 5663 phrases; correct: 5127.
accuracy:  97.89%; precision:  90.54%; recall:  90.78%; FB1:  90.66
              LOC: precision:  92.53%; recall:  92.81%; FB1:  92.67  1673
             MISC: precision:  81.12%; recall:  80.20%; FB1:  80.66  694
              ORG: precision:  86.98%; recall:  89.28%; FB1:  88.12  1705
              PER: precision:  96.35%; recall:  94.81%; FB1:  95.57  1591



### Pretrained Model on ACE testset

As the model is trained on a different dataset the scores drop significantly, which suggests that pretrained models perform poorly on different domains with little entity overlaps. Another approach is that pretrained model on Conll has the PER and MISC tags which are not available in our ACE corpus. Thus the results are misleading and this approach is not meaningful. So we change the labels in the Conll dataset and train the model again.

### Training Model on Conll and testing on ACE test set

We start testing our model on ACE using Conll dataset for the training. We merged the conll train test and validation sets into a single corpus. Training is done on this corpus, for validation and testing we use 10% and 20% of ACE respectively. Confusion matrix address: 

"**results/NeuroNer_Conlltrain_ACEtest_confmat.pdf**"

In [21]:
contracetesres1=open("results/NeuroNer_Conlltrain_ACEtest_res.txt").read()
print(contracetesres1)

processed 49689 tokens with 1408 phrases; found: 1273 phrases; correct: 845.
accuracy:  97.47%; precision:  66.38%; recall:  60.01%; FB1:  63.04
              LOC: precision:  71.54%; recall:  63.31%; FB1:  67.17  931
              ORG: precision:  52.34%; recall:  50.28%; FB1:  51.29  342



### Training Model on using ACE dataset alone

Next we trained the NeuroNer tool by splitting the ACE corpus into 3 sets using corpsplitter.py:

* train.txt
* valid.txt
* test.txt

Below are the results of the trained model on ACE test set. As the size of test and valid sets are small the results can be misleading. They are only shown to give a rough idea about the success of the model. Confusion matrix is included in the github with name "**results/ACE_test_confmat.pdf**". 

Training the model using the ACE dataset increases the scores significantly. Results suggests NER systems perform well when the training and test data is from the same domain or same source. 

In [22]:
aceres1=open("results/ACEtrain_testres.txt").read()
print(aceres1)

processed 49689 tokens with 1408 phrases; found: 1521 phrases; correct: 1221.
accuracy:  98.69%; precision:  80.28%; recall:  86.72%; FB1:  83.37
              LOC: precision:  83.20%; recall:  91.35%; FB1:  87.09  1155
              ORG: precision:  71.04%; recall:  73.03%; FB1:  72.02  366



### Training Model on  ACE and Conll datasets combined

In this step we merged the 2 datasets( 100% of Conll and 70 % of ACE) and trained the model using this larger corpus.
Testing is done on 20 % of ACE and validation uses the 10%.

As the corpus size is doubled training times increased accordingly. Each epoch takes around 900 seconds. Below are the results for the ANN trained for *34* epochs calculated using token metric (not Conll metric). The results show that we obtain an increase in the F1-score by merging two corpora. 

* **"results/mergedtrain_ACEtest_confmat.pdf"** is the confusion matrix file.


In [6]:
resmerged=open("results/mergedtrain_ACEtestres.txt").read()
print(resmerged)

processed 49689 tokens with 1408 phrases; found: 1443 phrases; correct: 1214.
accuracy:  98.85%; precision:  84.13%; recall:  86.22%; FB1:  85.16
              LOC: precision:  85.69%; recall:  92.21%; FB1:  88.83  1132
              ORG: precision:  78.46%; recall:  68.54%; FB1:  73.16  311



In [30]:
### Pos tag ekleyebilirim ACE datasetine
import nltk 
text=nltk.word_tokenize("We are going out. Just you and me.")
print nltk.pos_tag(text)

[('We', 'PRP'), ('are', 'VBP'), ('going', 'VBG'), ('out', 'RP'), ('.', '.'), ('Just', 'NNP'), ('you', 'PRP'), ('and', 'CC'), ('me', 'PRP'), ('.', '.')]


In [35]:
pre,rec=acc2("wapitideneme/acedene/second/res3")

Total predictions: 8698
Total entities: 8938
recall: 0.956477959275
precision: 0.982869625201
