-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
8 changed files
with
46 additions
and
116 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -101,12 +101,10 @@ ENV/ | |
.mypy_cache/ | ||
|
||
# Custom project stuff | ||
data/ | ||
pkl/ | ||
.pkl | ||
runs/ | ||
.vscode | ||
model/ | ||
resources/ | ||
trash/ | ||
out/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,75 +1,31 @@ | ||
# ner-topo-ff | ||
Named Entity Recognizer (NER) for entity extraction using a feedforward deep neural network and distance supervision | ||
# semeval-ffnn-baseline | ||
This project presents a baseline system for Task 12 i.e. Named Entity Recognition (NER) and Concept Resolution subtasks that uses a 2-layer feedforward neural network. | ||
|
||
Requirements: | ||
tensorflow | ||
numpy | ||
arparse | ||
gensim | ||
|
||
To run : | ||
python ff_model.py | ||
|
||
Argparse prompt | ||
|
||
usage: ff_model.py [-h] [--train TRAIN] [--test TEST] [--val VAL] | ||
[--dist DIST] [--pubdir PUBDIR] [--outdir OUTDIR] | ||
[--emb_loc EMB_LOC] [--embvocab EMBVOCAB] | ||
[--hid_dim HID_DIM] [--lrn_rate LRN_RATE] | ||
[--feat_cap FEAT_CAP] [--feat_dict FEAT_DICT] | ||
[--dropout DROPOUT] [--window_size WINDOW_SIZE] | ||
[--dist_epochs DIST_EPOCHS] [--train_epochs TRAIN_EPOCHS] | ||
[--eval_interval EVAL_INTERVAL] [--n_classes {2,3}] | ||
[--batch_size BATCH_SIZE] [--restore RESTORE] [--save SAVE] | ||
|
||
optional arguments: | ||
-h, --help show this help message and exit | ||
--train _TRAIN_ train file location | ||
--test _TEST_ test file location | ||
--val _VAL_ val file location | ||
--dist _DIST_ distance supervision files dir. | ||
--pubdir _PUBDIR_ pubmed files dir. To be production set. | ||
--outdir _OUTDIR_ Output dir for ffmodel annotated pubmed files. | ||
--emb_loc _EMB_LOC_ word2vec embedding location | ||
--embvocab _EMBVOCAB_ load top n words in word emb | ||
--hid_dim _HID_DIM_ dimension of hidden layers | ||
--lrn_rate _LRN_RATE_ learning rate | ||
--feat_cap _FEAT_CAP_ Capitalization feature | ||
--feat_dict _FEAT_DICT_ Dictionary feature | ||
--dropout _DROPOUT_ dropout probability | ||
--window_size _WINDOW_SIZE_ context window size - 3/5/7 | ||
--dist_epochs _DIST_EPOCHS_ number of distsup epochs | ||
--train_epochs _TRAIN_EPOCHS_ number of train epochs | ||
--eval_interval _EVAL_INTERVAL_ evaluate once in _ epochs | ||
--n_classes _{2,3}_ number of classes | ||
--batch_size _BATCH_SIZE_ batch size of training | ||
--restore _RESTORE_ path of saved model | ||
--save _SAVE_ path to save model | ||
Dependencies: | ||
1) ```python``` | ||
2) ```geonames-services``` for disabmiguation and normalization | ||
|
||
Input files: | ||
Requirements: | ||
1) Directory containing BRAT annotated files i.e. corpus files containing article texts (.txt) and respective annotation files (.ann). You can extract the training files from the provided dataset and place the .ann and .txt files in the ```data/train``` directory. | ||
2) A file containing word embeddings i.e word vectors that can be loaded using the gensim model. You can download word embeddings trained on PubMed and Wikipedia articles from http://bio.nlplab.org/ and place the bin file in the ```resources``` directory. | ||
|
||
Annotated input expected as a file containing tokens on each line along with their respective annotations B/I/O or I/O separated by tab-spaces. | ||
Install dependencies: | ||
``` | ||
pip install --upgrade -r requirements.txt | ||
``` | ||
|
||
To train the model: | ||
1) Create the files required for training by running the following command: | ||
``` | ||
Overall O | ||
, O | ||
these O | ||
results O | ||
indicate O | ||
widespread O | ||
human-to-animal O | ||
transmission O | ||
of O | ||
pandemic O | ||
( O | ||
H1N1 O | ||
) O | ||
2009 O | ||
influenza O | ||
viruses O | ||
in O | ||
South B | ||
Korea I | ||
. O | ||
python gen_training_files.py -t data/train -e resources/wikipedia-pubmed-and-PMC-w2v.bin -o resources/ | ||
``` | ||
|
||
2) Train the model by running the following command: | ||
``` | ||
python ffnn_train.py | ||
``` | ||
|
||
3) To annotate files using the trained model, run the following command: | ||
``` | ||
python ffnn_run.py | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# ignore everything except .gitignore just to keep data directory | ||
* | ||
!.gitignore |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
tensorflow==1.50 | ||
numpy==1.14.0 | ||
gensim==2.1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# ignore everything except .gitignore just to keep data directory | ||
* | ||
!.gitignore |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters