Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 33 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# AGROVOC autocoder

This Open-Ag auto-classification model is a product of
This Open-Ag auto-classification model is a product of
[Foundation Center](http://foundationcenter.org/). It was developed
by Dave Hollander (dfh@foundationcenter.org) and Bereketab Lakew
by Dave Hollander (dfh@foundationcenter.org) and Bereketab Lakew
(bkl@foundationcenter.org). This project uses Python 3.5.x in order to handle all
UTF-8 encoding issues.
UTF-8 encoding issues.

Training data for this model was obtained from
Training data for this model was obtained from
[Food and Agriculture Organization of the United Nations](http://agris.fao.org/agris-search/index.do).

The predictions are available for free via the `/text/ag_classification` REST API endpoint
at [apibeta.foundationcenter.org](https://apibeta.foundationcenter.org/docs/v2.0/documentation.html#/README).
If you wish to host the model locally, the pre-trained models can be
[downloaded](https://s3.amazonaws.com/fc-public/svm/open_ag_models.zip), and
If you wish to host the model locally, the pre-trained models can be
[downloaded](https://s3.amazonaws.com/fc-public/svm/open_ag_models.zip), and
should be stored in `src\model\clf_data\`. They can be served using the Flask
server included in this project.

Expand All @@ -26,35 +26,41 @@ supported in Windows):
On Ubuntu:

./bootstrap.sh

This will install Anaconda with Python 3, which includes dependent
libraries such as scikit-learn.
libraries such as scikit-learn.

On MacOS:

brew install qt
brew install python3
pip3 install -U numpy scipy scikit-learn
pip3 install -r requirements.txt


If you already have Python 3, scikit-learn,
NumPy and SciPy install you will only need to do

pip install -r requirements.txt

You will also need to gect the nltk corpora data:

python - <<EOF
import nltk
nltk.download('all-corpora')
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add to bootstrap.sh the following lines:

python -m nltk.downloader punkt
python -m nltk.downloader stopwords
python -m nltk.downloader wordnet

rather than downloading the entire corpus. This will also set up the environment in one script rather than having to run separate statements.

EOF

## Directories

## data

This directory contains the python classes for scraping the FAO AGROVOC coded text
documents. Using the scraping scripts requires either Ubuntu, or MacOS to compile
This directory contains the python classes for scraping the FAO AGROVOC coded text
documents. Using the scraping scripts requires either Ubuntu, or MacOS to compile
the necessary libraries.

## model

This contains all of the classes, scripts and server files for training
This contains all of the classes, scripts and server files for training
the AGROVOC models and serving up the prediction API.

## MySQL Setup
Expand All @@ -63,7 +69,7 @@ To connect to a MySQL instance within python:

1. from utils.database import MySQLDataBase
2. Pass in the connection JSON defined in config.py

In MySQL create a new database called 'agrovoc_autocode.' Within that database
create a table for containing the training documents that the AGROVOC
scraping results can be inserted into.
Expand All @@ -76,11 +82,11 @@ scraping results can be inserted into.
`page` INT NULL,
`search_term` VARCHAR(100) NULL,
PRIMARY KEY (`id`));

A code lookup table called `agrovoc_terms` in the `agrovoc_autocode` database should be
created from the agris_data.csv table contained in
created from the agris_data.csv table contained in
`db/`. In MySQL run [create_hierarchy_table](db/create_hierarchy_table.sql).
Finally, run [split_training_test](db/split_training_test.sql) to
Finally, run [split_training_test](db/split_training_test.sql) to
separate the test and training sets into separate, disjoint sets.

## Training
Expand All @@ -89,13 +95,13 @@ After setting up the MySQL environment, and training data has been collected usi
the included scraping scripts, the models can be trained for each hierarchy by running:

python train.py model_name validation_percent hierarchy_level CV_folds

from within `src/model`. An example of training the model for the first
AGROVOC code hierarchy is

python train.py model_h1 0.1 1 3
where the model will be named model_h1, we hold 10% of the input set for

where the model will be named model_h1, we hold 10% of the input set for
on-the-fly cross-validation, and we are doing 3-fold cross-validation. Models
are pickled in `src/data` with a hashed unique-identifier
as a name prefix to prevent model overwrites.
Expand All @@ -118,26 +124,26 @@ a lookup dictionary.
To start the API server simply run

python model/server.py

from within `src/`.

The API accepts the following parameters:

text (string) : This is the text will be passed through the models to predict the relevant AGROVOC terms. This can be anything from
a single word up to multi-page documents with many paragraphs.
chunk (string) : Input text can be broken up into individual sentences that are predicted separately, and the results are

chunk (string) : Input text can be broken up into individual sentences that are predicted separately, and the results are
aggregated. This parameter toggles this feature on and off (accpets 'true' and 'false', optional, 'false' by default).

threshold (string) : (optional, set to 'high' to only accept predictions with a high confidence score).

rollup (string) : Some AGROVOC codes are 'children' of more generic taxonomy terms. The model by default tries to return
the most specific codes, only. Setting this parameter to 'false' will also return 'parent' codes.
ex) The text 'apples' will predict the 'apples' and 'fruit' AGROVOC codes, but 'fruit' is the parent of 'apples', and so
with rollup turned on only 'apples' will be returned. (accpets 'true' and 'false', optional, 'true' by default).

form (string) : specify whether the response is returned in XML (JSON by default, accepts 'xml')

To make a call to the API from Python you can do:

```python
Expand All @@ -150,7 +156,7 @@ opts = {"text": "I want to grow apples. I'm interested in raising cows.",
"threshold": "low",
"rollup": "false"
}

req = request.Request(url, data=json.dumps(opts).encode('utf8'), headers={"Content-Type": "application/json"})
response = request.urlopen(req).read().decode('utf8')
print(response)
Expand Down
6 changes: 5 additions & 1 deletion bootstrap.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,8 @@ bash Anaconda3-4.3.0-Linux-x86_64.sh -b
export PATH=/home/ubuntu/anaconda3/bin:$PATH
sudo apt-get install build-essential -y
conda install -c anaconda qt=4.8.6 -y
pip install -r requirements.txt -y
pip install -r requirements.txt -y
python - <<EOF
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's replace this block with:

python -m nltk.downloader punkt
python -m nltk.downloader stopwords
python -m nltk.downloader wordnet

import nltk
nltk.download('all-corpora')
EOF
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ dill
flask
flask-cors
matplotlib
pandas
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas is included with Anaconda. It should be installed with bash Anaconda3-4.3.0-Linux-x86_64.sh -b.