fcappdev · tobybatch · Jun 6, 2017 · Jun 6, 2017 · Jul 17, 2017 · fcappdev
diff --git a/README.md b/README.md
@@ -1,18 +1,18 @@
 # AGROVOC autocoder
 
-This Open-Ag auto-classification model is a product of 
+This Open-Ag auto-classification model is a product of
 [Foundation Center](http://foundationcenter.org/). It was developed
-by Dave Hollander (dfh@foundationcenter.org) and Bereketab Lakew 
+by Dave Hollander (dfh@foundationcenter.org) and Bereketab Lakew
 (bkl@foundationcenter.org). This project uses Python 3.5.x in order to handle all
-UTF-8 encoding issues. 
+UTF-8 encoding issues.
 
-Training data for this model was obtained from 
+Training data for this model was obtained from
 [Food and Agriculture Organization of the United Nations](http://agris.fao.org/agris-search/index.do).
 
 The predictions are available for free via the `/text/ag_classification` REST API endpoint
 at [apibeta.foundationcenter.org](https://apibeta.foundationcenter.org/docs/v2.0/documentation.html#/README).
-If you wish to host the model locally, the pre-trained models can be 
-[downloaded](https://s3.amazonaws.com/fc-public/svm/open_ag_models.zip), and 
+If you wish to host the model locally, the pre-trained models can be
+[downloaded](https://s3.amazonaws.com/fc-public/svm/open_ag_models.zip), and
 should be stored in `src\model\clf_data\`. They can be served using the Flask
 server included in this project.
 
@@ -26,35 +26,41 @@ supported in Windows):
 On Ubuntu:
 
     ./bootstrap.sh
-    
+
 This will install Anaconda with Python 3, which includes dependent
-libraries such as scikit-learn. 
-    
+libraries such as scikit-learn.
+
 On MacOS:
 
     brew install qt
     brew install python3
     pip3 install -U numpy scipy scikit-learn
     pip3 install -r requirements.txt
 
-    
+
 If you already have Python 3, scikit-learn,
 NumPy and SciPy install you will only need to do
 
     pip install -r requirements.txt
 
+You will also need to gect the nltk corpora data:
+
+    python - <<EOF
+    import nltk
+    nltk.download('all-corpora')
+    EOF
 
 ## Directories
 
 ## data
 
-This directory contains the python classes for scraping the FAO AGROVOC coded text 
-documents.  Using the scraping scripts requires either Ubuntu, or MacOS to compile 
+This directory contains the python classes for scraping the FAO AGROVOC coded text
+documents.  Using the scraping scripts requires either Ubuntu, or MacOS to compile
 the necessary libraries.
 
 ## model
 
-This contains all of the classes, scripts and server files for training 
+This contains all of the classes, scripts and server files for training
 the AGROVOC models and serving up the prediction API.
 
 ## MySQL Setup
@@ -63,7 +69,7 @@ To connect to a MySQL instance within python:
 
     1. from utils.database import MySQLDataBase
     2. Pass in the connection JSON defined in config.py
-    
+
 In MySQL create a new database called 'agrovoc_autocode.'  Within that database
 create a table for containing the training documents that the AGROVOC
 scraping results can be inserted into.
@@ -76,11 +82,11 @@ scraping results can be inserted into.
       `page` INT NULL,
       `search_term` VARCHAR(100) NULL,
       PRIMARY KEY (`id`));
-      
+
 A code lookup table called `agrovoc_terms` in the `agrovoc_autocode` database should be
-created from the agris_data.csv table contained in 
+created from the agris_data.csv table contained in
 `db/`. In MySQL run [create_hierarchy_table](db/create_hierarchy_table.sql).
-Finally, run [split_training_test](db/split_training_test.sql) to 
+Finally, run [split_training_test](db/split_training_test.sql) to
 separate the test and training sets into separate, disjoint sets.
 
 ## Training
@@ -89,13 +95,13 @@ After setting up the MySQL environment, and training data has been collected usi
 the included scraping scripts, the models can be trained for each hierarchy by running:
 
     python train.py model_name validation_percent hierarchy_level CV_folds
-    
+
 from within `src/model`. An example of training the model for the first
 AGROVOC code hierarchy is
 
     python train.py model_h1 0.1 1 3
-    
-where the model will be named model_h1, we hold 10% of the input set for 
+
+where the model will be named model_h1, we hold 10% of the input set for
 on-the-fly cross-validation, and we are doing 3-fold cross-validation. Models
 are pickled in `src/data` with a hashed unique-identifier
 as a name prefix to prevent model overwrites.
@@ -118,26 +124,26 @@ a lookup dictionary.
 To start the API server simply run
 
     python model/server.py
-    
+
 from within `src/`.
 
 The API accepts the following parameters:
 
     text (string) : This is the text will be passed through the models to predict the relevant AGROVOC terms. This can be anything from
     a single word up to multi-page documents with many paragraphs.
-    
-    chunk (string) : Input text can be broken up into individual sentences that are predicted separately, and the results are 
+
+    chunk (string) : Input text can be broken up into individual sentences that are predicted separately, and the results are
     aggregated. This parameter toggles this feature on and off (accpets 'true' and 'false', optional, 'false' by default).
-    
+
     threshold (string) : (optional, set to 'high' to only accept predictions with a high confidence score).
-    
+
     rollup (string) : Some AGROVOC codes are 'children' of more generic taxonomy terms. The model by default tries to return
     the most specific codes, only. Setting this parameter to 'false' will also return 'parent' codes.
     ex) The text 'apples' will predict the 'apples' and 'fruit' AGROVOC codes, but 'fruit' is the parent of 'apples', and so
     with rollup turned on only 'apples' will be returned. (accpets 'true' and 'false', optional, 'true' by default).
 
     form (string) : specify whether the response is returned in XML (JSON by default, accepts 'xml')
-    
+
 To make a call to the API from Python you can do:
 
 ```python
@@ -150,7 +156,7 @@ opts = {"text": "I want to grow apples. I'm interested in raising cows.",
         "threshold": "low",
         "rollup": "false"
         }
-        
+
 req = request.Request(url, data=json.dumps(opts).encode('utf8'), headers={"Content-Type": "application/json"})
 response = request.urlopen(req).read().decode('utf8')
 print(response)

diff --git a/bootstrap.sh b/bootstrap.sh
@@ -7,4 +7,8 @@ bash Anaconda3-4.3.0-Linux-x86_64.sh -b
 export PATH=/home/ubuntu/anaconda3/bin:$PATH
 sudo apt-get install build-essential -y
 conda install -c anaconda qt=4.8.6 -y
-pip install -r requirements.txt -y
+pip install -r requirements.txt -y
+python - <<EOF
+import nltk
+nltk.download('all-corpora')
+EOF
diff --git a/requirements.txt b/requirements.txt
@@ -10,3 +10,4 @@ dill
 flask
 flask-cors
 matplotlib
+pandas
-Original file line number
+Diff line change
@@ Expand Up / @@ -10,3 +10,4 @@ dill @@
     flask
     flask-cors
     matplotlib
+    pandas