`txcl`: Crowdbreaks Text Classification Tool

A simple supervised text classification CLI framework.

Installation

Developer

git clone https://github.com/crowdbreaks/text-classification.git
cd text-classification
pip install -e .

User

pip install git+https://github.com/crowdbreaks/text-classification.git

Note: You may need to install additional packages for full functionality.

CLI

For a list of available commands, run

txcl --help

Output:

usage: txcl [-h] {main,deploy,plot,print} ...

positional arguments:
  {main,deploy,plot,print}
                        sub-commands
    main                main pipeline
    deploy              deployment
    plot                plotting
    print               printing

optional arguments:
  -h, --help            show this help message and exit

To access a help message for a specific subcommand, run, e.g.

txcl main train --help

Output:

usage: txcl main train [-h] [-c C] [--parallel]

Trains model based on config.

optional arguments:
  -h, --help        show this help message and exit
  -c C, --config C  name/path of configuration file (default: config.json)
  --parallel        run in parallel (only recommended for CPU-training) (default: False)

Configuration files

Many of the commands require a path to a configuration (config) file in JSON format. A config file defines a list of 'runs', each of which specifies a set of parameters that will be used to run your command of choice.

Parameters merging

The general structure of the config file looks like this

{
    "globals": { global_params },
    "runs": [
        { params_run_1 },
        { params_run_2 }
    ]
}

Put all the parameters necessary for each run of your command under the "runs" key.

If you have parameters that should be fixed for all runs, put them under the "globals" key. For example, you might want to preprocess the same data in a number of different ways or train a model with a few fixed parameters, and a few varied ones. In this case, you can put the data path or the fixed model params under "globals".

In summary,

"globals"
- Whichever parameters you put in here, will be merged with each of the runs' parameters
"runs"
- These are your run-specific parameters

List of available parameters

path (strings, paths)
- data
  - default './data'
- output
  - default './output'
- tmp
  - default './tmp'
- other
  - default './other'
data (strings, paths, default None)
- train
- val
- test
folders
- string, mode name, default 'new':
  - new
    - Create new folders (output folder, run folders), throw an error in case of existing folders
  - overwrite
    - Create new folders, overwrite existing folders
  - existing
    - Use existing folders, throw an error if no folder found
name
- string, [folder] name of the run

preprocess

dictionary, preprocessing parameters:

standardize_func_name: str = 'standardize'
min_num_tokens: int = 0
min_num_chars: int = 0
lower_case: bool = False
asciify: bool = False
remove_punctuation: bool = False
standardize_punctuation: bool = False
asciify_emoji: bool = False
remove_emoji: bool = False
replace_url_with: Union[str, None] = None
replace_user_with: Union[str, None] = None
replace_email_with: Union[str, None] = None
lemmatize: bool = False
remove_stop_words: bool = False

model
- name
  - string, model's name:
    - fasttext
    - bert
- params
  - dictionary, model parameters (check separately in the docs for each model)
- ...other model-specific parameters (check separately in the docs for each model)
test_only
- boolean, default False, txcl main train-specific
write_test_output
- boolean, default False, txcl main train-specific

Required and optional parameters for different commands

required, optional

txcl main preprocess
- name
- path (default)
- data
  - train or val, or test
- preprocess (default)
- model
  - name
txcl main train
- name
- path (default)
- data
  - train
    - Not required if test_only is True
  - val
  - test
- preprocess
  - Retrieved automatically if the data folder is a preprocessing run
- model
  - name
  - params
- test_only (default)
- write_test_output (default)
txcl main predict
- name
- path (default)
- data (default)
- model
  - name

Grid configs generation

To generate a config file that contains multiple runs following a parameter grid, use txcl main generate-config.

For example,

txcl main generate-config \
    --mode preprocess \
    --name standardize_ag-news \
    --train-data './data/ag-news/train.csv' \
    --test-data './data/ag-news/dev.csv' \
    --model fasttext \
    --globals 'folders:overwrite' \
              'preprocess.standardize_func_name:standardize' \
    --params 'preprocess.lower_case:val:true,false' \
             'preprocess.remove_punctuation:val:true,false'

will generate a config file with 4 runs with varying 'lower_case' and 'remove_punctuation' preprocessing keys.

To learn more about this command, run

txcl main generate-config --help

Example

In this example, you will train a FastText classifier on the AG's News Topic Classification Dataset example data.

Go to path/to/text-classification/test on your machine

cd path/to/text-classification/test

Check out the data

head -n 10 data/ag-news/all.csv

You should see something like this

label	text
2	Finish line is in sight for Martin
1	Blair backs India's quest for permanent seat on UN Security Council (AFP)
3	"Ford Expands Recall of SUVs to 600,000"

The dataset contains 4 types of labels:

1 = World
2 = Sports
3 = Business
4 = Sci/Tech

Note: It is important that the CSV files (train, validation and test) have columns named text and label.

Check out the configs (preprocessing and training)

cat configs/cli/config.preprocess.ag-news.json
cat configs/cli/config.train.ag-news.json

Preprocess the data

txcl main preprocess -c configs/cli/config.preprocess.ag-news.json

You can find your preprocessed data along with an exhaustive config file and label mapping in .output/preprocess_standardize.

Train a model Running this command trains and then automatically evaluates the model on the test set. If no -c option is given, it will look for a file called config.json in the current folder.

txcl main train -c configs/cli/config.train.ag-news.json

The trained model's artefacts, performance scores and run logs can be found in ./output/train_fasttext_lr_0.05_ws_3, train_fasttext_lr_0.005_ws_3, train_fasttext_lr_0.05_ws_5 and train_fasttext_lr_0.005_ws_.

View the results After training you can enter your output folder and run the list runs command

cd output
txcl main ls

to get a list of all trained models.

Output:

List runs
---------

- Mode:  Validation 
- Pattern:           * 

                              f1_macro  precision_macro  recall_macro  accuracy
name                                                                           
train_fasttext_lr_0.005_ws_5  0.860970         0.861144      0.861245  0.860972
train_fasttext_lr_0.005_ws_3  0.860365         0.860538      0.860666  0.860384
train_fasttext_lr_0.05_ws_3   0.861411         0.861472      0.861625  0.861364
train_fasttext_lr_0.05_ws_5   0.860968         0.861013      0.861198  0.860933

Test the model of your choice After you finish the fine-tuning process, pick your best model and test it with

cd ..
txcl main test -r output/train_fasttext_lr_0.05_ws_3

Contribute

Feel free to add new text classification models to this. All trained models inherit from a BaseModel class defined under models/. It contains a blueprint of which methods any new model should implement.

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
.github/workflows		.github/workflows
data		data
docs		docs
other		other
output		output
plots		plots
test		test
txcl		txcl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`txcl`: Crowdbreaks Text Classification Tool

Installation

Developer

User

CLI

Configuration files

Parameters merging

List of available parameters

Required and optional parameters for different commands

Grid configs generation

Example

Contribute

About

Releases

Packages

Contributors 4

Languages

License

digitalepidemiologylab/text-classification

Folders and files

Latest commit

History

Repository files navigation

txcl: Crowdbreaks Text Classification Tool

Installation

Developer

User

CLI

Configuration files

Parameters merging

List of available parameters

Required and optional parameters for different commands

Grid configs generation

Example

Contribute

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

`txcl`: Crowdbreaks Text Classification Tool

Packages