A simple supervised text classification CLI framework.
git clone https://github.com/crowdbreaks/text-classification.git
cd text-classification
pip install -e .
pip install git+https://github.com/crowdbreaks/text-classification.git
Note: You may need to install additional packages for full functionality.
For a list of available commands, run
txcl --help
Output:
usage: txcl [-h] {main,deploy,plot,print} ...
positional arguments:
{main,deploy,plot,print}
sub-commands
main main pipeline
deploy deployment
plot plotting
print printing
optional arguments:
-h, --help show this help message and exit
To access a help message for a specific subcommand, run, e.g.
txcl main train --help
Output:
usage: txcl main train [-h] [-c C] [--parallel]
Trains model based on config.
optional arguments:
-h, --help show this help message and exit
-c C, --config C name/path of configuration file (default: config.json)
--parallel run in parallel (only recommended for CPU-training) (default: False)
Many of the commands require a path to a configuration (config) file in JSON format. A config file defines a list of 'runs', each of which specifies a set of parameters that will be used to run your command of choice.
The general structure of the config file looks like this
{
"globals": { global_params },
"runs": [
{ params_run_1 },
{ params_run_2 }
]
}
Put all the parameters necessary for each run of your command under the "runs"
key.
If you have parameters that should be fixed for all runs, put them under the "globals"
key. For example, you might want to preprocess the same data in a number of different ways or train a model with a few fixed parameters, and a few varied ones. In this case, you can put the data path or the fixed model params under "globals"
.
In summary,
"globals"
- Whichever parameters you put in here, will be merged with each of the runs' parameters
"runs"
- These are your run-specific parameters
- path (strings, paths)
- data
- default
'./data'
- default
- output
- default
'./output'
- default
- tmp
- default
'./tmp'
- default
- other
- default
'./other'
- default
- data
- data (strings, paths, default
None
)- train
- val
- test
- folders
- string, mode name, default
'new'
:- new
- Create new folders (output folder, run folders), throw an error in case of existing folders
- overwrite
- Create new folders, overwrite existing folders
- existing
- Use existing folders, throw an error if no folder found
- new
- string, mode name, default
- name
- string, [folder] name of the run
- preprocess
- dictionary, preprocessing parameters:
standardize_func_name: str = 'standardize' min_num_tokens: int = 0 min_num_chars: int = 0 lower_case: bool = False asciify: bool = False remove_punctuation: bool = False standardize_punctuation: bool = False asciify_emoji: bool = False remove_emoji: bool = False replace_url_with: Union[str, None] = None replace_user_with: Union[str, None] = None replace_email_with: Union[str, None] = None lemmatize: bool = False remove_stop_words: bool = False
- dictionary, preprocessing parameters:
- model
- name
- string, model's name:
- fasttext
- bert
- string, model's name:
- params
- dictionary, model parameters (check separately in the docs for each model)
- ...other model-specific parameters (check separately in the docs for each model)
- name
- test_only
- boolean, default
False
,txcl main train
-specific
- boolean, default
- write_test_output
- boolean, default
False
,txcl main train
-specific
- boolean, default
required, optional
-
txcl main preprocess
- name
- path (default)
- data
- train or val, or test
- preprocess (default)
- model
- name
-
txcl main train
- name
- path (default)
- data
- train
- Not required if test_only is True
- val
- test
- train
- preprocess
- Retrieved automatically if the data folder is a preprocessing run
- model
- name
- params
- test_only (default)
- write_test_output (default)
-
txcl main predict
- name
- path (default)
- data (default)
- model
- name
To generate a config file that contains multiple runs following a parameter grid, use txcl main generate-config
.
For example,
txcl main generate-config \
--mode preprocess \
--name standardize_ag-news \
--train-data './data/ag-news/train.csv' \
--test-data './data/ag-news/dev.csv' \
--model fasttext \
--globals 'folders:overwrite' \
'preprocess.standardize_func_name:standardize' \
--params 'preprocess.lower_case:val:true,false' \
'preprocess.remove_punctuation:val:true,false'
will generate a config file with 4 runs with varying 'lower_case'
and 'remove_punctuation'
preprocessing keys.
To learn more about this command, run
txcl main generate-config --help
In this example, you will train a FastText classifier on the AG's News Topic Classification Dataset example data.
- Go to
path/to/text-classification/test
on your machine
cd path/to/text-classification/test
- Check out the data
head -n 10 data/ag-news/all.csv
You should see something like this
label | text |
---|---|
2 | Finish line is in sight for Martin |
1 | Blair backs India's quest for permanent seat on UN Security Council (AFP) |
3 | "Ford Expands Recall of SUVs to 600,000" |
The dataset contains 4 types of labels:
- 1 = World
- 2 = Sports
- 3 = Business
- 4 = Sci/Tech
Note: It is important that the CSV files (train, validation and test) have columns named text
and label
.
- Check out the configs (preprocessing and training)
cat configs/cli/config.preprocess.ag-news.json
cat configs/cli/config.train.ag-news.json
- Preprocess the data
txcl main preprocess -c configs/cli/config.preprocess.ag-news.json
You can find your preprocessed data along with an exhaustive config file and label mapping in .output/preprocess_standardize
.
- Train a model
Running this command trains and then automatically evaluates the model on the test set. If no
-c
option is given, it will look for a file calledconfig.json
in the current folder.
txcl main train -c configs/cli/config.train.ag-news.json
The trained model's artefacts, performance scores and run logs can be found in ./output/train_fasttext_lr_0.05_ws_3
, train_fasttext_lr_0.005_ws_3
, train_fasttext_lr_0.05_ws_5
and train_fasttext_lr_0.005_ws_
.
- View the results After training you can enter your output folder and run the list runs command
cd output
txcl main ls
to get a list of all trained models.
Output:
List runs
---------
- Mode: Validation
- Pattern: *
f1_macro precision_macro recall_macro accuracy
name
train_fasttext_lr_0.005_ws_5 0.860970 0.861144 0.861245 0.860972
train_fasttext_lr_0.005_ws_3 0.860365 0.860538 0.860666 0.860384
train_fasttext_lr_0.05_ws_3 0.861411 0.861472 0.861625 0.861364
train_fasttext_lr_0.05_ws_5 0.860968 0.861013 0.861198 0.860933
- Test the model of your choice After you finish the fine-tuning process, pick your best model and test it with
cd ..
txcl main test -r output/train_fasttext_lr_0.05_ws_3
Feel free to add new text classification models to this. All trained models inherit from a BaseModel
class defined under models/
. It contains a blueprint of which methods any new model should implement.