Cancer Hallmark Annotator
CHmAnnot is an automatic annotator of cancer hallmark on biomedical literature. It supports abstract-level annotation which means that given the abstract of a paper in PubMed it could predict the hallmark labels related to this paper. It is mainly used to evaluate the computational models and tune the model parameters. It also provides several utility functions to manipulate the dataset and post-process the results.
The following instructions will help you get a copy of the source code as well as the datasets, and run the programs on your own machine.
Firstly, you need to install a Python Interpreter (tested 2.7.12) and these packages:
- numpy (tested 1.11.1)
- scipy (tested 1.11.1)
- matplotlib (tested 1.5.1)
- pandas (tested 0.18.1)
- scikit-learn (tested 0.17.1)
- pyyaml (test 3.11)
- openpyxl (test 2.3.2)
- rdflib [optional] (tested 4.2.1)
The simplest way to get started is to use Anaconda Python distribution. If you have limited disk space, the Miniconda installer is recommended. After installing Miniconda and adding the path of folder
$PATH variable, run the following command:
conda install scikit-learn pandas matplotlib openpyxl
Download the Source Code
You can clone the repository of this project and then update the submodule after entering the main folder:
git clone https://github.com/cskyan/chmannot.git cd chmannot git submodule update --init --recursive
Or you can clone the repository and submodules simultaneously:
git clone --recursive https://github.com/cskyan/chmannot.git
Configure Environment Variable
- Add the path of folder
$PATHvariable so that you can run the scripts wherever you want. Remember to grant execution permissions to all the files located in
- Add the path of folder
$PYTHONPATHvariable so that the Python Interpreter can find the library
The global configuration file is stored as
etc/config.yaml. The configurations of different functions in different modules are separated, which looks like the code snippet below.
MODULE1: - function: FUNCTION1 params: PARAMETER1: VALUE1 PARAMETER2: VALUE2 - function: FUNCTION2 params: PARAMETER1: VALUE1 MODULE2: - function: FUNCTION1 params: PARAMETER1: VALUE1
Hence you can access a specific parameter VALUE using a triple (MODULE, FUNCTION, PARAMETER). The utility function
bionlp.util.io can be used to read the parameters in the configuration file:
import bionlp.util.io as io cfgr = io.cfg_reader(CONFIG_FILE_PATH) cfg = cfgr(MODULE, FUNCTION) VALUE = cfg[PARAMETER]
The parameters under the function
init means that they are defined in module scope, while the parameters under the function
common means that they are shared among all the functions inside the corresponding module.
Locate the Pre-Generated Dataset
After cloning the repository, you can download some pre-generated datasets here . The datasets described below are organized as csr sparse matrices, stored in compressed
npz files using the function of
|udt_orig_X.npz||Standard dataset filtered by UDT|
|udt_exp_X.npz||Expanded dataset filtered by UDT|
|dt_orig_X.npz||Standard dataset filtered by DT|
|dt_exp_X.npz||Expanded dataset filtered by DT|
|union_filt_X.npz||Standard dataset filtered by DF|
|X_[0-9].npz||Separated standard dataset|
|Y.npz||Cancer hallmark labels|
|y_[0-9].npz||Separated cancer hallmark label|
In order to locate the dataset you want to use, please rename it to 'X.npz', and change the parameter
DATA_PATH of module
etc/config.yaml into the location of 'X.npz'.
You can load a dataset into a Pandas DataFrame, with the corresponding
PMID as index and each feature as column name, using the utility function
import bionlp.util.io as io X = io.read_df('X.npz', with_idx=True, sparse_fmt='csr')
A Simple Example
You can run a demo using the following command:
chm_annot.py -m demo
If your operating system is Windows, please use the Python Interpreter to execute the python scripts:
python chm_annot.py -m demo
This demo will automatically download a dataset and perform a 5-fold cross validation on the proposed method UDT-RF. The log is printed to standard output and the results are saved on the disk.
For the sake of the best performance, you should tune the parameters of your selected model and write them on the model configuration file so that you can use these tuned parameters for model evaluation.
Setup parameter range
You can edit the function
bin/chm_annot.py to change the range of parameter tuning. Please uncomment the code lines corresponded to your selected model and change the range of the parameters or append other values you want to test.
Run parameter tuning script
You can choose an approach for parameter tuning using the following command.
chm_annot.py -t -r
Covert the result to configuration file
You can use the utility function in
bin/chm_helper.py to transformat your tuning result by the following command:
chm_helper.py -m n2y -l TUNING_OUTPUT_FOLDER_PATH
Then copy the basename of the configuration file ended with
.yaml to the parameter
mdl_cfg of module
The pre-tuned parameters for some models are stored in
You can use different combination of the feature selection model and classification model to generate a pipeline as the final computational model.
You can uncomment the corresponding code lines of the models you want to evaluate in function
bin/chm_annot.py for feature selection and classification respectively.
In addition, you can use command line parameter
-c to adopt the pre-combined model in function
gen_cb_models. To make use of the parameters stored in configuration file, you can use command line parameter
-c -b to adopt the pre-combined model with optimized parameters.
You can re-generate the dataset from the pre-processed files stored in
DATA_PATH using the following command:
chm_gendata.py -m gen
It will also generate separated label data
y_[0-9].npz for single label running.
Feature selection method can also be applied to the dataset in advance by uncommenting the corresponding code line in function
If you only want to apply feature selection on the generated dataset or generate separated label data, you can use command line parameter
-l. Make sure your dataset has already been renamed as 'X.npz' and the processed dataset will be generated as 'new_X.npz'.
Common Parameter Setting
- -p [0-9]
specify which label you want to use independently
indicate that you want to use all labels simultaneously
- -k NUM
specify K-fold cross validation
- -a [micro | macro]
specify which average strategy you want to use for multi-label annotation
- -n NUM
specify how many CPU cores you want to use simultaneously for parallel computing
Other parameter specification can be obtained using