Hierarchical Multi-Label Text Classification

This repository is a Pytorch implementation of this research project, and it is accepted by CIKM'19.

The main objective of the project is to solve the hierarchical multi-label text classification (HMTC) problem. Different from the multi-label text classification, HMTC assigns each instance (object) into multiple categories and these categories are stored in a hierarchy structure, is a fundamental but challenging task of numerous applications.

Requirements

Python 3.6 +
Pytorch 1.1.0 +
Numpy
Gensim

Introduction

Many real-world applications organize data in a hierarchical structure, where classes are specialized into subclasses or grouped into superclasses. For example, an electronic document (e.g. web-pages, digital libraries, patents and e-mails) is associated with multiple categories and all these categories are stored hierarchically in a tree or Direct Acyclic Graph (DAG).

It provides an elegant way to show the characteristics of data and a multi-dimensional perspective to tackle the classification problem via hierarchy structure.

The Figure shows an example of predefined labels in hierarchical multi-label classification of documents in patent texts.

Documents are shown as colored rectangles, labels as rounded rectangles.
Circles in the rounded rectangles indicate that the corresponding document has been assigned the label.
Arrows indicate a hierarchical structure between labels.

Data

See data format in data folder which including the data sample files.

Text Segment

You can use jieba package if you are going to deal with the Chinese text data.

Data Format

This repository can be used in other datasets (text classification) in two ways:

Modify your datasets into the same format of the sample.
Modify the data preprocess code in data_helpers.py.

Anyway, it should depend on what your data and task are.

Pre-trained Word Vectors

You can pre-training your word vectors(based on your corpus) in many ways:

Use gensim package to pre-train data.
Use glove tools to pre-train data.
Even can use a fasttext network to pre-train data.

Network Structure

Reference

If you want to follow the paper or utilize the code, please note the following info in your work:

@inproceedings{huang2019hierarchical,
  author    = {Wei Huang and
               Enhong Chen and
               Qi Liu and
               Yuying Chen and
               Zai Huang and
               Yang Liu and
               Zhou Zhao and
               Dan Zhang and
               Shijin Wang},
  title     = {Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach},
  booktitle = {Proceedings of the 28th {ACM} {CIKM} International Conference on Information and Knowledge Management, {CIKM} 2019, Beijing, CHINA, Nov 3-7, 2019},
  pages     = {1051--1060},
  year      = {2019},
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
HARNN		HARNN
data		data
utils		utils
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HARNN

HARNN

data

data

utils

utils

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Hierarchical Multi-Label Text Classification

Requirements

Introduction

Data

Text Segment

Data Format

Pre-trained Word Vectors

Network Structure

Reference

About

Releases

Packages

Languages

License

electron1c/HARNN-pytorch

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Multi-Label Text Classification

Requirements

Introduction

Data

Text Segment

Data Format

Pre-trained Word Vectors

Network Structure

Reference

About

Resources

License

Stars

Watchers

Forks

Languages