Skip to content

Evaluate the performance of several state-of-the-art deep learning techniques on various text classification datasets. This project is part of Master's Thesis in Computer Control and Automation, NTU.

diardanoraihan/Text_Classification_Capstone

Repository files navigation

Deep Learning Techniques for Text Classification

Master Thesis Project in Computer Control and Automation Program at Nanyang Technological University, Singapore.

Project Summary

The experiment will evaluate the performance of some popular deep learning models, such as feedforward, recurrent, convolutional, and ensemble-based neural networks, on five different datasets. We will build each model on top of two separate feature extractions to capture information within the text.

The result shows that the word embedding provides a robust feature extractor to all the models in making a better final prediction. The experiment also highlights the effectiveness of the ensemble-based and temporal convolutional neural network in achieving good performances and even competing with the state-of-the-art benchmark models.

Datasets

Dataset Classes Average
Sentence Length
Dataset Size Vocab Size *Test Size
MR 2 20 10662 18758 CV
SUBJ 2 23 10000 21322 CV
TREC 5 10 5952 8759 500
CR 2 19 3775 5334 CV
MPQA 2 3 10606 6234 CV
  • MR. Movie Reviews – classifying a review as positive or negative [1]. link
  • SUBJ. Subjectivity – classifying a sentence as subjective or objective [2]. link
  • TREC. Text REtrieval Conference – classifying a question into six categories (a person, location, numeric information, etc.) [3]. link
  • CR. Customer Reviews – classifying a product review (cameras, MP3s, etc.) as positive or negative [4]. link
  • MPQA. Multi-Perspective Question Answering – opinion polarity detection [5]. link

*the test size CV stands for cross-validation. It indicates the original dataset does not have a standard train/test split. Hence, we use a 10-fold CV. The AcademiaSinicaNLPLab [6] repository provides access to all these datasets.

The Proposed Deep Learning Models

*Model Bag-of-Words WE-avg WE-random WE-static WE-dynamic
SNN - - -
edRVFL - - -
1D CNN - -
TCN - -
BiGRU/BiLSTM - -
Stacked BiGRU/BiLSTM - -
Ensemble CNN-GRU - -

WE: word embedding (Word2Vec, GloVe, FastTExt, BERT, etc).

*For all models, the training process is done using an early stopping where the model will stop training before it overfits the training data.

The benchmarks used in this work:

  • CNN-multichannel (Yoon Kim, 2014) [7]
  • SuBiLSTM (Siddhartha Brahma, 2018) [8]
  • SuBiLSTM-Tied (Siddhartha Brahma, 2018) [8]
  • USE_T+CNN (Cer et al., 2018) [9]

Result

The proposed models against benchmarks

Model MR SUBJ TREC CR MPQA
edRVFL-BoW 76.2 89.4 75.2 78.0 85.0
edRVFL-avg 77.0 90.6 83.6 78.5 86.7
SNN-a/b/c-BoW 77.4 90.8 76.2 79.7 86.0
SNN-c-avg 78.3 91.6 85.8 80.5 87.6
1D CNN-rand (baseline) 77.6 92.05 89.8 80.4 86.4
1D CNN-static 79.0 92.51 92.2 81.4 88.6
1D CNN-dynamic 79.4 92.8 91.6 82.2 87.5
TCN-rand 77.3 91.4 90.0 81.2 86.3
TCN-static 80.3 92.3 93.6 83.9 88.3
TCN-dynamic 80.0 92.4 91.8 82.9 88.1
BiLSTM-rand 77.6 91.9 88.4 80.6 86.3
BiLSTM-static 79.5 92.5 90.4 81.7 88.2
BiLSTM-dynamic 79.8 92.6 88.8 81.8 88.0
BiGRU-rand 77.2 92.2 89.0 80.1 86.1
BiGRU-static 79.5 92.3 91.8 82.4 88.1
BiGRU-dynamic 79.2 93.0 90.6 81.6 88.1
Stacked BiLSTM-rand 77.7 91.9 89.6 79.7 86.1
Stacked BiLSTM-static 79.4 92.2 91.6 80.9 88.1
Stacked BiLSTM-dynamic 80.0 92.5 88.4 81.7 88.1
Stacked BiGRU-rand 76.9 92.3 89.2 80.1 85.9
Stacked BiGRU-static 79.6 92.3 92.0 81.5 88.1
Stacked BiGRU-dynamic 79.5 92.7 91.0 81.6 88.0
Ensemble CNN-GRU-rand 77.0 91.7 88.0 80.9 86.3
Ensemble CNN-GRU-static 79.8 92.7 93.0 82.5 88.4
Ensemble CNN-GRU-dynamic 79.4 92.6 89.6 82.4 88.0
CNN-multichannel (Yoon Kim, 2014) [7] 81.1 93.2 92.2 85.0 89.4
SuBiLSTM (Siddhartha Brahma, 2018) [8] 81.4 93.2 89.8 86.4 90.7
SuBiLSTM-Tied (Siddhartha Brahma, 2018) [8] 81.6 93.0 90.4 86.5 90.5
USE_T+CNN (Cer et al., 2018) [9] 81.2 93.6 98.1 87.5 87.3

The average accuracy margin of the models to the baseline on the 5 datasets.

  • The green bar represents the benchmark model.
  • The purple bar depicts the top six proposed models that beat the baseline.
  • The red bar is the proposed model with the lowest accuracy margin.
  • The minus (-) sign indicates the model has much lower accuracy than higher ones in all datasets with the baseline as the reference.

The average rank values (ARV) for each model against benchmarks.

  • The figure shows the top six models (the violet bar) with high average ranks and can compete with the benchmarks (the green bar).

Random vs. Static vs. Dynamic

The figure illustrates the effect of different word embedding modes on the model performance.

  • The static word embedding using pre-trained Word2Vec always performs better. The static mode can help any models predict classes more accurately up to 3% average accuracy increase than the random mode.
  • The dynamic Word2Vec can still improve the model performance/ However, the change is not significant. In some cases, a model can even have lower accuracy.

Conclusion

This dissertation has demonstrated a comprehensive experiment focusing on building deep learning models using two different feature extractions on five text classification datasets. In conclusion, the followings are the essential insights from this project:

  • When using the suitable feature extraction, such as word embedding, a deeper neural network can deliver a better final prediction;
  • In the edRVFL model, sigmoid works as the best activation function for text classification task;
  • To represent the text using BoW, binary proceeds as the best word scoring method, followed by freq, count, and TF-IDF.
  • Any model built on top of word embedding causes the model to perform exceptionally well.
  • Using a pre-trained word embedding such as Word2Vec can increase the model accuracy with a high margin.
  • TCN is an excellent alternative to recurrent architecture and has been proven effective in classifying text data.
  • The ensemble learning-based model can help make better predictions than a single model trained independently.
  • TCN and Ensemble CNN-GRU models are the best performing algorithms we obtained in this series of text classification tasks.

References

  • [1] B. Pang, L. Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales”, In Proceedings of ACL’05, 2005.
  • [2] B. Pang, L. Lee, “A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts”, In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), 2004.
  • [3] X. Li, D. Roth, “Learning question classifiers”, In Proceedings of COLING ’02, 2002.
  • [4] M. Hu, B. Liu, “Mining and summarizing customer reviews”, In Proceedings of KDD ’04, 2004.
  • [5] J. Wiebe, T. Wilson, and C. Cardie, “Annotating expressions of opinions and emotions in language”, Language Resources and Evaluation, 39(2):165–210, 2005.
  • [6] AcademiaSinicaNLPLab, “sentiment_dataset”, https://github.com/AcademiaSinicaNLPLab/sentiment_dataset, January, 2021.
  • [7] Y. Kim, "Convolutional Neural Networks for Sentence Classification," Association for Computational Linguistics, October, 2014.
  • [8] S. Brahma, “Improved Sentence Modeling using Suffix Bidirectional LSTM”, arXiv, September, 2018.
  • [9] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil, “Universal Sentence Encoder”, arXiv, April, 2018.

About

Evaluate the performance of several state-of-the-art deep learning techniques on various text classification datasets. This project is part of Master's Thesis in Computer Control and Automation, NTU.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published