We conducted a research work to analysis sentiment of Twitter data based on BOW, contextual and context-free embeddings to predict disaster. We used three traditional machine learing models (decision tree, random forest, and logistic regression) and three popular pre-trained contextual embedings (Skip-gram, FastText, and GloVe). For context-free embeddings, we used pre-trained BERT (Bert-base-uncased) model.
We used data from a Kaggle competition. To download the data, please visit the following link: https://www.kaggle.com/c/nlp-getting-started
To download pre-trained embeddings, please visit the following link:
- https://nlp.stanford.edu/projects/glove/
- https://github.com/google-research/bert
- "Advances in Pre-Training Distributed Word Representations", Mikolov T. G. and et al. Proceedings of the International Conference on Language Resources and Evaluation, LREC 2018
To reproduce the experimental results, follow the following steps:
- Download data from the Kaggle competition and keep in "data" folder.
- Download the pre-trained embeddings from the above folder and keep in the same source folder. Write the file name in our python code to load the embeddings. For example, you need to update the GLOVE_EMB variable in GloVe_softmax.py file with your own GloVe embedding file name.
- Run "data_analysis.py" file to see basic data statistic result for the training dataset.
- Run the other python files one by one to find the results of differnt machine learning models on different embeddings.
The following environment is used for the implementation:
python==3.6.2
torch==0.4.1
numpy==1.15.1
sklearn==0.19.2
Please acknowledge the following work in papers or derivative software:
@article{deb2022comparative,
title={Comparative analysis of contextual and context-free embeddings in disaster prediction from Twitter data},
author={Deb, Sumona and Chanda, Ashis Kumar},
journal={Machine Learning with Applications},
pages={100253},
year={2022},
publisher={Elsevier}
}