Deep Sentiment SE Cross-Platform

This repository contains the data, code, pre-trained models, experiment results for our ressearch project - An Empirical Study of Three Deep Learning Sentiment Detection Tools for SE in Cross-Platform Settings

Overview of Sentiment4SE

A deep learning based Sentiment Analysis tool for Software Engineering datasets.

In this research work we studied the performance of different sentiment analysis tools for Software Engineering datasets. We specifically focused on the performance of newly developed deep learning based tools such as (BERT4SentiSE, SEntiMoji, RNN4SentiSE).

Benchmark Datasets

For this study, we used three benchmark cross platform datasets from

The GitHub dataset contains around 7000 pull requestand commit comments. The dataset is well balanced of 28% positive, 29% negative, and 43% neutral emotions.
The Jira dataset contains around 6000 issue commentsand sentences of open source software projects (eg. Apache, Spring) annotated by software developers. This dataset was originally labelled with six emotions (i.e. love, joy, surprise, anger, fear, and sadness). In order to be consistent with other datasets, we can translate love and joy as a positive emotion, anger and sadness as a negative emotion. surprise cases are discarded as they could be either positive or negative. Finally, the absence of emotions is labelled as neutral. The dataset is not well balanced and the ratio is 19% positive**, 14% negative, and 67% neutral emotions.
The Stack Overflow dataset contains 4423 Stackoverflow posts including questions, comments and answers, manually annotated by twelve trained codes. Each post was annotated by three raters and received the final polarity based on majority voting. The dataset is quite well balanced of 35% positive, 27% negative, and 38% neutral emotions.

Sentiment Analysis Tools

For this study, we used three deep learning based tools and two shallow supervised ML tools.

Deep learning based tools:
1. SEntiMoji: proposed by Chen et al. in 2019, is an SE customized sentiment classifier based on DeepMoji. It learns vector representation of texts by leveraging how emojis are used and other texts in Tweeter and GitHub. The authors reported that it outperforms existing methodsby Lin et al.’s dataset.
2. BERT4SentiSE: proposed by Biswas et al. in 2020, is a BERT based pre-trained transformer model. This explores the effectiveness of using BERT based model for SA4SE. The authors report that the BERT classifier achieves reliable performance for SE datasets. They also report BERT combined with a larger dataset provides a better result.
3. RNN4SentiSE: proposed by Biswas et al. in 2019, is based on generic word embedding from Google news data. Its generic word embedding was also updated using software domain-specific word embedding from stack overflow posts. For this study’s purpose, we used this tool as a base model for RNN based SA4SE tools that do not use any pre-trained word vectors and fine-tuning. So, for this tool, we generated our own word embedding.
Shallow ML based tools:
1. Senti4SD: proposed by Calefato et al. in 2018, is a polarity classifier that uses a bag of words (BoW), sentiment lexicons, word em-bedding as features. This toolkit was originally trained and validated on 4K questions and answers of Stack Overflow. This toolkit allows customization by using gold standard dataset as input.
2. SentiCR proposed by Ahmed et al. in 2017, extracts term frequency in-verse document frequency for Bag of words and uses it as a feature. It implements pre-processing of input and handles negations, stop-words, removes code snippets. It also leverages SMOTE to handle class imbalance in the training dataset. It is based on Gradient Boosting Tree (GBT) and it is originally trained and tested on 2000 code-review comments.
Rule based tools:
1. SentistrengthSE: is developed by Md RakibulIslam and Minhaz F.Zibran on top of SentiStrength by introducing rules and sentiment words specific to Software Engineering.

Repo Structure

/analysis dir
- Contains the Jupyter notebooks that were used during analyzing the outputs and producing the results.
/datasets dir
- Contains the raw and processed datasets.
- /combined.csv file
  - contains the combined datasets that is used by the following deep and shallow machine learning based sentiment analysis tools.
  - contains 10-fold stratified sampling with Scikit-learn. This 10-fold sampling is used to report the performance of echo tools in within-platform settings.
/generated_output dir
- Contains generated combined outputs from all the tools based on all the datasets
/manual_labeling dir
- Contains the files that we generated after performing manual labeling.
- error_categorization.csv file contains labeling of BERT4SentiSE and SEntiMoji errors.
- sentistrengthse_errors.csv file contains labeling of SentistrentSE errors.
/tools dir
- Contains source codes of all the sentiment analysis tools that are used in this study
- /deep_learning_based dir
  - /bert4sentise contains replication package of BERT4SentiSE
  - /rnn4sentise contains the complete replication package of RNN4SentiSE
  - /sentimoji contains the complete replication package of SEntiMoji
- /shallow_learning_based dir
  - /senti4sd contains the complete replication package of Senti4SD
  - /senticr contains the complete replication package of SentiCR

Note

Tools folder contain their own readme.md files where it contains information about the tool from the original repository as well as some of our updated documentation. For each tools we provide the requirements.txt file that contains that environment configuration that we used during our experiments.

Declaration

We upload all the benchmark datasets that we used for this study to this repository for convenience. We do not claim any rights on them because they were not generated and released by us,. If you use any of them, please make sure you fulfill the licenses that they were released with and consider citing the original papers. The folders in this repository contains links and licence information about the orginal repository.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
analysis		analysis
datasets		datasets
generated_output		generated_output
manual_labeling		manual_labeling
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Sentiment SE Cross-Platform

Overview of Sentiment4SE

Benchmark Datasets

Sentiment Analysis Tools

Repo Structure

Note

Declaration

About

Releases

Packages

Contributors 2

Languages

License

disa-lab/DeepSentimentSECrossPlatform

Folders and files

Latest commit

History

Repository files navigation

Deep Sentiment SE Cross-Platform

Overview of Sentiment4SE

Benchmark Datasets

Sentiment Analysis Tools

Repo Structure

Note

Declaration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages