Dataset, scripts, and additional material for the EMSE submission "Best-Answer Prediction in Technical Q&A Sites"
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
additional_material added high-res figures in uncompressd tiff format Apr 29, 2018
dumps
input
lib
models
output
.gitignore
LICENSE
README.md
collect-metrics.py added python resources Nov 3, 2017
feature-selection.R
requirements.R
requirements.txt
run-default-predictions.sh
run-predictions.sh
run-tuning.sh
seeds.txt
skesd-test.R
so_no-upovotes.R
so_one-answer.R
test.R
timewise.R
tuning.R
untuned-perf.R
upvotes_distrib.R
wilcoxon-signed-rank.R

README.md

Dataset, scripts, and additional material for Best-Answer Prediction in Technical Q&A Sites

Original data dumps

Original dumps refer to the data extracted "as is" from the following technical Q&A sites:

Download

The data dumps and the description of their file formats are available from here.

Experimental datasets

The datasets containing the features extracted from the data dump of each Q&A site are available for download here. A description of each feature is also avaialble.

Python and R scripts

Setup

To ensure proper execution, first run the following commands to check for the presence and eventually install all the required packages for R and Python.

$ RScript requirements.R
$ pip install -r requirements.txt

Automated parameter tuning

To start the automated parameter tuning via caret, run the run-tuning.sh script as described below.

$ run-tuning.sh models_file data_file
  • The models_file param indicates the file containing (one per line) a list of models (learners) to be tuned. See the file models/models.txt for an example.
  • The data_file param indicates the file containing the data to be used for the tuning stage.
  • As output, a TXT file will be created under the output/tuning/ subfolder for each tuned model, containing the best param configuration and execution times.

Note. The tuning step is very time consuming and will take several hours for each model; the more models in the input file, the longer the script will take to finish.

Default (untuned) model performance

To compute the default AUC performance with the default parameter setting is obtained running the script below.

$ sh run-default-predictions.sh path/to/input/so-dataset.csv path/to/models/models.txt
  • As output, the file output/untuned/AUC-all-models.txt will be created with the AUC values.

Note. The prediction step is very time consuming and will take several hours to complete; the more models in the input file and the larger the dataset chosen, the longer the script will take to finish.

Scott-Knott ESD model clustering

To cluster model by AUC performance into non-overlapping groups, run the following scripts:

$ python collect-metrics.py --in path/to/metrics/folder.txt --out outfile --ext file_extension --sep field_sep --runs N
  • path/to/metrics/folder.txt - where the tuning script stored the execution log per model for each run
  • outfile - the name of file where to store the following main metrics per model per run:
    • AUC
    • F1
    • G-mean
    • Balance
    • Time taken
  • file_extension - the extension of the output file, chosen in {txt, csv, xls}
  • field_sep - the character used to separate fields in the output file, either , or ;
  • N- the number of runs used in the tuning step (e.g., 10, 100)
$ Rscript skesd-test metrics_outfiles runsN 
  • metrics_outfiles - the file with metrics generated by the Python script at the previous step
  • runsN - the number of runs, must match the same param from the previous step

Feature selection

The following script perform wrapper-based feature selection using the R package Boruta; for the sake of completeness, it will also perform Correlation-based Feature Selection (CFS).

$ Rscript feature-selection.R dataset_file dataset_name featN
  • dataset_file - the dataset used for feature selection
  • dataset_name - the name of the dataset, chosen in {so, docusign, dwolla, scn, yahoo}; so by default
  • featN - the number of feature to select, 10 by default
  • As output, the script will generate the file output/feature-selection/feature-subset.txt containing:
    • The output of Boruta
    • The output of CFS, with both Spearman and Pearson correlation values

Prediction experiment

Once the models have been tuned, you can execute the best-answer prediction experiment. Run the run-predictions.sh script as described below.

$ run-predictions.sh training_file models_file data_file
  • The training_file param indicates the file containing the dataset for training the learners.
  • The models_file param indicates the file containing (one per line) a list of models (learners) to be used in the prediction experiment.
  • The data_file param indicates the file containing the test dataset.
  • As output, the following folder and files will be created:
    • output/cm - containing a TXT file for each test set and model with the confusion matrix
    • output/misclassifications - containing a TXT file for each test set and model with listing the cases where wrong predictions (errors) occurred
    • output/plots - containing a ROC plot image file for each test set and model specified as input

Note. Before running the prediction experiment, the file test.R must be manually edited in order customize the tuneGrid var (dataframe) containing the best param configuration for each learner model. As of now, the script contains the grids for the 4 models in the file models/top-cluster.txt.

Testing the scripts

When executed without running the .sh files (e.g., via RStudio or Rscript), these scripts by default open the test file input/example.csv, which contains a few hundred lines from the Stack Overflow dataset. This test file is intended to show how the scripts work in general and the output they produce. Beware of the longest execution time when running the scripts with the other input files.