Skip to content
My research codes on StackOverflow dataset
Java
Branch: master
Clone or download
Latest commit db3428f Jul 14, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
lib changed line ending to lf Jan 31, 2017
nbproject translation counter added Jul 28, 2017
src/lucenesearch rel post finder added Oct 11, 2017
.gitignore changed line ending to lf Jan 31, 2017
README.md Update README.md Jul 14, 2019
build.xml changed line ending to lf Jan 31, 2017
manifest.mf changed line ending to lf Jan 31, 2017

README.md

My Research on StackOverflow

This repository contains source codes developed for my reaserach on StackOverflow

Published Papers:

Arash Dargahi Nobari, Sajad Sotudeh Gharebagh and Mahmood Neshati, “Skill Translation Models in Expert Finding”,
In proceedings of The 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17), Aug 2017.

You may check the paper (PDF) for more information.

Arash Dargahi Nobari, Mahmood Neshati and Sajad Sotudeh Gharebagh, 
“Quality-aware skill translation models for expert finding on StackOverflow”, Information Systems, 2019.

You may check the paper (PDF) for more information.

Requirements

JDK8 and Apache Lucene 6.2.1 is required for running the code.

To run machine learning algorithms there is another repository in python with Tensorflow

Data

All of data(including test collection, goldens, etc) and libraries are ignored in git repository.

These files can be downloaded from dropbox This file includes three folders, data_java, data_php and lib

The lib folder includes all libraries(including Apache Lucene and jsoup) required to run the project

The data_java and data_php folders include the following files and folders (The names are based on java directory, but they are same for php dataset too):

  • golden: The golden collection described in the paper.
  • java_a_tag.txt: Tags for each answer (Answers don't have tag by their self, taged are extracted from related questions)
  • java_q_tag.txt: Tags for each question.
  • java_Q_A.txt: each question and it's answer ids.
  • Posts.xml: This file is removed from data(due to it's very large size), This is the main dataset obtained from archive.org including posts from 2008-07-31 until 2015-03-08 at the time we download it. the version used in our paper can be downloaded here

Usage

Codes include a simple GUI to be used more easily.

Add data folder to the project and put Posts.xml inside it. Then Index posts using the button. All other functions have their own button.

Citation

Please cite the paper, If you used the codes in this repository.

You can’t perform that action at this time.