The 2nd place solution for WSDM Cup 2017: Vandalism Detection
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
01_xml_to_csv.py
02_join_data.py
03_extract_features.py
04_train_svm.py
README.md
feature_extraction_utils.py
tira_client.py

README.md

WSDM Cup 2017: Vandalism Detection

Running the solution:

  • download all the data, unpack, and put the files to the data folder,
  • run 01_xml_to_csv.py for converting the wikidata dump files into a bunch of csv files
  • run 02_join_data.py to join the data from the xml files with the meta information and labels
  • 03_extract_features.py processes the data so it can be used for the model. This includes
    • specifying the training, validation and testing folds
    • processing the information about the users (including the meta information)
    • extracting useful features from the comments
  • 04_train_svm.py creates two models:
    • vectorizer for creating a large one-hot-encoding matrix for all the string features
    • a linear SVM model with L1 penalty for performing the classification
  • tira_client.py is used for running the model on http://tira.io/