The repository contains the supplementary material (readme) for the paper:
- Deriving Dynamic Knowledge from Academic Social Tagging Data: A Novel Research Direction, iConference 2017 (paper, poster).
The program provides a simplied implementation of Data Cleaning workflow for noisy user-generated social tagging data, such as Bibsonomy, CiteULike, MovieLens, etc. The program takes a list of raw tags as input, and output cleaned multi-word and single-word tag groups based on simple morphological and statistical analyses, see the extracted tag groups in (Material 2).
An illustration of the data cleanning process is below:
The Lee-Lemmatizer, contained in the repository, is applied for lemmatisation of single-word tags.
The supplementary material contains the readme files, the extracted multiword tag groups (Material 2) and single-word tag groups (Material 3) from the Bibsonomy data, the specification of treatment of special characters (Material 1). For details, see the description file.
A simplified code implementation in Python is provided, tag-cleaning.py
, which applies the data cleaning steps to the CiteULike-a dataset; the code does not implement all steps described in the paper, but retains the main ideas. The program inputs the whole user-generated tag set from the CiteULike-a dataset, and output a list of tag groups which standard tags.
- Thanks to the Lee-Lemmatizer by Qingxiang Jia, under the license of GNU GPL v3, license information in lee-lemmatizer google code repository.
- The official Bibsonomy dataset is acquired from https://www.kde.cs.uni-kassel.de/bibsonomy/dumps/ after request.
- The CiteULike-a dataset file,
tag.dat
is from Collaborative topic regression with social regularization for tag recommendation (Wang, Chen, and Li, 2013, link). - The Google Translation API, see Cloud Translation documentation, was used to detect the language of tag groups from the Bibsonomy dataset.