Skip to content

acadTags/Tag-Data-Cleaning

Repository files navigation

Data Cleaning workflow for social tags

The repository contains the supplementary material (readme) for the paper:

  • Deriving Dynamic Knowledge from Academic Social Tagging Data: A Novel Research Direction, iConference 2017 (paper, poster).

The program provides a simplied implementation of Data Cleaning workflow for noisy user-generated social tagging data, such as Bibsonomy, CiteULike, MovieLens, etc. The program takes a list of raw tags as input, and output cleaned multi-word and single-word tag groups based on simple morphological and statistical analyses, see the extracted tag groups in (Material 2).

An illustration of the data cleanning process is below:

The Lee-Lemmatizer, contained in the repository, is applied for lemmatisation of single-word tags.

The supplementary material contains the readme files, the extracted multiword tag groups (Material 2) and single-word tag groups (Material 3) from the Bibsonomy data, the specification of treatment of special characters (Material 1). For details, see the description file.

A simplified code implementation in Python is provided, tag-cleaning.py, which applies the data cleaning steps to the CiteULike-a dataset; the code does not implement all steps described in the paper, but retains the main ideas. The program inputs the whole user-generated tag set from the CiteULike-a dataset, and output a list of tag groups which standard tags.

Acknowledgement

About

A data cleaning workflow for social tags

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages