Data Cleaning workflow for social tags

The repository contains the supplementary material (readme) for the paper:

Deriving Dynamic Knowledge from Academic Social Tagging Data: A Novel Research Direction, iConference 2017 (paper, poster).

The program provides a simplied implementation of Data Cleaning workflow for noisy user-generated social tagging data, such as Bibsonomy, CiteULike, MovieLens, etc. The program takes a list of raw tags as input, and output cleaned multi-word and single-word tag groups based on simple morphological and statistical analyses, see the extracted tag groups in (Material 2).

An illustration of the data cleanning process is below:

The Lee-Lemmatizer, contained in the repository, is applied for lemmatisation of single-word tags.

The supplementary material contains the readme files, the extracted multiword tag groups (Material 2) and single-word tag groups (Material 3) from the Bibsonomy data, the specification of treatment of special characters (Material 1). For details, see the description file.

A simplified code implementation in Python is provided, tag-cleaning.py, which applies the data cleaning steps to the CiteULike-a dataset; the code does not implement all steps described in the paper, but retains the main ideas. The program inputs the whole user-generated tag set from the CiteULike-a dataset, and output a list of tag groups which standard tags.

Acknowledgement

Thanks to the Lee-Lemmatizer by Qingxiang Jia, under the license of GNU GPL v3, license information in lee-lemmatizer google code repository.
The official Bibsonomy dataset is acquired from https://www.kde.cs.uni-kassel.de/bibsonomy/dumps/ after request.
The CiteULike-a dataset file, tag.dat is from Collaborative topic regression with social regularization for tag recommendation (Wang, Chen, and Li, 2013, link).
The Google Translation API, see Cloud Translation documentation, was used to detect the language of tag groups from the Bibsonomy dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
__pycache__		__pycache__
LeeLemmatizer.py		LeeLemmatizer.py
Material 1_Table for handling specific characters.pdf		Material 1_Table for handling specific characters.pdf
Material 2_Full multiword tag groups after step 4.txt		Material 2_Full multiword tag groups after step 4.txt
Material 3_Full_single tag groups after step 4.txt		Material 3_Full_single tag groups after step 4.txt
README.md		README.md
data-clean-bib.png		data-clean-bib.png
readme supplementary files.pdf		readme supplementary files.pdf
tag-cleanning.py		tag-cleanning.py
tags.dat		tags.dat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Cleaning workflow for social tags

Acknowledgement

About

Releases

Packages

Languages

acadTags/Tag-Data-Cleaning

Folders and files

Latest commit

History

Repository files navigation

Data Cleaning workflow for social tags

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages