Author: Nerses Nersesyan
This project shows how to work with the various data sets in Wikipedia Talk project on Figshare using fasttext.
It is important to note that there is an excisting API demo version created by Jigsaw. The API scores a comment based on its potential impact on a conversation.More detailed information about this project can be found here.
In this notebook we show how to build a simple classifier using fasttext for detecting personal attacks and apply the classifier to a random sample of the comment corpus to see whether discussions on user pages have more personal attacks than discussion on article pages.
Quantity of social media platforms users is rising from day to day and online discussion has become integral to people’s experience of the internet. It would be naive to have ever expected that online discussion won't contain abuse or harrasment. Manually moderating comments and discussion forums can be tedious and expensive. That's why any tool which is capable to increase moderation quality and decrease it's expenses would be in demand.
Research paper containing documentation on the data collection and modeling methodology.
Deliverable
Create a classifier using fastext with accuracy higher than 90%.
Milestone 1
Building a classifier based on fasttext for personal attacks
Milestone 2
- Model tune
- Use of classifier on the Wikipedia Talk Corpus
For training and evaluation of created model were used Wikipedia Talk project dataset. Wikipedia Talk project release includes:
- large historical corpus of discussion comments on Wikipedia talk pages
- sample of over 100k comments with human labels for whether the comment contains a personal attack
- sample of over 100k comments with human labels for whether the comment has aggressive tone
Please refer to wiki for documentation of the schema of each data set.