Data related to investigation of chat client censorship
Switch branches/tags
Nothing to show
Clone or download
Latest commit 59ac843 Sep 20, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LINE Update README.md Jul 3, 2018
SVP Update README.md Jul 3, 2018
TOM-Skype--Sina-UC Update README.md Jul 3, 2018
chinese-games Update README.md Jul 3, 2018
livestream Update README.md Jul 3, 2018
open-source Add README.md Aug 22, 2018
wechat Update README.md Aug 22, 2018
README.md Update README.md Sep 20, 2018
categories_keyword_censorship.csv keyword descriptions Oct 31, 2016
themes_keyword_censorship.csv Add files via upload Oct 29, 2016

README.md

Overview

This repository contains keyword blacklists and lists of other content such as URLs or images used to trigger censorship in apps used in China. With the exception of WeChat, these lists were reverse engineered and are the exhaustive lists of keywords used to trigger censorship on these platforms.

The full details on data collection and analysis methods and results are available below.

Chat apps

The research below tracks daily changes to censorship in three different chat apps used in China: TOM-Skype, Sina UC, and Line. Overall, our chat app data consists of over 4,000 blacklisted keywords.

Data: TOM-Skype and Sina UC, LINE

Live-streaming apps

The research below tracks hourly changes to censorship in three different live streaming apps in China: YY, Sina Show, and 9158; and documents the keywords censored by GuaGua, which does not include a mechanism for downloading updates to its censorship blacklists. Overall, our live-streaming data consists of over 20,000 blacklisted keywords.

Data: Original live-streaming data (2015), Updated live-streaming data (2017)

Mobile games

Our research on mobile games analyzes domestic Chinese games as well as international games that have been altered to comply with Chinese regulations. Overall, we found hundreds of mobile games performing censorship, collectively censoring over 100,000 unique blacklisted keywords.

Data: Mobile games

Open source projects

This research analyzes Chinese censorship in open source projects. We extracted over 1,000 Chinese keyword blacklists from open source projects on GitHub, collectively spanning over 200,000 unique blacklisted keywords.

Data: Open source blacklists

WeChat

Our research on WeChat censorship uses sample testing to determine what type of content, such as words, URLs, and images, can be communicated over the platform and which content is censored. We have studied what categorical content WeChat generally filters in addition to what content WeChat filters in response to specific events.

Data: Keywords and URLs (November 2016), 709 Crackdown keywords and images (April 2017), Liu Xiaobo keywords and images (July 2017), 19th Party Congress keywords (November 2017), Image filtering test data (May 2018)

Keyword Content Analysis

Datasets include raw keyword lists collected from the applications. Many also include processed data including translations and categorization of keywords. Keywords were translated to English using a combination of machine and human translation. Based on interpreting these translations with contextual information, we coded each keyword into content categories grouped under six general themes according to a code book.

License

All data is provided under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International and available in full here and summarized here.