This repository contains keyword blacklists and lists of other content such as URLs or images used to trigger censorship in apps used in China. With the exception of WeChat, these lists were reverse engineered and are the exhaustive lists of keywords used to trigger censorship on these platforms.
The full details on data collection and analysis methods and results are available below.
The research below tracks daily changes to censorship in three different chat apps used in China: TOM-Skype, Sina UC, and Line. Overall, our chat app data consists of over 4,000 blacklisted keywords.
The research below tracks hourly changes to censorship in three different live streaming apps in China: YY, Sina Show, and 9158; and documents the keywords censored by GuaGua, which does not include a mechanism for downloading updates to its censorship blacklists. Overall, our live-streaming data consists of over 20,000 blacklisted keywords.
Our research on mobile games analyzes domestic Chinese games as well as international games that have been altered to comply with Chinese regulations. Overall, we found hundreds of mobile games performing censorship, collectively censoring over 100,000 unique blacklisted keywords.
Data: Mobile games
Open source projects
This research analyzes Chinese censorship in open source projects. We extracted over 1,000 Chinese keyword blacklists from open source projects on GitHub, collectively spanning over 200,000 unique blacklisted keywords.
- The effect of information controls on developers in China: An analysis of censorship in Chinese open source projects
Data: Open source blacklists
Our research on WeChat censorship uses sample testing to determine what type of content, such as words, URLs, and images, can be communicated over the platform and which content is censored. We have studied what categorical content WeChat generally filters in addition to what content WeChat filters in response to specific events.
- One App, Two Systems How WeChat uses one censorship policy in China and another internationally
- We (can’t) Chat “709 Crackdown” Discussions Blocked on Weibo and WeChat
- Remembering Liu Xiaobo Analyzing censorship of the death of Liu Xiaobo on WeChat and Weibo
- Managing the Message: What you can’t say about the 19th National Communist Party Congress on WeChat
- (Can’t) Picture This: An Analysis of Image Filtering on WeChat Moments (paper)
Data: Keywords and URLs (November 2016), 709 Crackdown keywords and images (April 2017), Liu Xiaobo keywords and images (July 2017), 19th Party Congress keywords (November 2017), Image filtering test data (May 2018)
Keyword Content Analysis
Datasets include raw keyword lists collected from the applications. Many also include processed data including translations and categorization of keywords. Keywords were translated to English using a combination of machine and human translation. Based on interpreting these translations with contextual information, we coded each keyword into content categories grouped under six general themes according to a code book.