# Novaya Gazeta Censorship

The Novaya Gazeta was recently forced to start censoring itself, or risk shut down and imprisonment of its journalists.

https://www.reuters.com/world/russias-novaya-gazeta-cuts-ukraine-war-reporting-under-censorship-2022-03-04/

What articles have they had to remove?

First we can discover the articles that existed on the Novaya Gazeta website in the year prior to the invasion of the Ukraine. The article URLs seem to start with the `/articles/` path prefix and are returned with a `text/html` mimetype.

In [1]:
import wayback
import datetime

client = wayback.WaybackClient()

# search Wayback for archived HTML pages at novayagazeta.ru
results = client.search(
    url='https://novayagazeta.ru/articles/',
    matchType='prefix',
    from_date=datetime.date(2020, 1, 1),
    filter_field='mimetype:text/html'
)

# collect the unique URLs that are found
urls = set()

for record in results:
    if record.url not in urls:
        urls.add(record.url)
        print(len(urls), end="\r")

64196

That's a lot of URLs. Lets look at some of them:

In [2]:
list(urls)[0:500]

['https://novayagazeta.ru/articles/2021/09/07/zachem-im-duma-zachem-oni-dume',
 'https://novayagazeta.ru/articles/2021/08/07/vlasti-litvy-poobeshchali-po-300-evro-nelegalnym-migrantam-soglasivshimsia-vernutsia-na-rodinu-news?utm_source=tw&utm_medium=novaya&utm_campaign=nelegalnye-migranty--proniknuvshie-v-lit',
 'https://www.novayagazeta.ru/articles/2021/01/21/88818-spiski-poydut-v-ministerstvo',
 'https://novayagazeta.ru/articles/2020/07/14/86270-zalozhniki-bumazhnogo-blagopoluchiya',
 'https://novayagazeta.ru/articles/2021/11/05',
 'https://novayagazeta.ru/articles/2020/08/15/86688-rasstrelnyy-pistolet-vzyal?print=true',
 'https://novayagazeta.ru/articles/2020/05/14/85369-virus-molchaniya',
 'https://novayagazeta.ru/articles/2021/04/06/artdokfest-gonimyi',
 'https://novayagazeta.ru/articles/2021/09/25/kuklovody-protiv-obektivnogo-zla',
 'https://novayagazeta.ru/articles/2004/05/ic_search_white_24dp.svg',
 'https://www.novayagazeta.ru/articles/2021/01/14/88696-premier-razreshil-ubivat

There are some with URL tracking parameters like:

    https://novayagazeta.ru/articles/2021/10/20/poputchiki-na-titanike?utm_source=tw&utm_medium=novaya&utm_campaign=-chto-smotret-v-blizhayshee-vremya-v-kino',
    
We can remove any with '?' in them.

In [3]:
urls = list(filter(lambda url: '?' not in url, list(urls)))
len(urls)

40604

That's much more manageable. There are also URLs with 'www' in them which can be removed.

In [4]:
urls = list(filter(lambda url: 'https://www' not in url, urls))
len(urls)

32461

That's still quite a few to check. Lets focus on URLs from 2022 first. 

In [5]:
urls_2022 = list(filter(lambda url: url.startswith('https://novayagazeta.ru/articles/2022/'), urls))
len(urls_2022)

2875

How about urls from 2022 that have 'ukraine' in them?

In [6]:
import re

urls_2022_ukraine = list(filter(lambda url: 'ukraine' in url, urls_2022))
len(urls_2022_ukraine)

100

In [7]:
urls_2022_ukraine

['https://novayagazeta.ru/articles/2022/03/21/dvadtsat-shestoi-den-boevykh-deistvii-v-ukraine-glavnoe',
 'https://novayagazeta.ru/articles/2022/03/16/oon-s-nachala-boevykh-deistvii-v-ukraine-pogib-691-mirnyi-zhitel-sredi-nikh-48-detei-news',
 'https://novayagazeta.ru/articles/2022/03/17/dvadtsat-vtoroi-den-voennykh-deistvii-v-ukraine-glavnoe',
 'https://novayagazeta.ru/articles/2022/03/25/minekonomiki-frg-germaniia-znachitelno-sokratila-zavisimost-ot-postavok-gaza-nefti-i-uglia-iz-rossii-s-momenta-nachala-boevykh-deistvii-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/03/21/oon-s-nachala-boevykh-deistvii-v-ukraine-pogibli-ne-menee-925-mirnykh-zhitelei-news',
 'https://novayagazeta.ru/articles/2022/03/23/bloomberg-glava-tsb-nabiullina-sobiralas-uiti-v-otstavku-posle-nachala-spetsoperatsii-v-ukraine-no-putin-ei-otkazal-news',
 'https://novayagazeta.ru/articles/2022/03/23/iaponskii-proizvoditel-tekhniki-sharp-priostanovil-postavki-v-rossiiu-iz-za-sobytii-v-ukraine-news',
 'https

The tricky thing is that Novaya Gazeta don't return a 404 error when you request a URL for something that doesn't exist on their site:

In [8]:
! curl -i 'https://novayagazeta.ru/thisdoesntexistontheirwebsite/'

HTTP/2 200 
[1mdate[0m: Sun, 27 Mar 2022 11:59:05 GMT
[1mcontent-type[0m: text/html
[1mexpires[0m: Sun, 27 Mar 2022 11:59:05 GMT
[1mcache-control[0m: max-age=0
[1mlast-modified[0m: Sunday, 27-Mar-2022 11:59:05 GMT
[1mstrict-transport-security[0m: max-age=63072000; includeSubDomains; preload
[1mcf-cache-status[0m: DYNAMIC
[1mexpect-ct[0m: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
[1mserver[0m: cloudflare
[1mcf-ray[0m: 6f27f53b994c18cc-EWR

<!doctype html><html lang="ru" itemscope="itemscope" itemtype="http://schema.org/WebSite" xmlns="http://www.w3.org/1999/html"><head><title>Новая Газета - novayagazeta.ru</title><meta charset="UTF-8" /><meta name="viewport" content="width=device-width,height=device-height,maximum-scale=1,minimum-scale=1,initial-scale=1" /><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" /><link rel="preload" href="https://novayagazeta.ru/api/v1/get/main" as="fetch" type="application/json" cr

Also this page has a bunch of JavaScript on it that needs to be run in order to tell if it's a 404 error. The number 404 doesn't appear in the HTML above, but it does appear in the page you see if you visit: https://novayagazeta.ru/thisdoesntexistontheirwebsite/

![404](images/novaya-gazeta-404.png)

We can use [pyppeteer](https://pypi.org/project/pyppeteer/) to automate the checking.

In [14]:
import time
import asyncio
import pyppeteer

async def check(urls):
    missing = []
    browser = await pyppeteer.launch()
    page = await browser.newPage()
    for i, url in enumerate(urls, start=1):
        time.sleep(.5)
        print(f'checking:{i}/{len(urls)} found:{len(missing)}', end="\r")
        await page.goto(url, {'waitUntil': 'networkidle2'})
        content = await page.evaluate('document.body.textContent', force_expr=True)
        is_missing = 'Ошибка. Похоже, вы перешли по неправильной ссылке. Попробуйте найти материал через поиск на сайте или сообщите нам через ctrl+enter, если что-то сломалось.' in content
        if is_missing:
            missing.append(url)
    await browser.close()
    return missing

In [15]:
test_urls = [
    'https://novayagazeta.ru/thisdoesntexistontheirwebsite/',
    'https://novayagazeta.ru/articles/2022/03/15/na-ukraine-evropy'
]

result = await check(test_urls)
result

checking:2/2 found:1

['https://novayagazeta.ru/thisdoesntexistontheirwebsite/']

Lets try it on the full set of 2022 Ukraine articles?

In [16]:
result = await check(urls_2022_ukraine)

checking:100/100 found:12

In [17]:
result

['https://novayagazeta.ru/articles/2022/02/26/on-sam-ne-znal-chto-ikh-tuda-povezut-smi-pogovorili-s-rodstvennikami-soldata-kotoryi-predpolozhitelno-popal-v-plen-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/02/24/putin-obiavil-spetsoperatsiiu-v-ukraine-onlainhttps://novayagazeta.ru/articles/2022/02/24/putin-obiavil-spetsoperatsiiu-v-ukraine-onlain',
 'https://novayagazeta.ru/articles/2022/03/02/bolee-160-laureatov-nobelevskoi-premii-prizvali-rossiiu-prekratit-voennye-deistviia-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/02/28/glavnyi-prokuror-mezhdunarodnogo-ugolovnogo-suda-anonsiroval-rassledovanie-voennykh-prestuplenii-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/02/24/voine-net-opravdanii-rossiiskie-zhurnalisty-vystupili-protiv-spetsoperatsii-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/02/28/piatyi-okaiannyi-den-voina-v-ukraine-prodolzhaetsia-glavnoe',
 'https://novayagazeta.ru/articles/2022/02/24/putin-obiavil-spetsoperatsiiu-v-ukrai

How about the all the 2022 articles?

In [None]:
all_results = await check(urls_2022)

checking:210/2875 found:14