# Novaya Gazeta Censorship

The Novaya Gazeta was recently forced to start censoring itself, or risk shut down and imprisonment of its journalists.

https://www.reuters.com/world/russias-novaya-gazeta-cuts-ukraine-war-reporting-under-censorship-2022-03-04/

What articles have they had to remove?

First we can discover the articles that existed on the Novaya Gazeta website in the year prior to the invasion of the Ukraine. The article URLs seem to start with the `/articles/` path prefix and are returned with a `text/html` mimetype.

In [1]:
import wayback
import datetime

client = wayback.WaybackClient()

# search Wayback for archived HTML pages at novayagazeta.ru
results = client.search(
    url='https://novayagazeta.ru/articles/',
    matchType='prefix',
    from_date=datetime.date(2020, 1, 1),
    filter_field='mimetype:text/html'
)

# collect the unique URLs that are found
urls = set()

for record in results:
    if record.url not in urls:
        urls.add(record.url)
        print(len(urls), end="\r")

64246

That's a lot of URLs. Lets look at some of them:

In [2]:
list(urls)[0:500]

['https://novayagazeta.ru/articles/2021/10/17/shtrikhbreikery?utm_source=tw&utm_medium=novaya&utm_campaign=moskvich-zaprosil-dela-repressirovannyh-r',
 'https://novayagazeta.ru/articles/2019/12/18/83213-i-merkel-ne-pomogla',
 'https://novayagazeta.ru/articles/2021/06/17/chto-proizoshlo-za-noch-17-iiunia-korotko?utm_source=tw&utm_medium=novaya&utm_campaign=sbornaya-italii-stala-pervoy-vyshedshey-v-p',
 'https://novayagazeta.ru/articles/2021/08/17/zhurnalista-petra-maniakhina-oshtrafovali-za-semku-s-drona-ranee-ego-vnesli-v-spisok-smi-inoagentov-news',
 'https://novayagazeta.ru/articles/2021/11/11/dinozavry-tozhe-mogli-rassuzhdat-o-klimate-no-udaril-asteroid-i-vsio',
 'https://novayagazeta.ru/articles/2019/03/12/79847-sleduet-priznat-tot-fakt-chto-rossiya-spasla-ne-tolko-siriyu-no-i-livan',
 'https://novayagazeta.ru/articles/2021/11/16/glava-nasa-soobshchil-ob-ugroze-kosmonavtam-iz-za-ispytanii-rossiei-protivosputnikovogo-oruzhiia-news',
 'https://novayagazeta.ru/articles/2021/12/17/v-mo

There are some with URL tracking parameters like:

    https://novayagazeta.ru/articles/2021/10/20/poputchiki-na-titanike?utm_source=tw&utm_medium=novaya&utm_campaign=-chto-smotret-v-blizhayshee-vremya-v-kino',
    
We can remove any with '?' in them.

In [3]:
urls = list(filter(lambda url: '?' not in url, list(urls)))
len(urls)

40615

That's much more manageable. There are also URLs with 'www' in them which can be removed.

In [4]:
urls = list(filter(lambda url: 'https://www' not in url, urls))
len(urls)

32469

That's still quite a few to check. Lets focus on URLs from 2022 first. 

In [5]:
urls_2022 = list(filter(lambda url: url.startswith('https://novayagazeta.ru/articles/2022/'), urls))
len(urls_2022)

2878

How about urls from 2022 that have 'ukraine' in them?

In [6]:
import re

urls_2022_ukraine = list(filter(lambda url: 'ukraine' in url, urls_2022))
len(urls_2022_ukraine)

101

In [7]:
urls_2022_ukraine

['https://novayagazeta.ru/articles/2022/03/25/minekonomiki-frg-germaniia-znachitelno-sokratila-zavisimost-ot-postavok-gaza-nefti-i-uglia-iz-rossii-s-momenta-nachala-boevykh-deistvii-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/02/12/mid-rossii-obiavil-ob-optimizatsii-chisla-rossiiskikh-diplomatov-v-ukraine-iz-za-ugrozy-silovykh-aktsii-news',
 'https://novayagazeta.ru/articles/2022/03/22/novartis-priostanavlivaet-klinicheskie-ispytaniia-v-rossii-na-fone-boevykh-deistvii-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/03/16/ft-moskva-i-kiev-razrabotali-plan-prekrashcheniia-boevykh-deistvii-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/03/22/fontanka-obnaruzhila-v-peterburge-fabriku-po-napisaniiu-kommentariev-v-podderzhku-voennykh-deistvii-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/03/25/avstraliia-i-iaponiia-vveli-novye-sanktsii-protiv-rossii-iz-za-situatsii-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/03/16/ssha-vydeliat-ukraine-

The tricky thing is that Novaya Gazeta don't return a 404 error when you request a URL for something that doesn't exist on their site:

In [8]:
! curl -i 'https://novayagazeta.ru/thisdoesntexistontheirwebsite/'

HTTP/2 200 
[1mdate[0m: Sun, 27 Mar 2022 23:32:19 GMT
[1mcontent-type[0m: text/html
[1mexpires[0m: Sun, 27 Mar 2022 23:32:19 GMT
[1mcache-control[0m: max-age=0
[1mlast-modified[0m: Sunday, 27-Mar-2022 23:32:19 GMT
[1mstrict-transport-security[0m: max-age=63072000; includeSubDomains; preload
[1mcf-cache-status[0m: DYNAMIC
[1mexpect-ct[0m: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
[1mserver[0m: cloudflare
[1mcf-ray[0m: 6f2becb71a905950-IAD

<!doctype html><html lang="ru" itemscope="itemscope" itemtype="http://schema.org/WebSite" xmlns="http://www.w3.org/1999/html"><head><title>Новая Газета - novayagazeta.ru</title><meta charset="UTF-8" /><meta name="viewport" content="width=device-width,height=device-height,maximum-scale=1,minimum-scale=1,initial-scale=1" /><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" /><link rel="preload" href="https://novayagazeta.ru/api/v1/get/main" as="fetch" type="application/json" cr

Also this page has a bunch of JavaScript on it that needs to be run in order to tell if it's a 404 error. The number 404 doesn't appear in the HTML above, but it does appear in the page you see if you visit: https://novayagazeta.ru/thisdoesntexistontheirwebsite/

![404](images/novaya-gazeta-404.png)

We can use [pyppeteer](https://pypi.org/project/pyppeteer/) to automate the checking.

In [12]:
import time
import asyncio
import pyppeteer

async def check(urls):
    missing = []
    browser = await pyppeteer.launch()
    page = await browser.newPage()
    for i, url in enumerate(urls, start=1):
        time.sleep(.5)
        print(f'checking:{i}/{len(urls)} found:{len(missing)}', end="\r")
        try:
            await page.goto(url, {'waitUntil': 'networkidle2', 'timeout': 0})
            content = await page.evaluate('document.body.textContent', force_expr=True)
            is_missing = 'Ошибка. Похоже, вы перешли по неправильной ссылке. Попробуйте найти материал через поиск на сайте или сообщите нам через ctrl+enter, если что-то сломалось.' in content
            if is_missing:
                missing.append(url)
        except Exception as e:
            print(f'getting a new browser: {e}')
            await browser.close()
            browser = await pyppeteer.launch()
            page = await browser.newPage()

    await browser.close()
    return missing

In [10]:
test_urls = [
    'https://novayagazeta.ru/thisdoesntexistontheirwebsite/',
    'https://novayagazeta.ru/articles/2022/03/15/na-ukraine-evropy'
]

result = await check(test_urls)
result

checking:2/2 found:1

['https://novayagazeta.ru/thisdoesntexistontheirwebsite/']

Lets try it on the full set of 2022 Ukraine articles?

In [16]:
result = await check(urls_2022_ukraine)

checking:100/100 found:12

In [17]:
result

['https://novayagazeta.ru/articles/2022/02/26/on-sam-ne-znal-chto-ikh-tuda-povezut-smi-pogovorili-s-rodstvennikami-soldata-kotoryi-predpolozhitelno-popal-v-plen-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/02/24/putin-obiavil-spetsoperatsiiu-v-ukraine-onlainhttps://novayagazeta.ru/articles/2022/02/24/putin-obiavil-spetsoperatsiiu-v-ukraine-onlain',
 'https://novayagazeta.ru/articles/2022/03/02/bolee-160-laureatov-nobelevskoi-premii-prizvali-rossiiu-prekratit-voennye-deistviia-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/02/28/glavnyi-prokuror-mezhdunarodnogo-ugolovnogo-suda-anonsiroval-rassledovanie-voennykh-prestuplenii-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/02/24/voine-net-opravdanii-rossiiskie-zhurnalisty-vystupili-protiv-spetsoperatsii-v-ukraine-news',
 'https://novayagazeta.ru/articles/2022/02/28/piatyi-okaiannyi-den-voina-v-ukraine-prodolzhaetsia-glavnoe',
 'https://novayagazeta.ru/articles/2022/02/24/putin-obiavil-spetsoperatsiiu-v-ukrai

How about the all the 2022 articles?

In [13]:
all_results = await check(urls_2022)

getting a new browser: Execution context was destroyed, most likely because of a navigation.
checking:2878/2878 found:191

Now it would be useful to get Wayback URLs for the archived versions with their titles.

In [29]:
import wayback
import datetime

wb = wayback.WaybackClient()
browser = await pyppeteer.launch()
page = await browser.newPage()

async def get_info(url):
    
    # find the first Wayback URL that doesn't have the missing message in it
    for result in wb.search(url=url, to_date=datetime.date(2022, 3, 5), from_date=datetime.date(2020, 1, 1)):
        wb_url = result.view_url
    
        # load the page in the wayback machine and give it 10 more seconds to render
        await page.goto(wb_url, {'waitUntil': 'networkidle2', 'timeout': 0})
        time.sleep(10)

        # skip if it is a content missing page
        content = await page.evaluate('document.body.textContent', force_expr=True)
        is_missing = 'Ошибка. Похоже, вы перешли по неправильной ссылке. Попробуйте найти материал через поиск на сайте или сообщите нам через ctrl+enter, если что-то сломалось.' in content
        if is_missing:
            continue
        
        # get the title from the page and return
        title = await page.JJeval('head title', 'nodes => nodes[0].innerText')
        return {
            "url": wb_url,
            "title": title
        }

Try to get the info from Wayback for the pages that appear missing.

In [30]:
info = {}
for url in all_results:
    info[url] = await get_info(url)

We can put them in a DataFrame to make them easier to work with.

In [46]:
import pandas

pandas.set_option('display.width', 250)

df = pandas.DataFrame(info)
df = df.transpose()
df = df.reset_index()
df.columns = ['url', 'wayback_url', 'title']
df

Unnamed: 0,url,wayback_url,title
0,https://novayagazeta.ru/articles/2022/03/01/go...,http://web.archive.org/web/20220301104646/http...,Новая Газета - novayagazeta.ru
1,https://novayagazeta.ru/articles/2022/02/24/ch...,http://web.archive.org/web/20220224063831/http...,Wayback Machine
2,https://novayagazeta.ru/articles/2022/03/02/po...,http://web.archive.org/web/20220302103843/http...,Новая Газета - novayagazeta.ru
3,https://novayagazeta.ru/articles/2022/02/26/kr...,http://web.archive.org/web/20220226103728/http...,«Кричать от боли все будут на одном языке»: бо...
4,https://novayagazeta.ru/articles/2022/02/27/et...,http://web.archive.org/web/20220227115621/http...,Новая Газета - novayagazeta.ru
5,https://novayagazeta.ru/articles/2022/03/02/pr...,http://web.archive.org/web/20220302141535/http...,Новая Газета - novayagazeta.ru
6,https://novayagazeta.ru/articles/2022/02/27/et...,http://web.archive.org/web/20220227095521/http...,Новая Газета - novayagazeta.ru
7,https://novayagazeta.ru/articles/2022/02/24/iu...,http://web.archive.org/web/20220224171820/http...,Юлия Латынина: Мир просто разделится на свобод...
8,https://novayagazeta.ru/articles/2022/02/27/ko...,http://web.archive.org/web/20220227190346/http...,Новая Газета - novayagazeta.ru
9,https://novayagazeta.ru/articles/2022/02/24/ze...,http://web.archive.org/web/20220224171703/http...,Зеленский: российские войска пытаются захватит...


In [48]:
df.to_csv('data/novaya-gazeta.csv', index=False)