# Coronawiki dataset exploration

The purpose of the following notebook is to get familiar with the given Coronawiki data, as it is split among multiple files which serve different purposes.

As such, we will attempt to do the following tasks in this notebook:

- Preprocessing of the data,to make it more comfortable to use (Split the dataframes, give them another format, etc)
- Data wrangling: a lot of the data are timeseries which could be put together to derive interesting results
- First analysis phase

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import pycountry_convert as pc
import os

## Timeseries <a id='timeseries'></a>

The most important data we have in this dataset are time series of the Wikipedia views from 2018 to July 2020 for 14 different languages: one part are the total views for all of that language's wikipedia, a second part are the views for the articles that are related to Covid-19, as well as the percentage. Finally, we also have for the same window of time the views for different topics.

In [2]:
timeseries = pd.read_json("aggregated_timeseries.json.gz")
timeseries.head()

Unnamed: 0,ja.m,it,da.m,tr,no.m,en,sr,tr.m,en.m,no,...,ko.m,fi.m,sr.m,ja,fr,fi,ca,it.m,sv.m,ko
len,1197788,1594039,256451,346007,516838,6047509,632128,345790,6045654,531478,...,489181,480638,396063,1197856,2195949,481854,642031,1588312,1959446,490314
sum,"{'2018-01-01 00:00:00': 22328288, '2018-01-02 ...","{'2018-01-01 00:00:00': 3338750, '2018-01-02 0...","{'2018-01-01 00:00:00': 765123, '2018-01-02 00...","{'2018-01-01 00:00:00': 407629, '2018-01-02 00...","{'2018-01-01 00:00:00': 715031, '2018-01-02 00...","{'2018-01-01 00:00:00': 86763830, '2018-01-02 ...","{'2018-01-01 00:00:00': 192409, '2018-01-02 00...","{'2018-01-01 00:00:00': 493684, '2018-01-02 00...","{'2018-01-01 00:00:00': 135822131, '2018-01-02...","{'2018-01-01 00:00:00': 224417, '2018-01-02 00...",...,"{'2018-01-01 00:00:00': 1484496, '2018-01-02 0...","{'2018-01-01 00:00:00': 1319053, '2018-01-02 0...","{'2018-01-01 00:00:00': 451383, '2018-01-02 00...","{'2018-01-01 00:00:00': 7828155, '2018-01-02 0...","{'2018-01-01 00:00:00': 6441009, '2018-01-02 0...","{'2018-01-01 00:00:00': 523135, '2018-01-02 00...","{'2018-01-01 00:00:00': 111910, '2018-01-02 00...","{'2018-01-01 00:00:00': 12856884, '2018-01-02 ...","{'2018-01-01 00:00:00': 2383474, '2018-01-02 0...","{'2018-01-01 00:00:00': 819174, '2018-01-02 00..."
covid,"{'len': 30, 'sum': {'2018-01-01 00:00:00': 55,...","{'len': 33, 'sum': {'2018-01-01 00:00:00': 50,...","{'len': 4, 'sum': {'2018-01-01 00:00:00': 0, '...","{'len': 64, 'sum': {'2018-01-01 00:00:00': 1, ...","{'len': 10, 'sum': {'2018-01-01 00:00:00': 7, ...","{'len': 306, 'sum': {'2018-01-01 00:00:00': 57...","{'len': 9, 'sum': {'2018-01-01 00:00:00': 6, '...","{'len': 64, 'sum': {'2018-01-01 00:00:00': 3, ...","{'len': 306, 'sum': {'2018-01-01 00:00:00': 91...","{'len': 10, 'sum': {'2018-01-01 00:00:00': 2, ...",...,"{'len': 113, 'sum': {'2018-01-01 00:00:00': 6,...","{'len': 9, 'sum': {'2018-01-01 00:00:00': 0, '...","{'len': 9, 'sum': {'2018-01-01 00:00:00': 11, ...","{'len': 30, 'sum': {'2018-01-01 00:00:00': 26,...","{'len': 16, 'sum': {'2018-01-01 00:00:00': 62,...","{'len': 9, 'sum': {'2018-01-01 00:00:00': 2, '...","{'len': 49, 'sum': {'2018-01-01 00:00:00': 6, ...","{'len': 33, 'sum': {'2018-01-01 00:00:00': 139...","{'len': 8, 'sum': {'2018-01-01 00:00:00': 19, ...","{'len': 113, 'sum': {'2018-01-01 00:00:00': 3,..."
topics,{'Culture.Biography.Biography*': {'len': 14904...,{'Culture.Biography.Biography*': {'len': 29427...,{'Culture.Biography.Biography*': {'len': 57720...,{'Culture.Biography.Biography*': {'len': 70443...,{'Culture.Biography.Biography*': {'len': 11603...,{'Culture.Biography.Biography*': {'len': 14038...,{'Culture.Biography.Biography*': {'len': 37718...,{'Culture.Biography.Biography*': {'len': 70434...,{'Culture.Biography.Biography*': {'len': 14038...,{'Culture.Biography.Biography*': {'len': 11804...,...,{'Culture.Biography.Biography*': {'len': 75406...,{'Culture.Biography.Biography*': {'len': 10422...,{'Culture.Biography.Biography*': {'len': 37580...,{'Culture.Biography.Biography*': {'len': 14904...,{'Culture.Biography.Biography*': {'len': 38258...,{'Culture.Biography.Biography*': {'len': 10444...,{'Culture.Biography.Biography*': {'len': 10175...,{'Culture.Biography.Biography*': {'len': 29422...,{'Culture.Biography.Biography*': {'len': 14668...,{'Culture.Biography.Biography*': {'len': 75498...


In [3]:
timeseries.columns

Index(['ja.m', 'it', 'da.m', 'tr', 'no.m', 'en', 'sr', 'tr.m', 'en.m', 'no',
       'sv', 'nl.m', 'nl', 'da', 'de', 'fr.m', 'ca.m', 'de.m', 'ko.m', 'fi.m',
       'sr.m', 'ja', 'fr', 'fi', 'ca', 'it.m', 'sv.m', 'ko'],
      dtype='object')

Correspondence:
- ja -> Japanese
- it -> Italian
- da -> Danish
- tr -> Turkish?
- no -> Norwegian
- en -> English
- sr -> Serbian
- sv -> Swedish
- nl -> Dutch
- de -> German
- fr -> French
- ca -> Catalan?
- ko -> Korean
- fi -> Finnish

Not sure about the "?" ones. According to https://www.loc.gov/standards/iso639-2/php/langcodes-search.php, these correspond respectively to Turkish and Catalan.

### Splitting the timeseries data into different dataframes

As we can see, the data's format isn't ideal: for each language, the data is split into 3 Python dictionaries corresponding to the data described above, and it would be nice to separate these pieces of data to be able to read directly for each date, for example, the total number of views accross all languages, instead of having to iterate over each language's dictionnary every time.

This will also make the analysis phase easier later on.

### Total sum of views, views of articles related to Covid

<a id='extraction_format'></a>
In this part of the code we extract two following kind of data (there are three total dataframes, but the two last ones represent the same data), for each date:
- For every language's Wikipedia, the total number of views on that particular date
- For every language's Wikipedia, the total number of views for articles related to Covid-19 on that particular date
- For every language's Wikipedia, the percentage of views for articles related to Covid-19 on that particular date

Note that the two last dataframes might be redundant, but as we're given the data anyway, we choose to extract it after all.

---


Every resulting dataframe will have the following format:

 Column name          | Description                                                                                                                                                                                       |   |   |   |
|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|---|---|
| dates           | A particular date between January 2018 (inclusive) and July of 2020 (inclusive)                                                                                                                                             |   |   |   |
| language_code            | It can either be the total number of views for that language's Wikipedia, the number of views on Covid related articles on that same Wikipedia, or the percentage of these latter. There are 28 of these columns, as there are 14 languages and the data from desktop and mobile are separated.


---

We also extract another dataframe that simply maps for each language the number of articles that were considered in the original experiment.

In [4]:
timeseries_total_sum_dict = {}
timeseries_covid_len_dict = {}
timeseries_covid_sum_dict = {}
timeseries_covid_percent_dict = {}
for cn in timeseries.columns:
    timeseries_total_sum_dict[cn] = timeseries[cn]['sum']
    timeseries_covid_len_dict[cn] = timeseries[cn]['covid']['len']
    timeseries_covid_sum_dict[cn] = timeseries[cn]['covid']['sum']
    timeseries_covid_percent_dict[cn] = timeseries[cn]['covid']['percent']

In [5]:
sum_data_df = pd.DataFrame.from_dict(timeseries_total_sum_dict, orient = 'index').T
covid_len_data_df = pd.DataFrame.from_dict(timeseries_covid_len_dict, orient = 'index', columns = ['len']).T
covid_sum_data_df = pd.DataFrame.from_dict(timeseries_covid_sum_dict, orient = 'index').T
covid_percent_data_df = pd.DataFrame.from_dict(timeseries_covid_percent_dict, orient = 'index').T

In [6]:
index_without_time = [x[:10] for x in covid_sum_data_df.index]
sum_data_df.index = covid_sum_data_df.index = covid_percent_data_df.index = pd.to_datetime(index_without_time)
sum_data_df['dates'] = covid_sum_data_df['dates'] = covid_percent_data_df['dates'] = sum_data_df.index
new_column_order = [sum_data_df.columns[-1]] + list(sum_data_df.columns[-2::-1])
sum_data_df = sum_data_df[new_column_order]
covid_sum_data_df = covid_sum_data_df[new_column_order]
covid_percent_data_df = covid_percent_data_df[new_column_order]

In [7]:
sum_data_df.head()

Unnamed: 0,dates,ko,sv.m,it.m,ca,fi,fr,ja,sr.m,fi.m,...,no,en.m,tr.m,sr,en,no.m,tr,da.m,it,ja.m
2018-01-01,2018-01-01,819174,2383474,12856884,111910,523135,6441009,7828155,451383,1319053,...,224417,135822131,493684,192409,86763830,715031,407629,765123,3338750,22328288
2018-01-02,2018-01-02,959239,1873096,12887390,198405,648344,9079323,8759385,462824,1094280,...,374771,127087359,483443,253653,112245349,536506,426791,443384,5428428,22278953
2018-01-03,2018-01-03,1037688,1863012,12859488,188728,644605,9746428,9996156,404880,1022615,...,459743,116606137,471814,272278,121868290,552379,468642,415545,5640812,23632758
2018-01-04,2018-01-04,956653,1810874,12359845,203167,643311,10034517,11976989,391631,1001547,...,479999,115650878,462107,273699,112888840,528468,462860,416943,5794860,21893587
2018-01-05,2018-01-05,955955,2191670,12107559,168126,614471,9511358,12746833,380159,1012466,...,444863,116127950,485637,257239,109213987,600262,407226,455392,5475376,20734837


In [8]:
sum_data_df.describe()

Unnamed: 0,ko,sv.m,it.m,ca,fi,fr,ja,sr.m,fi.m,ko.m,...,no,en.m,tr.m,sr,en,no.m,tr,da.m,it,ja.m
count,943.0,943.0,943.0,943.0,943.0,943.0,943.0,943.0,943.0,943.0,...,943.0,943.0,943.0,943.0,943.0,943.0,943.0,943.0,943.0,943.0
mean,874037.7,1813193.0,11640490.0,329022.9,644900.4,8435672.0,11672720.0,519027.2,1158003.0,1368267.0,...,374530.8,133521000.0,715529.4,252778.569459,96663720.0,600669.9,434908.1,488067.505832,5075290.0,22370180.0
std,300111.9,261651.3,1874893.0,123645.7,155912.6,1591981.0,1515054.0,145947.5,143856.5,154664.1,...,144232.8,16170660.0,736893.9,71858.292498,13686200.0,94177.09,250353.1,87080.725791,1075515.0,3253867.0
min,397361.0,1373266.0,6793179.0,80448.0,314787.0,4797522.0,7608159.0,302358.0,816682.0,1027580.0,...,144046.0,107679100.0,261064.0,136739.0,63816990.0,374384.0,199125.0,331218.0,2533502.0,16681970.0
25%,726851.5,1605410.0,10450770.0,234550.5,522218.5,7250682.0,10553180.0,421916.0,1052212.0,1263100.0,...,256012.0,122450700.0,329145.0,209797.5,85068150.0,534279.0,285390.5,422152.0,4328148.0,20047340.0
50%,841536.0,1745874.0,11145170.0,326918.0,639540.0,8547916.0,11902900.0,498823.0,1129898.0,1350474.0,...,368713.0,130876600.0,361886.0,242344.0,97696630.0,581605.0,327488.0,469240.0,5119110.0,21542440.0
75%,967503.5,2011081.0,12371170.0,407496.5,751778.5,9615001.0,12733250.0,570201.5,1236041.0,1447855.0,...,482826.0,141391400.0,439639.0,277463.5,106438000.0,651776.5,434542.0,542283.0,5697126.0,24132590.0
max,5192512.0,2586867.0,20482900.0,1193818.0,1471053.0,17975460.0,25754020.0,1120918.0,1785992.0,2520902.0,...,1604503.0,202324000.0,3933698.0,993185.0,130745400.0,1036690.0,1508685.0,996658.0,11427150.0,37700610.0


In [9]:
covid_sum_data_df.head()

Unnamed: 0,dates,ko,sv.m,it.m,ca,fi,fr,ja,sr.m,fi.m,...,no,en.m,tr.m,sr,en,no.m,tr,da.m,it,ja.m
2018-01-01,2018-01-01,3,19,139,6,2,62,26,11,0,...,2,911,3,6,575,7,1,0,50,55
2018-01-02,2018-01-02,20,12,187,6,2,91,42,20,0,...,4,1006,2,13,1081,5,3,2,103,55
2018-01-03,2018-01-03,20,13,162,9,2,109,53,15,8,...,3,919,2,11,1265,2,6,1,130,51
2018-01-04,2018-01-04,16,14,180,10,3,107,114,30,0,...,1,1026,9,6,1167,2,1,0,112,46
2018-01-05,2018-01-05,20,11,127,4,0,113,134,33,0,...,4,978,2,13,1054,2,0,0,119,70


In [10]:
covid_sum_data_df.describe()

Unnamed: 0,ko,sv.m,it.m,ca,fi,fr,ja,sr.m,fi.m,ko.m,...,no,en.m,tr.m,sr,en,no.m,tr,da.m,it,ja.m
count,943.0,943.0,943.0,943.0,943.0,943.0,943.0,943.0,943.0,943.0,...,943.0,943.0,943.0,943.0,943.0,943.0,943.0,943.0,943.0,943.0
mean,368.194062,183.844115,7198.760339,86.745493,211.302227,1032.460233,3714.471898,57.279958,347.965005,613.879109,...,81.768823,127670.2,1236.389183,36.420997,97021.31,118.137858,435.064687,85.723224,2796.328738,4417.660657
std,1060.276765,530.529934,22508.643958,204.737455,701.724271,2821.946155,9394.238911,168.85586,1326.090918,2133.247119,...,227.749964,512254.6,5897.450636,100.384272,372776.3,396.63254,1842.725387,318.663849,8007.486218,11419.205557
min,1.0,0.0,21.0,0.0,0.0,15.0,25.0,0.0,0.0,2.0,...,0.0,680.0,0.0,0.0,385.0,0.0,0.0,0.0,10.0,30.0
25%,16.0,4.0,71.0,6.0,1.0,61.0,74.0,3.0,1.0,12.0,...,1.0,919.0,1.0,2.0,751.5,1.0,2.0,0.0,53.0,71.5
50%,26.0,7.0,104.0,13.0,2.0,102.0,114.0,6.0,2.0,19.0,...,3.0,1107.0,2.0,4.0,993.0,3.0,3.0,1.0,93.0,95.0
75%,45.0,14.0,406.0,40.0,4.0,254.5,234.5,21.0,4.0,35.0,...,6.0,2544.5,6.0,10.0,2073.5,7.0,7.0,3.0,289.5,180.0
max,7486.0,4220.0,147772.0,2407.0,6438.0,19564.0,56433.0,1542.0,17410.0,18720.0,...,1754.0,3966072.0,89855.0,1355.0,2509511.0,4089.0,21383.0,3156.0,50203.0,74684.0


In [11]:
covid_percent_data_df.head()

Unnamed: 0,dates,ko,sv.m,it.m,ca,fi,fr,ja,sr.m,fi.m,...,no,en.m,tr.m,sr,en,no.m,tr,da.m,it,ja.m
2018-01-01,2018-01-01,4e-06,8e-06,1.1e-05,5.4e-05,4e-06,1e-05,3e-06,2.4e-05,0.0,...,9e-06,7e-06,6e-06,3.1e-05,7e-06,1e-05,2e-06,0.0,1.5e-05,2e-06
2018-01-02,2018-01-02,2.1e-05,6e-06,1.5e-05,3e-05,3e-06,1e-05,5e-06,4.3e-05,0.0,...,1.1e-05,8e-06,4e-06,5.1e-05,1e-05,9e-06,7e-06,5e-06,1.9e-05,2e-06
2018-01-03,2018-01-03,1.9e-05,7e-06,1.3e-05,4.8e-05,3e-06,1.1e-05,5e-06,3.7e-05,8e-06,...,7e-06,8e-06,4e-06,4e-05,1e-05,4e-06,1.3e-05,2e-06,2.3e-05,2e-06
2018-01-04,2018-01-04,1.7e-05,8e-06,1.5e-05,4.9e-05,5e-06,1.1e-05,1e-05,7.7e-05,0.0,...,2e-06,9e-06,1.9e-05,2.2e-05,1e-05,4e-06,2e-06,0.0,1.9e-05,2e-06
2018-01-05,2018-01-05,2.1e-05,5e-06,1e-05,2.4e-05,0.0,1.2e-05,1.1e-05,8.7e-05,0.0,...,9e-06,8e-06,4e-06,5.1e-05,1e-05,3e-06,0.0,0.0,2.2e-05,3e-06


### Checking for missing data

Before continuing further, let us check for missing data in the timeseries; this will help us avoid bad surprises later on.

In [12]:
sum_data_df.isnull().any().any(), covid_sum_data_df.isnull().any().any(), covid_percent_data_df.isnull().any().any()

(False, False, True)

There appears to be some missing data in the percentage dataframe; let's check by language where that missing data is reported.

In [13]:
missing_data_language = covid_percent_data_df.isnull().any(axis = 0)
missing_data_language[missing_data_language]

sv    True
dtype: bool

Looking at the paper, this corresponds to Swedish. Let's check in the other dataframes how that missing data manifests itself.

In [14]:
sum_data_df.loc[:,['dates','sv']]

Unnamed: 0,dates,sv
2018-01-01,2018-01-01,0
2018-01-02,2018-01-02,0
2018-01-03,2018-01-03,0
2018-01-04,2018-01-04,0
2018-01-05,2018-01-05,0
...,...,...
2020-07-27,2020-07-27,622918
2020-07-28,2020-07-28,645601
2020-07-29,2020-07-29,639190
2020-07-30,2020-07-30,613870


In [15]:
covid_sum_data_df.loc[:,['dates', 'sv']]

Unnamed: 0,dates,sv
2018-01-01,2018-01-01,0
2018-01-02,2018-01-02,0
2018-01-03,2018-01-03,0
2018-01-04,2018-01-04,0
2018-01-05,2018-01-05,0
...,...,...
2020-07-27,2020-07-27,245
2020-07-28,2020-07-28,291
2020-07-29,2020-07-29,296
2020-07-30,2020-07-30,267


In both dataframes, the missing data corresponds to 0 values for the dates. The NaN values thus correspond to a division by 0. Let us now check the first date where we get data for Swedish.

In [16]:
swedish_mask = sum_data_df.loc[:,'sv'] > 0

In [17]:
sum_data_df.loc[:,'sv'][swedish_mask]

2019-01-01    607516
2019-01-02    821962
2019-01-03    872335
2019-01-04    854304
2019-01-05    775861
               ...  
2020-07-27    622918
2020-07-28    645601
2020-07-29    639190
2020-07-30    613870
2020-07-31    549610
Name: sv, Length: 578, dtype: int64

In [18]:
covid_sum_data_df.loc[:,'sv'][swedish_mask]

2019-01-01      1
2019-01-02      1
2019-01-03      3
2019-01-04      5
2019-01-05      2
             ... 
2020-07-27    245
2020-07-28    291
2020-07-29    296
2020-07-30    267
2020-07-31    252
Name: sv, Length: 578, dtype: int64

From what we can see, it appears that all the Swedish data from 2018 is missing. We will need to take that into consideration when doing our analysis.

The reason behind that is not clear: Wikipedia's swedish version has existed since 2001, and it's strange that the data from a whole year is either missing, or maybe it just hasn't been collected in the first place in the context of the original paper.

### Topics data

Now we will extract for each language, all the views per topic in such a way that the data becomes more usable. In the original data, all topic-related information was in a single dictionnary; we're gonna separate them in a way that each column will correspond to a different topic, with each row being a different language.

In [19]:
country_to_topics = {}
for cn in timeseries.columns:
    country_to_topics[cn] = timeseries[cn]['topics']
topics_df = pd.DataFrame.from_dict(country_to_topics, orient = 'index')

In [20]:
countries_to_topics_len = {}
countries_to_topics_sum = {}
countries_to_topics_percent = {}
for country in topics_df.index:
    countries_to_topics_len[country] = {}
    countries_to_topics_sum[country] = {}
    countries_to_topics_percent[country] = {}
    for topic in topics_df.columns:
        countries_to_topics_len[country][topic] = topics_df.loc[country,topic]['len']
        countries_to_topics_sum[country][topic] = topics_df.loc[country,topic]['sum']
        countries_to_topics_percent[country][topic] = topics_df.loc[country,topic]['percent']
countries_to_topics_len_df = pd.DataFrame.from_dict(countries_to_topics_len, orient = 'index')
countries_to_topics_sum_df = pd.DataFrame.from_dict(countries_to_topics_sum, orient = 'index')
countries_to_topics_percent_df = pd.DataFrame.from_dict(countries_to_topics_percent, orient = 'index')

In [21]:
countries_to_topics_sum_df.head()

Unnamed: 0,Culture.Biography.Biography*,Culture.Biography.Women,Culture.Food and drink,Culture.Internet culture,Culture.Linguistics,Culture.Literature,Culture.Media.Books,Culture.Media.Entertainment,Culture.Media.Films,Culture.Media.Media*,...,STEM.Computing,STEM.Earth and environment,STEM.Engineering,STEM.Libraries & Information,STEM.Mathematics,STEM.Medicine & Health,STEM.Physics,STEM.STEM*,STEM.Space,STEM.Technology
ja.m,"{'2018-01-01 00:00:00': 6629234, '2018-01-02 0...","{'2018-01-01 00:00:00': 1462146, '2018-01-02 0...","{'2018-01-01 00:00:00': 302934, '2018-01-02 00...","{'2018-01-01 00:00:00': 443986, '2018-01-02 00...","{'2018-01-01 00:00:00': 109480, '2018-01-02 00...","{'2018-01-01 00:00:00': 2140880, '2018-01-02 0...","{'2018-01-01 00:00:00': 97435, '2018-01-02 00:...","{'2018-01-01 00:00:00': 238059, '2018-01-02 00...","{'2018-01-01 00:00:00': 681533, '2018-01-02 00...","{'2018-01-01 00:00:00': 4264889, '2018-01-02 0...",...,"{'2018-01-01 00:00:00': 91338, '2018-01-02 00:...","{'2018-01-01 00:00:00': 72493, '2018-01-02 00:...","{'2018-01-01 00:00:00': 316615, '2018-01-02 00...","{'2018-01-01 00:00:00': 10072, '2018-01-02 00:...","{'2018-01-01 00:00:00': 44902, '2018-01-02 00:...","{'2018-01-01 00:00:00': 485801, '2018-01-02 00...","{'2018-01-01 00:00:00': 76863, '2018-01-02 00:...","{'2018-01-01 00:00:00': 1793359, '2018-01-02 0...","{'2018-01-01 00:00:00': 64445, '2018-01-02 00:...","{'2018-01-01 00:00:00': 264636, '2018-01-02 00..."
it,"{'2018-01-01 00:00:00': 809879, '2018-01-02 00...","{'2018-01-01 00:00:00': 193009, '2018-01-02 00...","{'2018-01-01 00:00:00': 34632, '2018-01-02 00:...","{'2018-01-01 00:00:00': 66037, '2018-01-02 00:...","{'2018-01-01 00:00:00': 23304, '2018-01-02 00:...","{'2018-01-01 00:00:00': 206403, '2018-01-02 00...","{'2018-01-01 00:00:00': 50646, '2018-01-02 00:...","{'2018-01-01 00:00:00': 86717, '2018-01-02 00:...","{'2018-01-01 00:00:00': 395631, '2018-01-02 00...","{'2018-01-01 00:00:00': 1137084, '2018-01-02 0...",...,"{'2018-01-01 00:00:00': 41406, '2018-01-02 00:...","{'2018-01-01 00:00:00': 20273, '2018-01-02 00:...","{'2018-01-01 00:00:00': 51490, '2018-01-02 00:...","{'2018-01-01 00:00:00': 7526, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 15705, '2018-01-02 00:...","{'2018-01-01 00:00:00': 76109, '2018-01-02 00:...","{'2018-01-01 00:00:00': 31334, '2018-01-02 00:...","{'2018-01-01 00:00:00': 383789, '2018-01-02 00...","{'2018-01-01 00:00:00': 18815, '2018-01-02 00:...","{'2018-01-01 00:00:00': 78432, '2018-01-02 00:..."
da.m,"{'2018-01-01 00:00:00': 289706, '2018-01-02 00...","{'2018-01-01 00:00:00': 74001, '2018-01-02 00:...","{'2018-01-01 00:00:00': 13610, '2018-01-02 00:...","{'2018-01-01 00:00:00': 4361, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 4238, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 28733, '2018-01-02 00:...","{'2018-01-01 00:00:00': 8817, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 13707, '2018-01-02 00:...","{'2018-01-01 00:00:00': 47315, '2018-01-02 00:...","{'2018-01-01 00:00:00': 269483, '2018-01-02 00...",...,"{'2018-01-01 00:00:00': 2505, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 3840, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 6923, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 226, '2018-01-02 00:00...","{'2018-01-01 00:00:00': 1783, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 12618, '2018-01-02 00:...","{'2018-01-01 00:00:00': 3836, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 62767, '2018-01-02 00:...","{'2018-01-01 00:00:00': 2775, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 7414, '2018-01-02 00:0..."
tr,"{'2018-01-01 00:00:00': 98424, '2018-01-02 00:...","{'2018-01-01 00:00:00': 14151, '2018-01-02 00:...","{'2018-01-01 00:00:00': 3154, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 7279, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 9300, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 14074, '2018-01-02 00:...","{'2018-01-01 00:00:00': 3511, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 3366, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 10859, '2018-01-02 00:...","{'2018-01-01 00:00:00': 54290, '2018-01-02 00:...",...,"{'2018-01-01 00:00:00': 7422, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 3512, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 6465, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 1297, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 2732, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 8190, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 6441, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 59353, '2018-01-02 00:...","{'2018-01-01 00:00:00': 4067, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 13951, '2018-01-02 00:..."
no.m,"{'2018-01-01 00:00:00': 232404, '2018-01-02 00...","{'2018-01-01 00:00:00': 64920, '2018-01-02 00:...","{'2018-01-01 00:00:00': 15889, '2018-01-02 00:...","{'2018-01-01 00:00:00': 5802, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 7222, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 27984, '2018-01-02 00:...","{'2018-01-01 00:00:00': 9003, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 9807, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 27301, '2018-01-02 00:...","{'2018-01-01 00:00:00': 153531, '2018-01-02 00...",...,"{'2018-01-01 00:00:00': 2907, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 6839, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 10382, '2018-01-02 00:...","{'2018-01-01 00:00:00': 596, '2018-01-02 00:00...","{'2018-01-01 00:00:00': 1871, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 21247, '2018-01-02 00:...","{'2018-01-01 00:00:00': 6604, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 97684, '2018-01-02 00:...","{'2018-01-01 00:00:00': 3576, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 10887, '2018-01-02 00:..."


In [22]:
countries_to_topics_sum_df.columns

Index(['Culture.Biography.Biography*', 'Culture.Biography.Women',
       'Culture.Food and drink', 'Culture.Internet culture',
       'Culture.Linguistics', 'Culture.Literature', 'Culture.Media.Books',
       'Culture.Media.Entertainment', 'Culture.Media.Films',
       'Culture.Media.Media*', 'Culture.Media.Music', 'Culture.Media.Radio',
       'Culture.Media.Software', 'Culture.Media.Television',
       'Culture.Media.Video games', 'Culture.Performing arts',
       'Culture.Philosophy and religion', 'Culture.Sports',
       'Culture.Visual arts.Architecture',
       'Culture.Visual arts.Comics and Anime', 'Culture.Visual arts.Fashion',
       'Culture.Visual arts.Visual arts*', 'Geography.Geographical',
       'Geography.Regions.Africa.Africa*',
       'Geography.Regions.Africa.Central Africa',
       'Geography.Regions.Africa.Eastern Africa',
       'Geography.Regions.Africa.Northern Africa',
       'Geography.Regions.Africa.Southern Africa',
       'Geography.Regions.Africa.Western 

However, we might not be interested in all available topics. As a matter of fact, for our project, it might be useful to only isolate the data about articles related to the environment. Examining the columns, the topic is available in only one of them, so we will extract only that topic in two dataframes that have the same format as [here](#extraction_format) .

In [23]:
sum_environment_df = countries_to_topics_sum_df['STEM.Earth and environment']
percent_environment_df = countries_to_topics_percent_df['STEM.Earth and environment']
country_to_env_data_sum = {}
country_to_env_data_percent = {}
for country in sum_environment_df.index:
    country_to_env_data_sum[country] = sum_environment_df[country]
    country_to_env_data_percent[country] = percent_environment_df[country]
sum_environment_df = pd.DataFrame.from_dict(country_to_env_data_sum, orient = 'index').T
sum_environment_df.index = pd.to_datetime(index_without_time)
percent_environment_df = pd.DataFrame.from_dict(country_to_env_data_percent, orient = 'index').T
percent_environment_df.index = pd.to_datetime(index_without_time)
sum_environment_df['dates'] = percent_environment_df['dates'] = pd.to_datetime(index_without_time)
sum_environment_df = sum_environment_df[new_column_order]
percent_environment_df = percent_environment_df[new_column_order]

In [24]:
sum_environment_df.head()

Unnamed: 0,dates,ko,sv.m,it.m,ca,fi,fr,ja,sr.m,fi.m,...,no,en.m,tr.m,sr,en,no.m,tr,da.m,it,ja.m
2018-01-01,2018-01-01,5120,19954,57629,2198,4987,60078,38004,1868,10513,...,2906,906437,2899,1282,709659,6839,3512,3840,20273,72493
2018-01-02,2018-01-02,5995,17666,73085,4645,7407,92710,46235,2813,10424,...,6096,981638,2992,1956,973197,5857,3462,3936,41790,96614
2018-01-03,2018-01-03,7069,19606,75968,4387,7117,104062,55066,2535,9656,...,8860,998147,3560,2150,1115237,6683,4137,4166,45349,107578
2018-01-04,2018-01-04,6201,19213,76988,5124,7637,106732,73793,4035,10072,...,9953,1059701,3520,2628,1097507,6919,4148,4285,47540,97229
2018-01-05,2018-01-05,5942,18918,71269,3684,6912,98577,85140,2780,9442,...,8706,951065,2828,2368,1016865,6446,3297,4431,43588,98070


In [25]:
percent_environment_df.head()

Unnamed: 0,dates,ko,sv.m,it.m,ca,fi,fr,ja,sr.m,fi.m,...,no,en.m,tr.m,sr,en,no.m,tr,da.m,it,ja.m
2018-01-01,2018-01-01,0.002914,0.00326,0.001771,0.009242,0.004212,0.003844,0.002296,0.001607,0.003341,...,0.005613,0.002359,0.002373,0.002651,0.003689,0.003708,0.003699,0.001947,0.002589,0.001414
2018-01-02,2018-01-02,0.00281,0.003829,0.00222,0.011023,0.005175,0.00425,0.00256,0.002492,0.004122,...,0.007101,0.002764,0.002486,0.003157,0.003676,0.004326,0.0034,0.003672,0.003256,0.002001
2018-01-03,2018-01-03,0.003077,0.004315,0.002311,0.011034,0.005038,0.004465,0.002667,0.002561,0.004057,...,0.008214,0.003071,0.003047,0.003244,0.00392,0.004906,0.00376,0.004139,0.003413,0.002074
2018-01-04,2018-01-04,0.00309,0.004363,0.002481,0.011972,0.0054,0.004455,0.003049,0.004285,0.004323,...,0.00887,0.003305,0.003057,0.00396,0.004073,0.005288,0.003828,0.004273,0.00358,0.002029
2018-01-05,2018-01-05,0.003026,0.003549,0.00233,0.010285,0.005054,0.004351,0.003345,0.003053,0.003987,...,0.008315,0.00293,0.002363,0.003787,0.003883,0.004078,0.003457,0.003925,0.003422,0.002136


### First analysis

In [26]:
figures_path = './Figures/'
timeseries_path = 'timeseries/'
hists_path = 'hists/'
total_views_path = 'all_views/'
covid_views_path = 'covid_views/'
topic_views_path = 'topic_views/'

In [27]:
def make_sub_dirs(main_dir):
    os.mkdir(main_dir + total_views_path)
    os.mkdir(main_dir + covid_views_path)
    os.mkdir(main_dir + topic_views_path)
    os.mkdir(main_dir + topic_views_path + total_views_path)
    os.mkdir(main_dir + topic_views_path + covid_views_path)

In [28]:
if not os.path.exists(figures_path):
    os.mkdir(figures_path)
    os.mkdir(figures_path + timeseries_path)
    os.mkdir(figures_path + hists_path)
    make_sub_dirs(figures_path + timeseries_path)
    make_sub_dirs(figures_path + hists_path)

In [29]:
def lineplot_language_views_timeseries(data, country, covid_views = False):
    fig, ax = plt.subplots(figsize=(10, 6), dpi=100)
    x = data.dates
    y = data[country]
    g = sns.lineplot( x = x, y = y)
    plt.xticks(fontsize=8)
    g.set(xlabel='Dates')
    if covid_views:
        title = 'Wikipedia page views for articles related to Covid-19 for {}'.format(country)
    else:
        title = 'Wikipedia page views for {}'.format(country)
    g.set(ylabel='Page views', title = title)
    #ax.set(xscale="log")
    if covid_views:
        plt.savefig(figures_path + timeseries_path + covid_views_path + title + ".jpg")
    else:
        plt.savefig(figures_path + timeseries_path + total_views_path + title + ".jpg")
    plt.close(fig)
    #plt.show()

In [30]:
def hist_language_views(data, country, covid_views = False):
    fig, ax = plt.subplots(figsize=(10, 6), dpi=100)
    g = sns.histplot(data = data, x = country, bins = 50)
    if covid_views:
        title = 'Wikipedia views distribution for articles related to Covid-19 {}'.format(country)
    else:
        title = 'Wikipedia views distribution for {}'.format(country)
        
    if covid_views:
        plt.savefig(figures_path + hists_path + covid_views_path + title + ".jpg")
    else:
        plt.savefig(figures_path + hists_path + total_views_path + title + ".jpg")
    
    plt.close(fig)

In [31]:
for country_code in timeseries.columns:
    lineplot_language_views_timeseries(sum_data_df, country_code)
    hist_language_views(sum_data_df, country_code)
    lineplot_language_views_timeseries(covid_sum_data_df, country_code, True)
    hist_language_views(covid_sum_data_df, country_code, True)

In [32]:
def lineplot_topic_views_timeseries(topic_data, country, covid_views = False, topic = 'environment'):
    fig, ax = plt.subplots(figsize=(10, 6), dpi=100)
    x = topic_data.dates
    y = topic_data[country]
    g = sns.lineplot( x = x, y = y)
    plt.xticks(fontsize=8)
    g.set(xlabel='Dates')
    if covid_views:
        title = 'Wikipedia page views for articles related to Covid-19 for {0} for the {1} topic'.format(country, topic)
    else:
        title = 'Wikipedia page views for {0} for the {1} topic'.format(country,topic)
    g.set(ylabel='Page views', title = title)
    if covid_views:
        plt.savefig(figures_path + timeseries_path + topic_views_path + covid_views_path + title + ".jpg")
    else:
        plt.savefig(figures_path + timeseries_path + topic_views_path + total_views_path + title + ".jpg")
        
    plt.close(fig)

In [33]:
for country_code in timeseries.columns:
    lineplot_topic_views_timeseries(sum_environment_df, country_code)
    lineplot_topic_views_timeseries(percent_environment_df, country_code, True)

## Mobility data

The second type of data we have are mobility data that come from two different sources. The first one is from Apple, who stopped giving out the data in April 2022, and the second one is from Google, which is still available, and more up-to-date (17th of October).

### Apple mobility

In [34]:
apple_mobility = pd.read_csv("applemobilitytrends-2020-04-20.csv.gz")
apple_mobility.head()

Unnamed: 0,geo_type,region,transportation_type,2020-01-13,2020-01-14,2020-01-15,2020-01-16,2020-01-17,2020-01-18,2020-01-19,...,2020-04-11,2020-04-12,2020-04-13,2020-04-14,2020-04-15,2020-04-16,2020-04-17,2020-04-18,2020-04-19,2020-04-20
0,country/region,Albania,driving,100,95.3,101.43,97.2,103.55,112.67,104.83,...,25.47,24.89,32.64,31.43,30.67,30.0,29.26,22.94,24.55,31.51
1,country/region,Albania,walking,100,100.68,98.93,98.46,100.85,100.13,82.13,...,27.63,29.59,35.52,38.08,35.48,39.15,34.58,27.76,27.93,36.72
2,country/region,Argentina,driving,100,97.07,102.45,111.21,118.45,124.01,95.44,...,19.4,12.89,21.1,22.29,23.55,24.4,27.17,23.19,14.54,26.67
3,country/region,Argentina,walking,100,95.11,101.37,112.67,116.72,114.14,84.54,...,15.75,10.45,16.35,16.66,17.42,18.18,18.8,17.03,10.59,18.44
4,country/region,Australia,driving,100,102.98,104.21,108.63,109.08,89.0,99.35,...,26.95,31.72,53.14,55.91,56.56,58.77,47.51,36.9,53.34,56.93


In [35]:
apple_mobility[apple_mobility.columns[:3]]

Unnamed: 0,geo_type,region,transportation_type
0,country/region,Albania,driving
1,country/region,Albania,walking
2,country/region,Argentina,driving
3,country/region,Argentina,walking
4,country/region,Australia,driving
...,...,...,...
390,city,Washington DC,transit
391,city,Washington DC,walking
392,city,Zurich,driving
393,city,Zurich,transit


In [36]:
apple_mobility[apple_mobility.columns[3:]].T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,385,386,387,388,389,390,391,392,393,394
2020-01-13,100.00,100.00,100.00,100.00,100.00,100.00,100.00,100.00,100.00,100.00,...,100.00,100.00,100.00,100.00,100.00,100.00,100.00,100.00,100.00,100.00
2020-01-14,95.30,100.68,97.07,95.11,102.98,101.78,101.31,101.14,101.55,101.19,...,93.81,105.24,103.12,103.45,105.82,100.78,99.07,102.38,101.51,106.27
2020-01-15,101.43,98.93,102.45,101.37,104.21,100.64,101.82,104.24,105.59,107.49,...,86.78,91.04,106.60,106.04,109.02,103.92,109.61,110.84,108.93,116.73
2020-01-16,97.20,98.46,111.21,112.67,108.63,99.58,104.52,112.21,112.24,107.67,...,96.86,111.66,126.18,116.05,110.37,105.02,104.16,105.48,97.87,115.31
2020-01-17,103.55,100.85,118.45,116.72,109.08,98.34,113.73,117.23,123.36,117.38,...,104.61,139.11,113.78,128.79,123.98,112.26,123.16,113.83,103.91,118.22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-04-16,30.00,39.15,24.40,18.18,58.77,21.84,47.84,56.02,45.58,51.74,...,20.48,56.76,56.77,38.40,52.38,17.10,37.76,77.13,33.65,71.41
2020-04-17,29.26,34.58,27.17,18.80,47.51,20.13,45.82,54.07,45.03,50.94,...,19.97,55.37,54.79,36.91,59.52,17.14,42.50,78.01,36.77,74.96
2020-04-18,22.94,27.76,23.19,17.03,36.90,17.64,38.71,45.63,43.04,41.76,...,20.71,59.25,47.68,35.83,54.52,16.61,41.80,73.62,36.75,76.06
2020-04-19,24.55,27.93,14.54,10.59,53.34,21.93,43.30,45.82,32.29,44.35,...,18.73,46.00,43.37,25.08,47.06,16.66,43.51,71.97,37.66,74.22


In [37]:
print(apple_mobility.transportation_type.unique()) # Three types of transportation
print(apple_mobility.geo_type.unique()) # Granularity

['driving' 'walking' 'transit']
['country/region' 'city']


The mobility data from Apple we have begins in mid-January 2020, and ends that same year in April. This isn't a big time window, and it doesn't appear that there is earlier data as it has been collected specifically for Covid-19 mobility tracking. We could however try to look for newer information (post-April 2020) on the web.

Three types of transportation have been tracked here: driving, walking, and transit. We also have two different granularities about the collected data: either country/world region level, or city level, which are often country capitals.

Per day and region, we have the pourcentage of the usage of every transportation mode according to some pre-pandemic baseline computed in early 2020.

In [38]:
apple_mobility.isnull().any().any() # There doesn't appear to be missing data

False

In [39]:
apple_mobility_walking = apple_mobility[apple_mobility.transportation_type == 'walking']
apple_mobility_driving = apple_mobility[apple_mobility.transportation_type == 'driving']
apple_mobility_transit = apple_mobility[apple_mobility.transportation_type == 'transit']

#### First analysis

### Global mobility from Google

In [40]:
global_mobility_report = pd.read_csv("Global_Mobility_Report.csv.gz")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [41]:
global_mobility_report

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,metro_area,iso_3166_2_code,census_fips_code,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
0,AE,United Arab Emirates,,,,,,2020-02-15,0.0,4.0,5.0,0.0,2.0,1.0
1,AE,United Arab Emirates,,,,,,2020-02-16,1.0,4.0,4.0,1.0,2.0,1.0
2,AE,United Arab Emirates,,,,,,2020-02-17,-1.0,1.0,5.0,1.0,2.0,1.0
3,AE,United Arab Emirates,,,,,,2020-02-18,-2.0,1.0,5.0,0.0,2.0,1.0
4,AE,United Arab Emirates,,,,,,2020-02-19,-2.0,0.0,4.0,-1.0,2.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2111407,ZW,Zimbabwe,Midlands Province,Kwekwe,,,,2020-08-19,,,,,-9.0,
2111408,ZW,Zimbabwe,Midlands Province,Kwekwe,,,,2020-08-20,,,,,-5.0,
2111409,ZW,Zimbabwe,Midlands Province,Kwekwe,,,,2020-08-21,,,,,-5.0,
2111410,ZW,Zimbabwe,Midlands Province,Kwekwe,,,,2020-08-24,,,,,-4.0,


In [42]:
print(min(global_mobility_report.date.unique()), max(global_mobility_report.date.unique()))
global_mobility_report.date = pd.to_datetime(global_mobility_report.date)

2020-02-15 2020-08-25


The mobility data from Google we have begins in mid-February 2020, and ends that same year in August. This is more than the given Apple data, despite the fact that both collections happened in the context of Covid-19. The full data can be found in the *Outside data* folder.

There are more levels of granularity with this data: for example, for the United Arab Emirates, we might simply talk about the whole country, or it could be specified in the column *sub_region_1* that the row is actually focused on the city of Abu Dhabi. This granularity can be made finer with *sub_region_2*.

Per day and region, we have the **difference** in pourcentage usage of various location types (workplaces, etc) according to some pre-pandemic baseline computed in early 2020.

In [43]:
global_mobility_report.isnull().sum()/global_mobility_report.shape[0]

country_region_code                                   0.000709
country_region                                        0.000000
sub_region_1                                          0.018154
sub_region_2                                          0.180897
metro_area                                            0.994122
iso_3166_2_code                                       0.813653
census_fips_code                                      0.762436
date                                                  0.000000
retail_and_recreation_percent_change_from_baseline    0.363799
grocery_and_pharmacy_percent_change_from_baseline     0.373900
parks_percent_change_from_baseline                    0.536971
transit_stations_percent_change_from_baseline         0.502960
workplaces_percent_change_from_baseline               0.050385
residential_percent_change_from_baseline              0.499080
dtype: float64

As expected, we have more coarse grained data (no missing data) than finer grained (many sub_region_1 fields are null, and even more sub_region_2 as well). The metropolitan area is very rarely defined, as almost 99.4% of the field is empty values.

Looking at the differences from baseline, we remark scarcity as well; apart from workplace locations which has a missing rate of only around 5.04%, others go from 36.4% (for retail) to 53.7% (for parks).

We can look in more details at the entries which have the missing values for the differences from baseline; first, let's check the intersection of these missing values, to see for example if the absence of one field implies the absence of the others.

In [44]:
retail_missing = (global_mobility_report.retail_and_recreation_percent_change_from_baseline.isnull())
grocery_pharmecy_missing = (global_mobility_report.grocery_and_pharmacy_percent_change_from_baseline.isnull())
park_missing = (global_mobility_report.parks_percent_change_from_baseline.isnull())
transit_stations_missing = (global_mobility_report.transit_stations_percent_change_from_baseline.isnull())
workplace_missing = (global_mobility_report.workplaces_percent_change_from_baseline.isnull())
residential_missing = (global_mobility_report.residential_percent_change_from_baseline.isnull())

In [45]:
missing_dict = {}
missing_dict['retail'] = retail_missing
missing_dict['grocery'] = grocery_pharmecy_missing
missing_dict['park'] = park_missing
missing_dict['transit_stations'] = transit_stations_missing
missing_dict['workplace'] = workplace_missing
missing_dict['residential'] = residential_missing

In [46]:
all_missing = retail_missing & grocery_pharmecy_missing & park_missing & transit_stations_missing & workplace_missing & residential_missing

In [47]:
all_missing.any()

False

From the above result, we can conclude that there doesn't appear to be a feature such that if that one is missing, then all the others are missing; this also means that for each entry, there's always at least one feature available.

#### First analysis

Let's now see, for each feature, which countries miss this data; maybe there is a strict subset of countries which have some missing. This will point to the quality/availability of the data for these regions of the world. We will begin with the countries missing some data from retail locations as a starting point.

In [48]:
# Total number of countries considered
print("Total number of countries considered in the collected data: {}"\
      .format(global_mobility_report.country_region.unique().size))

Total number of countries considered in the collected data: 135


In [49]:
countries_missing = {}
intersection = set()
for location in missing_dict:
    countries_missing[location] = global_mobility_report[missing_dict[location]].country_region.unique()

In [50]:
intersection_of_countries = set(countries_missing['retail'])
union_of_countries = set(countries_missing['retail'])
for location in countries_missing:
    print("--------------------------------------------")
    intersection_of_countries = intersection_of_countries.intersection(set(countries_missing[location]))
    union_of_countries = union_of_countries.union(set(countries_missing[location]))
    print("Length of the current intersection: {}".format(len(intersection_of_countries)))
    print("Length of the current union: {}".format(len(union_of_countries)))

--------------------------------------------
Length of the current intersection: 81
Length of the current union: 81
--------------------------------------------
Length of the current intersection: 80
Length of the current union: 86
--------------------------------------------
Length of the current intersection: 77
Length of the current union: 90
--------------------------------------------
Length of the current intersection: 77
Length of the current union: 93
--------------------------------------------
Length of the current intersection: 56
Length of the current union: 93
--------------------------------------------
Length of the current intersection: 56
Length of the current union: 98


As we can see, out of the total number of 135 , if a country misses some data then it's in a strict subset of the same 98 countries. For 56 of those, at least one data miss per feature is recorded. This of course doesn't mean that data is always missing for these countries; we simply know that if we record one data miss, we know that it's in one of these 98. It might also be interesting to know which countries *never* have missing data.

We will now map the 98 (maybe 56?) countries to their continents.

In [51]:
missing_countries_code = global_mobility_report[global_mobility_report['country_region'].isin(union_of_countries)].country_region_code.unique()

In [52]:
continent_to_country_missing = {}
for code in missing_countries_code:
    if type(code) is float:
        continue
    continent = pc.country_alpha2_to_continent_code(code)
    if continent not in continent_to_country_missing:
        continent_to_country_missing[continent] = []
    continent_to_country_missing[continent].append(code)
print(continent_to_country_missing.keys())

dict_keys(['AS', 'NA', 'AF', 'SA', 'EU', 'OC'])


Which continent each code corresponds to can be found on https://datahub.io/core/continent-codes . Let's now see the countries from which data is never missing.

In [53]:
not_missing_countries_code = global_mobility_report[~(global_mobility_report['country_region'].isin(union_of_countries))].country_region_code.unique()

In [54]:
continent_to_country_not_missing = {}
for code in not_missing_countries_code:
    if type(code) is float:
        continue
    continent = pc.country_alpha2_to_continent_code(code)
    if continent not in continent_to_country_not_missing:
        continent_to_country_not_missing[continent] = []
    continent_to_country_not_missing[continent].append(code)
print(continent_to_country_not_missing.keys())

dict_keys(['AS', 'EU', 'NA', 'OC', 'AF', 'SA'])


It appears that data can be missing from any part of the world, and so no conclusions can be made immediatly about a country based on which continent it's on.

## Interventions

In this data, each language is mapped to the main country of usage except for English, where the language's usage is very high in multiple countries such that it couldn't be mapped to a single country. As such, for that language, we have most of the data missing.

Per country, the pandemic timeline is represented, such as the first registered case, the first death, etc.

Note that the paper says that nine languages are spoken in a single language, but we have more than that here (reason for that unknown).

In [55]:
interventions = pd.read_csv("interventions.csv")
interventions

Unnamed: 0,lang,1st case,1st death,School closure,Public events banned,Lockdown,Mobility,Normalcy
0,fr,2020-01-24,2020-02-14,2020-03-14,2020-03-13,2020-03-17,2020-03-16,2020-07-02
1,da,2020-02-27,2020-03-12,2020-03-13,2020-03-12,2020-03-18,2020-03-11,2020-06-05
2,de,2020-01-27,2020-03-09,2020-03-14,2020-03-22,2020-03-22,2020-03-16,2020-07-10
3,it,2020-01-31,2020-02-22,2020-03-05,2020-03-09,2020-03-11,2020-03-11,2020-06-26
4,nl,2020-02-27,2020-03-06,2020-03-11,2020-03-24,,2020-03-16,2020-05-29
5,no,2020-02-26,2020-02-26,2020-03-13,2020-03-12,2020-03-24,2020-03-11,2020-06-04
6,sr,2020-03-06,2020-03-20,2020-03-15,2020-03-21,2020-03-21,2020-03-16,2020-05-02
7,sv,2020-01-31,2020-03-11,2020-03-18,2020-03-12,,2020-03-11,2020-06-05
8,ko,2020-01-20,2020-02-20,2020-02-23,,,2020-02-25,2020-04-15
9,ca,2020-01-31,2020-02-13,2020-03-12,2020-03-08,2020-03-14,2020-03-16,


In [56]:
def transform_column(column_name):
    interventions[column_name] = pd.to_datetime(interventions[column_name])
for intervention in interventions.columns[1:]:
    transform_column(intervention)

## Topics

Simply maps each considered article to the topics it is related to. A single article can be mapped to multiple topics. The number of articles per topic can be found in the [timeseries](#timeseries) data. 

In [57]:
topics_linked = pd.read_csv("topics_linked.csv.xz")
topics_linked

Unnamed: 0,index,Geography.Regions.Asia.Central Asia,Geography.Regions.Europe.Eastern Europe,History and Society.Military and warfare,Culture.Media.Television,History and Society.Education,Culture.Media.Books,Geography.Regions.Africa.Africa*,Culture.Visual arts.Architecture,Culture.Biography.Women,...,Geography.Regions.Asia.West Asia,STEM.Chemistry,Geography.Regions.Europe.Northern Europe,Culture.Media.Video games,Geography.Regions.Asia.Southeast Asia,Culture.Media.Entertainment,Culture.Media.Music,Geography.Regions.Asia.Asia*,Geography.Regions.Asia.North Asia,qid
0,Rosmalen,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q2001490
1,Commelinales,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q290349
2,Transport_in_Honduras,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q1130638
3,QuakeC,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q2122062
4,Food_writing,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q5465542
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4306121,Faimaala_Filipo,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,Q84090991
4306122,Jonathan_Horne,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q1666264
4306123,Steven_Da_Costa,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q22921600
4306124,The_Silence_of_Dr._Evans,False,True,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,Q4301095


# Data wrangling