# Coronawiki dataset exploration

The purpose of the following notebook is to get familiar with the given Coronawiki data, as it is split among multiple files which serve different purposes.

As such, we will attempt to do the following tasks in this notebook:

- Preprocessing of the data,to make it more comfortable to use (Split the dataframes, give them another format, etc)
- Data wrangling: a lot of the data are timeseries which could be put together to derive interesting results
- First analysis phase

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

## Timeseries

The most important data we have in this dataset are time series of the Wikipedia views from 2018 to July 2020 for 14 different languages: one part are the total views for all of that language's wikipedia, a second part are the views for the articles that are related to Covid-19, as well as the percentage. Finally, we also have for the same window of time the views for different topics.

In [2]:
timeseries = pd.read_json("aggregated_timeseries.json.gz")
timeseries.head()

Unnamed: 0,ja.m,it,da.m,tr,no.m,en,sr,tr.m,en.m,no,...,ko.m,fi.m,sr.m,ja,fr,fi,ca,it.m,sv.m,ko
len,1197788,1594039,256451,346007,516838,6047509,632128,345790,6045654,531478,...,489181,480638,396063,1197856,2195949,481854,642031,1588312,1959446,490314
sum,"{'2018-01-01 00:00:00': 22328288, '2018-01-02 ...","{'2018-01-01 00:00:00': 3338750, '2018-01-02 0...","{'2018-01-01 00:00:00': 765123, '2018-01-02 00...","{'2018-01-01 00:00:00': 407629, '2018-01-02 00...","{'2018-01-01 00:00:00': 715031, '2018-01-02 00...","{'2018-01-01 00:00:00': 86763830, '2018-01-02 ...","{'2018-01-01 00:00:00': 192409, '2018-01-02 00...","{'2018-01-01 00:00:00': 493684, '2018-01-02 00...","{'2018-01-01 00:00:00': 135822131, '2018-01-02...","{'2018-01-01 00:00:00': 224417, '2018-01-02 00...",...,"{'2018-01-01 00:00:00': 1484496, '2018-01-02 0...","{'2018-01-01 00:00:00': 1319053, '2018-01-02 0...","{'2018-01-01 00:00:00': 451383, '2018-01-02 00...","{'2018-01-01 00:00:00': 7828155, '2018-01-02 0...","{'2018-01-01 00:00:00': 6441009, '2018-01-02 0...","{'2018-01-01 00:00:00': 523135, '2018-01-02 00...","{'2018-01-01 00:00:00': 111910, '2018-01-02 00...","{'2018-01-01 00:00:00': 12856884, '2018-01-02 ...","{'2018-01-01 00:00:00': 2383474, '2018-01-02 0...","{'2018-01-01 00:00:00': 819174, '2018-01-02 00..."
covid,"{'len': 30, 'sum': {'2018-01-01 00:00:00': 55,...","{'len': 33, 'sum': {'2018-01-01 00:00:00': 50,...","{'len': 4, 'sum': {'2018-01-01 00:00:00': 0, '...","{'len': 64, 'sum': {'2018-01-01 00:00:00': 1, ...","{'len': 10, 'sum': {'2018-01-01 00:00:00': 7, ...","{'len': 306, 'sum': {'2018-01-01 00:00:00': 57...","{'len': 9, 'sum': {'2018-01-01 00:00:00': 6, '...","{'len': 64, 'sum': {'2018-01-01 00:00:00': 3, ...","{'len': 306, 'sum': {'2018-01-01 00:00:00': 91...","{'len': 10, 'sum': {'2018-01-01 00:00:00': 2, ...",...,"{'len': 113, 'sum': {'2018-01-01 00:00:00': 6,...","{'len': 9, 'sum': {'2018-01-01 00:00:00': 0, '...","{'len': 9, 'sum': {'2018-01-01 00:00:00': 11, ...","{'len': 30, 'sum': {'2018-01-01 00:00:00': 26,...","{'len': 16, 'sum': {'2018-01-01 00:00:00': 62,...","{'len': 9, 'sum': {'2018-01-01 00:00:00': 2, '...","{'len': 49, 'sum': {'2018-01-01 00:00:00': 6, ...","{'len': 33, 'sum': {'2018-01-01 00:00:00': 139...","{'len': 8, 'sum': {'2018-01-01 00:00:00': 19, ...","{'len': 113, 'sum': {'2018-01-01 00:00:00': 3,..."
topics,{'Culture.Biography.Biography*': {'len': 14904...,{'Culture.Biography.Biography*': {'len': 29427...,{'Culture.Biography.Biography*': {'len': 57720...,{'Culture.Biography.Biography*': {'len': 70443...,{'Culture.Biography.Biography*': {'len': 11603...,{'Culture.Biography.Biography*': {'len': 14038...,{'Culture.Biography.Biography*': {'len': 37718...,{'Culture.Biography.Biography*': {'len': 70434...,{'Culture.Biography.Biography*': {'len': 14038...,{'Culture.Biography.Biography*': {'len': 11804...,...,{'Culture.Biography.Biography*': {'len': 75406...,{'Culture.Biography.Biography*': {'len': 10422...,{'Culture.Biography.Biography*': {'len': 37580...,{'Culture.Biography.Biography*': {'len': 14904...,{'Culture.Biography.Biography*': {'len': 38258...,{'Culture.Biography.Biography*': {'len': 10444...,{'Culture.Biography.Biography*': {'len': 10175...,{'Culture.Biography.Biography*': {'len': 29422...,{'Culture.Biography.Biography*': {'len': 14668...,{'Culture.Biography.Biography*': {'len': 75498...


In [3]:
timeseries.columns

Index(['ja.m', 'it', 'da.m', 'tr', 'no.m', 'en', 'sr', 'tr.m', 'en.m', 'no',
       'sv', 'nl.m', 'nl', 'da', 'de', 'fr.m', 'ca.m', 'de.m', 'ko.m', 'fi.m',
       'sr.m', 'ja', 'fr', 'fi', 'ca', 'it.m', 'sv.m', 'ko'],
      dtype='object')

Correspondence:
- ja -> Japanese
- it -> Italian
- da -> Danish
- tr -> Turkish?
- no -> Norwegian
- en -> English
- sr -> Serbian
- sv -> Swedish
- nl -> Dutch
- de -> German
- fr -> French
- ca -> Catalan?
- ko -> Korean
- fi -> Finnish

Not sure about the "?" ones. According to https://www.loc.gov/standards/iso639-2/php/langcodes-search.php, these correspond respectively to Turkish and Catalan.

### Splitting the timeseries data into different dataframes

As we can see, the data's format isn't ideal: for each language, the data is split into 3 Python dictionaries corresponding to the data described above, and it would be nice to separate these pieces of data to be able to read directly for each date, for example, the total number of views accross all languages, instead of having to iterate over each language's dictionnary every time.

This will also make the analysis phase easier later on.

### Total sum of views, views of articles related to Covid

<a id='extraction_format'></a>
In this part of the code we extract two following kind of data (there are three total dataframes, but the two last ones represent the same data), for each date:
- For every language's Wikipedia, the total number of views on that particular date
- For every language's Wikipedia, the total number of views for articles related to Covid-19 on that particular date
- For every language's Wikipedia, the percentage of views for articles related to Covid-19 on that particular date

Note that the two last dataframes might be redundant, but as we're given the data anyway, we choose to extract it after all.

---


Every resulting dataframe will have the following format:

 Column name          | Description                                                                                                                                                                                       |   |   |   |
|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|---|---|
| dates           | A particular date between January 2018 (inclusive) and July of 2020 (inclusive)                                                                                                                                             |   |   |   |
| language_code            | It can either be the total number of views for that language's Wikipedia, the number of views on Covid related articles on that same Wikipedia, or the percentage of these latter. There are 28 of these columns, as there are 14 languages and the data from desktop and mobile are separated.


---

We also extract another dataframe that simply maps for each language the number of articles that were considered in the original experiment.

In [4]:
timeseries_total_sum_dict = {}
timeseries_covid_len_dict = {}
timeseries_covid_sum_dict = {}
timeseries_covid_percent_dict = {}
for cn in timeseries.columns:
    timeseries_total_sum_dict[cn] = timeseries[cn]['sum']
    timeseries_covid_len_dict[cn] = timeseries[cn]['covid']['len']
    timeseries_covid_sum_dict[cn] = timeseries[cn]['covid']['sum']
    timeseries_covid_percent_dict[cn] = timeseries[cn]['covid']['percent']

In [5]:
sum_data_df = pd.DataFrame.from_dict(timeseries_total_sum_dict, orient = 'index').T
covid_len_data_df = pd.DataFrame.from_dict(timeseries_covid_len_dict, orient = 'index', columns = ['len']).T
covid_sum_data_df = pd.DataFrame.from_dict(timeseries_covid_sum_dict, orient = 'index').T
covid_percent_data_df = pd.DataFrame.from_dict(timeseries_covid_percent_dict, orient = 'index').T

In [6]:
index_without_time = [x[:10] for x in covid_sum_data_df.index]
sum_data_df.index = covid_sum_data_df.index = covid_percent_data_df.index = pd.to_datetime(index_without_time)
sum_data_df['dates'] = covid_sum_data_df['dates'] = covid_percent_data_df['dates'] = sum_data_df.index
new_column_order = [sum_data_df.columns[-1]] + list(sum_data_df.columns[-2::-1])
sum_data_df = sum_data_df[new_column_order]
covid_sum_data_df = covid_sum_data_df[new_column_order]
covid_percent_data_df = covid_percent_data_df[new_column_order]

In [7]:
sum_data_df.head()

Unnamed: 0,dates,ko,sv.m,it.m,ca,fi,fr,ja,sr.m,fi.m,...,no,en.m,tr.m,sr,en,no.m,tr,da.m,it,ja.m
2018-01-01,2018-01-01,819174,2383474,12856884,111910,523135,6441009,7828155,451383,1319053,...,224417,135822131,493684,192409,86763830,715031,407629,765123,3338750,22328288
2018-01-02,2018-01-02,959239,1873096,12887390,198405,648344,9079323,8759385,462824,1094280,...,374771,127087359,483443,253653,112245349,536506,426791,443384,5428428,22278953
2018-01-03,2018-01-03,1037688,1863012,12859488,188728,644605,9746428,9996156,404880,1022615,...,459743,116606137,471814,272278,121868290,552379,468642,415545,5640812,23632758
2018-01-04,2018-01-04,956653,1810874,12359845,203167,643311,10034517,11976989,391631,1001547,...,479999,115650878,462107,273699,112888840,528468,462860,416943,5794860,21893587
2018-01-05,2018-01-05,955955,2191670,12107559,168126,614471,9511358,12746833,380159,1012466,...,444863,116127950,485637,257239,109213987,600262,407226,455392,5475376,20734837


In [8]:
covid_sum_data_df.head()

Unnamed: 0,dates,ko,sv.m,it.m,ca,fi,fr,ja,sr.m,fi.m,...,no,en.m,tr.m,sr,en,no.m,tr,da.m,it,ja.m
2018-01-01,2018-01-01,3,19,139,6,2,62,26,11,0,...,2,911,3,6,575,7,1,0,50,55
2018-01-02,2018-01-02,20,12,187,6,2,91,42,20,0,...,4,1006,2,13,1081,5,3,2,103,55
2018-01-03,2018-01-03,20,13,162,9,2,109,53,15,8,...,3,919,2,11,1265,2,6,1,130,51
2018-01-04,2018-01-04,16,14,180,10,3,107,114,30,0,...,1,1026,9,6,1167,2,1,0,112,46
2018-01-05,2018-01-05,20,11,127,4,0,113,134,33,0,...,4,978,2,13,1054,2,0,0,119,70


In [9]:
covid_percent_data_df.head()

Unnamed: 0,dates,ko,sv.m,it.m,ca,fi,fr,ja,sr.m,fi.m,...,no,en.m,tr.m,sr,en,no.m,tr,da.m,it,ja.m
2018-01-01,2018-01-01,4e-06,8e-06,1.1e-05,5.4e-05,4e-06,1e-05,3e-06,2.4e-05,0.0,...,9e-06,7e-06,6e-06,3.1e-05,7e-06,1e-05,2e-06,0.0,1.5e-05,2e-06
2018-01-02,2018-01-02,2.1e-05,6e-06,1.5e-05,3e-05,3e-06,1e-05,5e-06,4.3e-05,0.0,...,1.1e-05,8e-06,4e-06,5.1e-05,1e-05,9e-06,7e-06,5e-06,1.9e-05,2e-06
2018-01-03,2018-01-03,1.9e-05,7e-06,1.3e-05,4.8e-05,3e-06,1.1e-05,5e-06,3.7e-05,8e-06,...,7e-06,8e-06,4e-06,4e-05,1e-05,4e-06,1.3e-05,2e-06,2.3e-05,2e-06
2018-01-04,2018-01-04,1.7e-05,8e-06,1.5e-05,4.9e-05,5e-06,1.1e-05,1e-05,7.7e-05,0.0,...,2e-06,9e-06,1.9e-05,2.2e-05,1e-05,4e-06,2e-06,0.0,1.9e-05,2e-06
2018-01-05,2018-01-05,2.1e-05,5e-06,1e-05,2.4e-05,0.0,1.2e-05,1.1e-05,8.7e-05,0.0,...,9e-06,8e-06,4e-06,5.1e-05,1e-05,3e-06,0.0,0.0,2.2e-05,3e-06


### Checking for missing data

Before continuing further, let us check for missing data in the timeseries; this will help us avoid bad surprises later on.

In [10]:
sum_data_df.isnull().any().any(), covid_sum_data_df.isnull().any().any(), covid_percent_data_df.isnull().any().any()

(False, False, True)

There appears to be some missing data in the percentage dataframe; let's check by language where that missing data is reported.

In [11]:
missing_data_language = covid_percent_data_df.isnull().any(axis = 0)
missing_data_language[missing_data_language]

sv    True
dtype: bool

Looking at the paper, this corresponds to Swedish. Let's check in the other dataframes how that missing data manifests itself.

In [12]:
sum_data_df.loc[:,['dates','sv']]

Unnamed: 0,dates,sv
2018-01-01,2018-01-01,0
2018-01-02,2018-01-02,0
2018-01-03,2018-01-03,0
2018-01-04,2018-01-04,0
2018-01-05,2018-01-05,0
...,...,...
2020-07-27,2020-07-27,622918
2020-07-28,2020-07-28,645601
2020-07-29,2020-07-29,639190
2020-07-30,2020-07-30,613870


In [13]:
covid_sum_data_df.loc[:,['dates', 'sv']]

Unnamed: 0,dates,sv
2018-01-01,2018-01-01,0
2018-01-02,2018-01-02,0
2018-01-03,2018-01-03,0
2018-01-04,2018-01-04,0
2018-01-05,2018-01-05,0
...,...,...
2020-07-27,2020-07-27,245
2020-07-28,2020-07-28,291
2020-07-29,2020-07-29,296
2020-07-30,2020-07-30,267


In both dataframes, the missing data corresponds to 0 values for the dates. The NaN values thus correspond to a division by 0. Let us now check the first date where we get data for Swedish.

In [14]:
swedish_mask = sum_data_df.loc[:,'sv'] > 0

In [15]:
sum_data_df.loc[:,'sv'][swedish_mask]

2019-01-01    607516
2019-01-02    821962
2019-01-03    872335
2019-01-04    854304
2019-01-05    775861
               ...  
2020-07-27    622918
2020-07-28    645601
2020-07-29    639190
2020-07-30    613870
2020-07-31    549610
Name: sv, Length: 578, dtype: int64

In [16]:
covid_sum_data_df.loc[:,'sv'][swedish_mask]

2019-01-01      1
2019-01-02      1
2019-01-03      3
2019-01-04      5
2019-01-05      2
             ... 
2020-07-27    245
2020-07-28    291
2020-07-29    296
2020-07-30    267
2020-07-31    252
Name: sv, Length: 578, dtype: int64

From what we can see, it appears that all the Swedish data from 2018 is missing; we will need to take that into consideration when doing our analysis.

The reason behind that is not clear: Wikipedia's swedish version has existed since 2001, and it's strange that the data from a whole year is either missing, or maybe it just hasn't been collected in the first place.

### Topics data

Now we will extract for each language, all the views per topic in such a way that the data becomes more usable. In the original data, all topic-related information was in a single dictionnary; we're gonna separate them in a way that each column will correspond to a different topic, with each row being a different language.

In [17]:
country_to_topics = {}
for cn in timeseries.columns:
    country_to_topics[cn] = timeseries[cn]['topics']
topics_df = pd.DataFrame.from_dict(country_to_topics, orient = 'index')

In [18]:
countries_to_topics_len = {}
countries_to_topics_sum = {}
countries_to_topics_percent = {}
for country in topics_df.index:
    countries_to_topics_len[country] = {}
    countries_to_topics_sum[country] = {}
    countries_to_topics_percent[country] = {}
    for topic in topics_df.columns:
        countries_to_topics_len[country][topic] = topics_df.loc[country,topic]['len']
        countries_to_topics_sum[country][topic] = topics_df.loc[country,topic]['sum']
        countries_to_topics_percent[country][topic] = topics_df.loc[country,topic]['percent']
countries_to_topics_len_df = pd.DataFrame.from_dict(countries_to_topics_len, orient = 'index')
countries_to_topics_sum_df = pd.DataFrame.from_dict(countries_to_topics_sum, orient = 'index')
countries_to_topics_percent_df = pd.DataFrame.from_dict(countries_to_topics_percent, orient = 'index')

In [19]:
countries_to_topics_sum_df.head()

Unnamed: 0,Culture.Biography.Biography*,Culture.Biography.Women,Culture.Food and drink,Culture.Internet culture,Culture.Linguistics,Culture.Literature,Culture.Media.Books,Culture.Media.Entertainment,Culture.Media.Films,Culture.Media.Media*,...,STEM.Computing,STEM.Earth and environment,STEM.Engineering,STEM.Libraries & Information,STEM.Mathematics,STEM.Medicine & Health,STEM.Physics,STEM.STEM*,STEM.Space,STEM.Technology
ja.m,"{'2018-01-01 00:00:00': 6629234, '2018-01-02 0...","{'2018-01-01 00:00:00': 1462146, '2018-01-02 0...","{'2018-01-01 00:00:00': 302934, '2018-01-02 00...","{'2018-01-01 00:00:00': 443986, '2018-01-02 00...","{'2018-01-01 00:00:00': 109480, '2018-01-02 00...","{'2018-01-01 00:00:00': 2140880, '2018-01-02 0...","{'2018-01-01 00:00:00': 97435, '2018-01-02 00:...","{'2018-01-01 00:00:00': 238059, '2018-01-02 00...","{'2018-01-01 00:00:00': 681533, '2018-01-02 00...","{'2018-01-01 00:00:00': 4264889, '2018-01-02 0...",...,"{'2018-01-01 00:00:00': 91338, '2018-01-02 00:...","{'2018-01-01 00:00:00': 72493, '2018-01-02 00:...","{'2018-01-01 00:00:00': 316615, '2018-01-02 00...","{'2018-01-01 00:00:00': 10072, '2018-01-02 00:...","{'2018-01-01 00:00:00': 44902, '2018-01-02 00:...","{'2018-01-01 00:00:00': 485801, '2018-01-02 00...","{'2018-01-01 00:00:00': 76863, '2018-01-02 00:...","{'2018-01-01 00:00:00': 1793359, '2018-01-02 0...","{'2018-01-01 00:00:00': 64445, '2018-01-02 00:...","{'2018-01-01 00:00:00': 264636, '2018-01-02 00..."
it,"{'2018-01-01 00:00:00': 809879, '2018-01-02 00...","{'2018-01-01 00:00:00': 193009, '2018-01-02 00...","{'2018-01-01 00:00:00': 34632, '2018-01-02 00:...","{'2018-01-01 00:00:00': 66037, '2018-01-02 00:...","{'2018-01-01 00:00:00': 23304, '2018-01-02 00:...","{'2018-01-01 00:00:00': 206403, '2018-01-02 00...","{'2018-01-01 00:00:00': 50646, '2018-01-02 00:...","{'2018-01-01 00:00:00': 86717, '2018-01-02 00:...","{'2018-01-01 00:00:00': 395631, '2018-01-02 00...","{'2018-01-01 00:00:00': 1137084, '2018-01-02 0...",...,"{'2018-01-01 00:00:00': 41406, '2018-01-02 00:...","{'2018-01-01 00:00:00': 20273, '2018-01-02 00:...","{'2018-01-01 00:00:00': 51490, '2018-01-02 00:...","{'2018-01-01 00:00:00': 7526, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 15705, '2018-01-02 00:...","{'2018-01-01 00:00:00': 76109, '2018-01-02 00:...","{'2018-01-01 00:00:00': 31334, '2018-01-02 00:...","{'2018-01-01 00:00:00': 383789, '2018-01-02 00...","{'2018-01-01 00:00:00': 18815, '2018-01-02 00:...","{'2018-01-01 00:00:00': 78432, '2018-01-02 00:..."
da.m,"{'2018-01-01 00:00:00': 289706, '2018-01-02 00...","{'2018-01-01 00:00:00': 74001, '2018-01-02 00:...","{'2018-01-01 00:00:00': 13610, '2018-01-02 00:...","{'2018-01-01 00:00:00': 4361, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 4238, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 28733, '2018-01-02 00:...","{'2018-01-01 00:00:00': 8817, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 13707, '2018-01-02 00:...","{'2018-01-01 00:00:00': 47315, '2018-01-02 00:...","{'2018-01-01 00:00:00': 269483, '2018-01-02 00...",...,"{'2018-01-01 00:00:00': 2505, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 3840, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 6923, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 226, '2018-01-02 00:00...","{'2018-01-01 00:00:00': 1783, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 12618, '2018-01-02 00:...","{'2018-01-01 00:00:00': 3836, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 62767, '2018-01-02 00:...","{'2018-01-01 00:00:00': 2775, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 7414, '2018-01-02 00:0..."
tr,"{'2018-01-01 00:00:00': 98424, '2018-01-02 00:...","{'2018-01-01 00:00:00': 14151, '2018-01-02 00:...","{'2018-01-01 00:00:00': 3154, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 7279, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 9300, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 14074, '2018-01-02 00:...","{'2018-01-01 00:00:00': 3511, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 3366, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 10859, '2018-01-02 00:...","{'2018-01-01 00:00:00': 54290, '2018-01-02 00:...",...,"{'2018-01-01 00:00:00': 7422, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 3512, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 6465, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 1297, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 2732, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 8190, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 6441, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 59353, '2018-01-02 00:...","{'2018-01-01 00:00:00': 4067, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 13951, '2018-01-02 00:..."
no.m,"{'2018-01-01 00:00:00': 232404, '2018-01-02 00...","{'2018-01-01 00:00:00': 64920, '2018-01-02 00:...","{'2018-01-01 00:00:00': 15889, '2018-01-02 00:...","{'2018-01-01 00:00:00': 5802, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 7222, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 27984, '2018-01-02 00:...","{'2018-01-01 00:00:00': 9003, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 9807, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 27301, '2018-01-02 00:...","{'2018-01-01 00:00:00': 153531, '2018-01-02 00...",...,"{'2018-01-01 00:00:00': 2907, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 6839, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 10382, '2018-01-02 00:...","{'2018-01-01 00:00:00': 596, '2018-01-02 00:00...","{'2018-01-01 00:00:00': 1871, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 21247, '2018-01-02 00:...","{'2018-01-01 00:00:00': 6604, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 97684, '2018-01-02 00:...","{'2018-01-01 00:00:00': 3576, '2018-01-02 00:0...","{'2018-01-01 00:00:00': 10887, '2018-01-02 00:..."


In [20]:
countries_to_topics_sum_df.columns

Index(['Culture.Biography.Biography*', 'Culture.Biography.Women',
       'Culture.Food and drink', 'Culture.Internet culture',
       'Culture.Linguistics', 'Culture.Literature', 'Culture.Media.Books',
       'Culture.Media.Entertainment', 'Culture.Media.Films',
       'Culture.Media.Media*', 'Culture.Media.Music', 'Culture.Media.Radio',
       'Culture.Media.Software', 'Culture.Media.Television',
       'Culture.Media.Video games', 'Culture.Performing arts',
       'Culture.Philosophy and religion', 'Culture.Sports',
       'Culture.Visual arts.Architecture',
       'Culture.Visual arts.Comics and Anime', 'Culture.Visual arts.Fashion',
       'Culture.Visual arts.Visual arts*', 'Geography.Geographical',
       'Geography.Regions.Africa.Africa*',
       'Geography.Regions.Africa.Central Africa',
       'Geography.Regions.Africa.Eastern Africa',
       'Geography.Regions.Africa.Northern Africa',
       'Geography.Regions.Africa.Southern Africa',
       'Geography.Regions.Africa.Western 

However, we might not be interested in all available topics. As a matter of fact, for our project, it might be useful to only isolate the data about articles related to the environment. Examining the columns, the topic is available in only one of them, so we will extract only that topic in two dataframes that have the same format as [here](#extraction_format)

In [21]:
sum_environment_df = countries_to_topics_sum_df['STEM.Earth and environment']
percent_environment_df = countries_to_topics_percent_df['STEM.Earth and environment']
country_to_env_data_sum = {}
country_to_env_data_percent = {}
for country in sum_environment_df.index:
    country_to_env_data_sum[country] = sum_environment_df[country]
    country_to_env_data_percent[country] = percent_environment_df[country]
sum_environment_df = pd.DataFrame.from_dict(country_to_env_data_sum, orient = 'index').T
sum_environment_df.index = index_without_time
percent_environment_df = pd.DataFrame.from_dict(country_to_env_data_percent, orient = 'index').T
percent_environment_df.index = index_without_time

In [22]:
sum_environment_df.head()

Unnamed: 0,ja.m,it,da.m,tr,no.m,en,sr,tr.m,en.m,no,...,ko.m,fi.m,sr.m,ja,fr,fi,ca,it.m,sv.m,ko
2018-01-01,72493,20273,3840,3512,6839,709659,1282,2899,906437,2906,...,5291,10513,1868,38004,60078,4987,2198,57629,19954,5120
2018-01-02,96614,41790,3936,3462,5857,973197,1956,2992,981638,6096,...,5236,10424,2813,46235,92710,7407,4645,73085,17666,5995
2018-01-03,107578,45349,4166,4137,6683,1115237,2150,3560,998147,8860,...,4722,9656,2535,55066,104062,7117,4387,75968,19606,7069
2018-01-04,97229,47540,4285,4148,6919,1097507,2628,3520,1059701,9953,...,4908,10072,4035,73793,106732,7637,5124,76988,19213,6201
2018-01-05,98070,43588,4431,3297,6446,1016865,2368,2828,951065,8706,...,5100,9442,2780,85140,98577,6912,3684,71269,18918,5942


In [23]:
percent_environment_df.head()

Unnamed: 0,ja.m,it,da.m,tr,no.m,en,sr,tr.m,en.m,no,...,ko.m,fi.m,sr.m,ja,fr,fi,ca,it.m,sv.m,ko
2018-01-01,0.001414,0.002589,0.001947,0.003699,0.003708,0.003689,0.002651,0.002373,0.002359,0.005613,...,0.001663,0.003341,0.001607,0.002296,0.003844,0.004212,0.009242,0.001771,0.00326,0.002914
2018-01-02,0.002001,0.003256,0.003672,0.0034,0.004326,0.003676,0.003157,0.002486,0.002764,0.007101,...,0.002024,0.004122,0.002492,0.00256,0.00425,0.005175,0.011023,0.00222,0.003829,0.00281
2018-01-03,0.002074,0.003413,0.004139,0.00376,0.004906,0.00392,0.003244,0.003047,0.003071,0.008214,...,0.001953,0.004057,0.002561,0.002667,0.004465,0.005038,0.011034,0.002311,0.004315,0.003077
2018-01-04,0.002029,0.00358,0.004273,0.003828,0.005288,0.004073,0.00396,0.003057,0.003305,0.00887,...,0.00209,0.004323,0.004285,0.003049,0.004455,0.0054,0.011972,0.002481,0.004363,0.00309
2018-01-05,0.002136,0.003422,0.003925,0.003457,0.004078,0.003883,0.003787,0.002363,0.00293,0.008315,...,0.002155,0.003987,0.003053,0.003345,0.004351,0.005054,0.010285,0.00233,0.003549,0.003026


## Mobility data

The second type of data we have are mobility data that come from two different sources: the first one is from Apple, who stopped giving out the data in April 2022, and the second one is from Google, which is still available and updated to this day.

### Apple mobility

In [24]:
apple_mobility = pd.read_csv("applemobilitytrends-2020-04-20.csv.gz")
#apple_mobility = apple_mobility.T
apple_mobility.head()

Unnamed: 0,geo_type,region,transportation_type,2020-01-13,2020-01-14,2020-01-15,2020-01-16,2020-01-17,2020-01-18,2020-01-19,...,2020-04-11,2020-04-12,2020-04-13,2020-04-14,2020-04-15,2020-04-16,2020-04-17,2020-04-18,2020-04-19,2020-04-20
0,country/region,Albania,driving,100,95.3,101.43,97.2,103.55,112.67,104.83,...,25.47,24.89,32.64,31.43,30.67,30.0,29.26,22.94,24.55,31.51
1,country/region,Albania,walking,100,100.68,98.93,98.46,100.85,100.13,82.13,...,27.63,29.59,35.52,38.08,35.48,39.15,34.58,27.76,27.93,36.72
2,country/region,Argentina,driving,100,97.07,102.45,111.21,118.45,124.01,95.44,...,19.4,12.89,21.1,22.29,23.55,24.4,27.17,23.19,14.54,26.67
3,country/region,Argentina,walking,100,95.11,101.37,112.67,116.72,114.14,84.54,...,15.75,10.45,16.35,16.66,17.42,18.18,18.8,17.03,10.59,18.44
4,country/region,Australia,driving,100,102.98,104.21,108.63,109.08,89.0,99.35,...,26.95,31.72,53.14,55.91,56.56,58.77,47.51,36.9,53.34,56.93


In [25]:
print(apple_mobility.transportation_type.unique())
print(apple_mobility.geo_type.unique())

['driving' 'walking' 'transit']
['country/region' 'city']


In [26]:
apple_mobility.isnull().any().any()

False

The mobility data from Apple we have begins in mid-January 2020, and ends that same year in April. This isn't a big time window, and it doesn't appear that there is earlier data as it has been collected specifically for Covid-19 mobility tracking. We could however try to look for newer information (post-April 2020) on the web.

Three types of transportation have been tracked here: driving, walking, and transit. We also have two different precisions about the collected data: either country/world region level, or city level, which are often country capitals.

Per day and region, we have the pourcentage of the usage of every transportation mode according to some pre-pandemic baseline computed in early 2020.

In [27]:
apple_mobility_walking = apple_mobility[apple_mobility.transportation_type == 'walking']
apple_mobility_driving = apple_mobility[apple_mobility.transportation_type == 'driving']
apple_mobility_transit = apple_mobility[apple_mobility.transportation_type == 'transit']

### Global mobility from Google

In [28]:
global_mobility_report = pd.read_csv("Global_Mobility_Report.csv.gz")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [29]:
global_mobility_report.head()

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,metro_area,iso_3166_2_code,census_fips_code,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
0,AE,United Arab Emirates,,,,,,2020-02-15,0.0,4.0,5.0,0.0,2.0,1.0
1,AE,United Arab Emirates,,,,,,2020-02-16,1.0,4.0,4.0,1.0,2.0,1.0
2,AE,United Arab Emirates,,,,,,2020-02-17,-1.0,1.0,5.0,1.0,2.0,1.0
3,AE,United Arab Emirates,,,,,,2020-02-18,-2.0,1.0,5.0,0.0,2.0,1.0
4,AE,United Arab Emirates,,,,,,2020-02-19,-2.0,0.0,4.0,-1.0,2.0,1.0


In [30]:
global_mobility_report.sub_region_1.unique()

array([nan, 'Abu Dhabi', 'Ajman', ..., 'Matabeleland North Province',
       'Matabeleland South Province', 'Midlands Province'], dtype=object)

In [31]:
min(global_mobility_report.date.unique()), max(global_mobility_report.date.unique())

('2020-02-15', '2020-08-25')

The mobility data from Google we have begins in mid-February 2020, and ends that same year in August. This is more than the given Apple data, despite the fact that both collections happened in the context of Covid-19. Like with Apple, we can probably extend this data.

There are more levels of granularity with this data: for example, for the United Arab Emirates, we might simply talk about the whole country, or it could be specified in the column *sub_region_1* that the row is actually focused on the city of Abu Dhabi. This granularity can be made finer with *sub_region_2*.

Per day and region, we have the **difference** in pourcentage usage of various location types (workplaces, etc) according to some pre-pandemic baseline computed in early 2020.

## Interventions

In this data, each language is mapped to the main country of usage except for English, where the language's usage is very high in multiple countries such that it couldn't be mapped to a single country. As such, for that language, we have most of the data missing.

Per country, the pandemic timeline is represented, such as the first registered case, the first death, etc.

Note that the paper says that nine languages are spoken in a single language, but we have more than that here (reason for that unknown).

In [32]:
interventions = pd.read_csv("interventions.csv")
interventions

Unnamed: 0,lang,1st case,1st death,School closure,Public events banned,Lockdown,Mobility,Normalcy
0,fr,2020-01-24,2020-02-14,2020-03-14,2020-03-13,2020-03-17,2020-03-16,2020-07-02
1,da,2020-02-27,2020-03-12,2020-03-13,2020-03-12,2020-03-18,2020-03-11,2020-06-05
2,de,2020-01-27,2020-03-09,2020-03-14,2020-03-22,2020-03-22,2020-03-16,2020-07-10
3,it,2020-01-31,2020-02-22,2020-03-05,2020-03-09,2020-03-11,2020-03-11,2020-06-26
4,nl,2020-02-27,2020-03-06,2020-03-11,2020-03-24,,2020-03-16,2020-05-29
5,no,2020-02-26,2020-02-26,2020-03-13,2020-03-12,2020-03-24,2020-03-11,2020-06-04
6,sr,2020-03-06,2020-03-20,2020-03-15,2020-03-21,2020-03-21,2020-03-16,2020-05-02
7,sv,2020-01-31,2020-03-11,2020-03-18,2020-03-12,,2020-03-11,2020-06-05
8,ko,2020-01-20,2020-02-20,2020-02-23,,,2020-02-25,2020-04-15
9,ca,2020-01-31,2020-02-13,2020-03-12,2020-03-08,2020-03-14,2020-03-16,


## Topics

Simply maps each considered article to the topics it is related to. A single article can be mapped to multiple topics.

In [33]:
topics_linked = pd.read_csv("topics_linked.csv.xz")
topics_linked

Unnamed: 0,index,Geography.Regions.Asia.Central Asia,Geography.Regions.Europe.Eastern Europe,History and Society.Military and warfare,Culture.Media.Television,History and Society.Education,Culture.Media.Books,Geography.Regions.Africa.Africa*,Culture.Visual arts.Architecture,Culture.Biography.Women,...,Geography.Regions.Asia.West Asia,STEM.Chemistry,Geography.Regions.Europe.Northern Europe,Culture.Media.Video games,Geography.Regions.Asia.Southeast Asia,Culture.Media.Entertainment,Culture.Media.Music,Geography.Regions.Asia.Asia*,Geography.Regions.Asia.North Asia,qid
0,Rosmalen,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q2001490
1,Commelinales,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q290349
2,Transport_in_Honduras,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q1130638
3,QuakeC,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q2122062
4,Food_writing,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q5465542
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4306121,Faimaala_Filipo,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,Q84090991
4306122,Jonathan_Horne,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q1666264
4306123,Steven_Da_Costa,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Q22921600
4306124,The_Silence_of_Dr._Evans,False,True,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,Q4301095
