In [None]:
'''
File name: milestone3.ipynb
Authors: Charlotte Sertic, Arthur Nussbaumer, Carl Penning, Robin Debalme
Date created: 21/12/2022
Date last modified: 23/12/2022
Python version: 3.x.x
'''

# Milestone 3 research and data visualisation

### Table of Contents

* [Digital Propagation](#chapter1)
    * [Geographic physical & digital propagation of COVID-19 during the 1st wave](#section_1_1)
    * [Digital and physical propagation of Covid-19 during the whole pandemic](#Section_1_2) 
    * [Potential relationship between pageviews, deaths and cases](#Section_1_3) 
        * [During the begining of the 1st wave](#Subsection_1_1) 
        * [At the end of the pandemic](#Subsection_1_2) 
        * [Evolution during the whole pandemic](#Subsection_1_3)
* [Trust and mobility](#chapter2)
    * [Mobility](#section_2_1)
    * [Mobility and death and pageviews](#section_2_2)
    * [Mobility and trust](#Section_2_3)
       
        
* [Case studies](#chapter3)      


We import all the librairies needed to compute and plot our results.

In [127]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
from scipy import stats
from helper import *
#Importation of all the packages
import datetime
import math
import json
import zipfile  
import ssl
from scipy import stats
import statsmodels.formula.api as smf
import statsmodels.api as sm
import plotly.express as px
import bar_chart_race as brc
from statsmodels.regression.rolling import RollingOLS
import statsmodels.api as sm

#To dowload data
import requests
import io
import gzip

#To create the mapchart
import iso3166
import plotly
from iso3166 import countries
import plotly.graph_objects as go

## Digital propagation <a class="anchor" id="chapter1"></a>
### Geographic physical & digital propagation of COVID-19 during the 1st wave <a class="anchor" id="section_1_1"></a>

We first load the raw data for pageviews and population, and then clean it to obtain a dataframe of pageviews, cumulative pageviews, pageviews per 100,000 inhabitants, and cumulative pageviews per 100,000 inhabitants using the `get_pageviews_df` function.

For cases and deaths due to COVID-19, we do the same, but we obtain the raw data using URLs with the `get_cases_deaths_df` function.

For our initial analysis, we consider the data from the start of the COVID-19 pandemic, from **January 22, 2020** to **July 31, 2020**. We want to analyse the digital propagation of COVID-19 geographically. Due to the limitation of the librairie used to create the mapcharts, we will restrain our analysis only to the 1st wave.

In [99]:
#Loading raw df from csv file
pageview_df = pd.read_csv("page_views_covid_related.csv.gz")
population_df = pd.read_csv("Population_countries.csv")
#get cleaned dfs, cumulative df and per 100k of population dfs for pageviews, covid cases and deaths data 
df_pageviews, df_pageviews_cumul, df_pageviews100k, df_pageviews_cumul100k = get_pageviews_df(pageview_df, population_df, get_country_dict('original'), '2020-01-22', '2020-09-01')
deaths, cases, deaths_cumul, cases_cumul, deaths100k, deaths100k_cumul, cases100k, cases100k_cumul = get_cases_deaths_df(population_df, get_country_dict('original'), '2020-01-22', '2020-09-01')

Here, we create a dictionary `o_country_dict` that maps country names to their language codes using the `get_country_dict` function then we create a reversed version of the dictionary, `inv_o_country_dict`, that maps language codes to country names. We also defines a dictionary `other_country_name` that contains exceptions for country names that differ from their language codes to use them later for our mapchart plots.

In [100]:
# Get a dictionary mapping country names to their language codes
o_country_dict = get_country_dict('original')

# Reverse the dictionary to map language codes to country names
inv_o_country_dict = {v: k for k, v in o_country_dict.items()}

# Add exceptions for country names that differ from their language codes
other_country_name = {
    "Russia": "Russian Federation",
    "Turkey":"Türkiye",
    "Vietnam" : "Viet Nam",
    "South Korea" : "Korea, Democratic People's Republic of"
}

The following code creates map charts using the cumulative death per 100k of population `deaths100k_cumul`, the cumulative pageviews per 100k of population `df_pageviews_cumul100k`and the the cumulative cases per 100k of population `cases_cumul100k`.

In [101]:
# Create a df for the mapchart using the cumulative death data
deaths_mapchart = mapcharts_df(deaths100k_cumul, o_country_dict, 'deaths')
# Create map chart
mapcharts(deaths_mapchart, 'deaths', 'country', """Number of culmulative death of COVID-19 per 100k inhabitants""",
'The colour of the country corresponds to how much death per 100k inhabitants was recorded from 22-01-2020.', 15, 10, 'Reds')

![title](screen_plots/mapdeaths.png)

With this map, we can see that Italy was heavily impacted by Covid-19 early on, and that deaths spiked significantly. Belgium and Sweden followed two weeks later. At the end of the first wave (**July 31, 2020**), we can see significant disparities between countries. For example, Poland had only 5.4 deaths per 100,000 inhabitants compared to Sweden and Italy, which had ten times more deaths.

In [102]:
# Create a df for the mapchart using the cumulative pageviews data
pageviews_mapchart = mapcharts_df(df_pageviews_cumul100k, o_country_dict, 'pageviews')
# Create map chart
mapcharts(pageviews_mapchart, 'pageviews', 'country', """Number of cumulative pageviews per 100k inhabitants""",
'The colour of the country corresponds to how much pageviews per 100k inhabitants was recorded from 22-01-2020.', 15, 10, 'ylorbr')

We can see that the pageviews of Covid-related articles evolve differently than deaths. The countries with the highest pageviews (Germany: 38k per 100k of population and Czechia: 34k per 100k) have relatively low numbers of deaths, at 11 and 4, respectively. On the other hand, Sweden, which has a high number of deaths, has low pageviews (5.7). This may be due to the fact that the Swedish Wikipedia has few Covid-related articles, so Sweden's inhabitants may be documenting themselves using the English Wikipedia. However, considering the number of articles in Czech (48), it is difficult to explain the four-order-of-magnitude difference between the pageviews per 100k inhabitants of Sweden and Czechia solely by the difference in the number of articles. Another hypothesis is that Swedish people do not document themselves about Covid-19 as much. Later, we will investigate the possible relationship between pageviews and the number of deaths due to COVID-19.

In [103]:
COVID_RELATED_ARTICLES_PATH = "data/COVID_related_pages_project.csv"
df_covid_articles = pd.read_csv(COVID_RELATED_ARTICLES_PATH)
print("Number of covid related articles in swedish: {}".format(df_covid_articles[df_covid_articles['project'] == 'sv.wikipedia']['page'].count()))
print("Number of covid related articles in german: {}".format(df_covid_articles[df_covid_articles['project'] == 'de.wikipedia']['page'].count()))
print("Number of covid related articles in czech: {}".format(df_covid_articles[df_covid_articles['project'] == 'cs.wikipedia']['page'].count()))

Number of covid related articles in swedish: 17
Number of covid related articles in german: 253
Number of covid related articles in czech: 48


We decided not to use the number of cases to consider the spread of Covid-19 because, during the beginning of the pandemic, many cases went undetected due to the lack of testing. This data becomes more useful when considering the subsequent waves of the pandemic in 2021 and 2022.

In [104]:
# Create a df for the mapchart using the cumulative cases data
cases_mapchart = mapcharts_df(cases100k_cumul, o_country_dict, 'Cases')
# Create map chart
mapcharts(cases_mapchart, 'Cases', 'country', 'Number of cumulative cases per 100000 inhabitants and per country')


### Digital and physical propagation of Covid-19 during the whole pandemic <a class="anchor" id="section_1_2"></a>

For our next analysis, we consider the data from the whole COVID-19 pandemic, from **January 22, 2020** to **July 31, 2022**. We will explore the evolution of monthly pageviews, deaths, and cases for the top 15 countries of each variable.  

We first load the raw data for pageviews and population, and then clean it to obtain a dataframe of pageviews, cumulative pageviews, pageviews per 100,000 inhabitants, and cumulative pageviews per 100,000 inhabitants using the `get_pageviews_df` function.

For cases and deaths due to COVID-19, we do the same, but we obtain the raw data using URLs with the `get_cases_deaths_df` function.

In [105]:
df_pageviews, df_pageviews_cumul, df_pageviews100k, df_pageviews_cumul100k = get_pageviews_df(pageview_df, population_df, get_country_dict('original'), '2020-01-22', '2022-07-31')
deaths, cases, deaths_cumul, cases_cumul, deaths100k, deaths100k_cumul, cases100k, cases100k_cumul = get_cases_deaths_df(population_df, get_country_dict('original'), '2020-01-22', '2022-07-31')

To analyse the propagation and identify the different waves, we create different bar chart races:
- Monthly pageviews of Covid-19 related article by country per 1M inhabitants
- Monthly deaths due to Covid-19 by country per 1M inhabitants
- Monthly Covid-19 cases by country per 1M inhabitants

In [106]:

df_pageviews100k_br = get_race_bar_df(df_pageviews100k)
brc.bar_chart_race(df_pageviews100k_br * 10, n_bars= 15, sort= 'desc', period_length=2800, filename= 'pageviewsbarRace.mp4', filter_column_colors= True, steps_per_period=50, title='Monthly pageviews of Covid-19 related article by country per 1M inhabitants', title_size= '10')


FixedFormatter should only be used together with FixedLocator


FixedFormatter should only be used together with FixedLocator



In [107]:
deaths100k_br = get_race_bar_df(deaths100k)
brc.bar_chart_race(deaths100k_br * 10, n_bars= 15, sort= 'desc', period_length=2800, filename= 'deathsbarRace.mp4', filter_column_colors= True, steps_per_period=50, title='Monthly deaths due to Covid-19 by country per 1M inhabitants', title_size= '10')


FixedFormatter should only be used together with FixedLocator


FixedFormatter should only be used together with FixedLocator



In [108]:
cases100k_br = get_race_bar_df(cases100k)
brc.bar_chart_race(cases100k_br * 10, n_bars= 15, sort= 'desc', period_length=2800, filename= 'casesbarRace.mp4', filter_column_colors= True, steps_per_period=50, title='Monthly Covid-19 cases by country per 1M inhabitants', title_size= '10')


FixedFormatter should only be used together with FixedLocator


FixedFormatter should only be used together with FixedLocator



Looking at the different bar races, one may deduce the following points:

- COVID cases and deaths are strongly related, with an increase in cases leading to an increase in deaths with a small delay,

- Different waves of the pandemic occurred at different times in different countries. European countries had their first wave in March 2020 and then another important wave in November 2020. For West Asian countries such as Kyrgyzstan, Kazakhstan and Israel the first wave appears later in July 2020 and for Botswana only in March 2021,

- There is a strong relation between the evolution of the pandemic and number of pageviews related to it during the first wave, but this relation does not remain constant throughout the pandemic. One can illustrate that with Italy that arrived at the same time in the top 3 countries in the different bar races and dropped down after everywhere,

- Some countries, such as Germany and the Netherlands, remained consistently high in terms of pageviews per million inhabitants despite relatively low numbers of cases and deaths,

- Overall, one can observe that the pageviews are more consitent over the whole period.

- China was never among the top 15 countries in terms of pageviews, cases and deaths, possibly due to government censorship of information related to COVID-19.

### Potential relationship between pageviews, deaths and cases <a class="anchor" id="section_1_3"></a>
#### During the begining of the 1st wave <a class="anchor" id="Subsection_1_1"></a>

In [121]:
#loading the data of interest
_, _, _, df_pageviews_cumul100k = get_pageviews_df(pageview_df, population_df, get_country_dict('original'), '2020-01-22', '2020-05-22')
_, _, _, _, _, deaths100k_cumul, _, cases100k_cumul = get_cases_deaths_df(population_df, get_country_dict('original'), '2020-01-22', '2020-05-22')

To correct the skewed distribution of the data, we take the log of the cumulative pageviews and deaths. This helps to normalize the data and better visualize any trends or patterns.

In [122]:
#preprocess the data to fit our regression
data = pd.DataFrame()
data['y'] = np.log(df_pageviews_cumul100k.max().transpose().replace(0, np.nan))
#x of first regression, log of cumulative cases
data['x2'] = np.log(cases100k_cumul.max().transpose().replace(0, np.nan))
#x of first regression, log of cumulative deaths
data['x'] = np.log(deaths100k_cumul.max().transpose().replace(0, np.nan))
data = data.dropna()
data = data.rename(index= {v: k for k, v in get_country_dict('original').items()}).reset_index().rename(columns = {'index': 'Country'})

#apply a linear regression using as estimates OLS
model = smf.ols('y ~ x', data=data).fit()
#get model summary
results = model.summary()

print("Our model summary is:")
results

Our model summary is:


0,1,2,3
Dep. Variable:,y,R-squared:,0.512
Model:,OLS,Adj. R-squared:,0.497
Method:,Least Squares,F-statistic:,34.63
Date:,"Thu, 22 Dec 2022",Prob (F-statistic):,1.35e-06
Time:,19:13:48,Log-Likelihood:,-59.005
No. Observations:,35,AIC:,122.0
Df Residuals:,33,BIC:,125.1
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.4383,0.237,31.445,0.000,6.957,7.920
x,0.7979,0.136,5.885,0.000,0.522,1.074

0,1,2,3
Omnibus:,6.063,Durbin-Watson:,2.411
Prob(Omnibus):,0.048,Jarque-Bera (JB):,4.561
Skew:,-0.81,Prob(JB):,0.102
Kurtosis:,3.711,Cond. No.,1.88


The model has an R-squared value of 0.512, which means that 51.2% of the variance in the log of cumulative pageviews is explained by the log of cumulative deaths. Moreover the coefficient of x is 0.7979, meaning that there is proportional relationship between log of deaths and pageviews during the first 3 months of the COVID-19 pandemic. The more deaths a country have, the more its population documents itself about COVID-19 on Wikipedia. Moreover, the intercept and the coefficient are statistically significant.

In [123]:
model = smf.ols('y ~ x2', data=data).fit()

results = model.summary()

#Show the results of the linear regression
print("Our model summary is:")
results

Our model summary is:


0,1,2,3
Dep. Variable:,y,R-squared:,0.373
Model:,OLS,Adj. R-squared:,0.354
Method:,Least Squares,F-statistic:,19.65
Date:,"Thu, 22 Dec 2022",Prob (F-statistic):,9.71e-05
Time:,19:27:58,Log-Likelihood:,-63.387
No. Observations:,35,AIC:,130.8
Df Residuals:,33,BIC:,133.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.4021,0.814,5.410,0.000,2.747,6.057
x2,0.8640,0.195,4.433,0.000,0.467,1.261

0,1,2,3
Omnibus:,7.102,Durbin-Watson:,2.588
Prob(Omnibus):,0.029,Jarque-Bera (JB):,5.666
Skew:,-0.921,Prob(JB):,0.0588
Kurtosis:,3.703,Cond. No.,13.9


The model has an R-squared value of 0.373, which means that 37.3% of the variance in the log of cumulative pageviews is explained by the log of cumulative cases. Moreover the coefficient of x is 0.864, meaning that there is proportional relationship between log of cases and pageviews during the first 3 months of the COVID-19 pandemic. The more cases a country have, the more its population documents itself about COVID-19 on Wikipedia. Moreover, the intercept and the coefficient are statistically significant.

Do these relationships persist later in the pandemic? Or do they disappear as people become more informed about COVID-19?

#### At the end of the pandemic <a class="anchor" id="Subsection_1_2"></a>

In [124]:
#load data of interest
_, _, _, df_pageviews_cumul100k = get_pageviews_df(pageview_df, population_df, get_country_dict('original'), '2022-02-01', '2022-07-01')
_, _, _, _, _, deaths100k_cumul, _, cases100k_cumul = get_cases_deaths_df(population_df, get_country_dict('original'), '2022-02-01', '2022-07-01')

In [125]:
#preprocess the data to fit our regression
data = pd.DataFrame()
data['y'] = np.log(df_pageviews_cumul100k.max().transpose().replace(0, np.nan))
#x of first regression, log of cumulative cases
data['x2'] = np.log(cases100k_cumul.max().transpose().replace(0, np.nan))
#x of first regression, log of cumulative deaths
data['x'] = np.log(deaths100k_cumul.max().transpose().replace(0, np.nan))
data = data.dropna()
data = data.rename(index= {v: k for k, v in get_country_dict('original').items()}).reset_index().rename(columns = {'index': 'Country'})


#apply a linear regression using as estimates OLS
model = smf.ols('y ~ x', data=data).fit()
#get model summary
results = model.summary()

print("Our model summary is:")
results

Our model summary is:


0,1,2,3
Dep. Variable:,y,R-squared:,0.271
Model:,OLS,Adj. R-squared:,0.251
Method:,Least Squares,F-statistic:,13.04
Date:,"Thu, 22 Dec 2022",Prob (F-statistic):,0.000947
Time:,19:37:53,Log-Likelihood:,-51.947
No. Observations:,37,AIC:,107.9
Df Residuals:,35,BIC:,111.1
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.4250,0.314,14.081,0.000,3.787,5.063
x,0.3696,0.102,3.611,0.001,0.162,0.577

0,1,2,3
Omnibus:,3.64,Durbin-Watson:,2.142
Prob(Omnibus):,0.162,Jarque-Bera (JB):,1.596
Skew:,0.01,Prob(JB):,0.45
Kurtosis:,1.983,Cond. No.,6.25


Both the intercept and the coefficient are statistically significant, indicating that a relationship still exists between the two variables. However, compared to the same model at the beginning of the pandemic, the R-squared value is lower, as well as the intercept and coefficient values. This suggests that a certain log of cumulative deaths is now associated with a lower number of pageviews compared to the beginning of the pandemic. It is possible that as people have become more familiar with COVID-19 and the measures taken to control its spread, their interest in reading about the topic has decreased. Additionally, it could be that the impact of COVID-19 on people's daily lives has decreased over time, leading to a decrease in their interest in reading about it. It would be interesting to further explore the reasons behind this change in the relationship between the two variables. 

In [126]:
model = smf.ols('y ~ x2', data=data).fit()

results = model.summary()

print("Our model summary is:")
results

Our model summary is:


0,1,2,3
Dep. Variable:,y,R-squared:,0.405
Model:,OLS,Adj. R-squared:,0.388
Method:,Least Squares,F-statistic:,23.86
Date:,"Thu, 22 Dec 2022",Prob (F-statistic):,2.27e-05
Time:,19:38:30,Log-Likelihood:,-48.188
No. Observations:,37,AIC:,100.4
Df Residuals:,35,BIC:,103.6
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.8711,0.735,2.544,0.016,0.378,3.364
x2,0.4225,0.086,4.885,0.000,0.247,0.598

0,1,2,3
Omnibus:,1.09,Durbin-Watson:,2.013
Prob(Omnibus):,0.58,Jarque-Bera (JB):,0.925
Skew:,-0.128,Prob(JB):,0.63
Kurtosis:,2.269,Cond. No.,42.1


Same can be said here except for the R-squared that is slightly bigger (0.03).

Let's now take a look at the evolution through out the whole pandemic!

#### Evolution during the whole pandemic <a class="anchor" id="Subsection_1_3"></a>

In [128]:
#load of data of interest
_, _, df_pageviews_100k, df_pageviews_cumul100k = get_pageviews_df(pageview_df, population_df, get_country_dict('original'), '2020-01-22', '2022-07-01')
_, _, _, _, deaths100k, deaths100k_cumul, cases100k, cases100k_cumul = get_cases_deaths_df(population_df, get_country_dict('original'), '2020-01-22', '2022-07-01')

In [129]:
#rename columns as country name
deaths100k = deaths100k.rename(columns= {v: k for k, v in get_country_dict('original').items()}).reset_index().rename(columns = {'index': 'Country'})
deaths100k = deaths100k.drop('date', axis= 1)
df_pageviews_100k = df_pageviews_100k.rename(columns= {v: k for k, v in get_country_dict('original').items()}).reset_index().rename(columns = {'index': 'Country'})
df_pageviews_100k = df_pageviews_100k.drop('date', axis= 1)
#get the 60 days rolling cumulative dataframe for deaths and pageviews
df_pageviews_100k_rollingcumul = df_pageviews_100k.rolling(60, min_periods=1).sum()
df_pageviews_100k_rollingcumul = df_pageviews_100k_rollingcumul.loc[df_pageviews_100k_rollingcumul.index % 60 == 59]
df_pageviews_100k_rollingcumul = df_pageviews_100k_rollingcumul.reset_index().drop('index', axis= 1)
deaths100k_rollingcumul = deaths100k.rolling(60, min_periods=1).sum()
deaths100k_rollingcumul = deaths100k_rollingcumul.loc[deaths100k_rollingcumul.index % 60 == 59] #take on point every 2 months
deaths100k_rollingcumul = deaths100k_rollingcumul.reset_index().drop('index', axis= 1)

In [130]:
#create data in format for plot
data = pd.DataFrame()
#df used in iteration
data_tmp = pd.DataFrame()
for i in df_pageviews_100k_rollingcumul.index.values:
    if i == 0:
        data['Log of Pageviews'] = np.log(df_pageviews_100k_rollingcumul.transpose()[i] + 0.01) #apply offset to eliminate 0
        data['Log of Deaths'] = np.log(deaths100k_rollingcumul.transpose()[i] + 0.01) #apply offset to eliminate 0
        data['i'] = i
    else:
        data_tmp ['Log of Pageviews'] = np.log(df_pageviews_100k_rollingcumul.transpose()[i] + 0.01) #apply offset to eliminate 0
        data_tmp ['Log of Deaths'] = np.log(deaths100k_rollingcumul.transpose()[i] + 0.01) #apply offset to eliminate 0
        data_tmp ['i'] = i
        data = pd.concat([data, data_tmp], axis= 0)

data = data.reset_index().rename(columns= {'index' : 'country'}).dropna()
#set the date back for the plot
data['Date'] = data['i'].apply(lambda x:pd.to_datetime('2020-01-22') + datetime.timedelta(days= x * 59 + 59)) #date offset from the data points
data['Date'] = data['Date'].astype(str)
data.head()

Unnamed: 0,country,Log of Pageviews,Log of Deaths,i,Date
0,Italy,8.893485,2.095085,0,2020-03-21
1,Russia,8.812605,-4.538064,0,2020-03-21
2,China,6.335957,-1.428455,0,2020-03-21
3,Albania,4.75946,-2.519797,0,2020-03-21
4,Bangladesh,2.609923,-4.490556,0,2020-03-21


In [132]:
#new columns to differentiate country of interest to plot them in different colors in the plot
data['Data points']  = np.where(data['country']== 'Germany', 'Germany', 'Other')
data['Data points']  = np.where(data['country']== 'Bangladesh', 'Bangladesh', data['Data points'])
data['Data points']  = np.where(data['country']== 'Sweden', 'Sweden', data['Data points'])
data['Data points']  = np.where(data['country']==  'Thailand', 'Thailand', data['Data points'])
data.head()

Unnamed: 0,country,Log of Pageviews,Log of Deaths,i,Date,Data points
0,Italy,8.893485,2.095085,0,2020-03-21,Other
1,Russia,8.812605,-4.538064,0,2020-03-21,Other
2,China,6.335957,-1.428455,0,2020-03-21,Other
3,Albania,4.75946,-2.519797,0,2020-03-21,Other
4,Bangladesh,2.609923,-4.490556,0,2020-03-21,Bangladesh


In [133]:
#intial plot used to get the ols reg data
fig = px.scatter(data, x="Log of Deaths", y="Log of Pageviews", animation_frame="Date", hover_name="country", trendline="ols",
 trendline_color_override= 'black', range_x= [-5, 5], range_y=[-5, 12], color="Data points",
        title= format_title("Relationship between Log of cumulative deaths and Log of cumulative pageviews",
         "Each data points represent the log of the 2 months cumulative sum of pageviews and deaths per 100k inhabitants", 10, 14))
#get result of regression
results = px.get_trendline_results(fig)
tab_result = pd.DataFrame(columns= ['Period', 'Intercept', 'Intercept p-value', 'Coefficient', 'Coefficient p-value'])
date_end = '22-01-2020'
#create summary of the models with params and their respective p-values
for i in range(14):
    results1 = results.iloc[i]["px_fit_results"].params
    results2 = results.iloc[i]["px_fit_results"].pvalues
    date_begin = date_end
    date_end = results.iloc[i]["Date"]
    data['Period'] = np.where(data['i'] == i,"{} to {}".format(date_begin, date_end), data['Date'])
    period = "{} to {}".format(date_begin, date_end)
    tab_result = pd.concat([tab_result, pd.DataFrame({'Period': [period], 'Intercept': [results1[0]], 'Intercept p-value': [results2[0]],
                 'Coefficient': [results1[1]], 'Coefficient p-value': [results2[1]]})], axis= 0)

#final plot with period as animation
fig = px.scatter(data, x="Log of Deaths", y="Log of Pageviews", animation_frame="Period", hover_name="country", trendline="ols",
 trendline_color_override= 'black', range_x= [-5, 5], range_y=[-5, 12], color="Data points",
        title= format_title("Relationship between Log of cumulative deaths and Log of cumulative pageviews",
         "Each data points represent the log of the 2 months cumulative sum of pageviews and deaths per 100k inhabitants", 10, 14))
fig.update_traces(marker_size=10)
fig.update_layout(height=500, width= 800)
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 2000
fig.show()
fig.write_html("data/regressionDeathsPageviews.html",default_width= 500, default_height= 800)
tab_result = tab_result.set_index('Period').round(3)
tab_result.to_html('data/table_res_reg.html')
tab_result

Unnamed: 0_level_0,Intercept,Intercept p-value,Coefficient,Coefficient p-value
Period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
22-01-2020 to 2020-03-21,8.801,0.0,0.81,0.031
2020-03-21 to 2020-05-19,7.115,0.0,0.657,0.0
2020-05-19 to 2020-07-17,6.126,0.0,0.226,0.074
2020-07-17 to 2020-09-14,5.746,0.0,0.227,0.094
2020-09-14 to 2020-11-12,5.432,0.0,0.47,0.0
2020-11-12 to 2021-01-10,5.146,0.0,0.391,0.0
2021-01-10 to 2021-03-10,5.227,0.0,0.294,0.0
2021-03-10 to 2021-05-08,5.114,0.0,0.217,0.007
2021-05-08 to 2021-07-06,4.902,0.0,0.055,0.679
2021-07-06 to 2021-09-03,4.737,0.0,0.055,0.668


Looking at this Figure, one notice that the slope of the line decreases with the time. It passes from 0.81 (p-value: 0.03) to 0.055 (p-value: 0.67). These values show that the earlier in the pandemic we were, the higher was the relation between COVID-19 deaths and wikipedia page views. However, with the arrival of the new waves, it decreases and the p-value of 0.67 shows even that we can not reject a no slope in the middle of 2021. This is due to the fact that people were already well informed on the disease.

Another point shown on the animated plot is the evolution of the position of each country related to the regression line. Some stay above it all the time such as Germany, others stay below Bangladesh and most of them varied their position. This highlights the interest or even the fear of the population towards COVID-19. In fact, the first group will search more information on the COVID-19 than the mean for a fixed number of deaths. On the other hand, the second group has less interest in this disease. Finally, the last group changes depending on the different waves in their countries.

Finally, we notice an increase of the slope of the regression line at the end of 2021 which coincides with the arrival of the Omicron variant and the third dose of vaccine in Europe.

In [109]:
from plotly.subplots import make_subplots
df_pageviews_cumulsumed = df_pageviews.sum(axis= 1).to_frame().rolling(7, min_periods=1).sum().rename(columns={0: 'Cumulative pageviews'}).reset_index()
deaths_cumulsumed = deaths.sum(axis= 1).to_frame().rolling(7, min_periods=1).sum().rename(columns={0: 'Cumulative deaths'}).reset_index()
df_correlation = df_pageviews.sum(axis= 1).rolling(60, min_periods=60).corr(deaths.sum(axis= 1)).to_frame().rename(columns={0: '2 months rolling correlation'}).reset_index()
fig = make_subplots(rows=3, cols=1, subplot_titles=(f'<span style="font-size: 12px;">Weekly moving average of the COVID-19 related deaths in considered countries</span>',
         f'<span style="font-size: 12px;">Weekly moving average of the COVID related articles pageviews in considered countries</span>',
          f'<span style="font-size: 12px;">Two months rolling correlation between deaths and pageviews</span>'))
fig.add_trace(go.Scatter(x= deaths_cumulsumed['date'], y= deaths_cumulsumed['Cumulative deaths'], name= 'Weekly rolling deaths'), row= 1, col= 1)
fig.add_trace(go.Scatter(x= df_pageviews_cumulsumed['date'], y= df_pageviews_cumulsumed['Cumulative pageviews'], name= 'Weekly rolling pageviews'), row= 2, col= 1)
fig.add_trace(go.Scatter(x= df_correlation['date'], y= df_correlation['2 months rolling correlation'], name= 'Two months rolling correlation'), row= 3, col= 1)
fig.update_layout(height=700, width= 900, 
    title_text= f'<span style="font-size: 16px;"><b>Plots of the physical and digital propagation of COVID-19 and their correlation</b></span>', 
    xaxis3_title = 'Date',
    yaxis1_title = 'Deaths',
    yaxis2_title = 'Pageviews',
    yaxis3_title = 'Corr')
fig.show()

## Trust and Mobility <a class="anchor" id="chapter2"></a>
### Mobility <a class="anchor" id="section_2_1"></a>

During the pandemic, people engaged in a variety of activities in response to the spread of COVID-19. As the pandemic unfolded, many people turned to Wikipedia and other sources of information to learn more about the virus and the measures being taken to combat it. Some people may have reduced their outings in order to reduce the risk of transmission, while others may have continued to go out as usual, either due to a sense of trust in their government's handling of the crisis or for other reasons. Let's examine how mobility was affected by the COVID-19 pandemic.

Here we extract the mobility of all countries and separate them into two categories:
- Moving: which accounts for movement in Retail and Recreations, Parks, Transit stations, and Workplace.
- Covid: which accounts for movement in Grocery and Pharmacy and Residential.

The metric is a change in mobility from the baseline in percentage, meaning that positive values represent more mobility than normal and negative values represent the inverse. The baseline is defined as the period before the COVID-19 pandemic.

We will focus on the moving category and calculate the median of all countries per day.

In [110]:
#Load mobility data 

mobility = get_mobility_df(get_country_dict('trust gov mobility'))
moving_mobility_df = mobility['moving category'].reset_index()

# Select countries that appear in both the trust and mobility dataset
moving_mobility_df['country_region'] =moving_mobility_df['country_region'].map(get_country_dict('trust gov mobility'))

# Calculate median moving mobility on each day and save in a dataframe
median_mobility_df = moving_mobility_df.groupby('date')['moving category'].median().to_frame().reset_index()
median_mobility_df = median_mobility_df.rename(columns={'moving category':'Median mobility change from baseline', 'date': 'Date'})


Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.



Let's plot the timeseries of the median mobility over the six month period. 

In [111]:
df = px.data.stocks()
fig = px.line(median_mobility_df, x='Date', y='Median mobility change from baseline', title='Median mobility change from baseline over time<br><sup>The median mobility change from baseline was calculated over 30 countries</sup>')
fig.show()

We can see that there is a sharp drop in mobility for all countries in the middle of March, exactly when lockdowns were implemented in Europe. The curve then slowly increases until June, and then there is an above-baseline moving mobility, meaning that people are going out more than usual. This may be due to a combination of factors such as a desire to socialize after months of staying indoors and the natural tendency to go out more and be more social during the summer months.

### Mobility and death and pageviews <a class="anchor" id="section_2_2"></a>

Now let's analyse the changes in mobility over the 6 month period with respect to covid deaths and pageviews, to see how these are related. We also introduce the Trust dataset which contains information on how much each country trusts their goverment, in a percentage. To visualize this we create a bubble scatter plot.
We first load Pageviews, Deaths, and Trust Data


In [112]:
# Load and preprocess pageviews and deaths 
pageview_df = pd.read_csv("page_views_covid_related.csv.gz")
population_df = pd.read_csv("Population_countries.csv")

_,_, df_pageviews100k,_ = get_pageviews_df(pageview_df, population_df, get_country_dict('trust gov mobility'), '2020-01-22', '2020-11-22')
_, _, _, _, deaths100k, _, _,_ = get_cases_deaths_df(population_df, get_country_dict('trust gov mobility'), '2020-01-22', '2020-11-22')

# Transpose datagrames and rename columns
df_pageviews100k = df_pageviews100k.transpose().stack().to_frame().reset_index().rename(columns={"level_0": "country", "date": "date", 0:"pageviews per 100k"}, errors="raise")
deaths100k = deaths100k.transpose().stack().to_frame().reset_index().rename(columns={"level_0": "country", "date": "date", 0:"deaths per 100k"}, errors="raise")

# Merge pageviews and deaths onto df_animation which will contain columns : ['Month-Year','Country',	'Cumulative pageviews per 100k','Cumulative deaths per 100k','Mobility change from baseline','Trust']
df_animation = df_pageviews100k.merge(deaths100k, on=['country','date'])
df_animation['date'] = pd.to_datetime(df_animation['date'])
df_animation=df_animation.rename(columns={"country": "Country"})

In [113]:
# Dowload and preprocess trust dataset
data_folder = 'data_2/'
df_trust_gov = pd.read_csv(data_folder+'share-who-trust-government.csv.zip') 
df_trust_gov = df_trust_gov.set_index("Entity")[["Trust the national government in this country"]].transpose()[COUNTRY_OWN_LANG_TRUST_GOV.keys()].rename(columns= COUNTRY_OWN_LANG_TRUST_GOV)
country_dict = get_country_dict('trust gov mobility')

# # Map the trust category to the countries in df_animation
trust_dict =df_trust_gov.to_dict('index')
df_animation['Trust'] = df_animation['Country'].map(trust_dict['Trust the national government in this country'])
df_animation.head()

Unnamed: 0,Country,date,pageviews per 100k,deaths per 100k,Trust
0,it,2020-01-22,40.388883,0.0,52.3
1,it,2020-01-23,46.878422,0.0,52.3
2,it,2020-01-24,38.747154,0.0,52.3
3,it,2020-01-25,54.543748,0.0,52.3
4,it,2020-01-26,55.928115,0.0,52.3


In [114]:
#Function
def f(x):
    d = {}
    d['Cumulative pageviews per 100k'] = x['pageviews per 100k'].sum()
    d['Cumulative deaths per 100k'] = x['deaths per 100k'].sum()
    d["Mobility change from baseline"] = x['moving category'].mean()


    return pd.Series(d, index=['Cumulative pageviews per 100k','Cumulative deaths per 100k', "Mobility change from baseline"])
    
# Add mobility to df_animation
# merge dataframe with the mobility dataset
df_animation = df_animation.merge(moving_mobility_df, left_on=['Country','date'],  right_on=['country_region','date'])
df_animation = df_animation.drop(columns=['country_region','country_region_code'])
df_animation['date']=df_animation['date'].astype(str)

# Rename the date to make it clearer
df_animation['Month-Year'] = pd.to_datetime(df_animation['date']).dt.to_period('M') 
df_animation['Month-Year']=df_animation['Month-Year'].astype(str)

# Rename the country codes to country anmes
inv_map = {v: k for k, v in get_country_dict('trust gov mobility').items()}
df_animation['Country'] = df_animation['Country'].map(inv_map)

# Apply funciton in helper to apply aggregate functions 
grouped_df = df_animation.groupby(['Month-Year','Country']).apply(f)
grouped_df = grouped_df.reset_index()
grouped_df =grouped_df.merge(df_animation[['Country','Month-Year','Trust']], how='left', on=['Month-Year','Country']).drop_duplicates()

# Change trust to value between 0 and 1
grouped_df["Trust"] = grouped_df["Trust"]/100

df = grouped_df.round(decimals=3) # Round dataset values to 3 decimal places

grouped_df.head()

Unnamed: 0,Month-Year,Country,Cumulative pageviews per 100k,Cumulative deaths per 100k,Mobility change from baseline,Trust
0,2020-02,Bangladesh,0.556806,0.0,5.25,0.83
15,2020-02,Bulgaria,161.407208,0.0,5.097669,0.404
30,2020-02,Croatia,171.62918,0.0,8.503883,0.42
45,2020-02,Czechia,1925.544347,0.0,6.41,0.458
60,2020-02,Denmark,275.594008,0.0,-1.578199,0.793


Let's plot the bubble scatter plot:

In [115]:
fig = px.scatter(df,x="Mobility change from baseline", y="Cumulative deaths per 100k" , animation_frame="Month-Year", animation_group="Country",
           size="Cumulative pageviews per 100k" ,color="Trust", hover_name="Country", range_x=[-80,60],range_y=[-20,40], size_max=80, template='plotly_dark' )
      
fig.update_layout(height=500, width= 900,title_text="Cumulative deaths per 100k vs. Mobility change from baseline, 2020")
fig.update_layout(margin=dict(l=40, r=10, t=125, b=10))  
fig.add_annotation(xref='paper',yref='paper',x=0, y=1.28,showarrow=False,text = "The x-axis shows the mean mobility change from baseline per month.")
fig.add_annotation(xref='paper',yref='paper',x=0, y=1.22,showarrow=False,text = "The y-axis shows the cumulative deaths per 100k per month.")
fig.add_annotation(xref='paper',yref='paper',x=0, y=1.16,showarrow=False,text = "The size of the bubble shows the cumulative pageviews per 100k per month ")
fig.add_annotation(xref='paper',yref='paper',x=0, y=1.10,showarrow=False,text = "The colour of the bubble corresponds to how much people trust their government.")
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 2000
fig.show()

fig.write_html('data_2/Images/bubble_plot.html')

We can clearly see the bubbles moving up and expanding as they move to the left. This means that as there are more COVID-related deaths, there are more COVID-related pageviews and less overall mobility. We also see the reverse effect as we reach July. Additionally, we see that in the months of March and April, countries with lower levels of trust in their government have a lower change in mobility from the baseline, indicating that they are going out less. Let's explore this in further detail.

### Mobility and trust <a class="anchor" id="section_2_3"></a>
Let's look at the relationship between mobility and trust. For mobility, we will use the minimum mobility value, as this is where we saw the most disparities in the plot above.

In [117]:
# Create mobility_trust dataframe 
mobility_trust = mobility['moving category'].reset_index()
mobility_trust['country_region'] =mobility_trust['country_region'].map(get_country_dict('trust gov mobility'))
mobility_trust = mobility_trust.groupby('country_region')['moving category'].min().to_frame()
mobility_trust = mobility_trust.reset_index()
mobility_trust['Trust the national government in this country'] = mobility_trust['country_region'].map(trust_dict['Trust the national government in this country'])
mobility_trust['country_region'] = mobility_trust['country_region'].map(inv_map)
mobility_trust = mobility_trust.rename(columns={"moving category": "Min mobility change from baseline","country_region": "Country"})
mobility_trust.head()

Unnamed: 0,Country,Min mobility change from baseline,Trust the national government in this country
0,Bulgaria,-62.327553,40.4
1,Bangladesh,-61.75,83.0
2,Czechia,-65.633333,45.8
3,Denmark,-35.839475,79.3
4,Germany,-56.426471,82.0


In [118]:
fig = px.scatter(mobility_trust, x='Trust the national government in this country', y="Min mobility change from baseline",text="Country",
     trendline="ols", title='Relationship between min moblitiy change from baseline and trust in the national government in a country, 2020')
fig.update_traces(textposition="bottom right")
fig.update_layout(margin=dict(l=20, r=20, t=100, b=0))  
fig.add_annotation(xref='paper',yref='paper',x=0, y=1.13,showarrow=False,text = "The x-axis the trust in the national government in a country.")
fig.add_annotation(xref='paper',yref='paper',x=0, y=1.09,showarrow=False,text = "The y-axis shows the min mobility change from baseline per month.")
fig.update_layout(yaxis_ticksuffix = "%")
fig.update_layout(xaxis_ticksuffix = "%")
fig.show()

Above, we have plotted the regression line of trust in the national government and minimum mobility change.

In [120]:
# Linear regression summary
result = px.get_trendline_results(fig)
result_pval = result.iloc[0]['px_fit_results'].pvalues
result_coeff = result.iloc[0]['px_fit_results'].params
print('Regression Summary: ')
print('Intercept: {}, p-value: {}'.format(result_coeff[0],result_pval[0]))
print('Coefficent: {}, p-value: {}'.format(result_coeff[1],result_pval[1]))

Regression Summary: 
Intercept: -89.95318296260504, p-value: 8.413095308649931e-10
Coefficent: 0.48095989431734365, p-value: 0.007790561016220862


Interpretation:

The results of this analysis show that there is a significant negative relationship between trust in the government and minimum mobility change during the COVID-19 pandemic. Specifically, we find that a hypothetical country with 0% trust in the government would have a minimum mobility change of -90%. Additionally, we find that for every 10% increase in trust in the government, we would expect to see a 5% increase in minimum mobility change.

This suggests that populations that have trust in their government are more likely to maintain their mobility and lifestyle during the pandemic, potentially because they feel more confident and less scared of the virus.