# Milestone 3 research and data visualisation

### Table of Contents

* [Digital Propagation](#chapter1)
    * [Geographic digital propagation of COVID-19 during the 1st wave](#section_1_1)
    * [Overview of the data](#Section_1_2)
    * [Time series analysis](#Section_1_3)
    
        * [Check for stationarity](#section_1_3_1)
        * [Autocorrelation](#section_1_3_2)
        * [Decomposition](#section_1_3_3)
        
    * [Google mobility data](#Section_1_4) 
    
        * [Data processing](#section_1_4_1)
        * [Analysis per country](#section_1_4_1)
        
        
* [COVID-19 dataset](#chapter2)
    * [Downloading the data](#section_2_1)
    * [Overview of the data](#section_2_2)
    * [Time series analysis](#Section_2_3)
        * [Check for stationarity](#section_2_3_1)
        * [Autocorrelation](#section_2_3_2)
        * [Decomposition](#section_2_3_3)
        
        
* [Pearson Correlation](#chapter3)      


* [Trust dataset](#chapter4)  
    * [Visualizing Government trust](#section_4_1)
    * [Visualizing Trust in Journalists](#section_4_2)
    * [Visualizing Trust in Science](#section_4_3)
    * [Analysis](#section_4_4)


* [Clustering](#chapter5) 

We import all the librairies needed to compute and plot our results.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
from scipy import stats
from helper import *
from scipy.stats.mstats import gmean
#Importation of all the packages
import datetime
import math
import json
import zipfile  
import ssl
from scipy import stats
import statsmodels.formula.api as smf
import statsmodels.api as sm
import plotly.express as px
import bar_chart_race as brc

#To dowload data
import requests
import io
import gzip

#To create the mapchart
import iso3166
import plotly
from iso3166 import countries
import plotly.graph_objects as go

## Digital propagation <a class="anchor" id="chapter1"></a>
### Geographic digital propagation of COVID-19 during the 1st wave <a class="anchor" id="section_1_1"></a>

We first load the raw data for pageviews and population, and then clean it to obtain a dataframe of pageviews, cumulative pageviews, pageviews per 100,000 inhabitants, and cumulative pageviews per 100,000 inhabitants using the `get_pageviews_df` function.

For cases and deaths due to COVID-19, we do the same, but we obtain the raw data using URLs with the `get_cases_deaths_df` function.

For our initial analysis, we consider the data from the start of the COVID-19 pandemic, from **January 22, 2020** to **July 31, 2020**. We want to analyse the digital propagation of COVID-19 geographically. Due to the limitation of the librairie used to create the mapcharts, we will restrain our analysis only to the 1st wave.

In [2]:
#Loading raw df from csv file
pageview_df = pd.read_csv("page_views_covid_related.csv.gz")
population_df = pd.read_csv("Population_countries.csv")
#get cleaned dfs, cumulative df and per 100k of population dfs for pageviews, covid cases and deaths data 
df_pageviews, df_pageviews_cumul, df_pageviews100k, df_pageviews_cumul100k = get_pageviews_df(pageview_df, population_df, get_country_dict('original'), '2020-01-22', '2020-09-01')
deaths, cases, deaths_cumul, cases_cumul, deaths100k, deaths100k_cumul, cases100k, cases100k_cumul = get_cases_deaths_df(population_df, get_country_dict('original'), '2020-01-22', '2020-09-01')

Here, we create a dictionary `o_country_dict` that maps country names to their language codes using the `get_country_dict` function then we create a reversed version of the dictionary, `inv_o_country_dict`, that maps language codes to country names. We also defines a dictionary `other_country_name` that contains exceptions for country names that differ from their language codes to use them later for our mapchart plots.

In [3]:
# Get a dictionary mapping country names to their language codes
o_country_dict = get_country_dict('original')

# Reverse the dictionary to map language codes to country names
inv_o_country_dict = {v: k for k, v in o_country_dict.items()}

# Add exceptions for country names that differ from their language codes
other_country_name = {
    "Russia": "Russian Federation",
    "Turkey":"Türkiye",
    "Vietnam" : "Viet Nam",
    "South Korea" : "Korea, Democratic People's Republic of"
}

The following code creates map charts using the cumulative death data `deaths100k_cumul`, the cumulative pageviews df_pageviews_cumul100k and the country dictionary `o_country_dict`, and then creates a map chart using the `mapcharts` function with the map chart dataframe, the data column `'deaths'`, the hover column `'country'`,  The map charts visualizes the number of cumulative deaths due to COVID-19 per 100k inhabitants for each country, The number of cumulative pageviews per 100k inhabitants.

In [4]:
# Create a df for the mapchart using the cumulative death data
deaths_mapchart = mapcharts_df(deaths100k_cumul, o_country_dict, 'deaths')
# Create map chart
mapcharts(deaths_mapchart, 'deaths', 'country', """Number of culmulative death of COVID-19 per 100k inhabitants""",
'The colour of the country corresponds to how much death per 100k inhabitants was recorded from 22-01-2020.', 15, 10, 'Reds')

In [5]:
# Create a df for the mapchart using the cumulative pageviews data
pageviews_mapchart = mapcharts_df(df_pageviews_cumul100k, o_country_dict, 'pageviews')
# Create map chart
mapcharts(pageviews_mapchart, 'pageviews', 'country', """Number of cumulative pageviews per 100k inhabitants""",
'The colour of the country corresponds to how much pageviews per 100k inhabitants was recorded from 22-01-2020.', 15, 10, 'ylorbr')

In [6]:
# Create a df for the mapchart using the cumulative cases data
cases_mapchart = mapcharts_df(cases100k_cumul, o_country_dict, 'Cases')
# Create map chart
mapcharts(cases_mapchart, 'Cases', 'country', 'Number of cumulative cases per 100000 inhabitants and per country')


### Loading the data <a class="anchor" id="section_1_2"></a>

We first load the raw data for pageviews and population, and then clean it to obtain a dataframe of pageviews, cumulative pageviews, pageviews per 100,000 inhabitants, and cumulative pageviews per 100,000 inhabitants using the `get_pageviews_df` function.

For cases and deaths due to COVID-19, we do the same, but we obtain the raw data using URLs with the `get_cases_deaths_df` function.

For our inanalysis, we consider the data from the start of the COVID-19 pandemic, from **January 22, 2020** to **July 31, 2022**.

In [8]:
df_pageviews, df_pageviews_cumul, df_pageviews100k, df_pageviews_cumul100k = get_pageviews_df(pageview_df, population_df, get_country_dict('original'), '2020-01-22', '2022-07-31')
deaths, cases, deaths_cumul, cases_cumul, deaths100k, deaths100k_cumul, cases100k, cases100k_cumul = get_cases_deaths_df(population_df, get_country_dict('original'), '2020-01-22', '2022-07-31')

In [None]:
def get_race_bar_df(df):
    df_br = df.rename(columns= {v: k for k, v in get_country_dict('original').items()}).reset_index().rename(columns = {'index': 'Country'})
    df_br = df_br.drop('date', axis= 1)
    df_br = df_br.rolling(14, min_periods=1).sum()
    df_br= df_br.loc[df_br.index % 14 == 13].reset_index()
    df_br.index= df_br['index'].apply(lambda x: pd.to_datetime('2020-01-22') + datetime.timedelta(days= x))
    df_br = df_br.drop('index', axis= 1)
    return df_br

In [9]:

df_pageviews100k_br = get_race_bar_df(df_pageviews100k)
brc.bar_chart_race(df_pageviews100k_br, n_bars= 15, sort= 'desc', period_length=2000, filename= 'pageviewsbarRace.mp4', filter_column_colors= True, steps_per_period=40, title='Biweekly pageviews of Covid-19 related article by country per 100k inhabitants', title_size= '10')


In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`


FixedFormatter should only be used together with FixedLocator


FixedFormatter should only be used together with FixedLocator



In [10]:
deaths100k_br = get_race_bar_df(deaths100k)
brc.bar_chart_race(deaths100k_br, n_bars= 15, sort= 'desc', period_length=2000, filename= 'deathsbarRace.mp4', filter_column_colors= True, steps_per_period=50, title='Biweekly deaths due to Covid-19 by country per 100k inhabitants', title_size= '10')


In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`


FixedFormatter should only be used together with FixedLocator


FixedFormatter should only be used together with FixedLocator



In [None]:
cases100k_br = get_race_bar_df(cases100k)
brc.bar_chart_race(cases100k_br, n_bars= 15, sort= 'desc', period_length=2000, filename= 'casesbarRace.mp4', filter_column_colors= True, steps_per_period=50, title='Biweekly Covid-19 cases by country per 100k inhabitants', title_size= '10')


In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`


FixedFormatter should only be used together with FixedLocator


FixedFormatter should only be used together with FixedLocator



KeyboardInterrupt: 