# Indywidualny Projekt Badawczy: Sprawdzenie relacji między wyszukiwaniem fraz, a statystykami Covid-19
## Marta Kacperska 

---
### Linki do zbiorów danych

* Dane o globalnych przypadkach zachorowań na COVID-19 oraz zgonach WHO - https://data.humdata.org/dataset/coronavirus-covid-19-cases-and-deaths
* Dane o szczepieniach na COVID-19 - https://data.humdata.org/m/dataset/covid-19-vaccinations
* Dane o testach na COVID-19 w Europie - https://www.ecdc.europa.eu/en/publications-data/covid-19-testing
* Zunifikowane dane o popularności wyszukiwań fraz: covid, quarantine, covid testing, respiratory disease, vaccination - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=covid,quarantine,vaccination,respiratory%20disease,covid%20testing&hl=pl
* Zunifikowane dane o popularności wyszukiwań fraz: covid, pneumonia, asthma, lung cancer, cystic fibrosis - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=covid,pneumonia,asthma,lung%20cancer,cystic%20fibrosis&hl=pl
* Dane o popularności wyszukiwań frazy covid - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=covid&hl=pl
* Dane o popularności wyszukiwań frazy quarantine - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=quarantine&hl=pl
* Dane o popularności wyszukiwań frazy respiratory disease - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=respiratory%20disease&hl=pl
* Dane o popularności wyszukiwań frazy covid testing - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=covid%20testing&hl=pl
* Dane o popularności wyszukiwań frazy vaccination - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=vaccination&hl=pl
* Dane o popularności wyszukiwań frazy pneumonia - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=pneumonia&hl=pl
* Dane o popularności wyszukiwań frazy asthma - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=asthma&hl=pl
* Dane o popularności wyszukiwań frazy lung cancer - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=lung%20cancer&hl=pl
* Dane o popularności wyszukiwań frazy cystic fibosis - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=cystic%20fibrosis&hl=pl
---

## Przygotowanie danych do analizy

In [1]:
import numpy as np
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta

### describeDataSet
- wypisuje informacje o liczbie obiektów, atrybutów oraz brakujących danych
- jeśli zostanie podany parametry classAttr to funkcja wypisze również informację o liczbie klas oraz liczbach obserwacji w każdej z klas
- Funkcja przyjmuje ramkę danych oraz opcjonalnie atrybut decyzyjny

In [2]:
def describeDataSet(frame, classAttr=None):
    numOfObjects, numOfAttributes = frame.shape
    print("Number of objects in dataset: {0}".format(numOfObjects))
    print("Number of attributes in dataset: {0}".format(numOfAttributes))

    if(classAttr != None):
        classes = frame[classAttr].unique()
        print("Number of classes: {0}".format(len(classes)))

        obsPerClass = frame[classAttr].value_counts()
        obsPerClass = obsPerClass.sort_index()
        print("Number of observations per class: \n{0}".format(obsPerClass))

    print("Number of fields with missing data: {0}".format(pd.isnull(frame).sum().sum()))

### countNullsInColumns
- funkcja zwraca słownik, w którym znajdują się nazwy kolumn oraz ile brakujących wartości w nich wystąpiło
- jeśli w danej kolumnie nie wystąpiła wartość NaN to funkcja wypisuje odpowiedni komunikat
- funkcja przyjmuje ramkę danych

In [3]:
def countNullsInColumns(frame):
    nulls = pd.isnull(frame)
    columns = {}
    for col in nulls.columns:
        try:
            columns[col] = nulls[col].value_counts()[True]
        except:
            print("There is no null in column {0}".format(col))
    return columns 

wczytanie ramki z danymi o przypadkach COVID-19 w ujęciu globalnym   
Źródło danych:     
https://data.humdata.org/dataset/coronavirus-covid-19-cases-and-deaths

In [4]:
globalCases = pd.read_csv("data/WHO-COVID-19-global-data.csv")
globalCases.head(10)

Unnamed: 0,Date_reported,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,2020-01-05,AF,Afghanistan,EMRO,0,0,0,0
1,2020-01-12,AF,Afghanistan,EMRO,0,0,0,0
2,2020-01-19,AF,Afghanistan,EMRO,0,0,0,0
3,2020-01-26,AF,Afghanistan,EMRO,0,0,0,0
4,2020-02-02,AF,Afghanistan,EMRO,0,0,0,0
5,2020-02-09,AF,Afghanistan,EMRO,0,0,0,0
6,2020-02-16,AF,Afghanistan,EMRO,0,0,0,0
7,2020-02-23,AF,Afghanistan,EMRO,0,0,0,0
8,2020-03-01,AF,Afghanistan,EMRO,1,1,0,0
9,2020-03-08,AF,Afghanistan,EMRO,0,1,0,0


Opisanie ramki z danymi o przypadkach COVID-19

In [5]:
describeDataSet(globalCases, "Country_code")

Number of objects in dataset: 50160
Number of attributes in dataset: 8
Number of classes: 234
Number of observations per class: 
Country_code
AD    209
AE    209
AF    209
AG    209
AI    209
     ... 
YE    209
YT    209
ZA    209
ZM    209
ZW    209
Name: count, Length: 233, dtype: int64
Number of fields with missing data: 7733


W zbiorze danych jest dużo braków

In [6]:
countNullsInColumns(globalCases)

There is no null in column Date_reported
There is no null in column New_cases
There is no null in column Cumulative_cases
There is no null in column New_deaths
There is no null in column Cumulative_deaths


{'Country_code': 1463, 'Country': 1254, 'WHO_region': 5016}

Jako, że potrzebujemy kodu i nazwy kraju, a także grupowanie po regionie może się przydać, pozbędziemy się tych braków

In [7]:
globalCasesCleaned = globalCases.dropna(how="any")
describeDataSet(globalCasesCleaned, "Country_code")

Number of objects in dataset: 44935
Number of attributes in dataset: 8
Number of classes: 215
Number of observations per class: 
Country_code
AD    209
AE    209
AF    209
AG    209
AI    209
     ... 
XK    209
YE    209
ZA    209
ZM    209
ZW    209
Name: count, Length: 215, dtype: int64
Number of fields with missing data: 0


Wyświetlenie zakresu dat, w których były przeprowadzane pomiary

In [8]:
earliestDate = globalCasesCleaned["Date_reported"].min()
latestDate = globalCasesCleaned["Date_reported"].max()
print("Earliest Date: {0}\nLatest Date: {1}".format(earliestDate,latestDate))

Earliest Date: 2020-01-05
Latest Date: 2023-12-31


Wczytanie ramki danych z danymi o populacji i obszarze krajów  
Źródło danych: Dane zostały udostępnione na zajęcia laboratoryjne przedmiotu "Podstawy reprezentacji i analizy danych" w semestrze zimowym roku akademickiego 2023/2024

In [9]:
listOfCountries = pd.read_excel("data/alphabetical-list-of-countries2020.xlsx", index_col=1, header=4)
listOfCountries = listOfCountries.loc[:,["Country", "Population", "Land Area", "Density"]]
listOfCountries.head(10)

Unnamed: 0_level_0,Country,Population,Land Area,Density
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Afghanistan,38928346,652.86,60.0
2,Albania,2877797,27.4,105.0
3,Algeria,43851044,2381740.0,18.0
4,Andorra,77265,470.0,164.0
5,Angola,32866272,1246700.0,26.0
6,Antigua and Barbuda,97929,440.0,223.0
7,Argentina,45195774,2736690.0,17.0
8,Armenia,2963243,28.47,104.0
9,Australia,25499884,7682300.0,3.0
10,Austria,9006398,82.409,109.0


Opis ramki zawierającej dane o krajach

In [10]:
describeDataSet(listOfCountries)

Number of objects in dataset: 195
Number of attributes in dataset: 4
Number of fields with missing data: 0


Widzimy, że w zbiorze danych mamy mniej krajów niż mamy ich w zbiorze z przypadkami Covida. Pozbędziemy się ich, żeby dopasować 

In [11]:
globalCasesMatched = globalCasesCleaned[globalCasesCleaned["Country"].isin(listOfCountries["Country"])] #usuwamy z globalCases kraje, których nie ma w listOfCountries, aby kraje się pokrywały
describeDataSet(globalCasesMatched, "Country") #ile rekordów z danego kraju


Number of objects in dataset: 36575
Number of attributes in dataset: 8
Number of classes: 175
Number of observations per class: 
Country
Afghanistan    209
Albania        209
Algeria        209
Andorra        209
Angola         209
              ... 
Uzbekistan     209
Vanuatu        209
Yemen          209
Zambia         209
Zimbabwe       209
Name: count, Length: 175, dtype: int64
Number of fields with missing data: 0


Mamy teraz dane zarówno o populacji jak i o przypadkach Covidu ze 175 państw  
  
Wczytanie ramki danych z danymi o przeprowadzanych testach na COVID-19 oraz zmiana formatu daty  
Źródło danych:  
https://www.ecdc.europa.eu/en/publications-data/covid-19-testing

In [12]:
covidTesting = pd.read_excel("data/testing_on_covid.xlsx") #trzecia ramka danych
for id in covidTesting.index: #tworzymy kolumnę z datą
    covidTesting.loc[id,"year"] = str(covidTesting.loc[id,"year_week"])[:4]
    covidTesting.loc[id,"week"] = str(covidTesting.loc[id,"year_week"])[-2:]
    covidTesting.loc[id,"date"] = datetime.date(int(covidTesting.loc[id,"year"]), 1, 1) + relativedelta(weeks=int(covidTesting.loc[id,"week"])) 
    #rok na początkowy dzień roku i dodajemy daną liczbę tygodni
covidTesting = covidTesting.drop(columns=["year", "week", "year_week"])
covidTesting.head(10)

Unnamed: 0,country,country_code,level,region,region_name,new_cases,tests_done,population,testing_rate,positivity_rate,testing_data_source,date
0,Austria,AT,national,AT,Austria,,,8978929,,,,2020-01-08
1,Austria,AT,national,AT,Austria,,,8978929,,,,2020-01-15
2,Austria,AT,national,AT,Austria,,,8978929,,,,2020-01-22
3,Austria,AT,national,AT,Austria,,,8978929,,,,2020-01-29
4,Austria,AT,national,AT,Austria,,,8978929,,,,2020-02-05
5,Austria,AT,national,AT,Austria,,,8978929,,,,2020-02-12
6,Austria,AT,national,AT,Austria,,,8978929,,,,2020-02-19
7,Austria,AT,national,AT,Austria,,,8978929,,,,2020-02-26
8,Austria,AT,national,AT,Austria,12.0,,8978929,,,,2020-03-04
9,Austria,AT,national,AT,Austria,115.0,,8978929,,,,2020-03-11


Opis ramki danych i wyświetlenie informacji o brakujących danych

In [13]:
describeDataSet(covidTesting)
countNullsInColumns(covidTesting)

Number of objects in dataset: 6120
Number of attributes in dataset: 12
Number of fields with missing data: 4849
There is no null in column country
There is no null in column country_code
There is no null in column level
There is no null in column region
There is no null in column region_name
There is no null in column population
There is no null in column date


{'new_cases': 308,
 'tests_done': 1129,
 'testing_rate': 1129,
 'positivity_rate': 1154,
 'testing_data_source': 1129}

Te braki w danych są logiczne, uzupełnimy je zerami (gdy nie było nowych testów, ani notowanych przypadków, wstawiane były NaN jako wartości)

In [14]:
covidTesting = covidTesting.fillna(int(0))
covidTesting.head(10)

Unnamed: 0,country,country_code,level,region,region_name,new_cases,tests_done,population,testing_rate,positivity_rate,testing_data_source,date
0,Austria,AT,national,AT,Austria,0.0,0.0,8978929,0.0,0.0,0,2020-01-08
1,Austria,AT,national,AT,Austria,0.0,0.0,8978929,0.0,0.0,0,2020-01-15
2,Austria,AT,national,AT,Austria,0.0,0.0,8978929,0.0,0.0,0,2020-01-22
3,Austria,AT,national,AT,Austria,0.0,0.0,8978929,0.0,0.0,0,2020-01-29
4,Austria,AT,national,AT,Austria,0.0,0.0,8978929,0.0,0.0,0,2020-02-05
5,Austria,AT,national,AT,Austria,0.0,0.0,8978929,0.0,0.0,0,2020-02-12
6,Austria,AT,national,AT,Austria,0.0,0.0,8978929,0.0,0.0,0,2020-02-19
7,Austria,AT,national,AT,Austria,0.0,0.0,8978929,0.0,0.0,0,2020-02-26
8,Austria,AT,national,AT,Austria,12.0,0.0,8978929,0.0,0.0,0,2020-03-04
9,Austria,AT,national,AT,Austria,115.0,0.0,8978929,0.0,0.0,0,2020-03-11


In [15]:
print(covidTesting["level"].value_counts())
print(covidTesting["testing_data_source"].value_counts())

level
national    6120
Name: count, dtype: int64
testing_data_source
TESSy COVID-19    4991
0                 1129
Name: count, dtype: int64


Powyższe kolumny nic nie wnoszą do naszego zbioru, możemy się ich pozbyć

In [16]:
covidTesting = covidTesting.drop(columns=["level","testing_data_source"])
covidTesting.head(5)

Unnamed: 0,country,country_code,region,region_name,new_cases,tests_done,population,testing_rate,positivity_rate,date
0,Austria,AT,AT,Austria,0.0,0.0,8978929,0.0,0.0,2020-01-08
1,Austria,AT,AT,Austria,0.0,0.0,8978929,0.0,0.0,2020-01-15
2,Austria,AT,AT,Austria,0.0,0.0,8978929,0.0,0.0,2020-01-22
3,Austria,AT,AT,Austria,0.0,0.0,8978929,0.0,0.0,2020-01-29
4,Austria,AT,AT,Austria,0.0,0.0,8978929,0.0,0.0,2020-02-05


In [17]:
matchingName = (covidTesting["country"] == covidTesting["region_name"]).sum()
matchingCode = (covidTesting["country_code"] == covidTesting["region"]).sum()
print("All records: {0}\n sum of the same country and region_name: {1}\n sum of the same country_code and region {2}"
      .format(covidTesting.shape[0], matchingName, matchingCode))

All records: 6120
 sum of the same country and region_name: 6120
 sum of the same country_code and region 6120


Po jednym z pary tych atrybutów też jest nieprzydatna, można je usunąć

In [18]:
covidTesting = covidTesting.drop(columns=["region_name", "region"])
covidTesting.head(5)

Unnamed: 0,country,country_code,new_cases,tests_done,population,testing_rate,positivity_rate,date
0,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-08
1,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-15
2,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-22
3,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-29
4,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-02-05


Wyświetlenie zakresu dat, z których mamy dane z ramki covid testing

In [19]:
earliestDateTesting = covidTesting["date"].min()
latestDateTesting = covidTesting["date"].max()
print("Earliest Date: {0}\nLatest Date: {1}".format(earliestDateTesting,latestDateTesting))

Earliest Date: 2020-01-08
Latest Date: 2023-11-26


Wyświetlamy liczbę krajów w obu ramkach i uwzględniamy tylko kraje, które są w obu ramkach w celu późniejszego złączenia

In [20]:
print("Number of countries in covidTesting:",len(covidTesting["country"].unique())) #zliczamy kraje w testach europejskich

Number of countries in covidTesting: 30


In [21]:
#sprawdzamy ile krajów europejskich w danych globalnych
print("Number of european countries in globalCases:",len(globalCasesMatched[globalCasesMatched["WHO_region"] == "EURO"]["Country"].unique()))

Number of european countries in globalCases: 48


In [22]:
#matchujące kraje według z globalCasesCleaned oraz covidTesting to 28
globalCasesEurope = globalCasesCleaned[globalCasesCleaned["Country_code"].isin(covidTesting["country_code"])]
print("Number of european countries in globalCases:",len(globalCasesEurope["Country_code"].unique()))

Number of european countries in globalCases: 28


In [23]:
#unifikacja covidTesting
covidTesting = covidTesting[covidTesting["country_code"].isin(globalCasesEurope["Country_code"])]
print("Number of countries in covidTesting:",len(covidTesting["country_code"].unique()))

Number of countries in covidTesting: 28


Zunifikowaliśmy dane z krajów europejskich i światowych  
  
Wczytujemy ramkę z danymi o szczepieniach  
Źródło danych:  
https://data.humdata.org/m/dataset/covid-19-vaccinations  

In [24]:
globalVaccinations = pd.read_csv("data/vaccinations.csv")
globalVaccinations.head(10)

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
0,Afghanistan,AFG,22/02/2021,0.0,0.0,,,,,0.0,0.0,,,,,
1,Afghanistan,AFG,23/02/2021,,,,,,1367.0,,,,,33.0,1367.0,0.003
2,Afghanistan,AFG,24/02/2021,,,,,,1367.0,,,,,33.0,1367.0,0.003
3,Afghanistan,AFG,25/02/2021,,,,,,1367.0,,,,,33.0,1367.0,0.003
4,Afghanistan,AFG,26/02/2021,,,,,,1367.0,,,,,33.0,1367.0,0.003
5,Afghanistan,AFG,27/02/2021,,,,,,1367.0,,,,,33.0,1367.0,0.003
6,Afghanistan,AFG,28/02/2021,8200.0,8200.0,,,,1367.0,0.02,0.02,,,33.0,1367.0,0.003
7,Afghanistan,AFG,01/03/2021,,,,,,1580.0,,,,,38.0,1580.0,0.004
8,Afghanistan,AFG,02/03/2021,,,,,,1794.0,,,,,44.0,1794.0,0.004
9,Afghanistan,AFG,03/03/2021,,,,,,2008.0,,,,,49.0,2008.0,0.005


In [25]:
globalVaccinations = globalVaccinations.dropna(subset=["total_vaccinations"]) #dropujemy wiersze jeśli w total_vaccinations był null (niepotrzebne są dni, w których nie było pomiarów)
globalVaccinations

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
0,Afghanistan,AFG,22/02/2021,0.0,0.0,,,,,0.00,0.00,,,,,
6,Afghanistan,AFG,28/02/2021,8200.0,8200.0,,,,1367.0,0.02,0.02,,,33.0,1367.0,0.003
22,Afghanistan,AFG,16/03/2021,54000.0,54000.0,,,,2862.0,0.13,0.13,,,70.0,2862.0,0.007
44,Afghanistan,AFG,07/04/2021,120000.0,120000.0,,,,3000.0,0.29,0.29,,,73.0,3000.0,0.007
59,Afghanistan,AFG,22/04/2021,240000.0,240000.0,,,,8000.0,0.58,0.58,,,195.0,8000.0,0.019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
186471,Zimbabwe,ZWE,01/10/2022,12212594.0,6435008.0,4747263.0,1030323.0,,4037.0,74.83,39.43,29.09,6.31,247.0,1455.0,0.009
186472,Zimbabwe,ZWE,02/10/2022,12214870.0,6435481.0,4748232.0,1031157.0,2276.0,3870.0,74.84,39.43,29.09,6.32,237.0,1414.0,0.009
186473,Zimbabwe,ZWE,03/10/2022,12216848.0,6435970.0,4748940.0,1031938.0,1978.0,3777.0,74.86,39.43,29.10,6.32,231.0,1417.0,0.009
186475,Zimbabwe,ZWE,05/10/2022,12219760.0,6436704.0,4750104.0,1032952.0,,2076.0,74.87,39.44,29.11,6.33,127.0,638.0,0.004


Opisujemy ramkę

In [26]:
describeDataSet(globalVaccinations)

Number of objects in dataset: 80837
Number of attributes in dataset: 16
Number of fields with missing data: 105316


Na potrzeby przyszłej zaplanowanej analizy eksploracyjnej liczba atrybutów jest zbyt szczegółowa. Przydatne będą tylko "total_vaccinations" oraz "daily_vaccinations"

In [27]:
globalVaccinations = globalVaccinations.loc[:,["location",	"iso_code",	"date",	"total_vaccinations", "daily_vaccinations"]]

Widzimy,że kody krajów się różnią. Nazwy na szczęście w większości nie. Trzeba zrobić, żeby liczba krajów przypadków i szczepionek się zgadzała

In [28]:
globalVaccinationsMatched = globalVaccinations[globalVaccinations["location"].isin(globalCasesMatched["Country"])]
print("Countries: in globalVaccinations", len(globalVaccinationsMatched["location"].unique()))

Countries: in globalVaccinations 167


In [29]:
globalCasesMatchedVacc = globalCasesMatched[globalCasesMatched["Country"].isin(globalVaccinationsMatched["location"])]
print("Countries in globalCases:", len(globalCasesMatchedVacc["Country"].unique()))

Countries in globalCases: 167


Zunifikowaliśmy również dane o krajach z globalnych przypadków oraz globalnych szczepień  
  
Opisujemy ramkę globalVaccinationsMatched

In [30]:
describeDataSet(globalVaccinationsMatched)
countNullsInColumns(globalVaccinationsMatched)

Number of objects in dataset: 52901
Number of attributes in dataset: 5
Number of fields with missing data: 167
There is no null in column location
There is no null in column iso_code
There is no null in column date
There is no null in column total_vaccinations


{'daily_vaccinations': 167}

In [31]:
globalVaccinationsMatched

Unnamed: 0,location,iso_code,date,total_vaccinations,daily_vaccinations
0,Afghanistan,AFG,22/02/2021,0.0,
6,Afghanistan,AFG,28/02/2021,8200.0,1367.0
22,Afghanistan,AFG,16/03/2021,54000.0,2862.0
44,Afghanistan,AFG,07/04/2021,120000.0,3000.0
59,Afghanistan,AFG,22/04/2021,240000.0,8000.0
...,...,...,...,...,...
186471,Zimbabwe,ZWE,01/10/2022,12212594.0,4037.0
186472,Zimbabwe,ZWE,02/10/2022,12214870.0,3870.0
186473,Zimbabwe,ZWE,03/10/2022,12216848.0,3777.0
186475,Zimbabwe,ZWE,05/10/2022,12219760.0,2076.0


Wypełniamy puste miejsca zerami

In [32]:
globalVaccinationsMatched = globalVaccinationsMatched.fillna(int(0)) #wypełnienie nulli zerami

In [33]:
countNullsInColumns(globalVaccinationsMatched)

There is no null in column location
There is no null in column iso_code
There is no null in column date
There is no null in column total_vaccinations
There is no null in column daily_vaccinations


{}

W celu zunifikowania danych o testach z danymi o szczepionkach kopiujemy ramkę globalVaccinations i weźmiemy z niej tylko kraje europejskie

In [34]:
europeVaccinations = globalVaccinations.copy() #kopiujemy ramkę globalVaccinations a następnie wyłuskamy tylko kraje Europę
europeVaccinations = europeVaccinations.fillna(0)
europeVaccinations

Unnamed: 0,location,iso_code,date,total_vaccinations,daily_vaccinations
0,Afghanistan,AFG,22/02/2021,0.0,0.0
6,Afghanistan,AFG,28/02/2021,8200.0,1367.0
22,Afghanistan,AFG,16/03/2021,54000.0,2862.0
44,Afghanistan,AFG,07/04/2021,120000.0,3000.0
59,Afghanistan,AFG,22/04/2021,240000.0,8000.0
...,...,...,...,...,...
186471,Zimbabwe,ZWE,01/10/2022,12212594.0,4037.0
186472,Zimbabwe,ZWE,02/10/2022,12214870.0,3870.0
186473,Zimbabwe,ZWE,03/10/2022,12216848.0,3777.0
186475,Zimbabwe,ZWE,05/10/2022,12219760.0,2076.0


Uwzględnienie tylko krajów europejskich z danych o szczepionkach

In [35]:
europeVaccinations = europeVaccinations[europeVaccinations["location"].isin(covidTesting["country"])] #do europeVaccinations bierzemy tylko kraje, które są covidTesting. W covidTesting są same europejskie kraje
print("Countries in europeVaccinations:", len(europeVaccinations["location"].unique()))

Countries in europeVaccinations: 28


In [36]:
europeVaccinations.head(10)

Unnamed: 0,location,iso_code,date,total_vaccinations,daily_vaccinations
11858,Austria,AUT,08/01/2021,31790.0,0.0
11865,Austria,AUT,15/01/2021,117665.0,12268.0
11872,Austria,AUT,22/01/2021,216189.0,14075.0
11879,Austria,AUT,29/01/2021,265704.0,7074.0
11886,Austria,AUT,05/02/2021,364418.0,14102.0
11893,Austria,AUT,12/02/2021,482438.0,16860.0
11900,Austria,AUT,19/02/2021,586391.0,14850.0
11907,Austria,AUT,26/02/2021,715245.0,18408.0
11914,Austria,AUT,05/03/2021,885732.0,24355.0
11921,Austria,AUT,12/03/2021,1146023.0,37184.0


## Trendy

Wczytujemy ramkę z trendami zrelatywizowanymi do częstości wyszukiwań COVID  
Źródło danych:  
https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=covid,quarantine,vaccination,respiratory%20disease,covid%20testing&hl=pl

In [37]:
trendsUnified = pd.read_csv("data/trendsUnified.csv", index_col=0,header=1) #do atrybutu "covid: (Cały świat) jest zrelatywizowana reszta atrybutów"
trendsUnified

Unnamed: 0_level_0,covid: (Cały świat),vaccine: (Cały świat),respiratory disease: (Cały świat),covid testing: (Cały świat),quarantine: (Cały świat)
Tydzień,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-12-01,0,1,<1,<1,<1
2019-12-08,<1,1,<1,<1,<1
2019-12-15,<1,1,<1,0,<1
2019-12-22,<1,<1,<1,<1,<1
2019-12-29,0,1,<1,0,<1
...,...,...,...,...,...
2023-12-17,8,1,<1,<1,<1
2023-12-24,7,1,<1,<1,<1
2023-12-31,6,1,<1,<1,<1
2024-01-07,5,1,<1,<1,<1


Widzimy, że mamy w kilku miejscach wartości <1
Zdecyduję się na zmianę ich na 0.5, żeby uprościć obliczenia

In [38]:
trendsUnified = trendsUnified.replace(to_replace="<1", value=0.5) #wartość 100 ma słowo "covid" w najczęściej wyszukiwanym dniu, reszta atrybutów zunifikowana
trendsUnified

Unnamed: 0_level_0,covid: (Cały świat),vaccine: (Cały świat),respiratory disease: (Cały świat),covid testing: (Cały świat),quarantine: (Cały świat)
Tydzień,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-12-01,0,1,0.5,0.5,0.5
2019-12-08,0.5,1,0.5,0.5,0.5
2019-12-15,0.5,1,0.5,0,0.5
2019-12-22,0.5,0.5,0.5,0.5,0.5
2019-12-29,0,1,0.5,0,0.5
...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5
2023-12-24,7,1,0.5,0.5,0.5
2023-12-31,6,1,0.5,0.5,0.5
2024-01-07,5,1,0.5,0.5,0.5


Opisujemy ramke o trendach

In [39]:
describeDataSet(trendsUnified)

Number of objects in dataset: 216
Number of attributes in dataset: 5
Number of fields with missing data: 0


Każdy zbiór danych zaciągnięty z GoogleTrends będzie wyglądał tak samo. Musimy wszędzie zmienić <1 na 0.5

Wczytujemy ramkę z kolejną porcją danych zunifikowanych do wyszukiwań covid19  
https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=covid,pneumonia,asthma,lung%20cancer,cystic%20fibrosis&hl=pl  

In [40]:
trendsUnified2 = pd.read_csv("data/trendsUnified2.csv", index_col=0,header=1) #do atrybutu "covid: (Cały świat) jest zrelatywizowana reszta atrybutów"
trendsUnified2 = trendsUnified2.replace(to_replace="<1", value=0.5) #wartość 100 ma słowo "covid" w najczęściej wyszukiwanym dniu, reszta atrybutów zunifikowana
trendsUnified2.columns = ["covid_unified", "pneumonia_unified", "asthma_unified", "lungCancer_unified", "cysticFibrosis_unified"]
describeDataSet(trendsUnified2)
trendsUnified2

Number of objects in dataset: 216
Number of attributes in dataset: 5
Number of fields with missing data: 0


Unnamed: 0_level_0,covid_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified
Tydzień,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-12-01,0,1,0.5,0.5,0.5
2019-12-08,0,1,0.5,0.5,0.5
2019-12-15,0.5,1,0.5,0.5,0.5
2019-12-22,0,1,0.5,0.5,0.5
2019-12-29,0,1,0.5,0.5,0.5
...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5
2023-12-24,7,1,0.5,0.5,0.5
2023-12-31,6,1,0.5,0.5,0.5
2024-01-07,5,1,0.5,0.5,0.5


Przydać się mogą również dane o pojedynczych atrybutach nieuzależnione od Covidu. Powtarzamy te same kroki dla każdego ze słów  
* Dane o popularności wyszukiwań frazy covid - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=covid&hl=pl
* Dane o popularności wyszukiwań frazy quarantine - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=quarantine&hl=pl
* Dane o popularności wyszukiwań frazy respiratory disease - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=respiratory%20disease&hl=pl
* Dane o popularności wyszukiwań frazy covid testing - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=covid%20testing&hl=pl
* Dane o popularności wyszukiwań frazy vaccination - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=vaccination&hl=pl
* Dane o popularności wyszukiwań frazy pneumonia - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=pneumonia&hl=pl
* Dane o popularności wyszukiwań frazy asthma - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=asthma&hl=pl
* Dane o popularności wyszukiwań frazy lung cancer - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=lung%20cancer&hl=pl
* Dane o popularności wyszukiwań frazy cystic fibosis - https://trends.google.com/trends/explore?date=2019-12-01%202024-01-14&q=cystic%20fibrosis&hl=pl

In [41]:
covidTrends = pd.read_csv("data/covidTrends.csv", index_col=0,header=1) #druga kolumna tej ramki taka sama jak ramki powyżej
covidTrends = covidTrends.replace(to_replace="<1", value=0.5)
describeDataSet(covidTrends)
covidTrends.head(5)

Number of objects in dataset: 216
Number of attributes in dataset: 1
Number of fields with missing data: 0


Unnamed: 0_level_0,covid: (Cały świat)
Tydzień,Unnamed: 1_level_1
2019-12-01,0.0
2019-12-08,0.5
2019-12-15,0.5
2019-12-22,0.5
2019-12-29,0.0


In [42]:
vaccineTrends = pd.read_csv("data/vaccineTrends.csv", index_col=0,header=1) #wartość 100 ma tydzień z najczęściej wyszukiwanym słowem "vaccine"
vaccineTrends = vaccineTrends.replace(to_replace="<1", value=0.5)
describeDataSet(vaccineTrends)
vaccineTrends.head(5)

Number of objects in dataset: 216
Number of attributes in dataset: 1
Number of fields with missing data: 0


Unnamed: 0_level_0,vaccine: (Cały świat)
Tydzień,Unnamed: 1_level_1
2019-12-01,3
2019-12-08,3
2019-12-15,3
2019-12-22,2
2019-12-29,3


In [43]:
testingTrends = pd.read_csv("data/testingTrends.csv", index_col=0,header=1)
testingTrends = testingTrends.replace(to_replace="<1", value=0.5)
describeDataSet(testingTrends)
testingTrends.head(5)

Number of objects in dataset: 216
Number of attributes in dataset: 1
Number of fields with missing data: 0


Unnamed: 0_level_0,covid testing: (Cały świat)
Tydzień,Unnamed: 1_level_1
2019-12-01,0.5
2019-12-08,0.5
2019-12-15,0.0
2019-12-22,0.0
2019-12-29,0.0


In [44]:
quarantineTrends = pd.read_csv("data/quarantineTrends.csv", index_col=0,header=1)
quarantineTrends = quarantineTrends.replace(to_replace="<1", value=0.5)
describeDataSet(quarantineTrends)
quarantineTrends.head(5)

Number of objects in dataset: 216
Number of attributes in dataset: 1
Number of fields with missing data: 0


Unnamed: 0_level_0,quarantine: (Cały świat)
Tydzień,Unnamed: 1_level_1
2019-12-01,1
2019-12-08,1
2019-12-15,1
2019-12-22,1
2019-12-29,1


In [45]:
respiratoryTrends = pd.read_csv("data/respiratoryTrends.csv", index_col=0,header=1)
respiratoryTrends = respiratoryTrends.replace(to_replace="<1", value=0.5)
describeDataSet(respiratoryTrends)
respiratoryTrends.head(5)

Number of objects in dataset: 216
Number of attributes in dataset: 1
Number of fields with missing data: 0


Unnamed: 0_level_0,respiratory disease: (Cały świat)
Tydzień,Unnamed: 1_level_1
2019-12-01,21
2019-12-08,20
2019-12-15,19
2019-12-22,11
2019-12-29,12


In [46]:
pneumoniaTrends = pd.read_csv("data/pneumoniaTrends.csv", index_col=0,header=1)
pneumoniaTrends = pneumoniaTrends.replace(to_replace="<1", value=0.5)
describeDataSet(pneumoniaTrends)
pneumoniaTrends.head(5)

Number of objects in dataset: 216
Number of attributes in dataset: 1
Number of fields with missing data: 0


Unnamed: 0_level_0,pneumonia: (Cały świat)
Tydzień,Unnamed: 1_level_1
2019-12-01,37
2019-12-08,37
2019-12-15,39
2019-12-22,51
2019-12-29,48


In [47]:
lungCancerTrends = pd.read_csv("data/lungCancerTrends.csv", index_col=0,header=1)
lungCancerTrends = lungCancerTrends.replace(to_replace="<1", value=0.5)
describeDataSet(lungCancerTrends)
lungCancerTrends.head(5)

Number of objects in dataset: 216
Number of attributes in dataset: 1
Number of fields with missing data: 0


Unnamed: 0_level_0,lung cancer: (Cały świat)
Tydzień,Unnamed: 1_level_1
2019-12-01,49
2019-12-08,48
2019-12-15,45
2019-12-22,42
2019-12-29,44


In [48]:
cysticFibrosisTrends = pd.read_csv("data/cysticFibrosisTrends.csv", index_col=0,header=1)
cysticFibrosisTrends = cysticFibrosisTrends.replace(to_replace="<1", value=0.5)
describeDataSet(cysticFibrosisTrends)
cysticFibrosisTrends.head(5)

Number of objects in dataset: 216
Number of attributes in dataset: 1
Number of fields with missing data: 0


Unnamed: 0_level_0,cystic fibrosis: (Cały świat)
Tydzień,Unnamed: 1_level_1
2019-12-01,67
2019-12-08,73
2019-12-15,65
2019-12-22,51
2019-12-29,65


In [49]:
asthmaTrends = pd.read_csv("data/asthmaTrends.csv", index_col=0,header=1)
asthmaTrends = asthmaTrends.replace(to_replace="<1", value=0.5)
describeDataSet(asthmaTrends)
asthmaTrends.head(5)

Number of objects in dataset: 216
Number of attributes in dataset: 1
Number of fields with missing data: 0


Unnamed: 0_level_0,asthma: (Cały świat)
Tydzień,Unnamed: 1_level_1
2019-12-01,30
2019-12-08,28
2019-12-15,30
2019-12-22,26
2019-12-29,29


Złączę wszystkie ramki z trendami w jedną w celu późniejszego łatwiejszego wykorzystania  
Najpierw zmienimy nazwy atrybutów sprowadzonych do covidu

In [50]:
allTrends = trendsUnified.copy()
allTrends.columns = ["covid_unified", "vaccine_unified", "respiratory_disease_unified", "covid_testing_unified", "quarantine_unified"] #zmienione nazwy kolumn
allTrends.index = allTrends.index.rename("week") 
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-12-01,0,1,0.5,0.5,0.5
2019-12-08,0.5,1,0.5,0.5,0.5
2019-12-15,0.5,1,0.5,0,0.5
2019-12-22,0.5,0.5,0.5,0.5,0.5
2019-12-29,0,1,0.5,0,0.5
...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5
2023-12-24,7,1,0.5,0.5,0.5
2023-12-31,6,1,0.5,0.5,0.5
2024-01-07,5,1,0.5,0.5,0.5


Teraz dodamy do ramki resztę atrybutów zrelatywizowanych względem Covidu

In [51]:
allTrends.loc[:,["pneumonia_unified", "asthma_unified", "lungCancer_unified", "cysticFibrosis_unified"]] = trendsUnified2.loc[:,["pneumonia_unified", "asthma_unified", "lungCancer_unified", "cysticFibrosis_unified"]]
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2019-12-01,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5
2019-12-08,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5
2019-12-15,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5
2019-12-22,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5
2019-12-29,0,1,0.5,0,0.5,1,0.5,0.5,0.5
...,...,...,...,...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5
2023-12-24,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5
2023-12-31,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5
2024-01-07,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5


Teraz dodamy kolumnę dla każdego słowa niezrelatywizowanego do covidu

In [52]:
allTrends["covid_solo"] = covidTrends.iloc[:,0]
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified,covid_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-12-01,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0
2019-12-08,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5
2019-12-15,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5,0.5
2019-12-22,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5
2019-12-29,0,1,0.5,0,0.5,1,0.5,0.5,0.5,0
...,...,...,...,...,...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5,8
2023-12-24,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5,7
2023-12-31,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5,6
2024-01-07,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5


In [53]:
allTrends["vaccine_solo"] = vaccineTrends.iloc[:,0]
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified,covid_solo,vaccine_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2019-12-01,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0,3
2019-12-08,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,3
2019-12-15,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5,0.5,3
2019-12-22,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,2
2019-12-29,0,1,0.5,0,0.5,1,0.5,0.5,0.5,0,3
...,...,...,...,...,...,...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5,8,6
2023-12-24,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5,7,4
2023-12-31,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5,6,5
2024-01-07,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,5


In [54]:
allTrends["respiratory_disease_solo"] = respiratoryTrends.iloc[:,0]
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified,covid_solo,vaccine_solo,respiratory_disease_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2019-12-01,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0,3,21
2019-12-08,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,3,20
2019-12-15,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5,0.5,3,19
2019-12-22,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,2,11
2019-12-29,0,1,0.5,0,0.5,1,0.5,0.5,0.5,0,3,12
...,...,...,...,...,...,...,...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5,8,6,28
2023-12-24,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5,7,4,20
2023-12-31,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5,6,5,24
2024-01-07,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,5,30


In [55]:
allTrends["covid_testing_solo"] = testingTrends.iloc[:,0]
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified,covid_solo,vaccine_solo,respiratory_disease_solo,covid_testing_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2019-12-01,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0,3,21,0.5
2019-12-08,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,3,20,0.5
2019-12-15,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5,0.5,3,19,0
2019-12-22,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,2,11,0
2019-12-29,0,1,0.5,0,0.5,1,0.5,0.5,0.5,0,3,12,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5,8,6,28,1
2023-12-24,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5,7,4,20,1
2023-12-31,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5,6,5,24,1
2024-01-07,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,5,30,1


In [56]:
allTrends["quarantine_solo"] = quarantineTrends.iloc[:,0]
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified,covid_solo,vaccine_solo,respiratory_disease_solo,covid_testing_solo,quarantine_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2019-12-01,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0,3,21,0.5,1
2019-12-08,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,3,20,0.5,1
2019-12-15,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5,0.5,3,19,0,1
2019-12-22,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,2,11,0,1
2019-12-29,0,1,0.5,0,0.5,1,0.5,0.5,0.5,0,3,12,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5,8,6,28,1,1
2023-12-24,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5,7,4,20,1,1
2023-12-31,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5,6,5,24,1,1
2024-01-07,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,5,30,1,1


In [57]:
allTrends["quarantine_solo"] = quarantineTrends.iloc[:,0]
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified,covid_solo,vaccine_solo,respiratory_disease_solo,covid_testing_solo,quarantine_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2019-12-01,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0,3,21,0.5,1
2019-12-08,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,3,20,0.5,1
2019-12-15,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5,0.5,3,19,0,1
2019-12-22,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,2,11,0,1
2019-12-29,0,1,0.5,0,0.5,1,0.5,0.5,0.5,0,3,12,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5,8,6,28,1,1
2023-12-24,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5,7,4,20,1,1
2023-12-31,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5,6,5,24,1,1
2024-01-07,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,5,30,1,1


In [58]:
allTrends["pneumonia_solo"] = pneumoniaTrends.iloc[:,0]
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified,covid_solo,vaccine_solo,respiratory_disease_solo,covid_testing_solo,quarantine_solo,pneumonia_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2019-12-01,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0,3,21,0.5,1,37
2019-12-08,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,3,20,0.5,1,37
2019-12-15,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5,0.5,3,19,0,1,39
2019-12-22,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,2,11,0,1,51
2019-12-29,0,1,0.5,0,0.5,1,0.5,0.5,0.5,0,3,12,0,1,48
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5,8,6,28,1,1,45
2023-12-24,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5,7,4,20,1,1,45
2023-12-31,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5,6,5,24,1,1,43
2024-01-07,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,5,30,1,1,42


In [59]:
allTrends["asthma_solo"] = asthmaTrends.iloc[:,0]
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified,covid_solo,vaccine_solo,respiratory_disease_solo,covid_testing_solo,quarantine_solo,pneumonia_solo,asthma_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2019-12-01,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0,3,21,0.5,1,37,30
2019-12-08,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,3,20,0.5,1,37,28
2019-12-15,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5,0.5,3,19,0,1,39,30
2019-12-22,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,2,11,0,1,51,26
2019-12-29,0,1,0.5,0,0.5,1,0.5,0.5,0.5,0,3,12,0,1,48,29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5,8,6,28,1,1,45,29
2023-12-24,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5,7,4,20,1,1,45,28
2023-12-31,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5,6,5,24,1,1,43,29
2024-01-07,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,5,30,1,1,42,31


In [60]:
allTrends["lungCancer_solo"] = lungCancerTrends.iloc[:,0]
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified,covid_solo,vaccine_solo,respiratory_disease_solo,covid_testing_solo,quarantine_solo,pneumonia_solo,asthma_solo,lungCancer_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2019-12-01,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0,3,21,0.5,1,37,30,49
2019-12-08,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,3,20,0.5,1,37,28,48
2019-12-15,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5,0.5,3,19,0,1,39,30,45
2019-12-22,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,2,11,0,1,51,26,42
2019-12-29,0,1,0.5,0,0.5,1,0.5,0.5,0.5,0,3,12,0,1,48,29,44
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5,8,6,28,1,1,45,29,41
2023-12-24,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5,7,4,20,1,1,45,28,34
2023-12-31,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5,6,5,24,1,1,43,29,35
2024-01-07,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,5,30,1,1,42,31,41


In [61]:
allTrends["cysticFibrosis_solo"] = cysticFibrosisTrends.iloc[:,0]
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified,covid_solo,vaccine_solo,respiratory_disease_solo,covid_testing_solo,quarantine_solo,pneumonia_solo,asthma_solo,lungCancer_solo,cysticFibrosis_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2019-12-01,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0,3,21,0.5,1,37,30,49,67
2019-12-08,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,3,20,0.5,1,37,28,48,73
2019-12-15,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5,0.5,3,19,0,1,39,30,45,65
2019-12-22,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,0.5,2,11,0,1,51,26,42,51
2019-12-29,0,1,0.5,0,0.5,1,0.5,0.5,0.5,0,3,12,0,1,48,29,44,65
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5,8,6,28,1,1,45,29,41,42
2023-12-24,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5,7,4,20,1,1,45,28,34,34
2023-12-31,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5,6,5,24,1,1,43,29,35,63
2024-01-07,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,5,30,1,1,42,31,41,68


Jako, że w GoogleTrends największą wartość miała fraza "covid" kolumny "covid_unified" oraz "covid_solo" niosą identyczną informację. Usunę jedną z nich

In [62]:
allTrends = allTrends.drop(columns="covid_solo")
#liczba wyszukiwan fraz znormalizowana do 0-100. "unified" zunifikowane do liczby wyszukiwań covidu, solo znormalizowane do siebie
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified,vaccine_solo,respiratory_disease_solo,covid_testing_solo,quarantine_solo,pneumonia_solo,asthma_solo,lungCancer_solo,cysticFibrosis_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2019-12-01,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5,3,21,0.5,1,37,30,49,67
2019-12-08,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,3,20,0.5,1,37,28,48,73
2019-12-15,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5,3,19,0,1,39,30,45,65
2019-12-22,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,2,11,0,1,51,26,42,51
2019-12-29,0,1,0.5,0,0.5,1,0.5,0.5,0.5,3,12,0,1,48,29,44,65
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-12-17,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5,6,28,1,1,45,29,41,42
2023-12-24,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5,4,20,1,1,45,28,34,34
2023-12-31,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,24,1,1,43,29,35,63
2024-01-07,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,30,1,1,42,31,41,68


Unifikacja daty. Odstęp pomiędzy pomiarami to tydzień. Sprowadzamy datę do tego, żeby pierwszy pomiar był w poniedziałek, a każdy następny tydzień później

In [63]:
allTrends.index = pd.to_datetime(allTrends.index) 
#odejmujemy dzień tygodnia aby sprowadzić datę do poniedziałku
allTrends.index = allTrends.index - pd.to_timedelta(allTrends.index.weekday, unit='D')#daty sprowadzamy do pierwszego dnia tygodnia, w którym wystąpiła (do poniedziałku)
allTrends

Unnamed: 0_level_0,covid_unified,vaccine_unified,respiratory_disease_unified,covid_testing_unified,quarantine_unified,pneumonia_unified,asthma_unified,lungCancer_unified,cysticFibrosis_unified,vaccine_solo,respiratory_disease_solo,covid_testing_solo,quarantine_solo,pneumonia_solo,asthma_solo,lungCancer_solo,cysticFibrosis_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2019-11-25,0,1,0.5,0.5,0.5,1,0.5,0.5,0.5,3,21,0.5,1,37,30,49,67
2019-12-02,0.5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,3,20,0.5,1,37,28,48,73
2019-12-09,0.5,1,0.5,0,0.5,1,0.5,0.5,0.5,3,19,0,1,39,30,45,65
2019-12-16,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0.5,2,11,0,1,51,26,42,51
2019-12-23,0,1,0.5,0,0.5,1,0.5,0.5,0.5,3,12,0,1,48,29,44,65
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-12-11,8,1,0.5,0.5,0.5,1,0.5,0.5,0.5,6,28,1,1,45,29,41,42
2023-12-18,7,1,0.5,0.5,0.5,1,0.5,0.5,0.5,4,20,1,1,45,28,34,34
2023-12-25,6,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,24,1,1,43,29,35,63
2024-01-01,5,1,0.5,0.5,0.5,1,0.5,0.5,0.5,5,30,1,1,42,31,41,68


Opisujemy ramkę z danymi o trendach wyszukiwań

In [64]:
describeDataSet(allTrends)

Number of objects in dataset: 216
Number of attributes in dataset: 17
Number of fields with missing data: 0


## Złączenie ramek z przypadkami oraz szczepionkami

#### Chcąc złączyć ramki musimy się upewnić, że daty będą się zgadzać. Dla każdej daty przyjmiemy początek jej tygodnia

In [65]:
globalVaccinationsMatched["date"] = pd.to_datetime(globalVaccinationsMatched["date"])

  globalVaccinationsMatched["date"] = pd.to_datetime(globalVaccinationsMatched["date"])


In [66]:
globalVaccinationsMatched.head(5) 

Unnamed: 0,location,iso_code,date,total_vaccinations,daily_vaccinations
0,Afghanistan,AFG,2021-02-22,0.0,0.0
6,Afghanistan,AFG,2021-02-28,8200.0,1367.0
22,Afghanistan,AFG,2021-03-16,54000.0,2862.0
44,Afghanistan,AFG,2021-04-07,120000.0,3000.0
59,Afghanistan,AFG,2021-04-22,240000.0,8000.0


In [67]:
globalCasesMatchedVacc["Date_reported"] = pd.to_datetime(globalCasesMatchedVacc["Date_reported"]) #zamienia typ danych na datetime

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  globalCasesMatchedVacc["Date_reported"] = pd.to_datetime(globalCasesMatchedVacc["Date_reported"]) #zamienia typ danych na datetime


In [68]:
globalCasesMatchedVacc.head(5)

Unnamed: 0,Date_reported,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,2020-01-05,AF,Afghanistan,EMRO,0,0,0,0
1,2020-01-12,AF,Afghanistan,EMRO,0,0,0,0
2,2020-01-19,AF,Afghanistan,EMRO,0,0,0,0
3,2020-01-26,AF,Afghanistan,EMRO,0,0,0,0
4,2020-02-02,AF,Afghanistan,EMRO,0,0,0,0


Widzimy, że w przypadku zachorowań co tydzień, zaś szczepionek częściej. Trzeba to zunifikować  
Dodajemy kolumnę week_start, żeby znać datę początku tygodnia, w którym przeprowadzony został pomiar

In [69]:
globalCasesMatchedVaccToMerge = globalCasesMatchedVacc.copy()
globalCasesMatchedVaccToMerge['week_start'] = globalCasesMatchedVaccToMerge['Date_reported'] - pd.to_timedelta(globalCasesMatchedVaccToMerge['Date_reported'].dt.weekday, unit='D')
globalCasesMatchedVaccToMerge.head(5)

Unnamed: 0,Date_reported,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths,week_start
0,2020-01-05,AF,Afghanistan,EMRO,0,0,0,0,2019-12-30
1,2020-01-12,AF,Afghanistan,EMRO,0,0,0,0,2020-01-06
2,2020-01-19,AF,Afghanistan,EMRO,0,0,0,0,2020-01-13
3,2020-01-26,AF,Afghanistan,EMRO,0,0,0,0,2020-01-20
4,2020-02-02,AF,Afghanistan,EMRO,0,0,0,0,2020-01-27


In [70]:
globalVaccinationsMatched.head(21) #w ramce możliwy więcej niż jeden pomiar w tygodniu

Unnamed: 0,location,iso_code,date,total_vaccinations,daily_vaccinations
0,Afghanistan,AFG,2021-02-22,0.0,0.0
6,Afghanistan,AFG,2021-02-28,8200.0,1367.0
22,Afghanistan,AFG,2021-03-16,54000.0,2862.0
44,Afghanistan,AFG,2021-04-07,120000.0,3000.0
59,Afghanistan,AFG,2021-04-22,240000.0,8000.0
78,Afghanistan,AFG,2021-05-11,504502.0,13921.0
87,Afghanistan,AFG,2021-05-20,547901.0,4822.0
91,Afghanistan,AFG,2021-05-24,573277.0,5692.0
93,Afghanistan,AFG,2021-05-26,590454.0,6768.0
94,Afghanistan,AFG,2021-05-27,593313.0,6487.0


## resampleVaccination
- przerabia ramkę danych z danymi o szczepieniach (gdzie pomiary niekoniecznie muszą być przeprowadzane co tydzień) na ramkę, w której pomiary są przeprowadzane co tydzień w pierwszym dniu tygodnia
- Funkcja rozpatruje ramkę względem danego kraju. Wybiera z niego obserwacje odbywające się w tym samym tygodniu.
- Z wybranych obserwacji sumuje wartości w kolumnie "daily_vaccinations" i zapisuje je pod nazwą weeklyVaccinations
- Z wybranych obserwacji bierze największą wartość total_vaccinations i zapisuje ją pod nazwą totalVaccinations
- Tym sposobem kilka pomiarów z danego tygodnia jest przerabiane na jeden mający miejsce pierwszego dnia tygodnia
- funkcja przyjmuje ramkę danych jako parametr

In [71]:
def resampleVaccination(frame):
    vaccinationCountries = frame["location"].unique()
    newVaccinationsToMerge = pd.DataFrame() #konstruktor obiektu DataFrame, tworzę pustą ramkę danych
    for country in vaccinationCountries:
        countryFrame = frame[frame["location"] == country] #dla każdego kraju robię ramkę tylko dla tego kraju
        currentCode = countryFrame.iloc[0,1] #zerowy wiersz, pierwsza kolumna
        totalVaccinations = []
        weeklyVacs = []
        for currentDate in countryFrame["week_start"].unique():
            dateAndCountryFrame = countryFrame[countryFrame['week_start'] == currentDate]
            totalVaccinations.append(dateAndCountryFrame["total_vaccinations"].max()) #największa liczba szczepień dla danego tygodnia, prawdopodobnie ostatnia z danego tygodnia
            weeklyVacs.append(dateAndCountryFrame["daily_vaccinations"].sum())#sumuję szczepionki z każdego dnia w danym tygodniu, ich suma to weekly vaccination
        values = {"date": countryFrame["week_start"].unique(),
                "location": country,
                "iso_code": currentCode,
                "totalVaccinations": totalVaccinations,
                "weeklyVaccinations":weeklyVacs}
        newVaccinationsToMerge = pd.concat([newVaccinationsToMerge, pd.DataFrame(values)], ignore_index=True, axis=0) #dodaję nowy rekord do ramki łącząc istniejącą ramkę z nowym rekordem
    return newVaccinationsToMerge 

Przygotowanie ramki z globalnymi danymi o szczepieniach do złączenia

In [72]:
globalVaccinationsMatchedToMerge = globalVaccinationsMatched.copy()
globalVaccinationsMatchedToMerge['week_start'] = globalVaccinationsMatchedToMerge['date'] - pd.to_timedelta(globalVaccinationsMatchedToMerge['date'].dt.weekday, unit='D')
newGlobalVaccinationsToMerge = resampleVaccination(globalVaccinationsMatchedToMerge) #pomiary co tydzień licząc w pierwszym dniu tygodnia
newGlobalVaccinationsToMerge.head(10)

Unnamed: 0,date,location,iso_code,totalVaccinations,weeklyVaccinations
0,2021-02-22,Afghanistan,AFG,8200.0,1367.0
1,2021-03-15,Afghanistan,AFG,54000.0,2862.0
2,2021-04-05,Afghanistan,AFG,120000.0,3000.0
3,2021-04-19,Afghanistan,AFG,240000.0,8000.0
4,2021-05-10,Afghanistan,AFG,504502.0,13921.0
5,2021-05-17,Afghanistan,AFG,547901.0,4822.0
6,2021-05-24,Afghanistan,AFG,600152.0,23693.0
7,2021-05-31,Afghanistan,AFG,630305.0,10404.0
8,2021-06-07,Afghanistan,AFG,641295.0,3388.0
9,2021-06-14,Afghanistan,AFG,662003.0,3272.0


Złączamy ramki o przypadkach zachorowań i szczepieniach w cełu ułatwienia przyszłej analizy

In [73]:
globalCasesAndVaccinations = pd.merge(globalCasesMatchedVaccToMerge,newGlobalVaccinationsToMerge, how ='inner', left_on = ["Country", "week_start"], right_on = ["location", "date"]) #mergujemy najpierw po kraju, potem po dacie
#globalCasesMatchedVaccToMerge globalCases dopasowane do zmerge'owania z Vaccinations, 1 pomiar w tygodniu
globalCasesAndVaccinations

Unnamed: 0,Date_reported,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths,week_start,date,location,iso_code,totalVaccinations,weeklyVaccinations
0,2021-02-28,AF,Afghanistan,EMRO,110,55714,11,2443,2021-02-22,2021-02-22,Afghanistan,AFG,8200.0,1367.0
1,2021-03-21,AF,Afghanistan,EMRO,118,56103,6,2463,2021-03-15,2021-03-15,Afghanistan,AFG,54000.0,2862.0
2,2021-04-11,AF,Afghanistan,EMRO,484,57160,24,2521,2021-04-05,2021-04-05,Afghanistan,AFG,120000.0,3000.0
3,2021-04-25,AF,Afghanistan,EMRO,1050,58843,43,2582,2021-04-19,2021-04-19,Afghanistan,AFG,240000.0,8000.0
4,2021-05-16,AF,Afghanistan,EMRO,1642,63484,56,2742,2021-05-10,2021-05-10,Afghanistan,AFG,504502.0,13921.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13664,2022-08-28,ZW,Zimbabwe,AFRO,75,256699,1,5593,2022-08-22,2022-08-22,Zimbabwe,ZWE,12145591.0,18468.0
13665,2022-09-11,ZW,Zimbabwe,AFRO,115,256870,0,5596,2022-09-05,2022-09-05,Zimbabwe,ZWE,12161598.0,1334.0
13666,2022-09-25,ZW,Zimbabwe,AFRO,344,257340,1,5599,2022-09-19,2022-09-19,Zimbabwe,ZWE,12187779.0,6949.0
13667,2022-10-02,ZW,Zimbabwe,AFRO,177,257517,3,5602,2022-09-26,2022-09-26,Zimbabwe,ZWE,12214870.0,20360.0


Wybranie tylko istotynych atrybutów

In [74]:
arguments = ["date", "Country", "Country_code", "iso_code", "WHO_region", "New_cases", "Cumulative_cases", "New_deaths", 
             "Cumulative_deaths", "totalVaccinations", "weeklyVaccinations"] #ważne atrybuty
globalCasesAndVaccinationsReduced = globalCasesAndVaccinations.loc[:,arguments]
globalCasesAndVaccinationsReduced

Unnamed: 0,date,Country,Country_code,iso_code,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths,totalVaccinations,weeklyVaccinations
0,2021-02-22,Afghanistan,AF,AFG,EMRO,110,55714,11,2443,8200.0,1367.0
1,2021-03-15,Afghanistan,AF,AFG,EMRO,118,56103,6,2463,54000.0,2862.0
2,2021-04-05,Afghanistan,AF,AFG,EMRO,484,57160,24,2521,120000.0,3000.0
3,2021-04-19,Afghanistan,AF,AFG,EMRO,1050,58843,43,2582,240000.0,8000.0
4,2021-05-10,Afghanistan,AF,AFG,EMRO,1642,63484,56,2742,504502.0,13921.0
...,...,...,...,...,...,...,...,...,...,...,...
13664,2022-08-22,Zimbabwe,ZW,ZWE,AFRO,75,256699,1,5593,12145591.0,18468.0
13665,2022-09-05,Zimbabwe,ZW,ZWE,AFRO,115,256870,0,5596,12161598.0,1334.0
13666,2022-09-19,Zimbabwe,ZW,ZWE,AFRO,344,257340,1,5599,12187779.0,6949.0
13667,2022-09-26,Zimbabwe,ZW,ZWE,AFRO,177,257517,3,5602,12214870.0,20360.0


### Złączenie ramek z przypadkami oraz z testami

In [75]:
globalCasesEurope2 = globalCasesEurope.copy()
globalCasesEurope2["Date_reported"] = pd.to_datetime(globalCasesEurope2["Date_reported"]) # zamiana na date
globalCasesEurope2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5852 entries, 2508 to 43889
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Date_reported      5852 non-null   datetime64[ns]
 1   Country_code       5852 non-null   object        
 2   Country            5852 non-null   object        
 3   WHO_region         5852 non-null   object        
 4   New_cases          5852 non-null   int64         
 5   Cumulative_cases   5852 non-null   int64         
 6   New_deaths         5852 non-null   int64         
 7   Cumulative_deaths  5852 non-null   int64         
dtypes: datetime64[ns](1), int64(4), object(3)
memory usage: 411.5+ KB


In [76]:
covidTesting2 = covidTesting.copy()
covidTesting2["date"] = pd.to_datetime(covidTesting["date"])
covidTesting2.head(5)

Unnamed: 0,country,country_code,new_cases,tests_done,population,testing_rate,positivity_rate,date
0,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-08
1,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-15
2,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-22
3,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-29
4,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-02-05


Stworzenie kolumny week_start w celu ujednolicenia daty. Tutaj pomiary są co tydzień, wieć taki zabieg wystarczy

In [77]:
globalCasesEuropeToMerge = globalCasesEurope2.copy()
globalCasesEuropeToMerge['week_start'] = globalCasesEuropeToMerge['Date_reported'] - pd.to_timedelta(globalCasesEuropeToMerge['Date_reported'].dt.weekday, unit='D')
globalCasesEuropeToMerge.head(5)

Unnamed: 0,Date_reported,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths,week_start
2508,2020-01-05,AT,Austria,EURO,0,0,0,0,2019-12-30
2509,2020-01-12,AT,Austria,EURO,0,0,0,0,2020-01-06
2510,2020-01-19,AT,Austria,EURO,0,0,0,0,2020-01-13
2511,2020-01-26,AT,Austria,EURO,0,0,0,0,2020-01-20
2512,2020-02-02,AT,Austria,EURO,0,0,0,0,2020-01-27


In [78]:
covidTestingToMerge = covidTesting2.copy()
covidTestingToMerge['week_start'] = covidTestingToMerge['date'] - pd.to_timedelta(covidTestingToMerge['date'].dt.weekday, unit='D')
covidTestingToMerge.head(5)

Unnamed: 0,country,country_code,new_cases,tests_done,population,testing_rate,positivity_rate,date,week_start
0,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-08,2020-01-06
1,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-15,2020-01-13
2,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-22,2020-01-20
3,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-29,2020-01-27
4,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-02-05,2020-02-03


Złączenie ramek z danymi o testowaniu na covid-19 i przypadkami covidu w Europie

In [79]:
europeCasesAndTesting = pd.merge(globalCasesEuropeToMerge,covidTestingToMerge, how ='inner', left_on = ["Country", "week_start"], right_on = ["country", "week_start"])
europeCasesAndTesting

Unnamed: 0,Date_reported,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths,week_start,country,country_code,new_cases,tests_done,population,testing_rate,positivity_rate,date
0,2020-01-12,AT,Austria,EURO,0,0,0,0,2020-01-06,Austria,AT,0.0,0.0,8978929,0.000000,0.000000,2020-01-08
1,2020-01-19,AT,Austria,EURO,0,0,0,0,2020-01-13,Austria,AT,0.0,0.0,8978929,0.000000,0.000000,2020-01-15
2,2020-01-26,AT,Austria,EURO,0,0,0,0,2020-01-20,Austria,AT,0.0,0.0,8978929,0.000000,0.000000,2020-01-22
3,2020-02-02,AT,Austria,EURO,0,0,0,0,2020-01-27,Austria,AT,0.0,0.0,8978929,0.000000,0.000000,2020-01-29
4,2020-02-09,AT,Austria,EURO,0,0,0,0,2020-02-03,Austria,AT,0.0,0.0,8978929,0.000000,0.000000,2020-02-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5503,2023-10-29,SE,Sweden,EURO,1286,2720686,52,25012,2023-10-23,Sweden,SE,1992.0,7317.0,10452326,70.003557,27.224272,2023-10-29
5504,2023-11-05,SE,Sweden,EURO,2730,2723416,105,25117,2023-10-30,Sweden,SE,2730.0,8819.0,10452326,84.373564,30.955891,2023-11-05
5505,2023-11-12,SE,Sweden,EURO,3523,2726939,169,25286,2023-11-06,Sweden,SE,3523.0,10180.0,10452326,97.394589,34.607073,2023-11-12
5506,2023-11-19,SE,Sweden,EURO,3344,2730283,193,25479,2023-11-13,Sweden,SE,3344.0,10297.0,10452326,98.513958,32.475478,2023-11-19


Wybieramy tylko istotne atrybuty

In [80]:
arguments = ["week_start", "Country", "Country_code", "WHO_region", "New_cases", "Cumulative_cases", "New_deaths", 
             "Cumulative_deaths", "new_cases", "tests_done", "population", "testing_rate", "positivity_rate"]
europeCasesAndTesting = europeCasesAndTesting.loc[:,arguments]
europeCasesAndTesting = europeCasesAndTesting.rename(columns={"week_start": "date"})
europeCasesAndTesting.head(10)

Unnamed: 0,date,Country,Country_code,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths,new_cases,tests_done,population,testing_rate,positivity_rate
0,2020-01-06,Austria,AT,EURO,0,0,0,0,0.0,0.0,8978929,0.0,0.0
1,2020-01-13,Austria,AT,EURO,0,0,0,0,0.0,0.0,8978929,0.0,0.0
2,2020-01-20,Austria,AT,EURO,0,0,0,0,0.0,0.0,8978929,0.0,0.0
3,2020-01-27,Austria,AT,EURO,0,0,0,0,0.0,0.0,8978929,0.0,0.0
4,2020-02-03,Austria,AT,EURO,0,0,0,0,0.0,0.0,8978929,0.0,0.0
5,2020-02-10,Austria,AT,EURO,0,0,0,0,0.0,0.0,8978929,0.0,0.0
6,2020-02-17,Austria,AT,EURO,0,0,0,0,0.0,0.0,8978929,0.0,0.0
7,2020-02-24,Austria,AT,EURO,7,7,0,0,0.0,0.0,8978929,0.0,0.0
8,2020-03-02,Austria,AT,EURO,68,75,0,0,12.0,0.0,8978929,0.0,0.0
9,2020-03-09,Austria,AT,EURO,513,588,1,1,115.0,0.0,8978929,0.0,0.0


## Złączenie ramek ze szczepionkami i testowaniem

In [81]:
covidTestingVacc = covidTesting.copy()
covidTestingVacc["date"] = pd.to_datetime(covidTestingVacc["date"])
covidTestingVacc.head(5)

Unnamed: 0,country,country_code,new_cases,tests_done,population,testing_rate,positivity_rate,date
0,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-08
1,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-15
2,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-22
3,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-29
4,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-02-05


In [82]:
europeVaccinations["date"] = pd.to_datetime(europeVaccinations["date"], dayfirst=True)
europeVaccinations.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  europeVaccinations["date"] = pd.to_datetime(europeVaccinations["date"], dayfirst=True)


Unnamed: 0,location,iso_code,date,total_vaccinations,daily_vaccinations
11858,Austria,AUT,2021-01-08,31790.0,0.0
11865,Austria,AUT,2021-01-15,117665.0,12268.0
11872,Austria,AUT,2021-01-22,216189.0,14075.0
11879,Austria,AUT,2021-01-29,265704.0,7074.0
11886,Austria,AUT,2021-02-05,364418.0,14102.0


Sprowadzenie daty do pierwszego dnia tygodnia

In [83]:
europeVaccinationsToMerge = europeVaccinations.copy()
europeVaccinationsToMerge['week_start'] = europeVaccinationsToMerge['date'] - pd.to_timedelta(europeVaccinationsToMerge['date'].dt.weekday, unit='D')
europeVaccinationsToMerge.head(5)

Unnamed: 0,location,iso_code,date,total_vaccinations,daily_vaccinations,week_start
11858,Austria,AUT,2021-01-08,31790.0,0.0,2021-01-04
11865,Austria,AUT,2021-01-15,117665.0,12268.0,2021-01-11
11872,Austria,AUT,2021-01-22,216189.0,14075.0,2021-01-18
11879,Austria,AUT,2021-01-29,265704.0,7074.0,2021-01-25
11886,Austria,AUT,2021-02-05,364418.0,14102.0,2021-02-01


In [84]:
covidTestingVaccToMerge = covidTestingVacc.copy()
covidTestingVaccToMerge['week_start'] = covidTestingVaccToMerge['date'] - pd.to_timedelta(covidTestingVaccToMerge['date'].dt.weekday, unit='D')
covidTestingVaccToMerge.head(5)

Unnamed: 0,country,country_code,new_cases,tests_done,population,testing_rate,positivity_rate,date,week_start
0,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-08,2020-01-06
1,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-15,2020-01-13
2,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-22,2020-01-20
3,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-01-29,2020-01-27
4,Austria,AT,0.0,0.0,8978929,0.0,0.0,2020-02-05,2020-02-03


In [85]:
newEuropeVaccinationsToMerge = resampleVaccination(europeVaccinationsToMerge)
newEuropeVaccinationsToMerge.head(10)

Unnamed: 0,date,location,iso_code,totalVaccinations,weeklyVaccinations
0,2021-01-04,Austria,AUT,31790.0,0.0
1,2021-01-11,Austria,AUT,117665.0,12268.0
2,2021-01-18,Austria,AUT,216189.0,14075.0
3,2021-01-25,Austria,AUT,265704.0,7074.0
4,2021-02-01,Austria,AUT,364418.0,14102.0
5,2021-02-08,Austria,AUT,482438.0,16860.0
6,2021-02-15,Austria,AUT,586391.0,14850.0
7,2021-02-22,Austria,AUT,715245.0,18408.0
8,2021-03-01,Austria,AUT,885732.0,24355.0
9,2021-03-08,Austria,AUT,1146023.0,37184.0


Złączenie ramek z danymi o szczepieniach oraz testach

In [86]:
covidTestingVaccToMerge = covidTestingVaccToMerge.drop(columns="date")
europeVaccinationsAndTesting = pd.merge(newEuropeVaccinationsToMerge,covidTestingVaccToMerge, how ='inner', left_on = ["location", "date"], right_on = ["country", "week_start"])
europeVaccinationsAndTesting

Unnamed: 0,date,location,iso_code,totalVaccinations,weeklyVaccinations,country,country_code,new_cases,tests_done,population,testing_rate,positivity_rate,week_start
0,2021-01-04,Austria,AUT,31790.0,0.0,Austria,AT,14916.0,127657.0,8978929,1421.739720,11.684436,2021-01-04
1,2021-01-04,Austria,AUT,31790.0,0.0,Austria,AT,14636.0,137035.0,8978929,1526.184248,10.680483,2021-01-04
2,2021-01-11,Austria,AUT,117665.0,12268.0,Austria,AT,10948.0,911735.0,8978929,10154.162039,1.200788,2021-01-11
3,2021-01-18,Austria,AUT,216189.0,14075.0,Austria,AT,10156.0,1144351.0,8978929,12744.849636,0.887490,2021-01-18
4,2021-01-25,Austria,AUT,265704.0,7074.0,Austria,AT,9513.0,655713.0,8978929,7302.797472,1.450787,2021-01-25
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3394,2023-10-02,Sweden,SWE,26253144.0,354.0,Sweden,SE,698.0,4834.0,10452326,46.248079,14.439388,2023-10-02
3395,2023-10-09,Sweden,SWE,26262103.0,1280.0,Sweden,SE,964.0,5555.0,10452326,53.146065,17.353735,2023-10-09
3396,2023-10-16,Sweden,SWE,26274229.0,1732.0,Sweden,SE,1286.0,6132.0,10452326,58.666368,20.971950,2023-10-16
3397,2023-10-23,Sweden,SWE,26317969.0,6249.0,Sweden,SE,1992.0,7317.0,10452326,70.003557,27.224272,2023-10-23


Wybór tylko istotnych atrybutów

In [87]:
arguments = ["date", "country", "country_code", "new_cases", "tests_done", "population", "testing_rate", "positivity_rate",
             "totalVaccinations", "weeklyVaccinations"]
europeVaccinationsAndTesting = europeVaccinationsAndTesting.loc[:,arguments]
europeVaccinationsAndTesting.head(10)

Unnamed: 0,date,country,country_code,new_cases,tests_done,population,testing_rate,positivity_rate,totalVaccinations,weeklyVaccinations
0,2021-01-04,Austria,AT,14916.0,127657.0,8978929,1421.73972,11.684436,31790.0,0.0
1,2021-01-04,Austria,AT,14636.0,137035.0,8978929,1526.184248,10.680483,31790.0,0.0
2,2021-01-11,Austria,AT,10948.0,911735.0,8978929,10154.162039,1.200788,117665.0,12268.0
3,2021-01-18,Austria,AT,10156.0,1144351.0,8978929,12744.849636,0.88749,216189.0,14075.0
4,2021-01-25,Austria,AT,9513.0,655713.0,8978929,7302.797472,1.450787,265704.0,7074.0
5,2021-02-01,Austria,AT,9115.0,960284.0,8978929,10694.861269,0.949198,364418.0,14102.0
6,2021-02-08,Austria,AT,9458.0,1477892.0,8978929,16459.557704,0.639966,482438.0,16860.0
7,2021-02-15,Austria,AT,11627.0,1500910.0,8978929,16715.913446,0.774663,586391.0,14850.0
8,2021-02-22,Austria,AT,14178.0,1582385.0,8978929,17623.315654,0.895989,715245.0,18408.0
9,2021-03-01,Austria,AT,16313.0,1651867.0,8978929,18397.149593,0.987549,885732.0,24355.0


## Stworzenie ramki z sumą testów oraz trendami

Aby to zrobić zsumujemy wszystkie nowe przypadki oraz testy w każdej z dat

In [103]:
covidTestingSumToMergeWithTrends = covidTestingVaccToMerge.copy()
covidTestingSumToMergeWithTrends = covidTestingSumToMergeWithTrends.loc[:, ["week_start","new_cases", "tests_done"]]
covidTestingSumToMergeWithTrends = covidTestingSumToMergeWithTrends.groupby("week_start").sum()
covidTestingSumToMergeWithTrends

Unnamed: 0_level_0,new_cases,tests_done
week_start,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-06,11.0,23372.0
2020-01-13,12.0,4.0
2020-01-20,10.0,0.0
2020-01-27,20.0,0.0
2020-02-03,32.0,61.0
...,...,...
2023-10-23,56146.0,405996.0
2023-10-30,51034.0,171026.0
2023-11-06,60718.0,392145.0
2023-11-13,72587.0,385654.0


teraz dodamy inforamcje o trendach do tej ramki. Najpierw musimy wybrać tylko istotne atrybuty z trends

In [108]:
trendsToMergeWithSum = allTrends.copy()
trendsToMergeWithSum = trendsToMergeWithSum.drop(columns=["vaccine_unified", "respiratory_disease_unified",	"covid_testing_unified","quarantine_unified","pneumonia_unified",	"asthma_unified",	"lungCancer_unified",	"cysticFibrosis_unified"])
trendsToMergeWithSum

Unnamed: 0_level_0,covid_unified,vaccine_solo,respiratory_disease_solo,covid_testing_solo,quarantine_solo,pneumonia_solo,asthma_solo,lungCancer_solo,cysticFibrosis_solo
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2019-11-25,0,3,21,0.5,1,37,30,49,67
2019-12-02,0.5,3,20,0.5,1,37,28,48,73
2019-12-09,0.5,3,19,0,1,39,30,45,65
2019-12-16,0.5,2,11,0,1,51,26,42,51
2019-12-23,0,3,12,0,1,48,29,44,65
...,...,...,...,...,...,...,...,...,...
2023-12-11,8,6,28,1,1,45,29,41,42
2023-12-18,7,4,20,1,1,45,28,34,34
2023-12-25,6,5,24,1,1,43,29,35,63
2024-01-01,5,5,30,1,1,42,31,41,68


Złączymy te ramki

In [110]:
covidTestingSumAndTrends = pd.merge(covidTestingSumToMergeWithTrends,trendsToMergeWithSum, how ='inner', left_index=True, right_index=True)
covidTestingSumAndTrends

Unnamed: 0,new_cases,tests_done,covid_unified,vaccine_solo,respiratory_disease_solo,covid_testing_solo,quarantine_solo,pneumonia_solo,asthma_solo,lungCancer_solo,cysticFibrosis_solo
2020-01-06,11.0,23372.0,0.5,3,21,0.5,1,42,31,49,63
2020-01-13,12.0,4.0,0.5,3,24,0,2,53,29,49,65
2020-01-20,10.0,0.0,0.5,4,30,0,4,57,32,52,67
2020-01-27,20.0,0.0,0.5,4,30,0,5,51,32,100,62
2020-02-03,32.0,61.0,1,4,25,0.5,4,45,31,52,64
...,...,...,...,...,...,...,...,...,...,...,...
2023-10-23,56146.0,405996.0,4,6,32,0.5,1,35,32,45,55
2023-10-30,51034.0,171026.0,4,6,34,0.5,1,36,34,45,55
2023-11-06,60718.0,392145.0,4,6,36,0.5,1,39,35,47,49
2023-11-13,72587.0,385654.0,4,5,61,0.5,1,47,31,37,43


## Stworzenie ramki z sumą przypadków i zgonów oraz trendami

In [114]:
covidCasesSumToMergeWithTrends = globalCasesMatchedVaccToMerge.copy()
covidCasesSumToMergeWithTrends = covidCasesSumToMergeWithTrends.loc[:, ["week_start","New_cases", "New_deaths"]]
covidCasesSumToMergeWithTrends = covidCasesSumToMergeWithTrends.groupby("week_start").sum()
covidCasesSumToMergeWithTrends

Unnamed: 0_level_0,New_cases,New_deaths
week_start,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-12-30,2,3
2020-01-06,45,1
2020-01-13,89,2
2020-01-20,1884,56
2020-01-27,12517,309
...,...,...
2023-11-27,172369,1202
2023-12-04,204569,991
2023-12-11,214319,624
2023-12-18,117736,162


In [115]:
covidCasesSumAndTrends = pd.merge(covidCasesSumToMergeWithTrends,trendsToMergeWithSum, how ='inner', left_index=True, right_index=True)
covidCasesSumAndTrends

Unnamed: 0,New_cases,New_deaths,covid_unified,vaccine_solo,respiratory_disease_solo,covid_testing_solo,quarantine_solo,pneumonia_solo,asthma_solo,lungCancer_solo,cysticFibrosis_solo
2019-12-30,2,3,0,3,21,0.5,1,49,29,46,60
2020-01-06,45,1,0.5,3,21,0.5,1,42,31,49,63
2020-01-13,89,2,0.5,3,24,0,2,53,29,49,65
2020-01-20,1884,56,0.5,4,30,0,4,57,32,52,67
2020-01-27,12517,309,0.5,4,30,0,5,51,32,100,62
...,...,...,...,...,...,...,...,...,...,...,...
2023-11-27,172369,1202,5,6,83,1,1,65,33,43,54
2023-12-04,204569,991,6,6,44,1,1,48,33,49,55
2023-12-11,214319,624,8,6,28,1,1,45,29,41,42
2023-12-18,117736,162,7,4,20,1,1,45,28,34,34


## Stworzenie ramki z przypadkami oraz trendami

In [120]:
vaccinationsSumToMergeWithTrends = newGlobalVaccinationsToMerge.copy()
vaccinationsSumToMergeWithTrends = vaccinationsSumToMergeWithTrends.loc[:, ["date","totalVaccinations", "weeklyVaccinations"]]
vaccinationsSumToMergeWithTrends = vaccinationsSumToMergeWithTrends.groupby("date").sum()
vaccinationsSumToMergeWithTrends

Unnamed: 0_level_0,totalVaccinations,weeklyVaccinations
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-11-30,1.000000e+00,0.0
2020-12-07,1.500000e+01,5.0
2020-12-14,1.520779e+06,19581.0
2020-12-21,5.553360e+05,328139.0
2020-12-28,6.674180e+06,1503724.0
...,...,...
2023-12-11,2.446261e+09,44399.0
2023-12-18,2.429868e+09,37197.0
2023-12-25,2.348428e+08,21263.0
2024-01-01,2.231744e+08,31353.0


In [121]:
vaccinationsSumAndTrends = pd.merge(vaccinationsSumToMergeWithTrends,trendsToMergeWithSum, how ='inner', left_index=True, right_index=True)
vaccinationsSumAndTrends

Unnamed: 0,totalVaccinations,weeklyVaccinations,covid_unified,vaccine_solo,respiratory_disease_solo,covid_testing_solo,quarantine_solo,pneumonia_solo,asthma_solo,lungCancer_solo,cysticFibrosis_solo
2020-11-30,1.000000e+00,0.0,58,32,32,29,14,31,28,39,52
2020-12-07,1.500000e+01,5.0,59,39,28,32,13,31,27,36,46
2020-12-14,1.520779e+06,19581.0,60,31,17,33,12,27,24,33,36
2020-12-21,5.553360e+05,328139.0,67,36,17,33,12,30,26,34,37
2020-12-28,6.674180e+06,1503724.0,66,46,24,32,12,32,30,34,42
...,...,...,...,...,...,...,...,...,...,...,...
2023-12-11,2.446261e+09,44399.0,8,6,28,1,1,45,29,41,42
2023-12-18,2.429868e+09,37197.0,7,4,20,1,1,45,28,34,34
2023-12-25,2.348428e+08,21263.0,6,5,24,1,1,43,29,35,63
2024-01-01,2.231744e+08,31353.0,5,5,30,1,1,42,31,41,68


## Zapis nowych danych

In [88]:
allTrends.to_csv("data/cleanData/allGoogleTrends.csv") #trends

In [89]:
globalCasesAndVaccinationsReduced.to_csv("data/cleanData/covidCasesWithVaccinations.csv") #Covid cases with Vaccination data

In [90]:
europeCasesAndTesting.to_csv("data/cleanData/europeCasesAndTesting.csv") # Cases And Testing (Europe)

In [91]:
listOfCountries.to_csv("data/cleanData/listOfCountries.csv") #List of countries

In [92]:
europeVaccinationsAndTesting.to_csv("data/cleanData/europeVaccinationsAndTesting.csv") #Vaccinations and Testing (Europe)

In [122]:
covidTestingSumAndTrends.to_csv("data/cleanData/covidTestingSumAndTrends.csv")

In [123]:
covidCasesSumAndTrends.to_csv("data/cleanData/covidCasesSumAndTrends.csv")

In [124]:
vaccinationsSumAndTrends.to_csv("data/cleanData/vaccinationsSumAndTrends.csv")