# Process Notebook
*   Author: Alex 470066919
*   Co-author: none
*   Reviewed by: none
*   Created: 14 Oct 2020
*   Last edited: 2 Nov 2020

The purpose of this process notebook is to analyse the country **Italy** and record any findings.

#Libraries
This process notebook utilises the following libraries:

In [67]:
# beginning date: 29 Oct 2020
# end date: 29 Oct 2020

import pandas as pd
import os
import shutil
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np

# require orca to save charts as static png images
!pip install plotly>=4.0.0
!wget https://github.com/plotly/orca/releases/download/v1.2.1/orca-1.2.1-x86_64.AppImage -O /usr/local/bin/orca
!chmod +x /usr/local/bin/orca
!apt-get install xvfb libgtk2.0-0 libgconf-2-4

/usr/local/bin/orca: Text file busy
Reading package lists... Done
Building dependency tree       
Reading state information... Done
libgtk2.0-0 is already the newest version (2.24.32-1ubuntu1).
libgconf-2-4 is already the newest version (3.2.6-4ubuntu1).
xvfb is already the newest version (2:1.19.6-1ubuntu4.7).
0 upgraded, 0 newly installed, 0 to remove and 11 not upgraded.


The *pandas* library is used to...

# Dataset

## Import data
Our dataset is a snapshot taken from the 'Our World in Data' github repository and is saved as a CSV file in our [github repository](https://github.sydney.edu.au/awon6941/DATA3406_Group4/blob/master/README.md#contributing). It is made up of a collection of sources, namely the [European Centre for Disease Prevention and Control (ECDC)](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide ), official testing reports, United Nations, World Bank, etc. More details are provided in their [codebook](https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-codebook.csv).

In the following code, we are using pandas to read our data into a variable called `df1`.

In [37]:
# beginning date: 14 Oct 2020
# end date: 29 Oct 2020

# the raw data url is taken from our github repository
url = 'https://raw.github.sydney.edu.au/awon6941/DATA3406_Group4/master/data_raw/owid-covid-data.csv?token=AAAA6WNAI3I6TQCB5CALNRS7UONQC'
df1 = pd.read_csv(url)

# understanding the data
print(df1.shape)
print(df1.columns.tolist())
df1.head()

(50090, 41)
['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases', 'new_cases_smoothed', 'total_deaths', 'new_deaths', 'new_deaths_smoothed', 'total_cases_per_million', 'new_cases_per_million', 'new_cases_smoothed_per_million', 'total_deaths_per_million', 'new_deaths_per_million', 'new_deaths_smoothed_per_million', 'new_tests', 'total_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'tests_per_case', 'positive_rate', 'tests_units', 'stringency_index', 'population', 'population_density', 'median_age', 'aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty', 'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers', 'male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand', 'life_expectancy', 'human_development_index']


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,tests_per_case,positive_rate,tests_units,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
0,ABW,North America,Aruba,2020-03-13,2.0,2.0,,0.0,0.0,,18.733,18.733,,0.0,0.0,,,,,,,,,,,0.0,106766.0,584.8,41.2,13.085,7.452,35973.781,,,11.62,,,,,76.29,
1,ABW,North America,Aruba,2020-03-19,,,0.286,,,0.0,,,2.676,,,0.0,,,,,,,,,,33.33,106766.0,584.8,41.2,13.085,7.452,35973.781,,,11.62,,,,,76.29,
2,ABW,North America,Aruba,2020-03-20,4.0,2.0,0.286,0.0,0.0,0.0,37.465,18.733,2.676,0.0,0.0,0.0,,,,,,,,,,33.33,106766.0,584.8,41.2,13.085,7.452,35973.781,,,11.62,,,,,76.29,
3,ABW,North America,Aruba,2020-03-21,,,0.286,,,0.0,,,2.676,,,0.0,,,,,,,,,,44.44,106766.0,584.8,41.2,13.085,7.452,35973.781,,,11.62,,,,,76.29,
4,ABW,North America,Aruba,2020-03-22,,,0.286,,,0.0,,,2.676,,,0.0,,,,,,,,,,44.44,106766.0,584.8,41.2,13.085,7.452,35973.781,,,11.62,,,,,76.29,


To better understand the data, we look at the properties `shape`, `columns` and `head`. As seen in the print results, our data has **50090** rows and **41 features**. From the first five rows of data, we can also see that missing values in the columns are represented as `NaN`.

## Clean data
Before performing any processing, the data will have to be cleaned. This includes handling of zero and missing values, justifying the rationale of significant figures used and removing of outliers.

In [3]:
# beginning date: 29 Oct 2020
# end date: 31 Oct 2020

# scale huge numbers down
df1['population (million)'] = df1['population'].astype(float)/1000000

# justify rationale and convert significant figures
dfClean = df1.round({'population': -3, # nearest thousand
                      'aged_65_older': 1,
                      'aged_70_older': 1,
                      'gdp_per_capita': 2, # dollars
                      'extreme_poverty': 2, # since it is an extremely small number
                      'diabetes_prevalence': 1,
                      'female_smokers': 1,
                      'male_smokers': 1,
                      'handwashing_facilities': 1,
                      'hospital_beds_per_thousand': 1, # because some countries only 0.5
                      'human_development_index': 2, # normalized number from 0 to 1
                      'population (million)': 3, # nearest thousand,
                      'population_density': 0,
                      'median_age': 0,
                      'cardiovasc_death_rate': 0, # number of deaths, not percentage
                      'life_expectancy': 0,
                      })

# remove outliers
# TODO: can also visualise outliers using Boxplot, scatterplot
# TODO: test if there exists any negative values
dfClean['human_development_index'] = dfClean['human_development_index'].clip(0, 1)
zeroToHundred = ['stringency_index',
                 'median_age', # cant be older than 100
                 'life_expectancy', # cant be older than 100
                 'aged_65_older', # percentage
                 'aged_70_older', # percentage
                 'extreme_poverty', # percentage
                 'diabetes_prevalence', # percentage
                 'female_smokers', # percentage
                 'male_smokers', # percentage
                 'handwashing_facilities' # percentage
                 ]
dfClean[zeroToHundred] = dfClean[zeroToHundred].clip(0, 100)

1.   **Scaling down huge numbers**

An additional column `population (million)` was created to scale down the population to the nearest million.

2.   **0 values**

The driving question of this study is to see how deadly the COVID-19 virus is. This means that the values of 0 are valuable information too. It can represent the timeline when the virus began to affect countries around the world. As such, these values are untouched.

3.   **Missing values**

In the data shown above, the missing values have already been dealt with by the method `pd.read_csv()` which encodes missing values to `NaN`s automatically. The use of `NaN` to represent missing data is for simplicity and performance reasons. For example, `NaN` values are treated as zero when summing a column and are also excluded when used in groupby.

4.   **Significant figures**

In the code above, we round the metadata to an appropriate significant figure.

- `population`: round to nearest thousand
- `population (million)`: nearest thousand
- `gdp_per_capita`: currency
- `hospital_beds_per_thousand`: 1, because some countries only 0.5
- `human_development_index`: 2, because its a normalized number from 0 to 1
- `extreme_poverty`: 2, extremely small percentage
- `aged_65_older`: percentage
- `aged_70_older`: percentage
- `diabetes_prevalence`: percentage
- `female_smokers`: percentage
- `male_smokers`: percentage
- `handwashing_facilities`: percentage

The data in columns `population_density`, `median_age`, `cardiovasc_death_rate` and `life_expectancy` should be integers rather than floats. However, because pandas is unable to convert float to integer if there are missing values in the columns, these columns have to remain as floats but instead were rounded to the nearest whole number.

5.   **Remove outliers**

- `human_development_index`: 0 to 1
- `stringency_index`: 0 to 100
- `median_age`: 0 to 100, cant be older than 100
- `life_expectancy`: 0 to 100, cant be older than 100
- `extreme_poverty`: 0 to 100, percentage
- `aged_65_older`: 0 to 100, percentage
- `aged_70_older`: 0 to 100, percentage
- `diabetes_prevalence`: 0 to 100, percentage
- `female_smokers`: 0 to 100, percentage
- `male_smokers`: 0 to 100, percentage
- `handwashing_facilities`: 0 to 100, percentage

## Handling the data

The raw data and cleaned data are saved in separate folders, *data_raw* and *data_engineered* respectively. The raw data is in a tidy format. However, because the metadata remains the same across time, we take the cleaned  data and transform it into two different dataframes, *dfCountries* and *dfMetadata*.

In [62]:
# beginning date: 31 Oct 2020
# end date: 2 Nov 2020

# split the data into cases and metadata
dfCases = dfClean.iloc[:, 0:26]
dfMetadata = dfClean.drop(dfClean.iloc[:, 3:26], axis=1)

# remove columns for international and world data (na values for iso_code and continent)
dfCases = dfCases[dfCases['iso_code'].notna()]
dfCountries = dfCases[dfCases['continent'].notna()]

# group data by location and take maximum value
dfMetadata = dfMetadata.groupby(['location'], sort=False).max().reset_index()

# print shape of cleaned data
print(dfClean.shape)
print(dfCountries.shape)
print(dfMetadata.shape)

dfMetadata.tail()

(50090, 42)
(49512, 26)
(212, 19)


Unnamed: 0,location,iso_code,continent,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population (million)
207,South Africa,ZAF,Africa,59309000.0,47.0,27.0,5.3,3.1,12294.88,18.9,200.0,5.5,8.1,33.2,44.0,2.3,64.0,0.7,59.309
208,Zambia,ZMB,Africa,18384000.0,23.0,18.0,2.5,1.5,3689.25,57.5,234.0,3.9,3.1,24.7,13.9,2.0,64.0,0.59,18.384
209,Zimbabwe,ZWE,Africa,14863000.0,43.0,20.0,2.8,1.9,1899.78,21.4,308.0,1.8,1.6,30.7,36.8,1.7,61.0,0.54,14.863
210,World,OWID_WRL,,7794799000.0,58.0,31.0,8.7,5.4,15469.21,10.0,233.0,8.5,6.4,34.6,60.1,2.7,73.0,,7794.799
211,International,,,,,,,,,,,,,,,,,,


The dataframe `dfCountries` has **49512 rows** and **25 features**. The data for international and world data was removed so we would not double count the sum of all cases.

The dataframe `dfMetadata` has **212 rows** and **20 features**. This dataset represents 212 different locations and 19 other features which best describe each location.

## Saving Data to Local Directory

The processed data are saved in a folder named `data_engineered`, which will later be downloaded into the local drive. This is so that the data can be easily uploaded to GitHub if necessary. Similarly, a folder named `images` is created to store our charts and visualisations as png files.

In [66]:
# beginning date: 2 Nov 2020
# end date: 2 Nov 2020

# restart by removing the folders and their zip files
def removeDirAndZip(folderName):
    if os.path.isfile(folderName + ".zip"):
        os.remove(folderName + ".zip")
    dirpath = './' + folderName
    if os.path.exists(dirpath) and os.path.isdir(dirpath):
        shutil.rmtree(dirpath, ignore_errors=True)
removeDirAndZip('data_engineered')
removeDirAndZip('images')

# create directory, save dataframe into csv files, and zip folder
if not os.path.exists('data_engineered'):
    os.makedirs('data_engineered')
dfClean.to_csv('./data_engineered/dataset_cleaned.csv')
dfCountries.to_csv('./data_engineered/dataset_countries_only.csv')
dfMetadata.to_csv('./data_engineered/dataset_metadata_only.csv')
!zip -r "data_engineered.zip" "./data_engineered"

# create folder to save charts as images
if not os.path.exists('images'):
    os.makedirs('images')

  adding: data_engineered/ (stored 0%)
  adding: data_engineered/dataset_countries_only.csv (deflated 78%)
  adding: data_engineered/dataset_metadata_only.csv (deflated 55%)
  adding: data_engineered/dataset_cleaned.csv (deflated 84%)


The function, `removeDirAndZip`, was created to remove any existing data in the two folders and also remove their zip files. This is to ensure the data downloaded by the user is up to date.

## Filter by Country
We can filter the data by country. In the code below, we will observe and compare **Italy** to two other countries, **China** and **Singapore**.

In [5]:
# beginning date: 14 Oct 2020
# end date: 29 Oct 2020

# filter our dataset by location/country
dfItaly = dfCountries[dfCountries['location'].str.contains("Italy")]
dfSing = dfCountries[dfCountries['location'].str.contains("Singapore")]
dfChina = dfCountries[dfCountries['location'].str.contains("China")]

# print the dimensions of the three datasets
print(dfItaly.shape)
print(dfSing.shape)
print(dfChina.shape)

# print the first five rows of data for Italy
dfItaly.head()

(289, 26)
(289, 26)
(289, 26)


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,tests_per_case,positive_rate,tests_units,stringency_index
23098,ITA,Europe,Italy,2019-12-31,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,
23099,ITA,Europe,Italy,2020-01-01,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,0.0
23100,ITA,Europe,Italy,2020-01-02,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,0.0
23101,ITA,Europe,Italy,2020-01-03,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,0.0
23102,ITA,Europe,Italy,2020-01-04,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,0.0


Based on the dataframes for each country, we can see that each country has the same amount of data **(289 rows, 25 features)**.




## Bias

Data sourced from [European Centre for Disease Prevention and Control (ECDC)](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide)



*   Describe how you considered the potential bias in the data collection and how that may have affected your results

- potential for bias of data analysts and of target external audience, particularly confirmation bias, anchoring bias

confirmatory bias - look at symptoms and classify as covid right away, or if death, assume it was due to covid?



# Analysis

## Functions
The three functions below were created to keep the code clean and compact. The first function `printMaxOfCols` takes an array of countries and prints the max value of the specified columns. The second and third functions, `plotLineChartForCountries` and `plotPXLineChart`, takes an array of countries and plots a line chart on the variable specified.

In [6]:
# beginning date: 29 Oct 2020
# end date: 31 Oct 2020

# function to filter columns for the countries
def printFilterCols(df, cols, countries):
  # extract the columns and filter by the three countries
  dfFilter = pd.DataFrame(df,columns=cols)
  dfFilter = dfFilter[dfFilter['location'].isin(countries)]
  print(dfFilter)

In [45]:
# beginning date: 29 Oct 2020
# end date: 2 Nov 2020

# TODO: code to save chart as images
# plots a line chart
def plotLineChartForCountries(df, countries, variable, chartTitle):
  dfDate = df.sort_values(by=['date', 'location'])
  for country in countries:
    dfCountry = dfDate[dfDate['location'].str.contains(country)]
    plt.plot(dfCountry['date'], dfCountry[variable], label=country)
  plt.plot()
  plt.xlabel("date")
  plt.ylabel(variable)
  plt.title(chartTitle)
  plt.legend()
  plt.show()

# plots an interactive line chart
def plotPXLineChart(df, countries, variable, chartTitle):
  dfFilter = df[df['location'].isin(countries)]
  fig = px.line(dfFilter,
                 x = 'date',
                 y = variable,
                 color = 'location',
                 labels={'location' : "Country",
                         'date': "Date",
                         variable: variable},
                 title=chartTitle)
  
  fileName = chartTitle.replace(" ", "_")
  fig.write_image("images/" + fileName.lower() + ".png")
  fig.show()

## a. Exploratory Analysis of Resources
Based on the article, ["Coronavirus: Let’s not forget the world’s poorest countries"](https://www.un.org/africarenewal/web-features/coronavirus/coronavirus-let%E2%80%99s-not-forget-world%E2%80%99s-poorest-countries), it highlights that even though less developed countries (LDCs) have reported less COVID-19 cases as compared to hotspots like Italy and United States, we should expect to see these numbers rise over the next few months. This is because of the limited access of test kits and the already poor medical facilities.

In the code below, we aim to observe a countries' resources, in terms of GDP and the number of hospital beds per thousand, in order to discover if there really exists an inverse relationship between resources and number of covid cases.



In [46]:
# beginning date: 14 Oct 2020
# end date: 29 Oct 2020

# print the max value of columns for the countries
printFilterCols(dfMetadata, ['location','gdp_per_capita','hospital_beds_per_thousand'], ['Italy', 'Singapore', 'China', 'India'])

plotPXLineChart(dfCountries, ['Italy', 'Singapore', 'China', 'India'], 'total_cases', "Total cases over time")
plotPXLineChart(dfCountries, ['Italy', 'Singapore', 'China', 'India'], 'total_deaths', "Total deaths over time")
plotPXLineChart(dfCountries, ['Italy', 'Singapore', 'China', 'India'], 'new_cases', "New cases over time")
plotPXLineChart(dfCountries, ['Italy', 'Singapore', 'China', 'India'], 'new_deaths', "New deaths over time")

plotPXLineChart(dfCountries, ['Italy', 'Singapore', 'China', 'India'], 'total_tests', "Total tests over time")
plotPXLineChart(dfCountries, ['Italy', 'Singapore', 'China', 'India'], 'new_tests', "New tests over time")

      location  gdp_per_capita  hospital_beds_per_thousand
36       China        15308.71                         4.3
91       India         6426.67                         0.5
97       Italy        35220.08                         3.2
167  Singapore        85535.38                         2.4


From the results, we can observe that the India has the lowest GDP and lowest number of hospital beds per thousand which indicates that they do not have good medical facilities. The four graphs too agree that the number of COVID-19 cases in LDCs are on the rise. Since May 2020, the number of new COVID-19 cases in India has risen drastically, even beating Italy.




## b. Confirmatory Analysis of Age
The findings of the Centers for Disease Control and Prevention (CDC), as seen on their [website]("https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/older-adults.html#:~:text=The%20greatest%20risk%20for%20severe,as%20having%20underlying%20medical%20conditions"), states the risk of death from COVID-19 increases with age.

By looking at the countries' population, in terms of median age, percentage aged 65 and older, percentage aged 70 and older and life expectancy, we aim to confirm this relationship between old age and death due to COVID-19.

In [9]:
# beginning date: 14 Oct 2020
# end date: 29 Oct 2020

# print the max value of columns for the countries
printFilterCols(dfMetadata, ['location','median_age','aged_65_older','aged_70_older','life_expectancy'], ['Italy', 'Singapore', 'China', 'India'])

# total deaths over time not as good as we are looking at proportion of elderly!
# plotPXLineChart(dfClean, ['Italy', 'Singapore', 'China', 'India'], 'total_deaths', "Total deaths over time")
plotPXLineChart(dfCountries, ['Italy', 'Singapore', 'China', 'India'], 'total_deaths_per_million', "Total deaths per million over time")

      location  median_age  aged_65_older  aged_70_older  life_expectancy
36       China        39.0           10.6            5.9             77.0
91       India        28.0            6.0            3.4             70.0
97       Italy        48.0           23.0           16.2             84.0
167  Singapore        42.0           12.9            7.0             84.0


The chart above represents the total number of COVID-19 deaths per million. It shows that Italy , being the country with the highest median age, highest proportion of adults aged above 65 and 70, has suffered the most deaths over time. This is evidence that age does increase the risk of death from the COVID-19 virus.

## c. Confirmatory Analysis of Population Density
Based on this [study](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7439635/) conducted in Algeria, the spread of the COVID-19 virus increases with the population's density. This could be due to the nature of the virus where it spreads through close contact with an affected individual (approximately 2 arm lengths).

In the code below, we look at the countries' population density to see if the hypothesis is true.

In [10]:
# beginning date: 14 Oct 2020
# end date: 29 Oct 2020

# print the max value of columns for the countries
printFilterCols(dfMetadata, ['location','population (million)', 'population_density'], ['Italy', 'Singapore', 'China', 'India'])

plotPXLineChart(dfCountries, ['Italy', 'Singapore', 'China', 'India'], 'total_cases_per_million', "Total cases per million over time")
plotPXLineChart(dfCountries, ['Italy', 'Singapore', 'China', 'India'], 'new_cases_per_million', "New cases per million over time")

      location  population (million)  population_density
36       China              1439.324               148.0
91       India              1380.004               450.0
97       Italy                60.462               206.0
167  Singapore                 5.850              7916.0


The charts above show that Singapore has a higher number of total and new COVID-19 cases over time. And the fact that the population density in Singapore far exceeds the other countries is evidence that population density does have an impact in the spread of the virus.



# Visualisations

## a. Heat maps
Below are some visualisations of the total cases, total deaths and new cases of the COVID-19 virus in the form of heat maps. What is most interesting is the heat map for the number of new cases. At the start of the global pandemic, China was the only country with a high number of new cases. However, as time passed, China managed to control the spread of the virus. On the other hand, countries such as the United States started showing huge number of new cases and has not recovered since.


In [11]:
# beginning date: 29 Oct 2020
# end date: 29 Oct 2020

# function returns a heat map figure
# sort data by date and location
def generate_heat_map(df, variable, rangeStart, rangeEnd, chartTitle):
  dfDate = df.sort_values(by=['date', 'location'])
  return px.choropleth(data_frame = dfDate,
                      locations="iso_code",
                      color=variable,
                      color_continuous_scale='ylorbr',
                      hover_name="location",
                      animation_frame="date",
                      range_color=[rangeStart,rangeEnd],
                      title = chartTitle)

In [12]:
# beginning date: 29 Oct 2020
# end date: 29 Oct 2020

# TODO: get max and min values of total cases as the range

# chart heat map, based on total_deaths
fig = generate_heat_map(dfCountries, "total_cases", 100, 1000000, "Heat Map of Total COVID-19 Cases")
fig.show()

In [13]:
# beginning date: 29 Oct 2020
# end date: 29 Oct 2020

# TODO: get max and min values of total cases as the range

# chart heat map, based on total_deaths
fig = generate_heat_map(dfCountries, "total_deaths", 100, 50000, "Heat Map of Total COVID-19 Deaths")
fig.show()

In [14]:
# beginning date: 29 Oct 2020
# end date: 29 Oct 2020

# TODO: get max and min values of total cases as the range

# chart heat map, based on new_cases
fig = generate_heat_map(dfCountries, "new_cases", 0, 10000, "Heat Map of New COVID-19 Cases")
fig.show()

## b. Tree Map
Not working!!!
https://plotly.com/python/plotly-express/

In [15]:
# beginning date: 29 Oct 2020
# end date: 29 Oct 2020

dfPopulation = pd.DataFrame(dfCountries,columns=['location','population (million)'])
dfMillion = dfPopulation.groupby(['location'], sort=False)['population (million)'].max()
print(sum(dfMillion))

dfCountries["world"] = "world" # in order to have a single root node
fig = px.treemap(dfCountries, path=['world', 'continent', 'country'], values='total_cases',
                  color='total_cases', hover_data=['iso_code'])
fig.show()

nan




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



TypeError: ignored

# Uncertainty

*   Culture
*   Deaths due to COVID or something else?



# Download files

In [52]:
# beginning date: 2 Nov 2020
# end date: 2 Nov 2020

# zip images folder
!zip -r "images.zip" "./images"

# download both data_engineered and images folder
files.download("./data_engineered.zip")
files.download("./images.zip")

  adding: images/ (stored 0%)
  adding: images/new_cases_over_time.png (deflated 20%)
  adding: images/new_deaths_over_time.png (deflated 18%)
  adding: images/.ipynb_checkpoints/ (stored 0%)
  adding: images/total_tests_over_time.png (deflated 24%)
  adding: images/total_deaths_over_time.png (deflated 21%)
  adding: images/total_cases_over_time.png (deflated 22%)
  adding: images/new_tests_over_time.png (deflated 20%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# References for code


*   https://www.tutorialspoint.com/How-can-I-create-a-directory-if-it-does-not-exist-using-Python
* https://stackoverflow.com/questions/43765117/how-to-check-existence-of-a-folder-and-then-remove-it
*   https://stackoverflow.com/questions/57262385/saving-or-downloading-plotly-iplot-images-on-google-colaboratory
*   https://colab.research.google.com/drive/1xinRwhXtlL-9Y0KbPrTmTxNdcN-Hvq4m#scrollTo=W2FsS5y_xzX3

