In [None]:
<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/M5_Final/images/SN_web_lightmode.png" width="300">
</center>


<h1>Analysis of Global COVID-19 Pandemic Data</h1>

Estimated time needed: **90** minutes



## Overview:

There are 10 tasks in this final project. All tasks will be graded by your peers who are also completing this assignment within the same session.

You need to submit the following the screenshot for the code and output for each task for review.

If you need to refresh your memories about specific coding details, you may refer to previous hands-on labs for code examples.


In [None]:
# This lab requires 'httr' and 'rvest'packages, which are already pre-loaded into this lab environment.
# However, if you are working on your local RStudio, please uncomment the below codes and install the packages.

#install.packages("httr")
#install.packages("rvest")

In [4]:
library(httr)
library(rvest)

Loading required package: xml2


Note: if you can import above libraries, please use install.packages() to install them first.


## TASK 1: Get a `COVID-19 pandemic` Wiki page using HTTP request


First, let's write a function to use HTTP request to get a public COVID-19 Wiki page.

Before you write the function, you can open this public page from this 

URL https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country using a web browser.

The goal of task 1 is to get the html page using HTTP request (`httr` library)


In [5]:
get_wiki_covid19_page <- function() {
    
wiki_base_url <- "https://en.wikipedia.org/w/index.php"
url_parameter <- "title=Template:COVID-19_testing_by_country"
full_url <- paste(wiki_base_url, "?", url_parameter, sep = "")
response <- httr::GET(url = full_url)
return(response)
}

Call the `get_wiki_covid19_page` function to get a http response with the target html page


In [6]:
# Call the get_wiki_covid19_page function and print the response
wiki_response <- get_wiki_covid19_page()
wiki_html <- get_wiki_covid19_page()
print(wiki_html)

Response [https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country]
  Date: 2024-01-04 19:34
  Status: 200
  Content-Type: text/html; charset=UTF-8
  Size: 448 kB
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-fea...
<head>
<meta charset="UTF-8">
<title>Template:COVID-19 testing by country - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-heade...
"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","...
"CS1 Russian-language sources (ru)","CS1 Bosnian-language sources (bs)","CS1 ...
"CS1 Malagasy-language sources (mg)","CS1 Malay-language sources (ms)","CS1 R...
"wgIsProbablyEditable":false,"wgRelevantPageIsProbablyEditable":false,"wgRest...
...


## TASK 2: Extract COVID-19 testing data table from the wiki HTML page


On the COVID-19 testing wiki page, you should see a data table `<table>` node contains COVID-19 testing data by country on the page:

<a href="https://cognitiveclass.ai/">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/M5_Final/images/covid-19-by-country.png" width="400" align="center">
</a>

Note the numbers you actually see on your page may be different from above because it is still an on-going pandemic when creating this notebook.

The goal of task 2 is to extract above data table and convert it into a data frame


Now use the `read_html` function in rvest library to get the root html node from response


In [7]:
# Get the root html node from the http response in task 1
root_node<-read_html('https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country')
root_node

{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ...

Get the tables in the HTML root node using `html_nodes` function.


In [8]:
# Get the table node from the root html node
table_node<-html_nodes(root_node, "table")
table_node

{xml_nodeset (4)}
[1] <table class="box-Update plainlinks ombox ombox-content ambox-Update" rol ...
[2] <table class="wikitable plainrowheaders sortable collapsible autocollapse ...
[3] <table class="plainlinks ombox mbox-small ombox-notice" role="presentatio ...
[4] <table class="wikitable mw-templatedata-doc-params">\n<caption><p class=" ...

Read the specific table from the multiple tables in the `table_node` using the `html_table` function and convert it into dataframe using `as.data.frame`

_Hint:- Please read the `table_node` with index 2(ex:- table_node[2])._


In [9]:
# Read the table node and convert it into a data frame, and print the data frame for review
covid19_df<-as.data.frame(html_table(table_node[2]))
head(covid19_df)
tail(covid19_df)

Unnamed: 0_level_0,Country.or.region,Date.a.,Tested,Units.b.,Confirmed.cases.,Confirmed..tested..,Tested..population..,Confirmed..population..,Ref.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,Afghanistan,17 Dec 2020,154767,samples,49621,32.1,0.4,0.13,[1]
2,Albania,18 Feb 2021,428654,samples,96838,22.6,15.0,3.4,[2]
3,Algeria,2 Nov 2020,230553,samples,58574,25.4,0.53,0.13,[3][4]
4,Andorra,23 Feb 2022,300307,samples,37958,12.6,387.0,49.0,[5]
5,Angola,2 Feb 2021,399228,samples,20981,5.3,1.3,0.067,[6]
6,Antigua and Barbuda,6 Mar 2021,15268,samples,832,5.4,15.9,0.86,[7]


Unnamed: 0_level_0,Country.or.region,Date.a.,Tested,Units.b.,Confirmed.cases.,Confirmed..tested..,Tested..population..,Confirmed..population..,Ref.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
168,Uzbekistan,7 Sep 2020,2630000,samples,43975,1.7,7.7,0.13,[189]
169,Venezuela,30 Mar 2021,3179074,samples,159149,5.0,11.0,0.55,[190]
170,Vietnam,28 Aug 2022,45772571,samples,11403302,24.9,46.4,11.6,[191]
171,Zambia,10 Mar 2022,3301860,samples,314850,9.5,19.0,1.8,[192]
172,Zimbabwe,15 Oct 2022,2529087,samples,257893,10.2,17.0,1.7,[3][193]
173,.mw-parser-output .reflist{font-size:90%;margin-bottom:0.5em;list-style-type:decimal}.mw-parser-output .reflist .references{font-size:100%;margin-bottom:0;list-style-type:inherit}.mw-parser-output .reflist-columns-2{column-width:30em}.mw-parser-output .reflist-columns-3{column-width:25em}.mw-parser-output .reflist-columns{margin-top:0.3em}.mw-parser-output .reflist-columns ol{margin-top:0}.mw-parser-output .reflist-columns li{page-break-inside:avoid;break-inside:avoid-column}.mw-parser-output .reflist-upper-alpha{list-style-type:upper-alpha}.mw-parser-output .reflist-upper-roman{list-style-type:upper-roman}.mw-parser-output .reflist-lower-alpha{list-style-type:lower-alpha}.mw-parser-output .reflist-lower-greek{list-style-type:lower-greek}.mw-parser-output .reflist-lower-roman{list-style-type:lower-roman} ^ Local time. ^ For some countries it is unclear whether they report samples or cases. One person tested twice is recorded as one case and two samples. ^ Excluding Taiwan. ^ Excluding Northern Cyprus. ^ Excluding Greenland and the Faroe Islands. ^ Excluding Overseas France. ^ Testing data from 4 May to 12 May is missing because of the transition to the new reporting system SI-DEP. ^ Excluding Abkhazia and South Ossetia. ^ Data for residents only. ^ Excluding Transnistria. ^ Northern Cyprus is not recognized as a sovereign state by any country except Turkey. ^ Includes data for Liechtenstein. ^ Not a United Nations member.,.mw-parser-output .reflist{font-size:90%;margin-bottom:0.5em;list-style-type:decimal}.mw-parser-output .reflist .references{font-size:100%;margin-bottom:0;list-style-type:inherit}.mw-parser-output .reflist-columns-2{column-width:30em}.mw-parser-output .reflist-columns-3{column-width:25em}.mw-parser-output .reflist-columns{margin-top:0.3em}.mw-parser-output .reflist-columns ol{margin-top:0}.mw-parser-output .reflist-columns li{page-break-inside:avoid;break-inside:avoid-column}.mw-parser-output .reflist-upper-alpha{list-style-type:upper-alpha}.mw-parser-output .reflist-upper-roman{list-style-type:upper-roman}.mw-parser-output .reflist-lower-alpha{list-style-type:lower-alpha}.mw-parser-output .reflist-lower-greek{list-style-type:lower-greek}.mw-parser-output .reflist-lower-roman{list-style-type:lower-roman} ^ Local time. ^ For some countries it is unclear whether they report samples or cases. One person tested twice is recorded as one case and two samples. ^ Excluding Taiwan. ^ Excluding Northern Cyprus. ^ Excluding Greenland and the Faroe Islands. ^ Excluding Overseas France. ^ Testing data from 4 May to 12 May is missing because of the transition to the new reporting system SI-DEP. ^ Excluding Abkhazia and South Ossetia. ^ Data for residents only. ^ Excluding Transnistria. ^ Northern Cyprus is not recognized as a sovereign state by any country except Turkey. ^ Includes data for Liechtenstein. ^ Not a United Nations member.,.mw-parser-output .reflist{font-size:90%;margin-bottom:0.5em;list-style-type:decimal}.mw-parser-output .reflist .references{font-size:100%;margin-bottom:0;list-style-type:inherit}.mw-parser-output .reflist-columns-2{column-width:30em}.mw-parser-output .reflist-columns-3{column-width:25em}.mw-parser-output .reflist-columns{margin-top:0.3em}.mw-parser-output .reflist-columns ol{margin-top:0}.mw-parser-output .reflist-columns li{page-break-inside:avoid;break-inside:avoid-column}.mw-parser-output .reflist-upper-alpha{list-style-type:upper-alpha}.mw-parser-output .reflist-upper-roman{list-style-type:upper-roman}.mw-parser-output .reflist-lower-alpha{list-style-type:lower-alpha}.mw-parser-output .reflist-lower-greek{list-style-type:lower-greek}.mw-parser-output .reflist-lower-roman{list-style-type:lower-roman} ^ Local time. ^ For some countries it is unclear whether they report samples or cases. One person tested twice is recorded as one case and two samples. ^ Excluding Taiwan. ^ Excluding Northern Cyprus. ^ Excluding Greenland and the Faroe Islands. ^ Excluding Overseas France. ^ Testing data from 4 May to 12 May is missing because of the transition to the new reporting system SI-DEP. ^ Excluding Abkhazia and South Ossetia. ^ Data for residents only. ^ Excluding Transnistria. ^ Northern Cyprus is not recognized as a sovereign state by any country except Turkey. ^ Includes data for Liechtenstein. ^ Not a United Nations member.,.mw-parser-output .reflist{font-size:90%;margin-bottom:0.5em;list-style-type:decimal}.mw-parser-output .reflist .references{font-size:100%;margin-bottom:0;list-style-type:inherit}.mw-parser-output .reflist-columns-2{column-width:30em}.mw-parser-output .reflist-columns-3{column-width:25em}.mw-parser-output .reflist-columns{margin-top:0.3em}.mw-parser-output .reflist-columns ol{margin-top:0}.mw-parser-output .reflist-columns li{page-break-inside:avoid;break-inside:avoid-column}.mw-parser-output .reflist-upper-alpha{list-style-type:upper-alpha}.mw-parser-output .reflist-upper-roman{list-style-type:upper-roman}.mw-parser-output .reflist-lower-alpha{list-style-type:lower-alpha}.mw-parser-output .reflist-lower-greek{list-style-type:lower-greek}.mw-parser-output .reflist-lower-roman{list-style-type:lower-roman} ^ Local time. ^ For some countries it is unclear whether they report samples or cases. One person tested twice is recorded as one case and two samples. ^ Excluding Taiwan. ^ Excluding Northern Cyprus. ^ Excluding Greenland and the Faroe Islands. ^ Excluding Overseas France. ^ Testing data from 4 May to 12 May is missing because of the transition to the new reporting system SI-DEP. ^ Excluding Abkhazia and South Ossetia. ^ Data for residents only. ^ Excluding Transnistria. ^ Northern Cyprus is not recognized as a sovereign state by any country except Turkey. ^ Includes data for Liechtenstein. ^ Not a United Nations member.,.mw-parser-output .reflist{font-size:90%;margin-bottom:0.5em;list-style-type:decimal}.mw-parser-output .reflist .references{font-size:100%;margin-bottom:0;list-style-type:inherit}.mw-parser-output .reflist-columns-2{column-width:30em}.mw-parser-output .reflist-columns-3{column-width:25em}.mw-parser-output .reflist-columns{margin-top:0.3em}.mw-parser-output .reflist-columns ol{margin-top:0}.mw-parser-output .reflist-columns li{page-break-inside:avoid;break-inside:avoid-column}.mw-parser-output .reflist-upper-alpha{list-style-type:upper-alpha}.mw-parser-output .reflist-upper-roman{list-style-type:upper-roman}.mw-parser-output .reflist-lower-alpha{list-style-type:lower-alpha}.mw-parser-output .reflist-lower-greek{list-style-type:lower-greek}.mw-parser-output .reflist-lower-roman{list-style-type:lower-roman} ^ Local time. ^ For some countries it is unclear whether they report samples or cases. One person tested twice is recorded as one case and two samples. ^ Excluding Taiwan. ^ Excluding Northern Cyprus. ^ Excluding Greenland and the Faroe Islands. ^ Excluding Overseas France. ^ Testing data from 4 May to 12 May is missing because of the transition to the new reporting system SI-DEP. ^ Excluding Abkhazia and South Ossetia. ^ Data for residents only. ^ Excluding Transnistria. ^ Northern Cyprus is not recognized as a sovereign state by any country except Turkey. ^ Includes data for Liechtenstein. ^ Not a United Nations member.,.mw-parser-output .reflist{font-size:90%;margin-bottom:0.5em;list-style-type:decimal}.mw-parser-output .reflist .references{font-size:100%;margin-bottom:0;list-style-type:inherit}.mw-parser-output .reflist-columns-2{column-width:30em}.mw-parser-output .reflist-columns-3{column-width:25em}.mw-parser-output .reflist-columns{margin-top:0.3em}.mw-parser-output .reflist-columns ol{margin-top:0}.mw-parser-output .reflist-columns li{page-break-inside:avoid;break-inside:avoid-column}.mw-parser-output .reflist-upper-alpha{list-style-type:upper-alpha}.mw-parser-output .reflist-upper-roman{list-style-type:upper-roman}.mw-parser-output .reflist-lower-alpha{list-style-type:lower-alpha}.mw-parser-output .reflist-lower-greek{list-style-type:lower-greek}.mw-parser-output .reflist-lower-roman{list-style-type:lower-roman} ^ Local time. ^ For some countries it is unclear whether they report samples or cases. One person tested twice is recorded as one case and two samples. ^ Excluding Taiwan. ^ Excluding Northern Cyprus. ^ Excluding Greenland and the Faroe Islands. ^ Excluding Overseas France. ^ Testing data from 4 May to 12 May is missing because of the transition to the new reporting system SI-DEP. ^ Excluding Abkhazia and South Ossetia. ^ Data for residents only. ^ Excluding Transnistria. ^ Northern Cyprus is not recognized as a sovereign state by any country except Turkey. ^ Includes data for Liechtenstein. ^ Not a United Nations member.,.mw-parser-output .reflist{font-size:90%;margin-bottom:0.5em;list-style-type:decimal}.mw-parser-output .reflist .references{font-size:100%;margin-bottom:0;list-style-type:inherit}.mw-parser-output .reflist-columns-2{column-width:30em}.mw-parser-output .reflist-columns-3{column-width:25em}.mw-parser-output .reflist-columns{margin-top:0.3em}.mw-parser-output .reflist-columns ol{margin-top:0}.mw-parser-output .reflist-columns li{page-break-inside:avoid;break-inside:avoid-column}.mw-parser-output .reflist-upper-alpha{list-style-type:upper-alpha}.mw-parser-output .reflist-upper-roman{list-style-type:upper-roman}.mw-parser-output .reflist-lower-alpha{list-style-type:lower-alpha}.mw-parser-output .reflist-lower-greek{list-style-type:lower-greek}.mw-parser-output .reflist-lower-roman{list-style-type:lower-roman} ^ Local time. ^ For some countries it is unclear whether they report samples or cases. One person tested twice is recorded as one case and two samples. ^ Excluding Taiwan. ^ Excluding Northern Cyprus. ^ Excluding Greenland and the Faroe Islands. ^ Excluding Overseas France. ^ Testing data from 4 May to 12 May is missing because of the transition to the new reporting system SI-DEP. ^ Excluding Abkhazia and South Ossetia. ^ Data for residents only. ^ Excluding Transnistria. ^ Northern Cyprus is not recognized as a sovereign state by any country except Turkey. ^ Includes data for Liechtenstein. ^ Not a United Nations member.,.mw-parser-output .reflist{font-size:90%;margin-bottom:0.5em;list-style-type:decimal}.mw-parser-output .reflist .references{font-size:100%;margin-bottom:0;list-style-type:inherit}.mw-parser-output .reflist-columns-2{column-width:30em}.mw-parser-output .reflist-columns-3{column-width:25em}.mw-parser-output .reflist-columns{margin-top:0.3em}.mw-parser-output .reflist-columns ol{margin-top:0}.mw-parser-output .reflist-columns li{page-break-inside:avoid;break-inside:avoid-column}.mw-parser-output .reflist-upper-alpha{list-style-type:upper-alpha}.mw-parser-output .reflist-upper-roman{list-style-type:upper-roman}.mw-parser-output .reflist-lower-alpha{list-style-type:lower-alpha}.mw-parser-output .reflist-lower-greek{list-style-type:lower-greek}.mw-parser-output .reflist-lower-roman{list-style-type:lower-roman} ^ Local time. ^ For some countries it is unclear whether they report samples or cases. One person tested twice is recorded as one case and two samples. ^ Excluding Taiwan. ^ Excluding Northern Cyprus. ^ Excluding Greenland and the Faroe Islands. ^ Excluding Overseas France. ^ Testing data from 4 May to 12 May is missing because of the transition to the new reporting system SI-DEP. ^ Excluding Abkhazia and South Ossetia. ^ Data for residents only. ^ Excluding Transnistria. ^ Northern Cyprus is not recognized as a sovereign state by any country except Turkey. ^ Includes data for Liechtenstein. ^ Not a United Nations member.,.mw-parser-output .reflist{font-size:90%;margin-bottom:0.5em;list-style-type:decimal}.mw-parser-output .reflist .references{font-size:100%;margin-bottom:0;list-style-type:inherit}.mw-parser-output .reflist-columns-2{column-width:30em}.mw-parser-output .reflist-columns-3{column-width:25em}.mw-parser-output .reflist-columns{margin-top:0.3em}.mw-parser-output .reflist-columns ol{margin-top:0}.mw-parser-output .reflist-columns li{page-break-inside:avoid;break-inside:avoid-column}.mw-parser-output .reflist-upper-alpha{list-style-type:upper-alpha}.mw-parser-output .reflist-upper-roman{list-style-type:upper-roman}.mw-parser-output .reflist-lower-alpha{list-style-type:lower-alpha}.mw-parser-output .reflist-lower-greek{list-style-type:lower-greek}.mw-parser-output .reflist-lower-roman{list-style-type:lower-roman} ^ Local time. ^ For some countries it is unclear whether they report samples or cases. One person tested twice is recorded as one case and two samples. ^ Excluding Taiwan. ^ Excluding Northern Cyprus. ^ Excluding Greenland and the Faroe Islands. ^ Excluding Overseas France. ^ Testing data from 4 May to 12 May is missing because of the transition to the new reporting system SI-DEP. ^ Excluding Abkhazia and South Ossetia. ^ Data for residents only. ^ Excluding Transnistria. ^ Northern Cyprus is not recognized as a sovereign state by any country except Turkey. ^ Includes data for Liechtenstein. ^ Not a United Nations member.


## TASK 3: Pre-process and export the extracted data frame

The goal of task 3 is to pre-process the extracted data frame from the previous step, and export it as a csv file


Let's get a summary of the data frame


In [10]:
# Print the summary of the data frame
summary(covid19_df)

 Country.or.region    Date.a.             Tested            Units.b.        
 Length:173         Length:173         Length:173         Length:173        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 Confirmed.cases.   Confirmed..tested.. Tested..population..
 Length:173         Length:173          Length:173          
 Class :character   Class :character    Class :character    
 Mode  :character   Mode  :character    Mode  :character    
 Confirmed..population..     Ref.          
 Length:173              Length:173        
 Class :character        Class :character  
 Mode  :character        Mode  :character  

As you can see from the summary, the columns names are little bit different to understand and some column data types are not correct. For example, the `Tested` column shows as `character`. 

As such, the data frame read from HTML table will need some pre-processing such as removing irrelvant columns, renaming columns, and convert columns into proper data types.


We have prepared a pre-processing function for you to conver the data frame but you can also try to write one by yourself


In [11]:
preprocess_covid_data_frame <- function(data_frame) {
    
    shape <- dim(data_frame)

    # Remove the World row
    data_frame<-data_frame[!(data_frame$`Country.or.region`=="World"),]
    # Remove the last row
    data_frame <- data_frame[1:172, ]
    
    # We dont need the Units and Ref columns, so can be removed
    data_frame["Ref."] <- NULL
    data_frame["Units.b."] <- NULL
    
    # Renaming the columns
    names(data_frame) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")
    
    # Convert column data types
    data_frame$country <- as.factor(data_frame$country)
    data_frame$date <- as.factor(data_frame$date)
    data_frame$tested <- as.numeric(gsub(",","",data_frame$tested))
    data_frame$confirmed <- as.numeric(gsub(",","",data_frame$confirmed))
    data_frame$'confirmed.tested.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.tested.ratio`))
    data_frame$'tested.population.ratio' <- as.numeric(gsub(",","",data_frame$`tested.population.ratio`))
    data_frame$'confirmed.population.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.population.ratio`))
    
    return(data_frame)
}


Call the `preprocess_covid_data_frame` function


In [13]:
# call `preprocess_covid_data_frame` function and assign it to a new data frame
df<-preprocess_covid_data_frame(covid19_df)
head(df)

Unnamed: 0_level_0,country,date,tested,confirmed,confirmed.tested.ratio,tested.population.ratio,confirmed.population.ratio
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,Afghanistan,17 Dec 2020,154767,49621,32.1,0.4,0.13
2,Albania,18 Feb 2021,428654,96838,22.6,15.0,3.4
3,Algeria,2 Nov 2020,230553,58574,25.4,0.53,0.13
4,Andorra,23 Feb 2022,300307,37958,12.6,387.0,49.0
5,Angola,2 Feb 2021,399228,20981,5.3,1.3,0.067
6,Antigua and Barbuda,6 Mar 2021,15268,832,5.4,15.9,0.86


Get the summary of the processed data frame again


Get the summary of the processed data frame again


Get the summary of the processed data frame again


Get the summary of the processed data frame again


Get the summary of the processed data frame again


In [14]:
# Print the summary of the processed data frame again
summary<-(print(df))

                   country        date    tested confirmed
1              Afghanistan 17 Dec 2020    154767     49621
2                  Albania 18 Feb 2021    428654     96838
3                  Algeria  2 Nov 2020    230553     58574
4                  Andorra 23 Feb 2022    300307     37958
5                   Angola  2 Feb 2021    399228     20981
6      Antigua and Barbuda  6 Mar 2021     15268       832
7                Argentina 16 Apr 2022  35716069   9060495
8                  Armenia 29 May 2022   3099602    422963
9                Australia  9 Sep 2022  78548492  10112229
10                 Austria  1 Feb 2023 205817752   5789991
11              Azerbaijan 11 May 2022   6838458    792638
12                 Bahamas 28 Nov 2022    259366     37483
13                 Bahrain  3 Dec 2022  10578766    696614
14              Bangladesh 24 Jul 2021   7417714   1151644
15                Barbados 14 Oct 2022    770100    103014
16                 Belarus  9 May 2022  13217569    9828

After pre-processing, you can see the columns and columns names are simplified, and columns types are converted into correct types.


The data frame has following columns:

- **country** - The name of the country
- **date** - Reported date
- **tested** - Total tested cases by the reported date
- **confirmed** - Total confirmed cases by the reported date
- **confirmed.tested.ratio** - The ratio of confirmed cases to the tested cases
- **tested.population.ratio** - The ratio of tested cases to the population of the country
- **confirmed.population.ratio** - The ratio of confirmed cases to the population of the country


OK, we can call `write.csv()` function to save the csv file into a file. 


In [26]:
# Export the data frame to a csv file
getwd()
write.csv(df, file = "/resources/labs/authoride/IBMSkillsNetwork+RP0101EN/v2/M5_Final/covid.csv", row.names = FALSE)

Note for IBM Waston Studio, there is no traditional "hard disk" associated with a R workspace.

Even if you call `write.csv()` method to save the data frame as a csv file, it won't be shown in IBM Cloud Object Storage asset UI automatically.

However, you may still check if the `covid.csv` exists using following code snippet:


In [16]:
# Get working directory
wd <- getwd()
# Get exported 
file_path <- paste(wd, sep="", "/covid.csv")
# File path
print(file_path)
file.exists(file_path)

[1] "/resources/labs/authoride/IBMSkillsNetwork+RP0101EN/v2/M5_Final/covid.csv"


**Optional Step**: If you have difficulties finishing above webscraping tasks, you may still continue with next tasks by downloading a provided csv file from here:


In [22]:
## Download a sample csv file
covid_csv_file <- download.file("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/dataset/covid.csv", destfile="covid.csv")
covid_data_frame_csv <- read.csv("covid.csv", header=TRUE, sep=",")

## TASK 4: Get a subset of the extracted data frame

The goal of task 4 is to get the 5th to 10th rows from the data frame with only `country` and `confirmed` columns selected


In [23]:
# Read covid_data_frame_csv from the csv file
read.csv("/resources/labs/authoride/IBMSkillsNetwork+RP0101EN/v2/M5_Final/covid.csv")
# Get the 5th to 10th rows, with two "country" "confirmed" columns
covid_data_frame_csv[c(5:10),c('country','confirmed')]

country,date,tested,confirmed,confirmed.tested.ratio,tested.population.ratio,confirmed.population.ratio
<fct>,<fct>,<dbl>,<int>,<dbl>,<dbl>,<dbl>
Afghanistan,17 Dec 2020,154767,49621,32.10,0.40,0.1300
Albania,18 Feb 2021,428654,96838,22.60,15.00,3.4000
Algeria,2 Nov 2020,230553,58574,25.40,0.53,0.1300
Andorra,8 Mar 2021,159725,11066,6.90,206.00,14.3000
Angola,12 Mar 2021,399228,20981,5.30,1.30,0.0670
Antigua and Barbuda,6 Mar 2021,15268,832,5.40,15.90,0.8600
Argentina,14 Mar 2021,7998673,2195722,27.50,17.60,4.8000
Armenia,13 Mar 2021,765879,177104,23.10,25.90,6.0000
Australia,15 Mar 2021,14933604,29130,0.20,59.50,0.1200
Austria,13 Mar 2021,17906847,488007,2.70,201.00,5.5000


Unnamed: 0_level_0,country,confirmed
Unnamed: 0_level_1,<fct>,<int>
5,Angola,20981
6,Antigua and Barbuda,832
7,Argentina,2195722
8,Armenia,177104
9,Australia,29130
10,Austria,488007


## TASK 5: Calculate worldwide COVID testing positive ratio

The goal of task 5 is to get the total confirmed and tested cases worldwide, and try to figure the overall positive ratio using `confirmed cases / tested cases`


In [24]:
# Get the total confirmed cases worldwide
total_confirmed<- sum(covid_data_frame_csv['confirmed'])
total_confirmed
# Get the total tested cases worldwide
total_tested<- sum(covid_data_frame_csv['tested'])
total_tested
# Get the positive ratio (confirmed / tested)
positive_ratio<- total_confirmed/total_tested
positive_ratio

## TASK 6: Get a country list which reported their testing data 

The goal of task 6 is to get a catalog or sorted list of countries who have reported their COVID-19 testing data


In [28]:
# Get the `country` column
covid_data_frame_csv['country']
# Check its class (should be Factor)
class(covid_data_frame_csv$country)
# Conver the country column into character so that you can easily sort them
covid_data_frame_csv$country<- as.character(covid_data_frame_csv$country)
# Sort the countries AtoZ
atoz_country<- sort(covid_data_frame_csv$country)
# Sort the countries ZtoA
ztoa_country<-sort(covid_data_frame_csv$country, decreasing=TRUE)
# Print the sorted ZtoA list
list(sort(covid_data_frame_csv$country, decreasing=TRUE))

country
<chr>
Afghanistan
Albania
Algeria
Andorra
Angola
Antigua and Barbuda
Argentina
Armenia
Australia
Austria


## TASK 7: Identify countries names with a specific pattern

The goal of task 7 is using a regular expression to find any countries start with `United`


In [44]:
# Use a regular expression `United.+` to find matches
list1<- list(grep("United.+", covid_data_frame_csv$country), value=TRUE)
# Print the matched country names
list1

## TASK 8: Pick two countries you are interested, and then review their testing data

The goal of task 8 is to compare the COVID-19 test data between two countires, you will need to select two rows from the dataframe, and select `country`, `confirmed`, `confirmed-population-ratio` columns


In [81]:
# Select a subset (should be only one row) of data frame based on a selected country name and columns
covid_data_frame_csv[7,c("country","confirmed", "confirmed.population.ratio")]
# Select a subset (should be only one row) of data frame based on a selected country name and columns
covid_data_frame_csv[44,c("country","confirmed","confirmed.population.ratio")]

Unnamed: 0_level_0,country,confirmed,confirmed.population.ratio
Unnamed: 0_level_1,<chr>,<int>,<dbl>
7,Argentina,2195722,4.8


Unnamed: 0_level_0,country,confirmed,confirmed.population.ratio
Unnamed: 0_level_1,<chr>,<int>,<dbl>
44,Dominican Republic,243778,2.2


## TASK 9: Compare which one of the selected countries has a larger ratio of confirmed cases to population

The goal of task 9 is to find out which country you have selected before has larger ratio of confirmed cases to population, which may indicate that country has higher COVID-19 infection risk


In [82]:
# Use if-else statement
if(argentina$confirmed.population.ratio>dr$confirmed.population.ratio){
    print("Argentina is greater than DR")
} else{
    print("DR is greater than Argentina")
}

ERROR: Error in eval(expr, envir, enclos): object 'argentina' not found


## TASK 10: Find countries with confirmed to population ratio rate less than a threshold

The goal of task 10 is to find out which countries have the confirmed to population ratio less than 1%, it may indicate the risk of those countries are relatively low


In [None]:
# Get a subset of any countries with `confirmed.population.ratio` less than the threshold
