<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/M5_Final/images/SN_web_lightmode.png" width="300">
</center>


<h1>Analysis of Global COVID-19 Pandemic Data</h1>

Estimated time needed: **90** minutes



## Overview:

There are 10 tasks in this final project. All tasks will be graded by your peers who are also completing this assignment within the same session.

You need to submit the following the screenshot for the code and output for each task for review.

If you need to refresh your memories about specific coding details, you may refer to previous hands-on labs for code examples.


In [1]:
install.packages("rvest")

NameError: name 'install' is not defined

In [None]:
install.packages("httr")

In [None]:
install.packages("stringr")

In [None]:
library(httr)
library(rvest)

In [None]:
library("stringr")

Note: if you can import above libraries, please use install.packages() to install them first.


## TASK 1: Get a `COVID-19 pandemic` Wiki page using HTTP request


First, let's write a function to use HTTP request to get a public COVID-19 Wiki page.

Before you write the function, you can open this public page from this 

URL https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country using a web browser.

The goal of task 1 is to get the html page using HTTP request (`httr` library)


## TASK 1: Get a `COVID-19 pandemic` Wiki page using HTTP request

In [None]:

get_wiki_covid19_page <- function() {
    
  # Our target COVID-19 wiki page URL is: https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country  
  # Which has two parts: 
    # 1) base URL `https://en.wikipedia.org/w/index.php  
    # 2) URL parameter: `title=Template:COVID-19_testing_by_country`, seperated by question mark ?
    

  # Wiki page base
    
  wiki_base_url <- "https://en.wikipedia.org/w/index.php"
    
  # You will need to create a List which has an element called `title` to specify which page you want to get from Wiki
  # in our case, it will be `Template:COVID-19_testing_by_country`
  
  url_query_param <- list(title = "Template:COVID-19_testing_by_country")   
    
  # - Use the `GET` function in httr library with a `url` argument and a `query` arugment to get a HTTP response
    #using two angled brackets in order to make "response" a global variable that can be reused outside the function'''
    response <<- GET(wiki_base_url, query=url_query_param)
    
  # Use the `return` function to return the response
#     response$status
    response$request$url
    response$status
}

Call the `get_wiki_covid19_page` function to get a http response with the target html page

In [None]:
# Call the get_wiki_covid19_page function and print the response
get_wiki_covid19_page()

## TASK 2: Extract COVID-19 testing data table from the wiki HTML page


On the COVID-19 testing wiki page, you should see a data table `<table>` node contains COVID-19 testing data by country on the page:

<a href="https://cognitiveclass.ai/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkRP0101ENCoursera889-2022-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/M5_Final/images/covid-19-by-country.png" width="400" align="center">
</a>

Note the numbers you actually see on your page may be different from above because it is still an on-going pandemic when creating this notebook.

The goal of task 2 is to extract above data table and convert it into a data frame


Now use the `read_html` function in rvest library to get the root html node from response


In [None]:
# testing the content of the response variable
response

In [None]:
content(response)

In [None]:
response_headers <- headers(response)
response_headers

In [None]:
# Get the root html node from the http response in task 1
root_node <- read_html(response)

Get the correct table in the HTML root node using `html_node` function. 
Use the `inspect` function to check the live website and identify the attributes of the desired table.

In thi case, we're looking for the table with the following description: `<table class="wikitable plainrowheaders sortable collapsible autocollapse ...`

In [None]:
# Get the table nodes from the root html node
table_node <- html_nodes(root_node, "table")
table_node 

Scraping HTML tables with rvest: 
https://uc-r.github.io/scraping_HTML_tables

In [None]:
# getting the table node
second_tbl_node <- html_nodes(root_node, "table") %>% .[2]
second_tbl_node

Read the table node as a data frame using `html_table` function


## TASK 2: Extract COVID-19 testing data table from the wiki HTML page

In [None]:
raw_covid19_df <- html_table(table_node)
raw_covid19_df

In [None]:
# Read the table node and convert it into a data frame, and print the data frame for review
# Parsing the table nodes into a dataframe
covid_table <- html_table(second_tbl_node)
covid_table

## TASK 3: Pre-process and export the extracted data frame

The goal of task 3 is to pre-process the extracted data frame from the previous step, and export it as a csv file


Let's get a summary of the data frame


In [None]:
# Print the summary of the data frame
summary(covid_table)

As you can see from the summary, the columns names are little bit different to understand and some column data types are not correct. For example, the `Tested` column shows as `character`. 

As such, the data frame read from HTML table will need some pre-processing such as removing irrelvant columns, renaming columns, and convert columns into proper data types.


We have prepared a pre-processing function for you to conver the data frame but you can also try to write one by yourself


In [None]:
preprocess_covid_data_frame <- function(data_frame) {
    
    shape <- dim(data_frame)

    # Remove the World row
    data_frame<-data_frame[!(data_frame$`Country or region`=="World"),]
    # Remove the last row
    data_frame <- data_frame[1:172, ]
    
    # We dont need the Units and Ref columns, so can be removed
    data_frame["Ref."] <- NULL
    data_frame["Units[b]"] <- NULL
    
    # Renaming the columns
    names(data_frame) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")
    
    # Convert column data types
    data_frame$country <- as.factor(data_frame$country)
    data_frame$date <- as.factor(data_frame$date)
    data_frame$tested <- as.numeric(gsub(",","",data_frame$tested))
    data_frame$confirmed <- as.numeric(gsub(",","",data_frame$confirmed))
    data_frame$'confirmed.tested.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.tested.ratio`))
    data_frame$'tested.population.ratio' <- as.numeric(gsub(",","",data_frame$`tested.population.ratio`))
    data_frame$'confirmed.population.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.population.ratio`))
    
    return(data_frame)
}

Call the `preprocess_covid_data_frame` function


In [None]:
# call `preprocess_covid_data_frame` function and assign it to a new data frame
covid19_df <- preprocess_covid_data_frame(covid_table)

The `preprocess_covid_data_frame` probably failed because from the `summary` function, only one row was detected as against 173. As signified by [1,] in the description.

In [None]:
covid19_df

Get the summary of the processed data frame again


In [None]:
# Print the summary of the processed data frame again
summary(covid19_df)

After pre-processing, you can see the columns and columns names are simplified, and columns types are converted into correct types.


The data frame has following columns:

- **country** - The name of the country
- **date** - Reported date
- **tested** - Total tested cases by the reported date
- **confirmed** - Total confirmed cases by the reported date
- **confirmed.tested.ratio** - The ratio of confirmed cases to the tested cases
- **tested.population.ratio** - The ratio of tested cases to the population of the country
- **confirmed.population.ratio** - The ratio of confirmed cases to the population of the country


OK, we can call `write.csv()` function to save the csv file into a file. 


In [None]:
# Export the data frame to a csv file
# file dir shortened afterwards
write.csv(covid19_df, file= "*/jupyternotebookfiles/coursera-R/final project/covid.csv", row.names=FALSE)

Note for IBM Waston Studio, there is no traditional "hard disk" associated with a R workspace.

Even if you call `write.csv()` method to save the data frame as a csv file, it won't be shown in IBM Cloud Object Storage asset UI automatically.

However, you may still check if the `covid.csv` exists using following code snippet:


In [None]:
# Get working directory
#  wd <- getwd() # if we were saving to the working director
proj_folder <- '*/jupyternotebookfiles/coursera-R/final project' # shortened file dir


# Get exported 
#  file_path <- paste(wd, sep="", "/covid.csv") # if we saved to the working directory
file_path <- paste(proj_folder, sep="", "/covid.csv")

# File path
print(file_path)
file.exists(file_path)

**Optional Step**: If you have difficulties finishing above webscraping tasks, you may still continue with next tasks by downloading a provided csv file from here:


In [None]:
# # Download a sample csv file
# covid_csv_file <- download.file("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/dataset/covid.csv", destfile="covid.csv")

## TASK 4: Get a subset of the extracted data frame

The goal of task 4 is to get the 5th to 10th rows from the data frame with only `country` and `confirmed` columns selected


We would be using the covid.csv file instead of the covid_new.csv file

In [None]:
# Read covid_data_frame_csv from the csv file
covid_data_frame_csv <- read.csv("covid.csv", header=TRUE, sep=",")
covid_data_frame_csv

In [None]:
# Get the 5th to 10th rows, with two "country" "confirmed" columns
covid_data_frame_csv[5:10, c("country", "confirmed")]

## TASK 5: Calculate worldwide COVID testing positive ratio

The goal of task 5 is to get the total confirmed and tested cases worldwide, and try to figure the overall positive ratio using `confirmed cases / tested cases`


In [None]:
# Get the total confirmed cases worldwide
sum(covid_data_frame_csv$confirmed) -> total_confirmed_cases_worldwide

# Get the total tested cases worldwide
sum(covid_data_frame_csv$tested) -> total_tested_cases_worldwide

# Get the positive ratio (confirmed / tested)
# ideally humans do not have decimals but this is a ratio so decimals are appropriate
positive_ratio <- as.numeric(total_confirmed_cases_worldwide/total_tested_cases_worldwide)
positive_ratio

## TASK 6: Get a country list which reported their testing data 

The goal of task 6 is to get a catalog or sorted list of countries who have reported their COVID-19 testing data


In [None]:
# Get the `country` column
country_tosort <- covid_data_frame_csv$country 
country_tosort

In [None]:
# Check its class (should be Factor)
str(country_tosort)
class(country_tosort)

In [None]:
# Convert the country column into character so that you can easily sort them
as.character(country_tosort) -> country_tosort
country_tosort

From the look of things, some country names contain unneccessary bits. For example, 'Georgia[h]', 'France[f][g]', 'Moldova[j]' and 'Northern Cyprus[k]' all have that extr bits in square brackets and that looks to be the trend throughout the column.

Match everything until Parenthesis: https://stackoverflow.com/questions/13867860/match-everything-until-parenthesis

In [None]:
# using regex to remove the extra bits on the country
as.list(sub(" *\\[.*", "", country_tosort)) -> country_aslist
country_aslist

In [None]:
class(country_aslist)

In [None]:
country_asvector <- unlist(country_aslist) 
country_asvector

In [None]:
# Sort the countries atoz
country_asvector_atoz <- sort(country_asvector, decreasing = FALSE)
print(country_asvector_atoz)

In [None]:
# Sort the countries ztoa
country_asvector_ztoa <- sort(country_asvector, decreasing = TRUE)
print(country_asvector_ztoa)

In [None]:
# converting country vetor sorted in descending order back to list
country_list_ztoa <- list(country_asvector_ztoa)

# Print the sorted ZtoA list
print(country_list_ztoa)

## TASK 7: Identify countries names with a specific pattern

The goal of task 7 is using a regular expression to find any countires start with `United`


Regular expression to match string starting with a specific word: 
https://stackoverflow.com/questions/1240504/regular-expression-to-match-string-starting-with-a-specific-word

In [None]:
# Use a regular expression `United.+` to find matches
pattern <- "United.+"

# Print the matched country names
print(str_extract(country_aslist, pattern))

## TASK 8: Pick two countries you are interested, and then review their testing data

The goal of task 8 is to compare the COVID-19 test data between two countires, you will need to select two rows from the dataframe, and select `country`, `confirmed`, `confirmed-population-ratio` columns


In [None]:
head(covid_data_frame_csv)

In [None]:
# Select a subset (should be only one row) of data frame based on a selected country name and columns
nga_covid19 <- covid_data_frame_csv[118, c("country", "confirmed", "confirmed.tested.ratio")]
nga_covid19

# Select a subset (should be only one row) of data frame based on a selected country name and columns
uk_covid19 <- covid_data_frame_csv[165, c("country", "confirmed", "confirmed.tested.ratio")]
uk_covid19

## TASK 9: Compare which one of the selected countries has a larger ratio of confirmed cases to population

The goal of task 9 is to find out which country you have selected before has larger ratio of confirmed cases to population, which may indicate that country has higher COVID-19 infection risk


In [None]:
# Using either nga_covid19$confirmed.tested.ratio OR as.numeric(nga_covid19[,3]) works in both cases
    # as they both return numeric datatypes when inspected with class() built-in method
        # the column name or data frame name can be changed to uk_covid19 as well

# I'll be sticking with 'dataframe$confirmed.tested.ratio' method  
# Because it is more descriptive and requires lesser functional nesting

# Use if-else statement
if (nga_covid19$confirmed.tested.ratio < uk_covid19$confirmed.tested.ratio) {
   print('There are more confirmed Covid-19 cases in Nigeria than in the United Kingdom')
} else if (nga_covid19$confirmed.tested.ratio == uk_covid19$confirmed.tested.ratio){
   print('There are as many confirmed Covid-19 cases in Nigeria as in the United Kingdom')
}  else {
   print('There are more confirmed Covid-19 cases in the United Kingdom than in Nigeria')
} 

## TASK 10: Find countries with confirmed to population ratio rate less than a threshold

The goal of task 10 is to find out which countries have the confirmed to population ratio less than 1%, it may indicate the risk of those countries are relatively low


How to print the column when another column met the condition:
https://stackoverflow.com/questions/41089883/how-to-print-the-column-when-another-column-met-the-condition

In [None]:
# Get a subset of any countries with `confirmed.population.ratio` less than the threshold
one_pct_threshold <- 1.0000

threshold_df <- covid_data_frame_csv[c("country", "confirmed.population.ratio")]
threshold_df

In [None]:
# Again, we need to remove the unwanted parts of the country name
threshold_df$country <- sub(" *\\[.*", "", threshold_df$country)
threshold_df

In [None]:
class(threshold_df$country)

In [None]:
threshold_df$country <- as.character(threshold_df$country)
class(threshold_df$country)

In [None]:
# Print entry in the country column (threshold_df[,1]) 
# where entry in the confirmed.population.ratio(threshold_df[,2]) 
# is lesser than the 1% threshold
threshold_df[,1][threshold_df[,2] < one_pct_threshold]