## Excercise 4

In [None]:
'''
    Get the COVID-19 Dataset from the data sources. Number of observations should be more than
    100.Then, report the following informations
    a. Data Source detail(Ex: Link)
    b. Explain the Unit & Necessity of each variable
    c. Find the missing values(rows & columns) and replace them with mean(Tidy Dataset)
    d. Generate the two new variables(Var1:Mean, Var2: Median from available variable)
    e. Rename the two existing variables
    f. Create a plot using following instructions (using 7 layers of Grammar of Graphics)
    i. Choose x and y axis(aes)
    ii. geom_point() - specify the parameters, size : 5, color: red, alpha: 1⁄5
    iii. Use Facet grid, cartesian coordinates & geom_smooth()
    iv. Assign the title to x, y and graph
    v. Export the graph to your working directory with the title called “covid_19_
    dataset.png”
'''

## Solution

In [None]:
# Data Source
# https://data.europa.eu/euodp/en/data/dataset/covid-19-coronavirus-data
# This dataset contains the information about covid-19 cases happen around the world
# Dataset size: [59557, 12]
# A row of this dataset equal to a daily description of overall covid-19 cases of a specific country

# Columns' unit, description
    # Column 1: dateRep -> datetime in the format of DD/MM/YY (string)
    # Column 2: day -> day of the month (double)
    # Column 3: month -> month of the year (double)
    # Column 4: year -> year (double)
    # Column 5: cases -> total number of cases today (double)
    # Column 6: deaths -> total number of deaths today (double)
    # Column 7: countriesAndTerritories -> name of the country (string)
    # Column 8: geoId -> id of the country (string)
    # Column 9: countryterritoryCode -> territory code of the country (string)
    # Column 10: popData2019 -> total population of the country (value from 2019) (double)
    # Column 11: continentExp -> continent od the country (double)
    # Column 12: Cumulative_number_for_14_days_of_COVID-19_cases_per_100000 -> the cumulative of cases per 100000 people in 14 days (double)

# import dataset
library(tidyverse)
dataset = read_csv('../input/covid19-cases/covid_19_cases.csv')
print(paste('Dimension of dataset: '))
print(dim(dataset))

In [None]:
# missing value's column
list_na_columns = colnames(dataset)[ apply(dataset, 2, anyNA) ]
print(paste('columns contain missing value: ', list_na_columns))

In [None]:
# replace missing value with mean (Tidy Dataset)
avg_popData2019 =  apply(dataset[,'popData2019'],
      2,
      mean,
      na.rm =  TRUE)

avg_cnf14doc19cp1000 = apply(dataset[,'Cumulative_number_for_14_days_of_COVID-19_cases_per_100000'],
      2,
      mean,
      na.rm =  TRUE)

In [None]:
print(paste('Mean of popData2019: ', avg_popData2019))
print(paste('Mean of Cumulative_number_for_14_days_of_COVID-19_cases_per_100000: ', avg_cnf14doc19cp1000))
print(paste('Dimension of dataset after replace NA value with mean: '))
print(dim(dataset))

In [None]:
# Warning because some column that contain missing value is not numeric/ logical type, so it cannot generate mean value.
# In this case I will simply just drop them
dataset_drop_na <- dataset %>% na.omit()
print(paste('Dimension of dataset after dropped NA value:'))
print(dim(dataset_drop_na))

In [None]:
# create 2 new variables
# mean of column: Cumulative_number_for_14_days_of_COVID-19_cases_per_100000 as a new variable
# median of column: cases as a new variable
dataset_new_var <- mutate(dataset_drop_na, mean_cases_14_days_100000 = avg_cnf14doc19cp1000, median_cases = apply(dataset[,'cases'], 2, median, na.rm=TRUE))

In [None]:
# rename dataset columns
dataset_renamed <- rename(dataset_new_var,
    geo_id = geoId,
    country_territory_code = countryterritoryCode
)

In [None]:
# Plot the graph
# filter dataset for cambodia
cambodia_dataset = dataset[dataset$countriesAndTerritories == "Cambodia",]

In [None]:
qplot(dateRep, cases, data=cambodia_dataset) +
    geom_point(size=5, color='red', alpha=0.5) + 
    geom_smooth() + ggtitle("Daily number of cases in Cambodia") + xlab("Date") + ylab("Cases") + 
    ggsave(filename = "covid_19_dataset.png", units = "cm", width = 25, height = 25)
