# Exercise: Statistical Analysis - 1

Statistical analysis is important if you want to know that the estimate you obtained from your data (e.g. mean, difference in proportions, confidence interval, etc.) is close to that of the target population of interest. For example, if you were to conduct a clinical trial of a new locally manufactured COVID-19 vaccine in your country and if you were to prove that it is better than imported vaccines, then you need some statistical analysis to support your claim. If you are living a fantasy world, where you are almighty, all-powerful, and ultra-rich so that when you conduct this study, everyone in your country participated in the trial. That would be the perfect scenario. You can pinch yourself now and welcome back to reality. So, you pushed through with your clinical trial and enrolled 150 subjects, who were randomly assigned in each group and followed the protocol to the dot. Results showed that the efficacy of your local vaccine is higher (95% vs 50%) when compared to the imported vaccine. Given the results and samples size, can you claim that your local vaccine is better in protecting your fellow citizens compared to the imported vaccine? Thus, to answer this question, you need to do statistical inference.

When you finish this exercise, you learn how to:
1. perform hypothesis testing
2. deploy linear regression model

Load the **tidyverse** package.

In [None]:
# Load tidyverse
____

## COVID-19 case-fatality rate and smoking

In this exercise, you will explore to see if there is a relationship between smoking and COVID-19 case-fatality rate. The COVID-19 data is available from the [Our World in Data](https://ourworldindata.org/) website, which contains a rich collection of global data and figures from the University of Oxford. Import this data from their [GitHub site](https://github.com/owid/covid-19-data/tree/master/public/data). Check the [codebook](https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-codebook.csv) for the description of the different variables in this dataset.

In [None]:
# Import the COVID-19 data
link <- "https://covid.ourworldindata.org/data/owid-covid-data.csv"
cfr  <- ____(link)

In [None]:
# Check data structure
____

In [None]:
# Check the top and bottom rows
____(cfr)
____(cfr)

In [None]:
# Is the data tidy? Why or why not?
# Answer: _____

### COVID-19 case-fatality ratio

Create a column containing the outcome variable, COVID-19 case-fatality rate. But before that, you need to do some data cleaning.

In [None]:
# Subset latest data
cfr <- cfr %>%
    ____(date == max(date))
head(cfr)

Note that there are rows with OWID_* in the first column, which contain aggregated data. Remove these rows since these are not needed during statistical analysis.

In [None]:
# Remove rows containing NAs in continent column 
cfr <- cfr %>% 
    ____(continent)
head(cfr)

Recall from previous exercise, the case-fatality rate (CFR) of COVID-19 is calculated as the total number of deaths attributed to COVID-19 divided by the total number of confirmed COVID-19 cases, then multiplied by 100. 

In [None]:
# Create the outcome variable, CFR
cfr <- cfr %>% 
    ____(cfr = (____/____)*100
    )

head(cfr)

In [None]:
# Check the summary statistics of CFR
____

In [None]:
# Visualize histogram of CFR
cfr %>% ggplot(aes(cfr)) +
    ____

Our explanatory variable is smoking and in this dataset, there are two columns: `male_smokers` and `female_smokers`. These numbers represent proportion of male and female smokers in each country, respectively. Explore first the male smokers variable.

In [None]:
# Check the summary statistics of male smokers
____

In [None]:
# Visualize histogram of male smokers
cfr %>% ggplot(aes(male_smokers)) +
    ____

Visualize the relationship between two variables, CFR and proportion of male smokers, using a scatter plot.

In [None]:
# Load ggreepl and ggthemes packages
library(____)
library(____)

In [None]:
cfr %>% ggplot(aes(x = male_smokers, y = cfr, label = location)) +
        ____(size = 3, alpha = 0.5) +
        geom_text_repel(size = 4) +
        xlab("Proportion of male smokers") +
        ylab("COVID-19 case-fatality rate") +
        theme_clean() 

Notice that the data points are crowded at the bottom of the graph. Transform the y-axis to logarithmic scale.

In [None]:
cfr %>% ggplot(aes(x = male_smokers, y = cfr, label = location)) +
        ____(size = 3, alpha = 0.5) +
        geom_text_repel(size = 4) +
        xlab("Proportion of male smokers") +
        ylab("COVID-19 case-fatality rate (log scale)") +
        theme_clean() +
        ____(breaks = trans_breaks("log10", function(x) 10^x),
            labels = trans_format("log10", math_format(10^.x))) 

Countries located on the right side of the graph have higher proportion of male smokers while countries found on the top side of the graph have higher COVID-19 mortality rate.

In [None]:
# What is the proportion of male smokers in your country?
# Answer: ____

## Inference for linear regression

In [None]:
# What is statement of the problem?
# Answer: Is there a relationship between CFR and proportion of male smokers?

In [None]:
# What is the null hypothesis?
# Answer: The true coefficient of the proportion of male smokers is equal to zero.

In [None]:
# What is the alternative hypothesis?
# Answer: The true coefficient of the proportion of male smokers is NOT equal to zero.

Generate the linear regression model using the **`lm( )`** function. The **`lm( )`** function can also be used for analysis of variance. The `formula` argument specifies the variables to be used for fitting the model. A typical model has the term `response ~ explanatory`, where `response` is the outcome or response variable and `explanatory` is the explanatory or predictor variable for the outcome. 

In [None]:
model <- cfr %>%
    ____(formula = ____ ~ ____)

Use the **`summary()`** function to access the result summaries of the linear regression model.

In [None]:
result <- model %>% 
    ____

result

Use the **`coefficients`** accessor to obtain the coefficients of the variables in this model, which is located in the first column.

In [None]:
result$____

In [None]:
# What is the linear equation of the regression model? outcome = intercept + slope(explanatory) 
# Answer: cfr = 2.31 - 0.005(male_smokers) 

If you are interested in accessing the residuals, use the **`resid( )`** function.

In [None]:
result %>% ____

In [None]:
# What is the interpretation of this linear equation?
# The model predicts a 0.005 decrease in CFR for each additional percentage point of male smokers.

Use the **`coefficients`** accessor to obtain the p-values of interest, which is located in the fourth column.

In [None]:
result$____

In [None]:
# Based on the p-value, what is the statistical decision? Do you accept or reject the null?
# Answer: Accept the null hypothesis since the p-value is large (p-value = 0.7446)

Use the **`r.squared`** accessor to obtain the R-squared value of the model. 

In [None]:
result$____

In [None]:
# What is the interpretation of R-squared value?
# Answer: _____

Calculate the correlation coefficient by computing the square root of the coefficient of determination value.

In [None]:
sqrt(result$_____)

In [None]:
# Based on the value of the correlation coefficient, what is the relationship between CFR and proportion of male smokers?
# There is no correlation between CFR and proportion of male smokers.

Add a trendline using the **`geom_smooth( )`** function to visualize the relationship between the two variables. Use the `method` argument to specify `lm `for linear regression and the `se` argument set to `TRUE` to display the confidence interval of the fitted line.

In [None]:
cfr %>% ggplot(aes(x = male_smokers, y = log(cfr), label = location)) +
        ____(size = 3, alpha = 0.5) +
        xlab("Proportion of male smokers") +
        ylab("COVID-19 case-fatality rate (log scale)") +
        theme_clean() +
        ____(method = ____, se = ____, fill = "lightblue")

The linear regression curve (blue line) is almost a horizontal line indicating that there is no relationship between CFR and proportion of male smokers.

Next, you are going to evaluate the relationship betwen COVID-19 CFR and the proportion of female smokers.

In [None]:
# Write your code below


In [None]:
# What is the relationship betwen COVID-19 CFR and the proportion of female smokers?
# Answer: ____