-
Notifications
You must be signed in to change notification settings - Fork 0
Plotting data distributions using R
An important part of any data analysis is to perform exploratory data analysis (EDA). One of these steps is to check the distribution of data. A lot of statistical tests assume normality therefore it is of importance to know how the check if the data fulfills the assumption of normality. We will demonstrate how to produce:
- Histograms
- Density plots
- Bar plots
- Box plots
- violin plots
- qqplot and the shapiro test
First install and load the tidyverse
package if you haven't yet.
# install.packages("tidyverse")
library(tidyverse)
We will start by simulating normally distributed data with mean 80 and standard deviation 10. ggplot requires the input as a data frame, therefore we create data frame with a column called x_norm
before we pipe it to ggplot to produce a histogram with the classical bell shaped distribution.
norm_data <- data.frame(x_norm = rnorm(n=1000, mean=80, sd=10))
norm_data %>%
ggplot(aes(x_norm)) +
geom_histogram()
E1. Download the demo data using the following line of code.
df_in <- read.table("https://raw.githubusercontent.com/bcfgothenburg/Hands-on/master/distribution_demo.txt",
header = TRUE,
stringsAsFactors = TRUE)
E2. Have a look at the data using head()
and summary()
on the object df_in
.
Solution
head(df_in)
summary(df_in)
E3. We start with creating a basic histogram of the height column for the total population from the object df_in
. To do this, we use ggplot
function with aes(Height)
and the layer geom_histogram()
. Hint: Type ?geom_histogram
in the console for an example.
Solution
df_in %>%
ggplot(aes(Height)) +
geom_histogram()
E4. To make the histogram look nicer add the layers ggtitle("Total population")
and theme_classic()
. From now on we use theme_classic()
for all the plots.
Solution
df_in %>%
ggplot(aes(Height)) +
geom_histogram() +
ggtitle("Total population") +
theme_classic()
E5. Now we will create grouped histograms by the column Sex. To achieve this add group=Sex, fill=Sex
within the aes()
to the code above. Note that the total population is still shown but now colored by the column Sex
but if you specify position="identity" within geom_histogram()
you will have separate histograms for the column Sex
. To better visualize the groups add alpha=0.5
to geom_histogram(), this will make the histograms transparently overlay eachother.
Solution
df_in %>%
ggplot(aes(Height, group=Sex, fill=Sex)) +
geom_histogram(alpha=0.5, position="identity") +
theme_classic()
E6. We start with creating a basic density of the height column for the total population from the object df_in
. To do this, we use ggplot
function with aes(Height)
and the layer geom_density()
. Hint: Type ?geom_density
in the console for an example.
Solution
df_in %>%
ggplot(aes(Height)) +
geom_density() +
theme_classic()
E7. As we did for the histograms, we will create grouped density plots by the column Sex. To achieve this add group=Sex, fill=Sex
within the aes()
to the code above. To better visualize the groups add alpha=0.5
to geom_density(), this will make the densities transparently overlay eachother.
Solution
df_in %>%
ggplot(aes(Height, group=Sex, fill=Sex)) +
geom_density(alpha=0.5) +
theme_classic()
E8. Now we will use a bar plot to look at the distribution of the discrete variable training_week
. We will use geom_bar()
instead of geom_histogram(). Note that ggplot by default creates a stacked bar plot.
Solution
df_in %>%
ggplot(aes(training_week, group=Sex, fill=Sex)) +
geom_bar() +
theme_classic()
E9. To make it easier to distinguish the differences between the groups we will now create an unstacked bar plot by adding position="dodge2"
in the geom_bar()
layer. We will also add width=0.4
to make the bars thinner.
Solution
df_in %>%
ggplot(aes(training_week, group=Sex, fill=Sex)) +
geom_bar(position="dodge2", width=0.4) +
theme_classic()
E10. The bar plot above can easily be improved by making training_week
a factor before plotting. We do this using mutate(training_week = as.factor(training_week))
. Now we just have tick marks on the x-axis at the factor levels.
Solution
df_in %>%
mutate(training_week = as.factor(training_week)) %>%
ggplot(aes(training_week, group=Sex, fill=Sex)) +
geom_bar(position="dodge2", width=0.4) +
theme_classic()
E11. We can also use a classic box plot to visualize the distribution of heights between the groups by using the layer geom_boxplot()
and specifying ggplot(aes(x=Sex, y=Height, group=Sex, fill=Sex))
. By default the layer geom_boxplot()
adds a legend to the right. In this case the legend is not needed and therefore we remove it by adding the layer theme(legend.position="none")
.
Solution
df_in %>%
ggplot(aes(x=Sex, y=Height, group=Sex, fill=Sex)) +
geom_boxplot() +
theme_classic() +
theme(legend.position="none")
E12. The last type of visualization of distributions will be the violin plot. This is created by using ggplot(aes(x=Sex, y=Height))
and the layer geom_violin(fill="darkgrey")
. This creates a basic violin plot.
Solution0
df_in %>%
ggplot(aes(x=Sex, y=Height)) +
geom_violin(fill="darkgrey") +
theme_classic()
E13. We overlay the violin plot with jittered data points by adding geom_jitter(width=0.1)
.
Solution
df_in %>%
ggplot(aes(x=Sex, y=Height)) +
geom_violin(fill="darkgrey") +
geom_jitter(width=0.1) +
theme_classic()
E14. Next, we add a box plot instead of jittered data points.
Solution
df_in %>%
ggplot(aes(x=Sex, y=Height)) +
geom_violin(fill="darkgrey") +
geom_boxplot(width=0.04, fill="white") +
theme_classic()
E15. Next, we add a box plot instead of jittered data points.
Solution
df_in %>%
ggplot(aes(x=Sex, y=Height)) +
geom_violin(fill="darkgrey") +
geom_boxplot(width=0.04, fill="white") +
theme_classic()
E16. Now we will make a QQ-plot. We will reuse the normally distributed data in the data frame (norm_data
). To create a QQ-plot we will pipe norm_data
to ggplot(aes(sample = x_norm))
and use the layers stat_qq()
and stat_qq_line()
. Hint: type ?stat_qq
for an example. Ideally, normally distributed data follows the straight line y=x
but in practice this rarely occurs for real data. The deviations are often seen at the ends of the QQ-plot. Note that this is a subjective visual assessment.
Solution
norm_data %>%
ggplot(aes(sample = x_norm)) +
stat_qq() +
stat_qq_line()
E17. To do a formal test of the assumption of normal distribution we can use the shapiro.test
. Hint: type ?shapiro.test
to see an example.
Solution
shapiro.test(norm_data$x_norm)
E18. Do a QQ-plot of the variable dist
in the built-in dataset attenu
and do a shapiro test to check normality. Hint: Use the code from E17 and E18.
Solution
attenu %>%
ggplot(aes(sample = dist)) +
stat_qq() +
stat_qq_line()
shapiro.test(attenu$dist)
E19. Log-transformation is a common way to make skewed data normally distributed. Log-transform the variable dist and repeat the QQ-plot and shapiro test from E18 to check if the log-transformed variable is normally distributed.
Solution
attenu %>%
ggplot(aes(sample = log(dist))) +
stat_qq() +
stat_qq_line()
shapiro.test(log(attenu$dist))
Developed by Björn Andersson and Jari Martikainen 2023