Skip to content

Plotting data distributions using R

Bioinformatics and Data Centre, Gothenburg edited this page Nov 23, 2023 · 1 revision

Hands on Exercises Plotting data distributions using R

Background

An important part of any data analysis is to perform exploratory data analysis (EDA). One of these steps is to check the distribution of data. A lot of statistical tests assume normality therefore it is of importance to know how the check if the data fulfills the assumption of normality. We will demonstrate how to produce:

  • Histograms
  • Density plots
  • Bar plots
  • Box plots
  • violin plots
  • qqplot and the shapiro test

Before you start

First install and load the tidyverse package if you haven't yet.

# install.packages("tidyverse")
library(tidyverse)

Histogram

Check the distribution using histogram using ggplot2

We will start by simulating normally distributed data with mean 80 and standard deviation 10. ggplot requires the input as a data frame, therefore we create data frame with a column called x_norm before we pipe it to ggplot to produce a histogram with the classical bell shaped distribution.

norm_data <- data.frame(x_norm = rnorm(n=1000, mean=80, sd=10))

norm_data %>%
 ggplot(aes(x_norm)) +
 geom_histogram()

Exercises

E1. Download the demo data using the following line of code.

df_in <- read.table("https://raw.githubusercontent.com/bcfgothenburg/Hands-on/master/distribution_demo.txt",
                    header = TRUE, 
                    stringsAsFactors = TRUE)

E2. Have a look at the data using head() and summary() on the object df_in.

Solution
head(df_in)
summary(df_in)

E3. We start with creating a basic histogram of the height column for the total population from the object df_in. To do this, we use ggplot function with aes(Height) and the layer geom_histogram(). Hint: Type ?geom_histogram in the console for an example.

Solution
df_in %>%
 ggplot(aes(Height)) +
 geom_histogram()

E4. To make the histogram look nicer add the layers ggtitle("Total population") and theme_classic(). From now on we use theme_classic() for all the plots.

Solution
df_in %>%
 ggplot(aes(Height)) +
 geom_histogram() +
 ggtitle("Total population") +
 theme_classic()

E5. Now we will create grouped histograms by the column Sex. To achieve this add group=Sex, fill=Sex within the aes() to the code above. Note that the total population is still shown but now colored by the column Sex but if you specify position="identity" within geom_histogram() you will have separate histograms for the column Sex. To better visualize the groups add alpha=0.5 to geom_histogram(), this will make the histograms transparently overlay eachother.

Solution
df_in %>%
 ggplot(aes(Height, group=Sex, fill=Sex)) +
 geom_histogram(alpha=0.5, position="identity") +
 theme_classic()

E6. We start with creating a basic density of the height column for the total population from the object df_in. To do this, we use ggplot function with aes(Height) and the layer geom_density(). Hint: Type ?geom_density in the console for an example.

Solution
df_in %>%
  ggplot(aes(Height)) +
  geom_density() +
  theme_classic()

E7. As we did for the histograms, we will create grouped density plots by the column Sex. To achieve this add group=Sex, fill=Sex within the aes() to the code above. To better visualize the groups add alpha=0.5 to geom_density(), this will make the densities transparently overlay eachother.

Solution
df_in %>%
  ggplot(aes(Height, group=Sex, fill=Sex)) +
  geom_density(alpha=0.5) +
  theme_classic()

E8. Now we will use a bar plot to look at the distribution of the discrete variable training_week. We will use geom_bar() instead of geom_histogram(). Note that ggplot by default creates a stacked bar plot.

Solution
df_in %>%
 ggplot(aes(training_week, group=Sex, fill=Sex)) +
 geom_bar() +
 theme_classic()

E9. To make it easier to distinguish the differences between the groups we will now create an unstacked bar plot by adding position="dodge2" in the geom_bar() layer. We will also add width=0.4 to make the bars thinner.

Solution
df_in %>%
  ggplot(aes(training_week, group=Sex, fill=Sex)) +
  geom_bar(position="dodge2", width=0.4) +
  theme_classic()

E10. The bar plot above can easily be improved by making training_week a factor before plotting. We do this using mutate(training_week = as.factor(training_week)). Now we just have tick marks on the x-axis at the factor levels.

Solution
df_in %>%
  mutate(training_week = as.factor(training_week)) %>%
  ggplot(aes(training_week, group=Sex, fill=Sex)) +
  geom_bar(position="dodge2", width=0.4) +
  theme_classic()

E11. We can also use a classic box plot to visualize the distribution of heights between the groups by using the layer geom_boxplot() and specifying ggplot(aes(x=Sex, y=Height, group=Sex, fill=Sex)). By default the layer geom_boxplot() adds a legend to the right. In this case the legend is not needed and therefore we remove it by adding the layer theme(legend.position="none").

Solution
df_in %>%
 ggplot(aes(x=Sex, y=Height, group=Sex, fill=Sex)) +
 geom_boxplot() +
 theme_classic() +
 theme(legend.position="none")

E12. The last type of visualization of distributions will be the violin plot. This is created by using ggplot(aes(x=Sex, y=Height)) and the layer geom_violin(fill="darkgrey"). This creates a basic violin plot.

Solution0
df_in %>%
  ggplot(aes(x=Sex, y=Height)) +
  geom_violin(fill="darkgrey") +
  theme_classic()

E13. We overlay the violin plot with jittered data points by adding geom_jitter(width=0.1).

Solution
df_in %>%
 ggplot(aes(x=Sex, y=Height)) +
 geom_violin(fill="darkgrey") +
 geom_jitter(width=0.1) + 
 theme_classic()

E14. Next, we add a box plot instead of jittered data points.

Solution
df_in %>%
 ggplot(aes(x=Sex, y=Height)) +
 geom_violin(fill="darkgrey") +
 geom_boxplot(width=0.04, fill="white") + 
 theme_classic()

E15. Next, we add a box plot instead of jittered data points.

Solution
df_in %>%
 ggplot(aes(x=Sex, y=Height)) +
 geom_violin(fill="darkgrey") +
 geom_boxplot(width=0.04, fill="white") + 
 theme_classic()

E16. Now we will make a QQ-plot. We will reuse the normally distributed data in the data frame (norm_data). To create a QQ-plot we will pipe norm_data to ggplot(aes(sample = x_norm)) and use the layers stat_qq() and stat_qq_line(). Hint: type ?stat_qq for an example. Ideally, normally distributed data follows the straight line y=x but in practice this rarely occurs for real data. The deviations are often seen at the ends of the QQ-plot. Note that this is a subjective visual assessment.

Solution
norm_data %>% 
 ggplot(aes(sample = x_norm)) + 
 stat_qq() + 
 stat_qq_line()

E17. To do a formal test of the assumption of normal distribution we can use the shapiro.test. Hint: type ?shapiro.test to see an example.

Solution
shapiro.test(norm_data$x_norm)

E18. Do a QQ-plot of the variable dist in the built-in dataset attenu and do a shapiro test to check normality. Hint: Use the code from E17 and E18.

Solution
attenu %>% 
  ggplot(aes(sample = dist)) + 
  stat_qq() + 
  stat_qq_line()

 shapiro.test(attenu$dist)

E19. Log-transformation is a common way to make skewed data normally distributed. Log-transform the variable dist and repeat the QQ-plot and shapiro test from E18 to check if the log-transformed variable is normally distributed.

Solution
attenu %>% 
 ggplot(aes(sample = log(dist))) + 
 stat_qq() + 
 stat_qq_line()

 shapiro.test(log(attenu$dist))

Developed by Björn Andersson and Jari Martikainen 2023