# R Basics - Statistical Data Analysis

by Dr. Matthias Duschl & Dr. Daniel Lee, 2025

## Solutions for the exercises

In [None]:
aphasiker <- read.table("data/Aphasiker.csv", sep = ";", dec = ",", header = TRUE, encoding = "UTF-8")
claims <- read.table("data/Auto_Insurance_Claims_Sample.csv", sep = ",", header = T)

## Exercise 1: examine the Aphasiker data set

Learn about the data by applying the functions below to the Aphasiker data.

In [None]:
aphasiker

In [None]:
head(aphasiker)

In [None]:
tail(aphasiker)

In [None]:
nrow(aphasiker)

In [None]:
colnames(aphasiker)

In [None]:
summary(aphasiker)

In [None]:
str(aphasiker)

## Exercise 2: data types
Enter some values with different types into your R terminal.

In [None]:
42
-5.37
5.2e6
"foobar"
"here's some more character data"
TRUE
FALSE

Assign these values to variable names so that you can reuse them.

In [None]:
my.number <- 25
my.characters <- "Wir schaffen das!"
my.truth <- FALSE
my.number
my.characters
my.truth

Try different mathematical operations with data of different types.
What happens?

In [None]:
25 + 42
25 / 42
25 ^ 42

In [None]:
"Wir" + "schaffen"

In [None]:
TRUE + TRUE
TRUE + FALSE
TRUE - FALSE
FALSE * FALSE

In [None]:
25 + TRUE
25 - FALSE

Inspect the type of your variables with the ``class`` and ``str`` functions.
Note that this function can be really useful when you have a variable that you receive e.g. from a function you didn't write yourself.
It lets you find out what type the returned data has.

In [None]:
str(my.number)
class(my.number)
str(my.characters)
class(my.characters)
str(my.truth)
class(my.truth)

## Exercise 3: vectors
Create a numeric, character, and boolean vector. Each should be 5 elements long.

In [None]:
c(1, 2, 3, 4, 5)
c("these", "are", "exactly", "five", "words")
c(T, T, F, F, T)

Save those vectors to variables.

In [None]:
n.vector <- c(1, 2, 3, 4, 5)
c.vector <- c("these", "are", "exactly", "five", "words")
b.vector <- c(T, T, F, F, T)

Concatenate vectors with each other.
Remember, vectors can be concatenated as long as it is possible to "promote" their values to a common data type.

In [None]:
c(n.vector, n.vector)
c(n.vector, 5, 6)
c(n.vector, c(5, 6))
c(n.vector, c.vector)
c(c.vector, b.vector)
c(n.vector, c.vector, b.vector)

Using your new vectors, do the following:
- Extract the 3rd and 2nd from last elements
- Extract the first 2 elements using
  - Indexing
  - A boolean vector
- Experiment with boolean vectors as selectors

In [None]:
n.vector[3]
n.vector[-2]
n.vector[1:2]
n.vector[c(T, T, F, F, F)]
n.vector[b.vector]

Note how all operations are "vectorised" - they're applied element for element within the vector.
Try:
- Computing 2 vectors together
- "Recycling" a vector with length 1 against a longer vector
- Computing 2 vectors with incompatible lengths

In [None]:
n.vector + b.vector
n.vector + 200
n.vector + 1:2

## Exercise 4: functions
- Use the ``max`` and ``mean`` functions on some vectors that you create
- Read the help for ``max``. Can you understand how to use it? How would you cite this function in a paper if you were so inclined?
- Find out what the ``pmax`` function does

In [None]:
max(100:1)
min(100:1)
?max

## Exercise 5: NAs
- Make a vector that contains ``NA`` values and see how it performs with different operations and functions
- Filter ``NA``s out of a vector
- Use the ``na.rm = TRUE`` argument to ignore ``NA``s in a function

In [None]:
na.vector <- 1:10
na.vector[5] <- NA
na.vector

In [None]:
na.vector + 1

In [None]:
na.vector * 2

In [None]:
mean(na.vector)

In [None]:
min(na.vector)

In [None]:
is.na(na.vector)  # Find NAs in vector

In [None]:
na.vector[!is.na(na.vector)]  # ! negates a boolean vector

In [None]:
mean(na.vector, na.rm = TRUE)

## Exercise 6: sorting and filtering
- Read the help for the `sort` function. How would use the `na.last` argument?
- Sort the vectors you have created using different arguments

In [None]:
sort(n.vector)
sort(c.vector, decreasing = T)
sort(b.vector)
sort(na.vector)
sort(na.vector, na.last = TRUE)
sort(na.vector, na.last = F)

Remove all the numbers out of `n.vector` that are both odd **and** are less than 4.

- Filter it using
  - Boolean expressions
  - Vectors
  - Bitwise combinations

In [None]:
n.vector
filter.vector <- as.logical(!n.vector %% 2)  # Filter odds
filter.vector[n.vector >= 4] <- T  # Keep >= 4
n.vector[filter.vector]  # Did the filter work?
# Modify original vector (actually you should make a new one)
n.vector <- n.vector[filter.vector]
n.vector

Filter out all words from `c.vector` that start with letters after "l" in the alphabet.
How many of those words are there?

In [None]:
filtered.words <- c.vector[c.vector < "l"]
filtered.words
length(filtered.words)

## Exercise 7: working with random numbers

* Generate a sample of 100 normally distributed numbers centered at 0 with a standard deviance of 200.
* What's their standard deviance? Is it really what you specified?
* What's their variance?
* Sort them in ascending order.
* Sort them in descending order.
* Filter out negative values.
* Do the exercises again for a series of exponentially distributed values, this time filtering for values under 1.

In [None]:
series <- rnorm(100, sd = 200)

In [None]:
sd(series)

In [None]:
var(series)

In [None]:
sort(series)

In [None]:
sort(series, decreasing = TRUE)

In [None]:
series[series > 0]

In [None]:
series <- rexp(100)
sd(series)
var(series)
sort(series)
sort(series, decreasing = TRUE)
series[series > 1]

## Exercise 8: working with data.frames

Answer the following questions using the aphasiker data:
* How many patients are in the data?
* What is the average age of the patients?
* How many women are among the patients? (Make a frequency table)

Answer the following questions using the claims file:
* What is the mean, median, minimum and maximum of the Total Claim Amount?
* What is the average Total Claim Amount for highly-qualified (Doctor and Master) versus the other insured?

In [None]:
nrow(aphasiker)  # Number of rows in the table = number of patients
length(aphasiker$Patienten_ID)  # Length of one variable = number of patients
mean(aphasiker$Alter)
table(aphasiker$Geschlecht) 

In [None]:
mean(claims$Total.Claim.Amount)
median(claims$Total.Claim.Amount)
min(claims$Total.Claim.Amount)
max(claims$Total.Claim.Amount)

In [None]:
summary(claims$Total.Claim.Amount)

In [None]:
claims$high.qualified <- FALSE
claims[claims$Education == "Doctor" | claims$Education == "Master", ]$high.qualified <- TRUE
mean(claims[claims$high.qualified == TRUE, ]$Total.Claim.Amount)
mean(claims[claims$high.qualified == FALSE, ]$Total.Claim.Amount)

## Exercise 9: Visualize it!
Use the claims data and explore the distribution of the variables visually. For instance,
* create a histogram for show the Income distribution of the customers. Are there values you might want to filter out? 
* a barplot for state, education or claim reason. For this, you need to count the frequencies first using ``table()``.

Try to customize your plots using colors and axes labels. 

In [None]:
hist(claims$Income)
hist(claims[claims$Income>0,]$Income, main="Income Distribution", xlab="Income")

In [None]:
barplot(table(claims$State))
barplot(table(claims$Education))
barplot(table(claims$Claim.Reason))

## Exercise 10: Regression lines
Use the claims data and visualize the relationship between "Claims Amount" and "Total Claims Amount". 

Try to customize your scatterplot:
* Choose a color for the dots and a different symbol with the parameter ``pch``
* Control the size of the symbols (``?par`` shows you an overview on all graphical parameters)
* Add a linear regression line using ``abline()`` and the coefficients from a linear model ``lm()``: ``abline(lm(var1 ~ var2))``

Which one is the dependent variable in the model?

In [None]:
plot(claims$Total.Claim.Amount, claims$Claim.Amount,
     pch=19, col = "darkgrey", cex=0.5,
     xlab="Total Claim Amount",
     ylab="Claim Amount")
abline(lm(claims$Claim.Amount ~ claims$Total.Claim.Amount), col="red")

## Exercise 11: statistical testing

Test the following null-hypotheses:
* The gender distribution of this R course today is random
* The claims amount (from the claims data set) is normally distributed
* The claims amount (from the claims data set) is exponentially distributed. The rate/lambda of an exponential distribution is defined as 1/expected value
* There is no difference in the average claim amount between employed and unemployed (from variable employment status). Can the t-test be used? If not, use the non-parametric alternative ``wilcox.test()``

In [None]:
hist(claims$Claim.Amount)
mean(claims$Claim.Amount)
hist(rexp(10000, rate=1/mean(claims$Claim.Amount)))
ks.test(claims$Claim.Amount, "pexp", rate=1/mean(claims$Claim.Amount))

In [None]:
table(claims$EmploymentStatus)
claims_amount_employed <- claims[claims$EmploymentStatus == "Employed", ]$Claim.Amount
claims_amount_unemployed <- claims[claims$EmploymentStatus == "Unemployed", ]$Claim.Amount
t.test(claims_amount_employed, claims_amount_unemployed)
wilcox.test(claims_amount_employed, claims_amount_unemployed)

## Exercise 12: data manipulation

Get all claims with open complaints and sort by claim amount (descending). Try with and without the pipe operator ``|>``.

In [None]:
library(dplyr)

In [None]:
arrange(filter(select(claims, 
                      Customer,
                      Claim.Amount,
                      Number.of.Open.Complaints), 
               Number.of.Open.Complaints >= 1), 
        -Claim.Amount)

In [None]:
claims |>
    select(Customer, Claim.Amount, Number.of.Open.Complaints) |>
    filter(Number.of.Open.Complaints >= 1) |>
    arrange(-Claim.Amount)

## Exercise 13: correlation analyses

Are there any correlations between Claim Amount and Total Claim Amount / Months Since Policy Inception / Income?
Please use the appropriate statistical correlation tests. 

In [None]:
cor.test(claims$Claim.Amount, claims$Total.Claim.Amount, method="spearman")
cor.test(claims$Claim.Amount, claims$Months.Since.Policy.Inception, method="spearman")
cor.test(claims$Claim.Amount, claims$Income, method="spearman")

In [None]:
cor(claims[, c(5, 12, 17)], method = "spearman")

## Exercise 14: discover advanced statistical modelling methods in R

Depending on your interest, do one or more of these exercises.

1. Improve the model for predicting the claim amount (e.g. by introducing new co-variates).
2. Read more on ANOVA in R:
  - [Data Camp tutorial on ANOVA](https://www.statmethods.net/stats/anova.html) 
  - [Tutorial for one-way ANOVA](http://www.sthda.com/english/wiki/one-way-anova-test-in-r)
  - [Tutorial for two-way ANOVA](http://www.sthda.com/english/wiki/two-way-anova-test-in-r)
  - [ANOVA with package car (in German)](https://statistikguru.de/r/r-anova-ancova-manova.html)
3. Go to the [CRAN Task Views](http://cran.r-project.org/web/views/) and find a package containing your advanced method. Open their reference manuals and (if available) their vignettes. For instance, you could try to run a Neuronal Net with the package ``nnet`` by opening the help with ``?nnet`` and reproducing the given example?

###### Exercise 17: create your first ``ggplot``

Consult the [ggplot2 references](https://ggplot2.tidyverse.org/reference/index.html) for inspiration and guidance.

## Exercise 15: working with `apply()`
- Create a copy of `aphasiker` with only numeric columns
- Find the minimum, maximum and standard deviance of each column

In [None]:
numeric.aphasiker <- aphasiker[, c(3,4, 6:14)]
apply(numeric.aphasiker, 2, min, na.rm = T)
apply(numeric.aphasiker, 2, max, na.rm = T)
apply(numeric.aphasiker, 2, sd, na.rm = T)

## Exercise 17: programming in R
- Write an if/else clause
- Write a for loop
- Write a while loop
- Write a function that could replace the `mean` function

In [None]:
# Repeat this with different values for hungry
# Also try using conditions that have to be evaluated, like "n - 4"
hungry <- TRUE
if (hungry) {
  print("I'll cook something.")
  print("And then I'll eat it.")
  print("Afterwards I need to wash the dishes.")
} else {
  print("I think I'll go running instead.")
  print("Oh, my MP3 player isn't charged.")
  print("Maybe I'll just read a book?")
}

In [None]:
# Each member is assigned to the variable you specify
nums <- runif(10, 0, 100)  # 10 random numbers between 0 and 100
for (i in nums) {  # Each member of nums is assigned to i for the block's body
  print(i)
}

In [None]:
i <- 0
while (i < 10) {
  print(i)
  i <- i + 1
}

In [None]:
mittel <- function(numbers) {
  # This returns the mean of the numbers passed to the function.
  sum(numbers) / length(numbers)  # The result is returned to the caller
}
nums <- 1:10  # Generate a series of numbers
mean(nums) == mittel(nums)  # Result is true: it worked!