# R Basics - Statistical Data Analysis

### Dr. Matthias Duschl & Dr. Daniel Lee, 2020

#### Solutions for the exercises

In [None]:
aphasiker <- read.table("data/Aphasiker.csv", sep=";", dec=",", header=TRUE)
claims <- read.table("data/Auto_Insurance_Claims_Sample.csv", sep=",", header=T)

## Exercise 1: examine the Aphasiker data set

Learn about the data by applying the functions below to the Aphasiker data.

In [None]:
head(aphasiker)

In [None]:
tail(aphasiker)

In [None]:
nrow(aphasiker)

In [None]:
colnames(aphasiker)

In [None]:
summary(aphasiker)

In [None]:
str(aphasiker)

## Exercise 2: Objects
TODO: Renumber ensuing exercises.

Talk about all the stuff from Lecture.ipynb/Objects in R onwards.

- Working with data types
  - Numeric
  - Character
  - Logical
- Creating and combining vectors
- Extracting individual elements and ranges of elements using ranges and vectors
- Use some functions
- Play with ``NA``s
- Show how to use ``na.rm = TRUE``
- Sort stuff
- Filter it using
  - Boolean expressions
  - Vectors
  - Bitwise combinations
  
There's more below http://localhost:8888/notebooks/Downloads/mara-r-basics-2020/Lecture.ipynb#Random-numbers that needs triaging.
I have now checked out the lecture to http://localhost:8888/lab#The-apply()-function, exercises need to be moved in here.

## Exercise 2: working with random numbers

* Generate a sample of 100 normally distributed numbers centered at 0 with a standard deviance of 200.
* What's their standard deviance? Is it really what you specified?
* What's their variance?
* Sort them in ascending order.
* Sort them in descending order.
* Filter out negative values.
* Do the exercises again for a series of exponentially distributed values, this time filtering for values under 1.

In [None]:
series <- rnorm(100, sd = 200)

In [None]:
sd(series)

In [None]:
var(series)

In [None]:
sort(series)

In [None]:
sort(series, decreasing = TRUE)

In [None]:
series[series > 0]

In [None]:
series <- rexp(100)
sd(series)
var(series)
sort(series)
sort(series, decreasing = TRUE)
series[series > 1]

## Exercise 3: working with data.frames

Answer the following questions using the aphasiker data:
* How many patients are in the data?
* What is the average age of the patients?
* How many women are among the patients? (Make a frequency table)

Answer the following questions using the claims file:
* What is the mean, median, minimum and maximum of the Total Claim Amount?
* What is the average Total Claim Amount for highly-qualified (Doctor and Master) versus the other insured?

In [None]:
nrow(aphasiker)  # Number of rows in the table = number of patients
length(aphasiker$Patienten_ID)  # Length of one variable = number of patients
mean(aphasiker$Alter)
table(aphasiker$Geschlecht) 

In [None]:
mean(claims$Total.Claim.Amount)
median(claims$Total.Claim.Amount)
min(claims$Total.Claim.Amount)
max(claims$Total.Claim.Amount)

In [None]:
summary(claims$Total.Claim.Amount)

In [None]:
claims$high.qualified <- FALSE
claims[claims$Education == "Doctor" | claims$Education == "Master", ]$high.qualified <- TRUE
mean(claims[claims$high.qualified == TRUE, ]$Total.Claim.Amount)
mean(claims[claims$high.qualified == FALSE, ]$Total.Claim.Amount)

## Exercise 4: Visualize it!
Use the claims data and explore the distribution of the variables visually. For instance,
* create a histogram for show the Income distribution of the customers. Are there values you might want to filter out? 
* a barplot for state, education or claim reason. For this, you need to count the frequencies first using ``table()``.

Try to customize your plots using colors and axes labels. 

In [None]:
hist(claims$Income)
hist(claims[claims$Income>0,]$Income, main="Income Distribution", xlab="Income")

In [None]:
barplot(table(claims$State))
barplot(table(claims$Education))
barplot(table(claims$Claim.Reason))

## Exercise 5: Regression lines
Use the claims data and visualize the relationship between "Claims Amount" and "Total Claims Amount". 

Try to customize your scatterplot:
* Choose a color for the dots and a different symbol with the parameter ``pch``
* Control the size of the symbols (``?par`` shows you an overview on all graphical parameters)
* Add a linear regression line using ``abline()`` and the coefficients from a linear model ``lm()``: ``abline(lm(var1 ~ var2))``

Which one is the dependent variable in the model?

In [None]:
plot(claims$Total.Claim.Amount, claims$Claim.Amount,
     pch=19, col = "darkgrey", cex=0.5,
     xlab="Total Claim Amount",
     ylab="Claim Amount")
abline(lm(claims$Claim.Amount ~ claims$Total.Claim.Amount), col="red")

## Exercise 6: statistical testing

Test the following null-hypotheses:
* The gender distribution of this R course today is random
* The claims amount (from the claims data set) is normally distributed
* The claims amount (from the claims data set) is exponentially distributed. The rate/lambda of an exponential distribution is defined as 1/expected value
* There is no difference in the average claim amount between employed and unemployed (from variable employment status). Can the t-test be used? If not, use the non-parametric alternative ``wilcox.test()``

In [None]:
hist(claims$Claim.Amount)
mean(claims$Claim.Amount)
hist(rexp(10000, rate=1/mean(claims$Claim.Amount)))
ks.test(claims$Claim.Amount, "pexp", rate=1/mean(claims$Claim.Amount))

In [None]:
table(claims$EmploymentStatus)
claims_amount_employed <- claims[claims$EmploymentStatus == "Employed", ]$Claim.Amount
claims_amount_unemployed <- claims[claims$EmploymentStatus == "Unemployed", ]$Claim.Amount
t.test(claims_amount_employed, claims_amount_unemployed)
wilcox.test(claims_amount_employed, claims_amount_unemployed)

## Exercise 7: data manipulation

Get all claims with open complaints and sort by claim amount (descending). Try with and without the pipe operator ``%>%``.

In [None]:
library(dplyr)

In [None]:
arrange(filter(select(claims, 
                      Customer,
                      Claim.Amount,
                      Number.of.Open.Complaints), 
               Number.of.Open.Complaints >= 1), 
        -Claim.Amount)

In [None]:
claims %>%
    select(Customer, Claim.Amount, Number.of.Open.Complaints) %>%
    filter(Number.of.Open.Complaints >= 1) %>%
    arrange(-Claim.Amount)

## Exercise 8: correlation analyses

Are there any correlations between Claim Amount and Total Claim Amount / Months Since Policy Inception / Income?
Please use the appropriate statistical correlation tests. 

In [None]:
cor.test(claims$Claim.Amount, claims$Total.Claim.Amount, method="spearman")
cor.test(claims$Claim.Amount, claims$Months.Since.Policy.Inception, method="spearman")
cor.test(claims$Claim.Amount, claims$Income, method="spearman")

In [None]:
cor(claims[, c(5, 12, 17)], method = "spearman")

## Exercise 9: discover advanced statistical modelling methods in R

Depending on your interest, do one or more of these exercises.

1. Improve the model for predicting the claim amount (e.g. by introducing new co-variates).
2. Do some non-linear curve fitting. Try to run and understand the codes from these tutorials: 
    - [On curve fitting using R
](http://davetang.org/muse/2013/05/09/on-curve-fitting/) - shows how to estimate a linear model with polynomials
    - [Simple nonlinear least squares curve fitting in R
](http://www.walkingrandomly.com/?p=5254) - shows how to fit a nonlinear least-squares model
3. Read more on ANOVA in R:
  - [Data Camp page on ANOVA](https://www.statmethods.net/stats/anova.html) 
  - [Description of using factor analysis in psychological research](http://personality-project.org/r/r.guide.html#factoranal)
4. Go to the [CRAN Task Views](http://cran.r-project.org/web/views/) and find a package containing your advanced method. Open their reference manuals and (if available) their vignettes. For instance, you could try to run a Neuronal Net with the package ``nnet`` by opening the help with ``?nnet`` and reproducing the given example?

## Exercise 10: create your first ``ggplot``

Consult the [ggplot2 references](https://ggplot2.tidyverse.org/reference/index.html) for inspiration and guidance.

In [None]:

############
# Challenges
############

## These are special challenges for the adventurous among you.
# 1. Test a condition to control the program's flow
# Repeat this with different values for hungry
# Also try using conditions that have to be evaluated, like "n - 4"
hungry <- TRUE
if (hungry) {
  print("I'll cook something.")
  print("And then I'll eat it.")
  print("Afterwards I need to wash the dishes.")
} else {
  print("I think I'll go running instead.")
  print("Oh, my MP3 player isn't charged.")
  print("Maybe I'll just read a book?")
}

# 2. Repeat a section of code as long as a condition is true
i <- 0
while (i < 10) {
  print(i)
  i <- i + 1
}

# 3. Repeat a section of code for every member of a set
# Each member is assigned to the variable you specify
nums <- runif(10, 0, 100)  # 10 random numbers between 0 and 100
for (i in nums) {  # Each member of nums is assigned to i for the block's body
  print(i)
}

# 4. Write your own function!
mittel <- function(numbers) {
  # This returns the mean of the numbers passed to the function.
  sum(numbers) / length(numbers)  # The result is returned to the caller
}
nums <- 1:10  # Generate a series of numbers
mean(nums) == mittel(nums)  # Result is true: it worked!

