*Managerial Problem Solving*

# Tutorial 9 - Hypothesis Testing and Regression Analysis

Toni Greif<br>
Lehrstuhl für Wirtschaftsinformatik und Informationsmanagement

SS 2019

## Hypothesis Testing
Drawing inferences about two contrasting propositions (each called a hypothesis) relating to the value of one or more population parameters.

- $H_0$  Null hypothesis: describes an existing theory (conservative, adversarial)
- $H_1$  Alternative hypothesis: the complement of $H_0$ 

Using sample data, we either:
- reject $H_0$ and conclude the sample data provides sufficient evidence to support $H_1$, or
- fail to reject $H_0$ and conclude the sample data does not support $H_1$.

### Understanding Risk in Hypothesis Testing
We always risk drawing an incorrect conclusion:
- $H_0$ is true and the test correctly fails to reject $H_0$
- $H_0$ is false and the test correctly rejects $H_0$
- $H_0$ is true and the test incorrectly rejects $H_0$ (called a *Type I error*)
- $H_0$ is false and the test incorrectly fails to reject $H_0$ (called a *Type II error*)

We are typically most concerned about Type I errors:
- Innocent person convicted
- Ineffective treatment approved
- Sick person considered healthy

### Steps of Hypothesis Testing procedures
1. Identify the population parameter and formulate  the hypotheses to test.
2. Select a level of significance (related to the risk of drawing an incorrect conclusion).
3. Determine a decision rule on which to base a conclusion.
4. Collect data and calculate a test statistic.
5. Apply the decision rule and draw a conclusion.

The key competence in hypothesis testing is the correct choice of test statistics, and the interpretation of the results (Critical Value, p-value, confidence interval...)

### Computing the Test Statistics
**One-sample test on a mean, σ unknown**

$$t=\frac{\bar{x}-\mu_0}{s\ /\sqrt{n}}$$


**One-sample test on a proportion**

$$z=\frac{\hat{p}-\pi_0}{\sqrt{\pi_0(1-\pi_0)\ /n}}$$

with $\hat{p}=\frac{number\ in\ the\ sample}{size\ of\ the\ sample}$

However, we will rely on pre-installed test functions in most applications.

### Exercise 1

Use the mtcars data set to test the following hypothesis:
- The average mpg of a car is below 20.

In [1]:
library(tidyverse)

"package 'tidyverse' was built under R version 3.5.3"-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --
[32mv[39m [34mggplot2[39m 3.1.0     [32mv[39m [34mpurrr  [39m 0.2.5
[32mv[39m [34mtibble [39m 2.1.1     [32mv[39m [34mdplyr  [39m 0.8.1
[32mv[39m [34mtidyr  [39m 0.8.2     [32mv[39m [34mstringr[39m 1.3.1
[32mv[39m [34mreadr  [39m 1.1.1     [32mv[39m [34mforcats[39m 0.3.0
"package 'dplyr' was built under R version 3.5.3"-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [2]:
df <- mtcars

Manual calculation of t-statistics
$$t=\frac{\bar{x}-\mu_0}{s\ /\sqrt{n}}$$

...and now using the pre-installed function:

```R
    t.test()
```

- $H_0: mean(mpg) \leq 20$
- $H_1: mean(mpg) > 20$

Use the pre-installed function to test the following hypothesis:
- Cars with more than 4 cylinders have a lower mpg than cars with 4 or less cylinders.

### Exercise 2

The file *roomInspection.csv* summarizes the room inspection results of a hotel chain. During the samples 1000  hotel rooms have been inspected.

In [3]:
df <- read.csv("data/T09/roomInspections.csv")

In [4]:
df %>% head()

room,roomOk
<int>,<lgl>
1,True
2,True
3,True
4,False
5,True
6,True



The management wants the share of rooms not matching the standard to be below 2%. Formulate a suitable hypothesis and test it.

### Exercise 3
A retailer believes that a new marketing strategy can improve the revenues. Until now, customer spending in 15 different categories averages at 70.00€ for customers between 18 and 34 as well as for customers 35+. After the new marketing strategy is launched, the spending of customers is analyzed.

1. Set up the hypothesis to test the success of a marketing strategy.
2. 300 of the asked customers are aged between 18 and 34. Their average spending is 75.86€ with a standard deviation of 50.90€. Has the average spending been changed significantly?
3. 700 of the asked  are aged above 35. Their average spending is 68.53€ with a standard deviation of 45.29€. Has the average spending of this group been changed significantly?

## Regression Analysis
### Results of Regression Analysis

**Information on model quality:**
- Standard error (SE)
    - Information on the deviation of the model from the data
- Pearson correlation coefficient $(R)$
    - Magnitude of linear correlation $(-1 \leq R \leq 1)$
- Coefficient of determination $(R^2)$
    - Characterizes the 'predictive power' of the model
    
**Intercept and slope of regression function (Regression coefficients)**

**Confidence intervals**
- Interval in which the true regression coefficient value lies with a probability of 95%
    - If 0 is covered by the interval, the coefficient is not statistically significant
    - The same information is conveyed by the coefficients’ p-values (p-value < 0.05)


Load the dataset “income.csv”.

In [5]:
income <- read.csv("data/T09/income.csv")

In [6]:
income %>% head()

Population,Income,Illiteracy,Life.Exp,Murder,HS.Grad,Frost,Area
<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>
3615,3624,2.1,69.05,15.1,41.3,20,50708
365,6315,1.5,69.31,11.3,66.7,152,566432
2212,4530,1.8,70.55,7.8,58.1,15,113417
2110,3378,1.9,70.66,10.1,39.9,65,51945
21198,5114,1.1,71.71,10.3,62.6,20,156361
2541,4884,0.7,72.06,6.8,63.9,166,103766


Perform a multiple linear regression. Therefore, use the income as dependent variable and all others parameters as independent variables.

After fitting the initial model, keep removing the insignificant (5%) independent variables.What independent variables have a significant influence on the life expectancy of the state inhabitants?

What share of total variance in the data can be explained by our regression model?