*Managerial Problem Solving*

# Tutorial 9 - Hypothesis Testing and Regression Analysis

Toni Greif<br>
Lehrstuhl für Wirtschaftsinformatik und Informationsmanagement

SS 2019

## Hypothesis Testing
Drawing inferences about two contrasting propositions (each called a hypothesis) relating to the value of one or more population parameters.

- $H_0$  Null hypothesis: describes an existing theory (conservative, adversarial)
- $H_1$  Alternative hypothesis: the complement of $H_0$ 

Using sample data, we either:
- reject $H_0$ and conclude the sample data provides sufficient evidence to support $H_1$, or
- fail to reject $H_0$ and conclude the sample data does not support $H_1$.

### Understanding Risk in Hypothesis Testing
We always risk drawing an incorrect conclusion:
- $H_0$ is true and the test correctly fails to reject $H_0$
- $H_0$ is false and the test correctly rejects $H_0$
- $H_0$ is true and the test incorrectly rejects $H_0$ (called a *Type I error*)
- $H_0$ is false and the test incorrectly fails to reject $H_0$ (called a *Type II error*)

We are typically most concerned about Type I errors:
- Innocent person convicted
- Ineffective treatment approved
- Sick person considered healthy

### Steps of Hypothesis Testing procedures
1. Identify the population parameter and formulate  the hypotheses to test.
2. Select a level of significance (related to the risk of drawing an incorrect conclusion).
3. Determine a decision rule on which to base a conclusion.
4. Collect data and calculate a test statistic.
5. Apply the decision rule and draw a conclusion.

The key competence in hypothesis testing is the correct choice of test statistics, and the interpretation of the results (Critical Value, p-value, confidence interval...)

### Computing the Test Statistics
**One-sample test on a mean, σ unknown**

$$t=\frac{\bar{x}-\mu_0}{s\ /\sqrt{n}}$$


**One-sample test on a proportion**

$$z=\frac{\hat{p}-\pi_0}{\sqrt{\pi_0(1-\pi_0)\ /n}}$$

with $\hat{p}=\frac{number\ in\ the\ sample}{size\ of\ the\ sample}$

However, we will rely on pre-installed test functions in most applications.

### Exercise 1

Use the mtcars data set to test the following hypothesis:
- The average mpg of a car is below 20.

In [1]:
library(tidyverse)

"package 'tidyverse' was built under R version 3.5.3"-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --
[32mv[39m [34mggplot2[39m 3.1.0     [32mv[39m [34mpurrr  [39m 0.2.5
[32mv[39m [34mtibble [39m 2.1.1     [32mv[39m [34mdplyr  [39m 0.8.1
[32mv[39m [34mtidyr  [39m 0.8.2     [32mv[39m [34mstringr[39m 1.3.1
[32mv[39m [34mreadr  [39m 1.1.1     [32mv[39m [34mforcats[39m 0.3.0
"package 'dplyr' was built under R version 3.5.3"-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [2]:
df <- mtcars

Manual calculation of t-statistics
$$t=\frac{\bar{x}-\mu_0}{s\ /\sqrt{n}}$$

In [3]:
(mean(df$mpg) - 20)/((sd(df$mpg)/sqrt(nrow(df))))
pt(0.085, df = nrow(df))

...and now using the pre-installed function:

```R
    t.test()
```

- $H_0: mean(mpg) \leq 20$
- $H_1: mean(mpg) > 20$

In [4]:
t.test(df$mpg,
       mu = 20,
       alternative = "greater")


	One Sample t-test

data:  df$mpg
t = 0.08506, df = 31, p-value = 0.4664
alternative hypothesis: true mean is greater than 20
95 percent confidence interval:
 18.28418      Inf
sample estimates:
mean of x 
 20.09062 


Use the pre-installed function to test the following hypothesis:
- Cars with more than 4 cylinders have a lower mpg than cars with 4 or less cylinders.

In [5]:
# H0: mean(mpg["bigCars"]) >= mean(mpg["smallCars"])
# H1: mean(mpg["bigCars"]) < mean(mpg["smallCars"])

t.test(df %>% filter(cyl <= 4) %>% select(mpg),
       df %>% filter(cyl > 4) %>% select(mpg),
       alternative = "greater")


	Welch Two Sample t-test

data:  df %>% filter(cyl <= 4) %>% select(mpg) and df %>% filter(cyl > 4) %>% select(mpg)
t = 6.5737, df = 15.266, p-value = 4.045e-06
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 7.348035      Inf
sample estimates:
mean of x mean of y 
 26.66364  16.64762 


### Exercise 2

The file *roomInspection.csv* summarizes the room inspection results of a hotel chain. During the samples 1000  hotel rooms have been inspected.

In [6]:
df <- read.csv("data/T09/roomInspections.csv")

In [7]:
df %>% head()

room,roomOk
<int>,<lgl>
1,True
2,True
3,True
4,False
5,True
6,True



The management wants the share of rooms not matching the standard to be below 2%. Formulate a suitable hypothesis and test it.

In [8]:
# H0: p(FALSE) < 0.02
# H1: p(FALSE) >= 0.02

prop.test(df %>% filter(roomOk == FALSE) %>% nrow(),
          df %>% nrow(),
          p = 0.02,
          alternative = "greater")


	1-sample proportions test without continuity correction

data:  df %>% filter(roomOk == FALSE) %>% nrow() out of df %>% nrow(), null probability 0.02
X-squared = 0, df = 1, p-value = 0.5
alternative hypothesis: true p is greater than 0.02
95 percent confidence interval:
 0.01390848 1.00000000
sample estimates:
   p 
0.02 


### Exercise 3
A retailer believes that a new marketing strategy can improve the revenues. Until now, customer spending in 15 different categories averages at 70.00€ for customers between 18 and 34 as well as for customers 35+. After the new marketing strategy is launched, the spending of customers is analyzed.

1. Set up the hypothesis to test the success of a marketing strategy.
2. 300 of the asked customers are aged between 18 and 34. Their average spending is 75.86€ with a standard deviation of 50.90€. Has the average spending been changed significantly?
3. 700 of the asked  are aged above 35. Their average spending is 68.53€ with a standard deviation of 45.29€. Has the average spending of this group been changed significantly?

In [9]:
tStat <- function(xbar, x0, s, n){
  (xbar-x0)/(s/sqrt(n))
}

#H0: mean(spending) <= 70
#H1: mean(spending) > 70

tStat(75.86,70, 50.90, 300)
qt(0.95, 299)

#H0: mean(spending) = 70
#H1: mean(spending) != 70

tStat(68.53, 70, 45.29, 700)
qt(0.95, 699)

## Regression Analysis
### Results of Regression Analysis

**Information on model quality:**
- Standard error (SE)
    - Information on the deviation of the model from the data
- Pearson correlation coefficient $(R)$
    - Magnitude of linear correlation $(-1 \leq R \leq 1)$
- Coefficient of determination $(R^2)$
    - Characterizes the 'predictive power' of the model
    
**Intercept and slope of regression function (Regression coefficients)**

**Confidence intervals**
- Interval in which the true regression coefficient value lies with a probability of 95%
    - If 0 is covered by the interval, the coefficient is not statistically significant
    - The same information is conveyed by the coefficients’ p-values (p-value < 0.05)


Load the dataset “income.csv”.

In [10]:
income <- read.csv("data/T09/income.csv")

In [11]:
income %>% head()

Population,Income,Illiteracy,Life.Exp,Murder,HS.Grad,Frost,Area
<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>
3615,3624,2.1,69.05,15.1,41.3,20,50708
365,6315,1.5,69.31,11.3,66.7,152,566432
2212,4530,1.8,70.55,7.8,58.1,15,113417
2110,3378,1.9,70.66,10.1,39.9,65,51945
21198,5114,1.1,71.71,10.3,62.6,20,156361
2541,4884,0.7,72.06,6.8,63.9,166,103766


Perform a multiple linear regression. Therefore, use the income as dependent variable and all others parameters as independent variables.


In [12]:
fit1a <- lm(Income ~ ., data=income)
summary(fit1a)


Call:
lm(formula = Income ~ ., data = income)

Residuals:
    Min      1Q  Median      3Q     Max 
-899.73 -226.98  -53.63  232.37  966.02 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  3.183e+03  6.980e+03   0.456   0.6508  
Population   3.999e-02  1.808e-02   2.212   0.0325 *
Illiteracy  -1.437e+02  2.302e+02  -0.624   0.5359  
Life.Exp    -8.688e+00  9.739e+01  -0.089   0.9293  
Murder      -1.154e+01  4.151e+01  -0.278   0.7823  
HS.Grad      3.417e+01  1.455e+01   2.349   0.0236 *
Frost        1.916e-01  2.061e+00   0.093   0.9264  
Area         1.662e-03  1.021e-03   1.627   0.1111  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 470.1 on 42 degrees of freedom
Multiple R-squared:  0.4983,	Adjusted R-squared:  0.4146 
F-statistic: 5.958 on 7 and 42 DF,  p-value: 7.485e-05


After fitting the initial model, keep removing the insignificant (5%) independent variables.What independent variables have a significant influence on the life expectancy of the state inhabitants?

In [13]:
# - Life.Exp
fit1a <- lm(Income ~ Population + 
                    Illiteracy + Murder + HS.Grad + 
                    Frost + Area, data=income)
summary(fit1a)

# - Frost
fit1a <- lm(Income ~ Population + 
                    Illiteracy + Murder + HS.Grad + 
                    Area, data=income)
summary(fit1a)

# Murder
fit1a <- lm(Income ~ Population + 
                    Illiteracy + HS.Grad + 
                    Area, data=income)
summary(fit1a)

# - Illiteracy
fit1a <- lm(Income ~ Population + 
                    HS.Grad + 
                    Area, data=income)
summary(fit1a)

# Area
fit1a <- lm(Income ~ Population + 
                    HS.Grad, data=income)
summary(fit1a)


Call:
lm(formula = Income ~ Population + Illiteracy + Murder + HS.Grad + 
    Frost + Area, data = income)

Residuals:
    Min      1Q  Median      3Q     Max 
-906.99 -219.42  -59.47  231.33  968.55 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  2.567e+03  1.018e+03   2.522   0.0155 *
Population   3.954e-02  1.718e-02   2.301   0.0263 *
Illiteracy  -1.440e+02  2.275e+02  -0.633   0.5301  
Murder      -8.929e+00  2.906e+01  -0.307   0.7601  
HS.Grad      3.375e+01  1.361e+01   2.480   0.0171 *
Frost        2.414e-01  1.961e+00   0.123   0.9026  
Area         1.663e-03  1.009e-03   1.648   0.1067  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 464.7 on 43 degrees of freedom
Multiple R-squared:  0.4982,	Adjusted R-squared:  0.4281 
F-statistic: 7.114 on 6 and 43 DF,  p-value: 2.617e-05



Call:
lm(formula = Income ~ Population + Illiteracy + Murder + HS.Grad + 
    Area, data = income)

Residuals:
    Min      1Q  Median      3Q     Max 
-905.01 -219.80  -60.24  232.74  965.49 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  2.641e+03  8.109e+02   3.257  0.00217 **
Population   3.885e-02  1.605e-02   2.421  0.01969 * 
Illiteracy  -1.605e+02  1.815e+02  -0.884  0.38124   
Murder      -9.291e+00  2.858e+01  -0.325  0.74666   
HS.Grad      3.325e+01  1.284e+01   2.590  0.01295 * 
Area         1.701e-03  9.499e-04   1.791  0.08019 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 459.4 on 44 degrees of freedom
Multiple R-squared:  0.498,	Adjusted R-squared:  0.4409 
F-statistic: 8.729 on 5 and 44 DF,  p-value: 8.236e-06



Call:
lm(formula = Income ~ Population + Illiteracy + HS.Grad + Area, 
    data = income)

Residuals:
    Min      1Q  Median      3Q     Max 
-912.18 -216.28  -61.89  217.13  947.96 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  2.577e+03  7.788e+02   3.309  0.00185 **
Population   3.684e-02  1.466e-02   2.513  0.01563 * 
Illiteracy  -1.900e+02  1.557e+02  -1.220  0.22889   
HS.Grad      3.411e+01  1.244e+01   2.742  0.00873 **
Area         1.601e-03  8.895e-04   1.800  0.07860 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 454.9 on 45 degrees of freedom
Multiple R-squared:  0.4968,	Adjusted R-squared:  0.452 
F-statistic: 11.11 on 4 and 45 DF,  p-value: 2.372e-06



Call:
lm(formula = Income ~ Population + HS.Grad + Area, data = income)

Residuals:
    Min      1Q  Median      3Q     Max 
-956.47 -225.60  -14.54  198.88  974.12 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.807e+03  4.581e+02   3.944 0.000272 ***
Population  3.620e-02  1.473e-02   2.458 0.017820 *  
HS.Grad     4.508e+01  8.634e+00   5.221 4.17e-06 ***
Area        1.150e-03  8.135e-04   1.414 0.164049    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 457.3 on 46 degrees of freedom
Multiple R-squared:  0.4801,	Adjusted R-squared:  0.4462 
F-statistic: 14.16 on 3 and 46 DF,  p-value: 1.137e-06



Call:
lm(formula = Income ~ Population + HS.Grad, data = income)

Residuals:
    Min      1Q  Median      3Q     Max 
-998.14 -237.26  -43.64  202.88 1355.74 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.664e+03  4.516e+02   3.685 0.000591 ***
Population  3.743e-02  1.486e-02   2.519 0.015237 *  
HS.Grad     4.920e+01  8.213e+00   5.990 2.78e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 462.1 on 47 degrees of freedom
Multiple R-squared:  0.4575,	Adjusted R-squared:  0.4345 
F-statistic: 19.82 on 2 and 47 DF,  p-value: 5.723e-07


What share of total variance in the data can be explained by our regression model?