# Determine what effect the staff has on both the dependent variable of the days in advance cancelled and the average price

*In this video, you will be looking at the effects of the staff member BOTH on the number of days in advance the appointment was cancelled and the average price.  Luckily, MANOVAs let you look at  both dependent variables at the same time!*

*Of course, the first thing that needs doing is to load in your libraries.  You want mvnormtest, which will allow you to test for multivariate normality, as well as car, which will allow you to run the MANOVA itself and other assumptions. Lastly, you'll need IDPmisc to deal with missing data if you have any.*

## Load Libraries

In [18]:
library("mvnormtest")
library("car")
library("IDPmisc")

"package 'IDPmisc' was built under R version 3.5.3"

## Read in Data

In [3]:
salon <- read.csv("C://Users/meredith.dodd/Documents/Data Science/105 Intermediate Statistics/Lesson 3/client_cancellations.csv")

In [4]:
head(salon)

cancel.date,cancel.date.month,code,service.code,service.desc,staff,booking.date,booking.date.month,canceled.by,days.in.adv,avg.price
3/10/2018,March,KOOM01,SHCW,Women's hair cut,JJ,4/3/2018,April,JJ,24,88.44
3/27/2018,March,WIL*01,SHCW,Women's hair cut,JJ,3/29/2018,March,JJ,2,88.44
4/3/2018,April,BUDG02,SHCM,Men's hair cut,SINEAD,4/21/2018,April,BECKY,18,41.41
4/3/2018,April,HILJ01,CFC,Color full color,KELLY,4/3/2018,April,JJ,0,63.13
4/3/2018,April,STEM01,SHCW,Women's hair cut,BECKY,4/21/2018,April,JJ,18,67.84
4/3/2018,April,STRH01,CHLFH,Highlights full,KELLY,4/4/2018,April,JJ,1,120.0


*Next, there are a few data wrangling tasks. You will need to make sure that both of your dependent variables are numeric, and you will need to isolate them and make a matrix with just them.  These data wranging requirements are not for the MANOVA itself, but rather for the m shapiro test for multivariate normality, which you'll enact shortly.*

## Data Wrangling

### Make sure dependent variables are numeric

*Your first task is to make sure that your dependent variables are numeric, and not factor or string data.  It doesn't matter whether they are classified as num or as int, but they need to be some form of number. You can check the format wtih the str() function, standing for structure.*

In [5]:
str(salon$days.in.adv)

 int [1:243] 24 2 18 0 18 1 55 0 0 1 ...


In [6]:
str(salon$avg.price)

 num [1:243] 88.4 88.4 41.4 63.1 67.8 ...


*Excellent! Looks like they are both good to go!*

*Next, you will want to remove missing data. Although the MANOVA itself can handle missing data, the M shapiro test cannot, and you will get an error if you try to run it with missing data.*

### Remove missing data

In [19]:
salon2 <- NaRV.omit(salon)

*And then, also for the M Shapiro test, you need to isolate your dependent variables by themselves and turn them into a matrix. You'll use the keeps solution to subset your data, keeping only the columns for your dependent variables.*

## Subset and format as a matrix to test the assumption of multivariate normality

In [20]:
keeps <- c("days.in.adv", "avg.price")
salon3 <- salon2[keeps]

In [22]:
head(salon3)

days.in.adv,avg.price
24,88.44
2,88.44
18,41.41
0,63.13
18,67.84
1,120.0


*And then simmply use the function as.matrix to turn that data frame into a matrix instead.*

In [23]:
salon4 <- as.matrix(salon3)

In [24]:
head(salon4)

Unnamed: 0,days.in.adv,avg.price
1,24,88.44
2,2,88.44
3,18,41.41
4,0,63.13
5,18,67.84
6,1,120.0


## Test Assumptions

*Now you are all set up to test even the most finicky of assumptions!*

### Sample Size

*First up is sample size. Since you have 200 something rows of data, you meet this assumption without a problem, as you need 20 cases per IV.*

### Multivariate Normality

*Next, you want to test for multivariate normality with the Shapiro Wilk's test, the function for which is mshapiro.test(). The argument that you will place in is the matrix you created of your two dependent variables while wrangling your data.*

In [25]:
mshapiro.test(t(salon4))


	Shapiro-Wilk normality test

data:  Z
W = 0.55738, p-value < 2.2e-16


*And you want the results to be non-significant. If they are significant, you have violated the assumption of multivariate normality and should not run a MANOVA.  But, you will continue here just for learning purposes.*

### Homogeneity of Variance

*Next, you will test the assumption of homogeneity of variance. This needs to be done individually with a Levene's test for each dependent variable you have.*

In [12]:
leveneTest(days.in.adv ~ staff, data=salon)

Unnamed: 0,Df,F value,Pr(>F)
group,5,0.7968467,0.5528728
,237,,


In [13]:
leveneTest(avg.price ~ staff, data=salon)

Unnamed: 0,Df,F value,Pr(>F)
group,5,9.83148,1.508878e-08
,235,,


*You wanted both of these to be non-signficant in order to meet the assumption of homogeneity of variance. Unfortunately, average price does not meet this assumption, although days in advance does. You will proceed with this analysis, however, even having violated the assumption, just for learning purposes.*

### Absence of Multicollinearity

*Lastly, you need to test for multicollinearity. This can be done just by doing a correlation between your dependent variables. You are looking for a correlation that is lower than .7 or so. You can test for correlation in any way you like, though is this case, you'll use cor.test().*

In [14]:
cor.test(salon$days.in.adv, salon$avg.price, method="pearson", use="complete.obs")


	Pearson's product-moment correlation

data:  salon$days.in.adv and salon$avg.price
t = -1.2739, df = 239, p-value = 0.2039
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.20634802  0.04470744
sample estimates:
        cor 
-0.08212297 


*The correlation between your two dependent variables is very small, less than .01, so you definitely have not violated the assumption of absence of multicollinearity, and are good to proceed.  Note that it doesn't matter whether the correlation is positive or negative; it's really just the strength of the correlation that matters.*

# END VIDEO 1

## The Analysis

*Alright! You have now tested all the necessary assumptions, and it is now time to actually run your MANOVA! You will use the function manova(). In the argument for cbind, you will place both of your dependent variables; this is how they get joined together to become one uber dependent variable.  Then you'll put in any of your other model information after the tilde.  In this case, you just have the IV, but just like ANOVA, you could add in covariates or additional factors if you'd like. Run the summary on the model you just created, and you get this information:*

In [15]:
MANOVA <- manova(cbind(days.in.adv, avg.price) ~ staff, data = salon)
summary(MANOVA)

           Df Pillai approx F num Df den Df    Pr(>F)    
staff       5 0.2471   6.6253     10    470 1.215e-09 ***
Residuals 235                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

*Looks like the staff does have a significant effect on the two dependent variables rolled together. But which dependent variable is driving this significance? Days in advance? Price? Both? In order to find out, you'll need to do a post hoc. And with MANOVAs, the first post hoc you'll do is actually an ANOVA, not a t-test!*

## Post Hocs

*Luckily, however, there is code for this, so you don't have to arrange for multiple different ANOVAs. Simply use the summary.aov() function and place in the name of your multivariate model, and specify for the argument test= Wilks.*

In [16]:
summary.aov(MANOVA, test = "wilks") 

 Response days.in.adv :
             Df  Sum Sq Mean Sq F value Pr(>F)
staff         5   489.4  97.871  0.9081 0.4764
Residuals   235 25327.5 107.776               

 Response avg.price :
             Df Sum Sq Mean Sq F value    Pr(>F)    
staff         5  95215 19042.9   14.24 3.553e-12 ***
Residuals   235 314254  1337.2                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2 observations deleted due to missingness

*Now you get the responses back for both ANOVAs. Looks like the average price was driving things, not the days in advance, because days in advance is NOT significant, but average price is.From here, you can absolutely run your normal ANOVA post hocs, and then look at the means to see which staff member is performing better. This has already been done in previous videos with this data, so it will not be repeated here.* 