### Multivariate and Multiple regression

Multivariate regression deals with the case where there are more than one response variables that you want to regress over one or more predictors. On the other hand, multiple regression deals with the case where there is one dependent variable but more than one independent variable.

For example, consider a doctor has collected data on cholesterol, blood pressure, and weight of different patients.  He also collected data on the eating habits of the subjects like the amount of vegetables included in the diet, how many ounces of red meat, dairy products, chocolate consumed per week etc. He wants to investigate the relationship between the three measures of health and four measures of eating habits. Multivariate Regression is the solution to solve this kind of problems.

The dataset used in this notebook is about energy analysis using 12 different simulated building shapes. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. Dataset has 768 samples and 8 features. There are two response variables Heating load and Cooling load which are predicted using eight independent variables. 

Start with loading the data from in /datasets/energy efficiency folder

In [1]:
energy=read.csv("../../../datasets/energy efficiency/ENB2012_data.csv")
head(energy)
dim(energy)

X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2,X,X.1
0.98,514.5,294.0,110.25,7,2,0,0,15.55,21.33,,
0.98,514.5,294.0,110.25,7,3,0,0,15.55,21.33,,
0.98,514.5,294.0,110.25,7,4,0,0,15.55,21.33,,
0.98,514.5,294.0,110.25,7,5,0,0,15.55,21.33,,
0.9,563.5,318.5,122.5,7,2,0,0,20.84,28.28,,
0.9,563.5,318.5,122.5,7,3,0,0,21.46,25.38,,


In [2]:
summary(energy)

       X1               X2              X3              X4       
 Min.   :0.6200   Min.   :514.5   Min.   :245.0   Min.   :110.2  
 1st Qu.:0.6825   1st Qu.:606.4   1st Qu.:294.0   1st Qu.:140.9  
 Median :0.7500   Median :673.8   Median :318.5   Median :183.8  
 Mean   :0.7642   Mean   :671.7   Mean   :318.5   Mean   :176.6  
 3rd Qu.:0.8300   3rd Qu.:741.1   3rd Qu.:343.0   3rd Qu.:220.5  
 Max.   :0.9800   Max.   :808.5   Max.   :416.5   Max.   :220.5  
 NA's   :528      NA's   :528     NA's   :528     NA's   :528    
       X5             X6             X7               X8              Y1       
 Min.   :3.50   Min.   :2.00   Min.   :0.0000   Min.   :0.000   Min.   : 6.01  
 1st Qu.:3.50   1st Qu.:2.75   1st Qu.:0.1000   1st Qu.:1.750   1st Qu.:12.99  
 Median :5.25   Median :3.50   Median :0.2500   Median :3.000   Median :18.95  
 Mean   :5.25   Mean   :3.50   Mean   :0.2344   Mean   :2.812   Mean   :22.31  
 3rd Qu.:7.00   3rd Qu.:4.25   3rd Qu.:0.4000   3rd Qu.:4.000   3rd Qu.:

We dont need the last two columns as they are junk data with all NA values. Exclude them from the dataset. 

In [3]:
energy = energy[, !(colnames(energy) %in% c("X","X.1"))]
head(energy)

X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0.98,514.5,294.0,110.25,7,2,0,0,15.55,21.33
0.98,514.5,294.0,110.25,7,3,0,0,15.55,21.33
0.98,514.5,294.0,110.25,7,4,0,0,15.55,21.33
0.98,514.5,294.0,110.25,7,5,0,0,15.55,21.33
0.9,563.5,318.5,122.5,7,2,0,0,20.84,28.28
0.9,563.5,318.5,122.5,7,3,0,0,21.46,25.38


Use the readme.md file to get the attribute headers. The column names as per readme file are Relative_Compactness, Surface_Area, Wall_Area , Roof_Area, Overall_Height, Orientation, Glazing_Area, Glazing_Area_Distribution, Heating_Load, Cooling_Load.

In [4]:
names(energy) = c("Relative_Compactness", "Surface_Area", "Wall_Area" , "Roof_Area", "Overall_Height", "Orientation", "Glazing_Area", "Glazing_Area_Distribution", "Heating_Load", "Cooling_Load")

There are 528 NA values in every column of the dataset including the dependent variables Heating_load and Cooling_load. Omit this rows from the dataset.

In [5]:
energy = energy[complete.cases(energy),]
# str(energy)
summary(energy)

 Relative_Compactness  Surface_Area     Wall_Area       Roof_Area    
 Min.   :0.6200       Min.   :514.5   Min.   :245.0   Min.   :110.2  
 1st Qu.:0.6825       1st Qu.:606.4   1st Qu.:294.0   1st Qu.:140.9  
 Median :0.7500       Median :673.8   Median :318.5   Median :183.8  
 Mean   :0.7642       Mean   :671.7   Mean   :318.5   Mean   :176.6  
 3rd Qu.:0.8300       3rd Qu.:741.1   3rd Qu.:343.0   3rd Qu.:220.5  
 Max.   :0.9800       Max.   :808.5   Max.   :416.5   Max.   :220.5  
 Overall_Height  Orientation    Glazing_Area    Glazing_Area_Distribution
 Min.   :3.50   Min.   :2.00   Min.   :0.0000   Min.   :0.000            
 1st Qu.:3.50   1st Qu.:2.75   1st Qu.:0.1000   1st Qu.:1.750            
 Median :5.25   Median :3.50   Median :0.2500   Median :3.000            
 Mean   :5.25   Mean   :3.50   Mean   :0.2344   Mean   :2.812            
 3rd Qu.:7.00   3rd Qu.:4.25   3rd Qu.:0.4000   3rd Qu.:4.000            
 Max.   :7.00   Max.   :5.00   Max.   :0.4000   Max.   :5.000     

See how the dependent and independent variables are correlated separately.

In [8]:
cor(energy)>0.6

Unnamed: 0,Relative_Compactness,Surface_Area,Wall_Area,Roof_Area,Overall_Height,Orientation,Glazing_Area,Glazing_Area_Distribution,Heating_Load,Cooling_Load
Relative_Compactness,True,False,False,False,True,False,False,False,True,True
Surface_Area,False,True,False,True,False,False,False,False,False,False
Wall_Area,False,False,True,False,False,False,False,False,False,False
Roof_Area,False,True,False,True,False,False,False,False,False,False
Overall_Height,True,False,False,False,True,False,False,False,True,True
Orientation,False,False,False,False,False,True,False,False,False,False
Glazing_Area,False,False,False,False,False,False,True,False,False,False
Glazing_Area_Distribution,False,False,False,False,False,False,False,True,False,False
Heating_Load,True,False,False,False,True,False,False,False,True,True
Cooling_Load,True,False,False,False,True,False,False,False,True,True


In [9]:
cor(energy[,9:10])

Unnamed: 0,Heating_Load,Cooling_Load
Heating_Load,1.0,0.9758618
Cooling_Load,0.9758618,1.0


So the dependent variables are highly correlated to each other. Relative_Compactness and Overall_Height are the only variables which have a positive correlation of over 0.6 with the dependent variables. Contruct the model to predict Heating_Load and Cooling_Load using rest of the variables. When doing multivariate regression all response variables should be combined into one single object/dataframe. You will realize why when we apply the model. Lets combine the response variables Heating_Load and Cooling_Load into a variable called 'load'.

In [10]:
#cbind() combines two variables column wise. Now load will be a dataframe with Heating_Load and Cooling_Load as columns.
load  <- cbind(energy$Heating_Load, energy$Cooling_Load)

In [11]:
# Build multivariate regression model to predict Heating_Load and Cooling_LOad using rest all independent variables.
# Include all independent variables using notaion " ~ ."
# Exclude variables Heating_Load  Cooling_Load as independent variables by
# using minus operator ( - )

# energy_reg is a linear model of load against all variables except Heating_Laod and Cooling_Load
energy_reg <- lm(load ~ . -Heating_Load - Cooling_Load, data=energy)
summary(energy_reg)

Response Y1 :

Call:
lm(formula = Y1 ~ (Relative_Compactness + Surface_Area + Wall_Area + 
    Roof_Area + Overall_Height + Orientation + Glazing_Area + 
    Glazing_Area_Distribution + Heating_Load + Cooling_Load) - 
    Heating_Load - Cooling_Load, data = energy)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8965 -1.3196 -0.0252  1.3532  7.7052 

Coefficients: (1 not defined because of singularities)
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                84.014521  19.033607   4.414 1.16e-05 ***
Relative_Compactness      -64.773991  10.289445  -6.295 5.19e-10 ***
Surface_Area               -0.087290   0.017075  -5.112 4.04e-07 ***
Wall_Area                   0.060813   0.006648   9.148  < 2e-16 ***
Roof_Area                         NA         NA      NA       NA    
Overall_Height              4.169939   0.337990  12.337  < 2e-16 ***
Orientation                -0.023328   0.094705  -0.246  0.80550    
Glazing_Area               19.93

In [12]:
## Give us a summary of the linear model constructed in the cell above

summary(manova(energy_reg))

## Summary shows that every variable EXCEPT Orientation has a significant effect on the energy model

                           Df  Pillai approx F num Df den Df    Pr(>F)    
Relative_Compactness        1 0.82502  1789.27      2    759 < 2.2e-16 ***
Surface_Area                1 0.56418   491.28      2    759 < 2.2e-16 ***
Wall_Area                   1 0.80016  1519.50      2    759 < 2.2e-16 ***
Overall_Height              1 0.17495    80.47      2    759 < 2.2e-16 ***
Orientation                 1 0.00675     2.58      2    759 0.0766512 .  
Glazing_Area                1 0.47606   344.82      2    759 < 2.2e-16 ***
Glazing_Area_Distribution   1 0.02200     8.54      2    759 0.0002153 ***
Residuals                 760                                             
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1