# Assignment 6
### Multiple Regression

### Dataset

Course evaluations are used to obtain anonymous feedback about a course in order to improve it. However, the use of course evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. [A paper from 2005](http://www.sciencedirect.com/science/article/pii/S0272775704001165) found that instructors who are viewed to be better looking receive higher instructional ratings.

In this practical we will have a look at this data, and perform multiple regression to see if there is any indication that this is true.

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import probplot

sns.set_style("whitegrid")

In [None]:
courseEval = pd.read_csv('CourseEvaluations.csv')
courseEval.head()

### Beauty influencing score

The conclusion of the paper was that the perceived beauty of the lecturer has an influence on the score given in the course evaluations. First, we will create a simple model (linear regression) between `bty_avg` and `score`

## Exercise 1

1. Create a linear regression model of `bty_avg` predicting `score`. 
2. Plot the data with regression line.
3. Write down the obtained regression equation and the adjusted $R^{2}$ value.<div style="text-align: right"> **3 points** </div>

In [None]:
# your code/answer here
# score is the dependent variable (response) and bty_avg is the independent (predictor)
def print_question(question_number, sep_line_width = 78):
    print(f"Question {question_number}")
    print(sep_line_width * "=")
    
print_question(1)
print("")
formula_string = "score ~ bty_avg"

model = sm.formula.ols(formula=formula_string, data=courseEval)
model_fitted = model.fit()

print(model_fitted.summary())

plt.figure(figsize=(10, 6))
sns.regplot(data=courseEval, x='bty_avg', y='score')
plt.title('score vs bty_avg')
plt.show()

print(f"The regression equation is: score = {model_fitted.params['Intercept']:.4f} + {model_fitted.params['bty_avg']:.4f} x bty_avg")
print(f"The adjusted R^2 value is {model_fitted.rsquared_adj:.4f}")

### **Adding more variables and dummy coding**

More variables, other than `bty_avg`, can have an impact on the `score`. But instead of looking at these variables separately, we can create a multiple regression model with more than one explanatory variable. The next variable that we will add is `gender`.

The problem with the `gender` variable is that it is a nominal variable (male/female). In order to use this for regression we need to convert these levels into a numerical variable with the values 0 and 1, called an **indicator variable** (also refered to as a dummy variable).

In [None]:
dummy_coding = {'male': 0, 'female': 1}
gender_dummy = courseEval['gender'].copy()
gender_dummy = gender_dummy.replace(dummy_coding)
courseEval['gender_dummy']=gender_dummy
courseEval.tail

## Exercise 2

Now, the dummy coding has been applied. 

1. Create and fit a model with `bty_avg` and `gender_dummy` predicting the `score` variable. (Hint: you can use the + sign in the formula string to add multiple explanatory variables to your model). 
2. Again, write down the formula and the adjusted $R^{2}$. 
3. Do you think the model with gender and perceived beauty is a better model than the model with beauty alone? Justify your answer.<div style="text-align: right"> **3 points** </div>

In [16]:
# your code/answer here

## Exercise 3

1. Plot (in a scatterplot) `bty_avg` vs `score`, and draw 2 lines in them, one for `male` and one for the `female`. (Hint: first create two separate equations for males and females). Make sure the 2 groups and lines have different colors and add a legend so you know which color represents which group. 
2. If two professors had the same beauty score, did the `male` or `female` tend to have a higher score?<div style="text-align: right"> **5 points** </div>

In [17]:
# your code/answer here

## Exercise 4
P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Verify that the conditions for this model are reasonable using diagnostic plots. To check the model conditions you will need to make the following plots (see page 271 of the book for more details about assumptions and example plots): 
1. scatterplot of the (absolute value) residuals (y-axis) against the predicted values (x-axis)
2. a histogram and QQ-plot of the residuals
3. scatterplot of the residuals (y-axis) against the order of collection (x-axis) 
4. scatterplots of the residuals (y-axis) against each of the explanatory variables (x-axis)

Make sure you use subplots (plt.subplot) to order these plots in a structured  manner. All plots must have a title and labels for the x and y-axis.

5. For each of the plots, describe whether the assumptions of multiple regression are met, or not.<div style="text-align: right"> **9 points** </div>

In [18]:
# your code/answer here

### The search for the best model 
Now we will incorporate more predictors into the regression model. We will start with a full model that predicts professor score based on  ethnicity, gender, language of the university where they got their degree (language), age, proportion of students that filled out evaluations (cls_perc_eval), class size (cls_students), course level (cls_level), number of professors (cls_profs), number of credits (cls_credits), average beauty rating (bty_avg), outfit (pic_outfit), and picture color (pic_color).

In [None]:
m_full = sm.formula.ols(formula = 'score ~ ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg + pic_outfit + pic_color', data = courseEval)
multi_reg = m_full.fit()
print(multi_reg.summary())

## Exercise 5
Interpret the coefficient associated with the ethnicity variable.<div style="text-align: right"> **2 points** </div>

## Exercise 6
One of the things that makes multiple regression interesting, but also difficult, is that coefficient estimates depend on the other variables that are included in the model.

1. Drop the variable with the largest p-value and re-fit the model. 
2. Did the coefficients and significance of the other explanatory variables change?  
3. What happened to `ethnicity`? Describe and interpret your observations in terms of collinearity with the other explanatory variables.<div style="text-align: right"> **3 points** </div>

In [20]:
# your code/answer here

## Exercise 7
Using backward-selection and p-value as the selection criterion, determine the best model. 
1. Describe the order in which you removed predictor variables.
2. Show the output for the final model. <div style="text-align: right"> **2 points** </div>

In [21]:
# your code/answer here

## Exercise 8
Verify that the conditions for this reduced, final model are reasonable using diagnostic plots.
To get the predicted values for a regression model, you can use the *predict* function in your regression model object, so for the example above question 5 that would be: `predicted_value = multi_reg.predict()`. Make sure you use subplots to order the diagnostic plots in a structured  manner.<div style="text-align: right"> **5 points** </div>

In [22]:
# your code/answer here

## Exercise 9
In Exercises 4 and 8, we looked at residuals through histograms and probability plotes, and we were wondering if they are actually normally distributed. In both cases we were unsure whether the residuals may deviate too much from normality. Non-normality of residuals is not a problem *per se* for linear regression but it can be problematic for inference (i.e., deriving and interpreting p-values). Several formal tests exist that can tell us whether residuals are normally distributed, or not. One such test is the Shapiro-Wilk test, another one is the Kolmogorov-Smirnov test.

1. Run the Shapiro-Wilk test from `scipy.stats` on the residuals from the previous exercise.
2. Are the residuals normally distributed according to this test? Interpret the test output.<div style="text-align: right"> **2 points** </div>

## Exercise 10
1. Based on your final model, describe the characteristics of a professor and course at University of Texas at Austin that would be associated with a high evaluation score. 
2. Would you be comfortable generalizing your conclusions to apply to professors generally (at any university)? Justify your answer.<div style="text-align: right"> **2 points** </div>

**Total number of points**: 36