<a href="https://colab.research.google.com/github/craigalexander/DAS23/blob/main/Week_6_Lab_Regression_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Analysis Skills - Regression Modelling with Python




# Introduction

In Week 4, we covered how to model data in R using regression models. This week, we will look at how we can carry out the same tasks using Python. 

For this lab, we will be using the same examples as in week 4.

As before, the key ideas behind modelling the data is to infer a relationship between:


*   **outcome or response variable** $y$ and
*   an **explanatory (or predictor) variable(s)** $x$, which can be referred to as an **independent variable** or **covariate**

Modelling can be used for two purposes:


1.   **Explanation**: For describing the relationship between an outcome variable $y$ and an explanatory variable  $x$, and determining the potential significance of such relationships using quantifiable measures. 
2.   **Prediction**: for predicting the outcome variable  $y$ given information from one or more explanatory variables.



There are many different modelling techniques. However, we will begin with one of the easier to understand and commonly-used approaches, linear regression. In particular, we will start by looking at simple linear regression (SLR), where we only have one explanatory variable. We will then extend these models to allow for more than one explanatory variable using multiple linear regression (MLR).


# Simple Linear Regression

For a response variable  $y$ and an explanatory variable  $x$, the data can be expressed as:

$$(y_{i},x_{i}), i=1, \dots, n$$

That is, we have  $n$ observations of  $y$ and  $x$. A statistical model is a mathematical statement describing the variability in a random variable  $y$, which includes any relationship with the explanatory variable  $x$. The inclusion of random (unpredictable) components  $\epsilon$, makes the model statistical, rather than deterministic. A simple linear regression model involves, as the name suggests, fitting a linear regression line to the data. Hence, a simple linear regression model can be written as follows:



*   $y_i$ is the $i^{\text{th}}$ observation of the response variable;
*   $\alpha$ is the **intercept**; 
*   $\beta$ is the **slope** of the regression line;
*   $x_i$ is the $i^{\text{th}}$ observation of the explanatory variable; and
*   $\epsilon_i$ is the $i^{\text{th}}$ random component.

The random components, $\epsilon_i$, are normally distributed with constant variance $\sigma^2$, such that we are essentially adding random white noise to the deterministic part of the model $(\alpha + \beta x_i)$. Thus, the full probability model for $y_i$ given $x_i$ $(y_{i}|x_i)$ can be written as

$$(y_{i}|x_i) \sim N(\alpha + \beta x_i, \sigma^2)$$

Hence, the mean comes from the deterministic part of the model, while the variance comes from the random part. 

In the lab next week, we will look at how to choose between different models. For this lab, we will focus on using the p-value to decide which terms to include in the model. We will use the standard convention of choosing a 5% level of significnce, and then decide to include terms in the mdel if the p-value of the associated parameter estimate is less than 0.05 (i.e. reject the null hypothesis that the parameter is zero).

# Simple linear regression with one numerical explanatory variable

First, we will import the following libraries

In [1]:
import pandas as pd
import numpy as np

Student feedback in higher education is extremely important when it comes to the evaluation of teaching techniques, materials, and improvements in teaching methods and technologies. However, there have been studies into potential bias factors when feedback is provided, such as the physical appearance of the teacher; see [Economics of Education Review](https://www.sciencedirect.com/journal/economics-of-education-review) for details. Here, we shall look at a study from student evaluations of  $n = 463$ professors from The University of Texas at Austin. In particular, we will examine the evaluation scores of the instructors based purely on one numerical variable: their beauty score. Therefore, our simple linear regression model will consist of



*   the numerical outcome variable *teaching score $(y)$*; and
*   the numerical explanatory variable *beauty score ($x$)*.



# Exploratory data analysis

