# [LEGALST-123] Lab 06: Regression and Causal Inference

This lab will review Ordinary Least Squares regression, the use of regression for causal inference, and interpreting regression models (including the idea of hypothesis testing). The idea here is to review how causal inference models are used in the social sciences (here, with data on bike rentals) and how to interpret those models.

In [None]:
# Just run this cell
from collections import Counter
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import mean_squared_error

**Here are some helpful resources to reference while doing this lab**:


*   [Python Reference Table](https://www.data8.org/fa23/reference/)
*   [Data 8 textbook - Regression](https://inferentialthinking.com/chapters/15/2/Regression_Line.html?highlight=regression)

<hr style="border: 1px solid #fdb515;" />

## Data

The data we are exploring is collected from a bike sharing system in Washington D.C.

In [None]:
# Run this cell to load the data, no further action is needed
bike = pd.read_csv("https://github.com/ds-modules/data/raw/main/bikeshare.txt")

# Note that we're taking a random sample of the dataset since our dataset is large.
bike = bike.sample(n=1000, random_state=42).reset_index(drop=True)
bike.head(10)

The variables in this data frame are defined as:

| Variable | Description |
| ---------| ------------|
| instant  | record index|
| dteday   | date |
| season | 1: spring, 2: summer, 3: fall, 4: winter |
| yr | 0: 2011, 1: 2012 |
| mnth | month (1 to 12) |
| hr | hour (0 to 23) |
| holiday | whether the day is a holiday or not |
| weekday | day of the week |
| workingday | if the day is neither weekend nor holiday |
| weathersit | 1: clear or partly cloudy, 2: mist and clouds, 3: light snow or rain, 4: heavy snow or rain |
| temp | normalized temperature in Celsius (divided by 41) |
| atemp | normalized "feels-like" temperature in Celsius (divided by 50) |
| hum | normalized percent humidity (divided by 100) |
| windspeed | normalized wind speed (divided by 67) |
| casual | count of casual users |
| registered | count of registered users |
| cnt | count of total rental bikes, including casual and registered |

## A Note on Data Preparation

Reflecting back on Lab 3, it’s crucial to remember the importance of cleaning our dataset to enhance the quality of our analysis. Below are some strategies that could be beneficial:


1. Addressing Missing Data
  *   Identify and handle missing data points to ensure they don’t negatively impact our analysis.
  *   This could involve removing or imputing missing values depending on the situation. <br> <br>

2. Recode Categorical Variables
  *   Transform categorical variables into dichotomous variables, taking on values of 0 or 1, to enable analysis and interpretation. Be warned: sometimes categorical variables may take integer values, not strings. <br> <br>

3. Standardize Scale
  *   Ensure that all scales are recoded in a consistent direction, enhancing the interpretability of our results (we'll discuss the specifics of this later on!).

<hr style="border: 1px solid #fdb515;" />

## Simple Linear Regression

Recall from Data 8 that the least-squares regression line is the unique straight line that minimizes root mean squared error (RMSE) among all possible fit lines. Using this property, we can find the equation of the regression line by finding the pair of slope and intercept values that minimize the root mean squared error.

**Simple linear regression** in this case refers to a specific type of least-squares regression in which we are only picking one independent variable and fitting a model to predict our dependent variable.

For this example, we're going to explore the relationship between temperature (`temp`) and count (`cnt`). Let's do a simple linear regression with `"temp"` as a predictor for `"cnt"`.

In [None]:
# Preparing the data for modeling
X = bike[['temp']]  # Predictor
Y = bike['cnt']     # Response

# Creating a scatter plot to visualize the data (without OLS Model First)
plt.figure(figsize=(8, 5))
plt.scatter(X, Y, alpha=0.5)
plt.title('Bike Rentals vs Temperature')
plt.xlabel('Normalized Temperature (Celsius)')
plt.ylabel('Count of Total Bike Rentals')
plt.show()

Based solely on the plot above, it looks like there may be a slight positive relationship between the temperature and count of bike rentals -- that is, generally, it seems like the number of bike rentals goes up as the temperature increases. To see if this is an accurate interpretation, we create the linear regression model below.

In order to fit the linear regression model, **we use the linear regression method from the scikit-learn package.** First, we **define our model as a linear regression** by calling the method, then we **fit our data with a predictor (X) and response (Y) variable.** We can then use the fitted model to predict the response variable and analyze how well it performs by comparing the predicted response value to the observed response.

In [None]:
# Fitting the linear regression model
model = LinearRegression()
model.fit(X, Y)

# Making predictions
y_pred = model.predict(X)

# Plotting the regression line
plt.figure(figsize=(8, 5))
plt.scatter(X, Y, alpha=0.5)  # actual points
plt.plot(X, y_pred, color='red', linewidth=2)  # regression line
plt.title('Bike Rentals vs Temperature')
plt.xlabel('Normalized Temperature (Celsius)')
plt.ylabel('Count of Total Bike Rentals')
plt.show()

# Output the model coefficients
print(f"Coefficient (slope): {model.coef_[0]}")
print(f"Intercept: {model.intercept_}")

# Calculating and printing RMSE
rmse = mean_squared_error(Y, y_pred, squared=False)
print(f"Root Mean Squared Error: {rmse}")

#### Analysis of Simple Linear Regression Results

The coefficient, also known as the slope, indicates the change in the count of total bike rentals for each unit increase in normalized temperature. The positive value suggests a direct relationship between temperature and bike rentals. The intercept represents the expected count of bike rentals when the temperature is 0.

However, it is important to consider the assumptions behind linear regression when interpreting these results. The main assumptions include linearity, independence, homoscedasticity, and normal distribution of residuals. If any of these assumptions do not hold, the predictions and inferences made from the model may be unreliable. Therefore, further analysis and diagnostics are necessary to validate these assumptions.


**Steps of Linear Regression**

Just like that, we've created a simple linear regression model! As we saw with the code above, it's a simple model to create and allows for interpretability of coefficients for us to understand clearly how our model makes its predictions.

Here is a full breakdown of the steps to performing Ordinary Least Squares (OLS) regression:

1. **Choose Your Variables**
  * Decide on which variable you want to predict (the dependent variable) and which variables you will use to predict it (the independent variables). <br> <br>
2. **Gather Data**
  * Collect the data for all the variables you have decided to use. Ensure there are no missing values and that the data is clean. <br> <br>
3. **Visualize Data**
  * Plot the data to get a sense of the relationship between the independent and dependent variables. Look for trends, patterns, and potential outliers. <br> <br>
4. **Fit the Model**
  * Use the OLS method to find the coefficients that minimize the sum of the squares of the residuals (the differences between the observed values and the values predicted by the model). <br> <br>
5. **Analyze Results**
  * Examine the coefficients to see how changes in the independent variables are expected to affect the dependent variable. Check the R-squared value to see how well the model explains the variability of the dependent variable. <br> <br>
6. **Diagnostic Checks (More on this at the bottom of the notebook)**
  * Perform diagnostic tests to ensure the model assumptions are not violated. This may include checking for linearity, homoscedasticity, independence, and normality of residuals. <br> <br>
7. **Make Predictions**
  * Use the model to make predictions on new data, applying the coefficients to the independent variables to get predicted values for the dependent variable. <br> <br>
8. **Evaluate the Model**
  * Assess the model's performance using metrics like RMSE (Root Mean Squared Error) and validate it on a test dataset to ensure that it generalizes well to unseen data.

<hr style="border: 1px solid #fdb515;" />

## Multiple Linear Regression

In the previous example, you saw an implementation of a simple linear regression model. In the following example, we'll take a look at another type of OLS model -- multiple linear regression. As the name implies, multiple linear regression allows us to predict a dependent variable as a linear combination of multiple independent variables.

In this example, we'll also implement and break down many of the steps that we discussed in the outline of linear models above!

### Steps 1 and 2: Choosing and Gathering Our Data

For this example, we'll be predicting the `"cnt"` variable from the independent features `"temp"`, `"hum"`, `"windspeed"`, `"season"`, and `"weathersit"`. Let's gather all of these columns and set them up as variables, and we'll also double check that none of the values are missing.

In [None]:
# Choosing a subset of variables for the multiple regression model
features = ['temp', 'hum', 'windspeed', 'season', 'weathersit']  # example feature set
X_multi = bike[features]
y_multi = bike['cnt']

In [None]:
# We can use a combination of the .isna() and .sum() methods to make sure there are no missing values.
X_multi.isna().sum()

In [None]:
y_multi.isna().sum()

We now have our independent and dependent variables set up in `X_multi` and `y_multi` respectively, and our data is null-free and ready to use.

### Step 2.5: Scaling and Splitting the Data

We didn't explicitly mention this in the steps above, but two incredibly important steps with regards to treating our data before performing linear regression are scaling and splitting the data.

Scaling the data refers to the process of **placing the numerical features on a standard scale.** If we skip this step before performing linear regression, it can potentially over- or underestimate the effect that our independent variables have on the dependent variable.

Splitting the data refers to **dividing the data into two parts: a training set and a testing set.** We use the training set as the data that we actually fit the model with, and we use the testing set when we're looking into the accuracy or error rates of our predictions. It's crucial for us to take this step to avoid **overfitting our model, which is what happens when our model performs very well on our training data but poorly on unseen data.** We'll dive a bit deeper into this later on!

In the code cell below, we use two functions from the scikit-learn package to easily do these tasks for us.

In [None]:
# Splitting the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_multi, y_multi, test_size=0.2, random_state=42)

# Standardizing the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Step 3. Visualize Data

As in the simple linear regression model, let's look at the relationships between our independent and dependent variables before we do any model creation, in order to give us more insight into our data.

Take a look at the scatterplots we create below for some of the continuous independent variables against our dependent variable.

Remember that:


*   Continous variables can take any value within a range (e.g., temperature, height).
*   Discrete variables can only take specific, separate values (e.g., count of people, number of cars).



In [None]:
# Scatter plot for 'temp' vs 'cnt'
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_train['temp'], y=y_train)
plt.title('Temperature vs Total Bike Rentals')
plt.xlabel('Normalized Temperature (Celsius)')
plt.ylabel('Count of Total Bike Rentals')
plt.show()

# Scatter plot for 'hum' vs 'cnt'
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_train['hum'], y=y_train)
plt.title('Humidity vs Total Bike Rentals')
plt.xlabel('Normalized Humidity')
plt.ylabel('Count of Total Bike Rentals')
plt.show()

# Scatter plot for 'windspeed' vs 'cnt'
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_train['windspeed'], y=y_train)
plt.title('Windspeed vs Total Bike Rentals')
plt.xlabel('Normalized Windspeed')
plt.ylabel('Count of Total Bike Rentals')
plt.show()

Some of the features above seem like they might be suitable for linear regression, while others seem like they aren't quite as applicable. Understanding these trends before creating a model can be helpful in providing more context for the results of the model.

### Steps 4, 7, and 8: Fitting the Model, Making Predictions, and Evaluating the Model

We'll combine some of the steps specified above because they go together quite well and are easy to implement thanks to scikit-learn.

As before, we'll define the model, fit it to our training data, make predictions for our test data, and then see how the model performs on the test data.

In [None]:
# Fitting the multiple regression model
model_multi = LinearRegression()
model_multi.fit(X_train_scaled, y_train)

# Making predictions on the test set
y_pred_multi = model_multi.predict(X_test_scaled)

# Evaluating the model
rmse_multi = mean_squared_error(y_test, y_pred_multi, squared=False)
print(f"Root Mean Squared Error (Multiple Regression): {rmse_multi}")

As we can see, the RMSE for this model is 145.77, while the simple linear regression model had an RMSE of 160.59! This model seems to do a better job of making predictions for our data. However, this brings us to an important point to be cautious of when performing OLS regression: the bias-variance tradeoff.

#### Feature Importances
Usually, when we are testing a model developed from social science theory we want to evaluate causal importance of the features (independent variables). Since we have standardized the features, we cannot easily evaluate them in terms of their all-else-equal effect on the outcome (although we could if we wanted use the unstandardized features and merely rescale them to be comparable). But we can investigate their relative importance using sklearn. **Think especially about the `weathersit` and `season` variables in the dataset. What kind of variables are they? Should we be using them here?**

In [None]:
# get the standardized regression coefficients for the features in our multiple regression model
# and the columns that made up X
# and put them into a df
standardized_coeffs = pd.DataFrame...
standardized_coeffs

<span style="color:red">**Question:**</span><p>
_**Does the list of standardized coefficients for the features in the model make sense? Should we have used `weathersit` and `season` in our model like this?** Explain here_ <p>
...


### Bias-Variance Tradeoff

The bias-variance tradeoff is a significant consideration when creating models for regression (or any type of value prediction tasks). As discussed above, overfitting is the situation in which a model performs very well on the trained data but poorly on unseen data. Model overfitting can happen in two major ways:

1. **Not splitting the data into training and testing sets.**
    * The model that we create is the one that performed the best on the training set. If we use all of our data to train the model, it will become very good at formulating predictions for the data that it saw and was trained on, but it will not be able to make accurate predictions for any seen data. Essentially, the model will not be generalizable. <br> <br>
    
2. **Utilizing too many features as independent variables in the model.**
    * The tradeoff between having high accuracy and being generalizable is called the bias-variance tradeoff. Let's look at an example of this next.

Let's try creating a multiple linear regression model using 11 of the independent variables in our dataset. Along with this, in order to show the issue with using this many features, we'll change the sizes of our training and testing split between our data: in this case, we'll train on only a bit of the data and use a lot of it for testing.

Below, we'll perform the exact same process that we did before, but we'll keep all of the code in one big cell.

In [None]:
# Choosing a subset of variables for the multiple regression model
features_of = ['temp', 'hum', 'windspeed', 'season', 'weathersit', 'holiday', 'workingday', 'mnth', 'hr', 'weekday', 'atemp']
X_multi_of = bike[features_of]
y_multi_of = bike['cnt']

# Splitting the data into training and testing sets
X_train_of, X_test_of, y_train_of, y_test_of = train_test_split(X_multi_of, y_multi_of, test_size=0.75, random_state=42)

# Standardizing the features
scaler = StandardScaler()
X_train_scaled_of = scaler.fit_transform(X_train_of)
X_test_scaled_of = scaler.transform(X_test_of)

# Fitting the multiple regression model
model_multi_of = LinearRegression()
model_multi_of.fit(X_train_scaled_of, y_train_of)

# Making predictions on the test set
y_pred_multi_of = model_multi_of.predict(X_test_scaled_of)

Now that we have our new model, **let's compare the RMSEs for this model and the previous multiple LR model for the training data.**

In [None]:
# Evaluating the model
rmse_multi = mean_squared_error(y_train, model_multi.predict(X_train_scaled), squared=False)
rmse_multi_of = mean_squared_error(y_train_of, model_multi_of.predict(X_train_scaled_of), squared=False)
print(f"Root Mean Squared Error (Multiple Regression): {rmse_multi}")
print(f"Root Mean Squared Error (Multiple Regression Overfit): {rmse_multi_of}")

When we're comparing the RMSE of the two models on our training set, we see that the model with more features does a bit better than the one with less features! But now, let's take a look at the RMSEs of the two models on the testing set.

In [None]:
# Evaluating the model
rmse_multi = mean_squared_error(y_test, model_multi.predict(X_test_scaled), squared=False)
rmse_multi_of = mean_squared_error(y_test_of, model_multi_of.predict(X_test_scaled_of), squared=False)
print(f"Root Mean Squared Error (Multiple Regression): {rmse_multi}")
print(f"Root Mean Squared Error (Multiple Regression Overfit): {rmse_multi_of}")

Here, we see the issue with using this many features for prediction: because we saved a lot of our data for testing rather than training, we could see how our model became overfit on the training data due to all of the features that we used. Be sure to keep this tradeoff in mind as you work on regression problems in the future!

<hr style="border: 1px solid #fdb515;" />

## ✅ Question 1: Try Your Own Model on the Dataset!

Now that we've run through a simple linear regression example and a multiple linear regression example, try one of your own! Come up with a explanatory model for the bike rentals dataset or relationship you are interested in exploring using linear regression. **In the following text cell, outline the model including the theory of human behavior (however basic) that underlies your model.** Then, implement your model in the code cell below. Finally, in another text cell, describe what you found with your model.

**Note:** Feel free to add additional code cells if you'd prefer to split up the steps!

*Discuss your model in words here!*

In [None]:
# Implement your model here, commenting on the steps you take to clean the data and make the variables
# ready to use in an OLS regression model.

*Here, describe what you found with your model (remember that a negative finding is still a finding!).*

<hr style="border: 1px solid #fdb515;" />

## Some Parting Notes

### Predicting Continuous vs. Discrete Variables

In the examples in this lab, we predicted a continuous variable -- the count of bike rentals. However, if our problem instead wanted to predict a discrete variable, such as whether a particular individual rented a bike (1) or not (0), that would require a different approach than regression: **classification**, or the prediction task of classifying units into some group or another. We'll explore this a bit more in future labs, so stay tuned!

### Using OLS Regression to Make an Inference from Data

In the social sciences, regression is often used to support a causal argument based on some kind of theory of human action.
In effect, what it represents mathematically is, "What is the effect of a unit change in  𝑥  (the cause of interest) on  𝑦  (the outcome of interest), holding everything else equal?"
Another way of thinking about it is as the partial correlation between a single variable  𝑥  and an outcome  𝑦. Note that when we use `standard scaler` (which expresses variables in terms of their mean and variance) we cannot make direct inferences about the effect of a unit change in _x_ on a unit change in _y_. When you read social science articles, you will note that they will transform variables such that they are on a scale of 0 to 1 in order for the reader to more easily interpret what a change in _x_ will do to _y_.

Correlation and regression are introduced quite well in the Data 8 textbook linked at the top of this notebook, so in this notebook we are trying to show how social scientists use regression to make causal arguments. We asked you to include a very simple theory of human behavior to motivate your model-building above. Social scientists draw on theory to explain the causal mechanism behind the association of the input variables and the outcome variable.

### Linear Regression Assumptions

As we discussed above, there are a number of assumptions that regression techniques require that we will keep returning to in future notebooks. Here's a list of the major ones:
* Independent variables (IVs) are quantitative or dichotomous.
* Dependent variable is quantitative, continuous, and unbounded.
* All IVs have variance not equal to 0 (ie. there is some variation in their value).
* No perfect multicollinearity between any two IVs.
* For each set of values for the independent variables, the mean value of the error term is zero.
* Each IV is uncorrelated with the error term.
* The variance of the error term for each set of values for the IVs is constant (homoscedasticity assumption).
* Error terms for any two observations of the values of IVs are uncorrelated.
* Error terms for each set of values for the IVs are normally distributed.

Some of these assumptions may not have been satisfied by the examples that we did above and are out of scope for this class -- that's okay. However, in future work, it's important for us to keep these in mind and do our best to satisfy them or identify where we fell short with them, in order to provide the proper justification for our model.