# Investigating the Impact of Learning Styles on Course Completion Rates

## Introduction

In this project, we explore whether a specific learning style—binge learning—can increase the likelihood of course completion among learners on Coursera. Binge learning is defined as completing and starting consecutive weeks of a course on the same day. To investigate this, we use a dataset containing information on 49,808 learners and employ various statistical methods, including regression analysis and instrumental variable (IV) techniques, to draw causal inferences.

## Data Description

The dataset `lecture3.csv` contains the following variables:
- **id**: A unique identifier for each learner
- **paid_enroll**: Dummy variable indicating if a learner has paid for enrollment (1 = yes, 0 = no)
- **prv_wk_nbr**: The most recent course week a learner has completed
- **prv_wk_min**: The minutes a learner spent in the previous week on the platform
- **message**: Dummy variable indicating if a learner received a message encouraging binge learning (1 = yes, 0 = no)
- **binge**: Dummy variable indicating if a learner has binged (1 = yes, 0 = no)
- **complete**: Dummy variable indicating if a learner completed the next week in the course (1 = yes, 0 = no)

## Methodology

To determine the impact of binge learning on course completion, we follow these steps:
1. **Regression Analysis**: Analyze the relationship between binge learning and course completion.
2. **Instrumental Variable Analysis**: Use a randomized encouragement trial to address potential self-selection bias.

## Analysis and Results

In [6]:
# Import necessary libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from linearmodels import IV2SLS

# Load the dataset
data_coursera = pd.read_csv('assets/lecture3.csv')

# Display the first few rows of the dataset
data_coursera.head()

Unnamed: 0,id,paid_enroll,prv_wk_nbr,prv_wk_min,message,binge,complete
0,1,1,2,193,0,1,1
1,2,0,5,194,0,1,1
2,3,1,1,45,0,0,1
3,4,1,4,118,0,0,1
4,5,0,5,247,0,1,1


In [8]:
# Using robust standard errors in the statsmodels module, regress the variable complete on binge
reg = smf.ols(formula='complete ~ binge', data=data_coursera).fit()
binge_coeff1_1 = round(reg.params['binge'], 4)
binge_coeff1_1

0.4619

In [7]:
# Run the regression one more time with additional controls: paid_enroll, prv_wk_nbr, prv_wk_min
reg2 = smf.ols(formula='complete ~ binge + paid_enroll + prv_wk_nbr + prv_wk_min', data=data_coursera).fit()
binge_coeff1_2 = round(reg2.params['binge'], 4)
binge_coeff1_2

0.3172

When the point estimate we are interested in (i.e., the coefficient in front of variable binge) changes drastically with the inclusion of further covariates, we consider that to be worrisome for causal inference purposes (remember the regression sensitivity analysis). Furthermore, intuitively the positive correlation between bingeing and completion could just be the result of self-selection by learners who are both inherently more likely to complete as well as more likely to binge because of higher motivation. To overcome this problem, researchers in Coursera decided to run a randomized encouragement trial. They randomly split their learners into two groups. The treatment group received a message immediately after completing a week of material. The goal of the message was to encourage learners to start the next week right away. The control group didn’t receive the message.

### Instrumental Variable Analysis

We will use the binary variable message as our instrument to investigate the impact of bingeing on completion of the following week’s lecture.

In [10]:
# Using robust standard errors in the statsmodels module, regress variable binge on variable message
reg3 = smf.ols(formula='binge ~ message', data=data_coursera).fit()
robust_reg2_2a = reg3.get_robustcov_results(cov_type='HC1')
robust_reg2_2a.summary()

0,1,2,3
Dep. Variable:,binge,R-squared:,0.018
Model:,OLS,Adj. R-squared:,0.018
Method:,Least Squares,F-statistic:,911.9
Date:,"Wed, 24 Jul 2024",Prob (F-statistic):,1.5500000000000001e-198
Time:,15:30:49,Log-Likelihood:,-16053.0
No. Observations:,49808,AIC:,32110.0
Df Residuals:,49806,BIC:,32130.0
Df Model:,1,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.8243,0.002,342.078,0.000,0.820,0.829
message,0.0903,0.003,30.198,0.000,0.084,0.096

0,1,2,3
Omnibus:,18923.394,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,52775.181
Skew:,-2.131,Prob(JB):,0.0
Kurtosis:,5.696,Cond. No.,2.62


**First Stage Analysis**: The first stage regression shows that receiving a message significantly increases the likelihood of bingeing. This strong relationship between the instrument (message) and the endogenous variable (binge) indicates a strong first stage.

### Intention-to-Treat (ITT) Effect

Next, we calculate the ITT effect by running the reduced form regression.

In [11]:
# Calculate the "intention-to-treat" (ITT) effect by running the reduced form regression
reg4 = smf.ols(formula='complete ~ message', data=data_coursera).fit()
l_change2_3 = round(reg4.params['message'], 4)
l_change2_3

0.0113

In [12]:
# Calculate the "intention-to-treat" (ITT) effect by running the reduced form regression
reg4 = smf.ols(formula='complete ~ message', data=data_coursera).fit()
l_change2_3 = round(reg4.params['message'], 4)
l_change2_3

0.0113

**ITT Effect**: The ITT effect shows that receiving a message increases the likelihood of completing the next week by approximately 1.13 percentage points.

### Omitted Variable Bias (OVB) and Monotonicity

The ITT doesn’t take into account that some users may not comply with the treatment assignment. With heterogeneous treatment effects, we assume monotonicity, meaning there are “no defiers” in the population.

In [14]:
# Calculate the share of "always-takers"
at_share2_5 = round(((data_coursera["message"] == 0) & (data_coursera["binge"] == 1)).sum() / (data_coursera["message"] == 0).sum(), 4)
at_share2_5

0.8243

In [15]:
# Calculate the share of "never-takers"
nt_share2_6 = round(((data_coursera["message"] == 1) & (data_coursera["binge"] == 0)).sum() / (data_coursera["message"] == 1).sum(), 4)
nt_share2_6

0.0854

In [28]:
# Calculate the IV estimate manually
first_stage_coeff = round(robust_reg2_2a.params[1], 4)
iv_estimate = round(l_change2_3 / first_stage_coeff, 4)
iv_estimate

0.1251

In [30]:
# Run a two-stage least squares regression using the IV2SLS module from the linearmodels library

data_coursera =data_coursera.dropna()
data_coursera["const"]=1
iv2sls2_8 =  IV2SLS(dependent = data_coursera["complete"],
             exog = data_coursera["const"] ,
            endog= data_coursera['binge'],
             instruments = data_coursera['message']
             ).fit(cov_type = 'robust')

In [31]:
print(iv2sls2_8.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:               complete   R-squared:                      0.1213
Estimator:                    IV-2SLS   Adj. R-squared:                 0.1213
No. Observations:               49808   F-statistic:                    19.395
Date:                Wed, Jul 24 2024   P-value (F-stat)                0.0000
Time:                        16:25:31   Distribution:                  chi2(1)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          0.7862     0.0248     31.690     0.0000      0.7376      0.8348
binge          0.1254     0.0285     4.4040     0.00

### Conclusion

Our analysis demonstrates that receiving a message encouraging learners to start the next week's material immediately after completing the current week significantly increases the likelihood of bingeing, which in turn significantly increases the likelihood of completing the following week. The instrumental variable (IV) estimate, calculated using two-stage least squares (2SLS), provides a measure of the causal effect of bingeing on course completion.

### References

- ["Instrumental Variables & Randomized Encouragement Trials: Driving Engagement of Learners"](assets/MediumArticle.pdf)