# <center>How Does Curricular Complexity Affect Time-to-Degree?</center>
## <center> Study Design </center>

The following Julia programming language packages will be used in the analyses provided in this notebook:

In [None]:
using CurricularAnalytics, CurricularVisualization
#using CurricularOptimization
using Glob
using CSV
using DataFrames
using GLM
using StatsBase
using Plots
using DataStructures
using JLD2
using StatsPlots
using KernelDensity
using CausalInference
using GraphPlot
using Statistics
using LaTeXStrings
using IJulia
using WebIO

# Introduction
The goal of this study is to investigate the various factors that influence the amount of time it takes for students to earn an undergraduate degree, i.e., time-to-degree.  Clearly, there are many factors influencing time-to-degree, including a student's prior preparation, students changing majors, student employment, internships, institutional support, etc. In this notebook we investigate the influence of curricular complexity, and in particular structural complexity, on time-to-degree.  

The first portion of this study involves determining the impact of a program's structural complexity on a student's ability to graduate from that program within 4, 5, and 6 years. This portion of the study does *not* take into account student or institutional characteristics. This portion of the work will set the baseline for determining how additional factors influence the aforementioned graduation rates. Thus, if $X$ denotes the curricular (structural) complexity of a particular academic program, we'd like to determine its causal affect on the program's graduation rate, which can be represented as a causal diagram:
<p align="center">
   <img src="./XY-causal.png" width="300" height="200">
</p>

which should be read as "$X$ has a causal impact on $Y$." The variable $Y$ in this case corresponds to the time-to-degree for a given student, given that they graduated from a program with structural complexity $X$. The variable $U$ shown in this figure is meant to model unknown influences on the graduation rates (e.g., background preparation, motivation, etc.), and we will elaborate them in subsequent work.

As mentioned above, there are likely many other "confounding" factors that impact a student's ability to graduate within a given timeframe. These will be considered in the sequel, as we build upon the initial causal model shown above.

In this model, we treat both $X$ and $U$ as direct causes of $Y$, where $U$ are unmeasured factors (i.e., random exogenous causes), leading to the following initial structural causal model.

$$
 U = \{U_1, U_2, \ldots \} 
$$
$$
 V = \{Y\}
$$

## A Logistic Regression Model
We will model the relationship between curricular complexity and time-to-degree using regression techniques, with the dependent variable being whether or not a student graduated within a particular timeframe, and the independent variable being the curricular (i.e., structural) complexity. Because the dependent variables are dichotomous, i.e., a student either graduates or not within a given time period, we will apply *logistic regression* in order to deal with this nonlinearity. Later we will include other independent variables related to student demographics and preparation, instructional complexity, institutional support, etc. 

Consider first the relationship between program curricular complexity and the probability that a student will be able to graduate from that program within four years. We will use $Y=1$ is a binary outcome variable that indicates a student was able to gradaute within four years, and $Y=0$ to denote a student was *not* able to graduate within fours. I.e., $Y$ is treated as the outcome of a Bernoulli trial. Thus, we are interested in using previously collected student data to estimate:
$$
   \text{Prob}\{Y | X\}. 
$$

Logistic regression is commonly used to create a model for "explaining" how variances in binary dependent variables relate to changes in the independent variables. Logistic regression makes use of the logistic function, expressed as: 
$$ 
\begin{equation}
 \text{Prob}\{Y=1| X\} = \left[1 + e^{-(b_0 + b_1X)}\right]^{-1}.
 %\label{logistic}
\end{equation}
$$
A graph of the function 
$$
 P = \left[1 + e^{-x}\right]^{-1}
$$
is shown in the figure below.

In [None]:
X = [i*0.1 for i in -40:40]
Y = [1/(1+exp(-x)) for x in X]
plot(X, Y, seriestype = :line, xlabel = latexstring("x"), ylabel=latexstring("P"), legend=false)

Notice that this transformation maps any $x$ value in $(-\infty, \infty)$ to a $P$ value in $[0,1]$.  Notice also that there is a region of $x$, from about -1.5 to 1.5, that is roughly linear, and where a small change in $x$ yields a large change in $P$.  This is also the behavior we see on test data when we compare compare graduation rates to curricular complexity.  That is, there is a linear region, of intermediate curricular complexities values, where small changes in curricular complexity yeild large changes in student success (i.e., graduation rates), but at very low and very high curricular complexity values, small changes in curricular complexity produce small changes in student success. At the extremes, almost everyone graduates within four years (low curricular complexity) or almost nobody graduates within four years (high curricular complexity).

Solving the logistic function for $x$ yields the inverse logistic function, also know as the logistic transformation (or *logit* for short). Specifically,
$$ 
 \begin{align}
 \ln P & = & \ln\left({1 \over 1 + e^{-x}}\right) = \ln\left({e^x \over 1 + e^{x}}\right) \\
       & = & \ln e^x - \ln(1 + e^x) \\
       & = & x + \ln\left({1 \over 1 + e^{x}}\right) \\
       & = & x - \ln(1-P)
 \end{align}
$$
and therefore
$$
  x = \ln\left({P \over 1-P}\right).
$$

This logit equation has an interesting, and important, interpretation. Specifically, if we treat $P_i$ as the probability of an event $i$ occurring, then the quantity $P_i/(1-P_i)$ corresponds to the *odds* of an event $i$ occurring, and the *logged odds* for that event is defined as:
$$
   L_i = \log \left[{P_i \over 1-P_i} \right].
$$

The purpose of logistic regression is to estimate the $P_i$ values (i.e., the probability of graduating within a given timeframe) from observed data. This is accomplished by determing the $b_0$ and $b_1$ coefficients in the logistic equation provided above through maximum likelihood estimation. 

### Data Requirements
The data that will be used to estimate these coefficients should be collected as follows. First, the analysis is backwards looking, which means the cohort should be defined as follows: Start with everyone who graduated in a given term, and then look backwards to see which first-time full-time (FTFT) cohort they belonged to. If they don’t belong to a FTFT cohort, they should be excluded from the data (i.e., the analysis should focus on full-time students).  From this you should compute how many years it took each student to graduate.  

A sample spreadsheet with the required fields is supplied with this notebook. This spreadsheet contains the following data fields (columns):

- **CIP:** The Classification of Instructional Programs (CIP) code is a taxonomic coding scheme that categorizes fields of study and instructional programs. Developed by the National Center for Education Statistics (NCES) in the United States, CIP codes are used to facilitate the organization, collection, and reporting of fields of study and program completions across all education levels. Each CIP code provides a numerical identifier assigned to a specific area of academic study, ranging from agriculture, business, and health sciences to engineering and more. These codes are essential for educational research, policy analysis, and planning, as they provide a standardized system for tracking, assessing, and comparing educational programs and their outcomes.

- **Program_Complexity:** This is a discrete variable reflecting the complexity of each program that students attend at a given university. The complexity metric could encompass factors like the number of required courses, the depth of course content, and the sequencing of course material.

- **grad4:** A binary variable that indicates whether a student graduated within a four-year time frame (1) or took more than four years (0). This variable is essential to assess the efficiency and effectiveness of university programs.

- **grad6:** A binary variable that indicates whether a student graduated within a six-year time frame (1) or took more than six years (0). This variable is essential to assess the efficiency and effectiveness of university programs.

- **HSGPA:** High School GPA for each student, measuring academic performance prior to university enrollment. This variable is often used in educational research to control for prior academic achievement.

- **Gender:** A binary variable (0 for male and 1 for female) representing the Gender of the student. Gender has been shown to play a role in educational outcomes and choices.

- **Pell_Award:** A binary variable indicating whether a student received Pell Grant money (1) or not (0). This variable indicates students' socioeconomic status and is crucial to understanding access to educational opportunities and resources.

**A Note on Sample Size.**
It is important to collect a sufficient amount of data in order to obtain accurate estimates of the coefficietns. A discussion of sufficient sample size statistics is beyond the scope of this notebook, but as a rule of thumb, the goal should be to collect on the order of thousands of samples across a large range of programs with different curricular complexity values. In practice this typically involves collecting 6-10 terms worth of graduation data across all of the undergraduate programs at a given institution.

For a semester-based institution, 6-10 terms corresponds to 3-5 years worth of data.  Given that programs often change their requirements over time, this introduces the problem of a program's curricular compleixty possibly changing over the data collection period. Rather than trying to track which students graduated from particular versions of a program, it may be easiest to simply average the curricular complexity scores of a program over time, and treat this as a small amount of noise in the data.

### Example – University of Arizona
Graduation data over the past years was collected from the University of Arizona.  All undergraduate programs at the university currently in existance are included in this data set.  In order to work with this data set, we will read them into a data frame.

In [None]:
@load "/Users/daniel/Desktop/Arizona/VIP/G2KU/Assignment1/UA_student_data_01_25_24.jld2" df_binary
df = df_binary
show(df, maximum_columns_width = 8) 

## Curricular Analytics

The analyses in this notebook makes use of the Curricular Analytics toolbox built using the Julia programming language and available as open source software [1]. If you would like to modify any of these analyses, you may find it useful to read the toolbox documenation, as well as the curricular analytics paper listed in the References section below [2]. The curricula associated with chemcial engineering undergraduate programs at various univiersites were collected from the  http://CurricularAnalytics.org website. These curricula were entered by those working at the various institutions that provide these curricula.  We have *not* validated them in any way, i.e., we are using them "as is" according to how they were entered into the afforementioned web application.  That said, it is realitvely straightforward to check these curricula by visiting the websites of the various universites offering these programs.  

### What is Curricular Complexity?
The curricular complexity metrics described below were derived based upon their impact on the ability of students to progress through a curriculum. Brief details of these metrics are provided below, for the sake of completeness; however, you can skip the technical details below without loss of understanding.

As a high-level summary, we model the overall complexity of a curriculum as a function of two main components: (1) the manner in which courses in the curriculum are taught and supported, and (2) the manner in which the curriculum is structured. We refer to the former as the *instructional complexity* of the curriculum, and to the latter as the *structural complexity* of the curriculum [2]. Each of these main components are functions of numerous other curriculum-related factors. In this report we focus on the structural complexity compnents; however, if there is an interest in investigating the impact of instructional complexity on student progress, there are simulation capabilities within the CurricularAnalytics.jl toolbox that can be used for that purpose. 

The structural complexity components we use below are:

#### Delay Factor
Many curricula, particularly those in science, technology engineering and math (STEM) fields, contain a set of courses that must be completed in sequential order. It is not uncommon in these programs to find prerequisite pathways consisting of seven or eight courses—they span nearly every term in any possible degree plan. The ability to successfully navigate these long pathways without delay is critical for student success and on-time graduation.

For any curriculum $c$ we can construct a *curriculum graph*, denoted $G_c = (V,E)$, that is determined by the prerequisites in the curriculum. Specifially, each vertex $v_1, . . . , v_n ∈ V$ represents a required course course in curriculum $c$. There is a directed edge $(v_i,v_j) ∈ E$ from course $v_i$ to $v_j$ if $v_i$ is a prerequisite of $v_j$.

Based up this definition, we define the **delay factor** associated with a given course $v_k$ in a curriculum $c$, denoted $d_c(v_k)$, as the number of vertices in the longest path in $G_c$ that passes through $v_k$. 

\begin{equation}
d_c(v_k) = \max_{i,j,k,l} \{ \#(v_i\rightsquigarrow v_k \rightsquigarrow v_j)\}
\end{equation}

We define the delay factor associated with an entire curriculum $c$ as:

\begin{equation}
d(G_c)= \sum_{v_k ∈ V} d_c(v_k)
\end{equation}

#### Blocking Factor
Another structural factor arises when one course serves as the gateway to many other courses in the curriculum. In this case, if a student is unable to pass the gateway course, they are **blocked** from attempting many of the other courses in the curriculum.

For instance, *Calculus 1* is often a foundational first-term course in a STEM curriculum that must be completed before taking other major-specific classes in subsequent terms. It is obvious that a course which is a prerequisite for a large number of other courses in a curriculum is a highly important course in that curriculum.

We will denote the situation where course $v_j$ is reachable from course $v_i$, via any prerequisite pathway, using $vi\rightsquigarrow v_j$, and $v_i \nrightarrow v_j$ will be used if course $v_j$ is not reachable from course $v_i$. The blocking factor associated with course $v_i$ in curriculum $G_c = (V, E)$, denoted $b_c(v_i)$, is then given by: 

\begin{equation}
b_c(v_i)= \sum_{v_j ∈ V} I(v_i,v_j)
\end{equation}

where $I$ is the indicator function :

\begin{equation}
= I \begin{cases}
1 & if \space \space v_i\rightsquigarrow v_j\\
0 & if \space \space v_i \nrightarrow v_j
\end{cases}
\end{equation}

We define the blocking factor associated with an entire curriculum $c$ as:

\begin{equation}
b(G_c)= \sum_{v_i ∈ V} b_c(v_i)
\end{equation}

#### Structural Complexity
After computing the blocking and delay factors for a curriculum, a unitless measure for structural complexity can be computed for any curriculum. Keep in mind that structural complexity explicitly relates to the likelihood that a student can complete a curriculum, as well be demonstrated below..

In order to determine this overall *structural complexity* metric, we simply add the blocking and delay factors of the entire curricula:

\begin{equation}
Structural \; Complexity = b(G_c) + d(G_c)
\end{equation}

Given these definintions, we provide an analysis of the UA's chemical engineering curriculum below. A visaulization of this curriculum is provided next. If you hover your mouse over the courses in this visualization, you will see the complexity metrics associated with each course in this curriculum.

**Mechanical Engineering program at the University of Arizona**<br>
Read the Chemical Engineering program requirements at UA and employ the <span style="color:green">read_csv</span> function to construct a Curriculum object.

In [None]:
UA_ME_curric = read_csv("/Users/daniel/Desktop/Arizona/VIP/G2KU/Assignment1/BS in Mechanical Engineering.csv")

In [None]:
visualize(UA_ME_curric, notebook=true)

In [None]:
UA_Bioinfo_curric = read_csv("/Users/daniel/Desktop/Arizona/VIP/G2KU/Assignment1/BS_in_Bioinformatics.csv")

In [None]:
visualize(UA_Bioinfo_curric, notebook=true)

**Key metrics of the Mechanical Engineering curriculum at the University of Arizona**<br>
The curricular metrics for this program are as follows:

In [None]:
metrics = basic_metrics(UA_ME_curric)
println(String(take!(metrics)))

In [None]:
metrics = basic_metrics(UA_Bioinfo_curric)
println(String(take!(metrics)))

**Create a degree plan for the Mechanical Engineering program at the UA utilizing the <span style="color:green">req_distance_obj</span> objective.**<br>
The <span style="color:green">optimize_plan</span> function creates a degree plan spanning eight terms, aiming to minimize the gap between courses that are directly dependent on each other as prerequisites.

In [None]:
dp = optimize_plan(UA_ME_curric, 8, 14, 17, [req_distance_obj])

In [None]:
visualize(dp, notebook=true)

**Key metrics of the Mechanical Engineering program at the University of Arizona**<br>


In [None]:
metrics = basic_metrics(dp)
println(String(take!(metrics)))

**Electrical and Computer Engineering program at the University of Arizona**<br>
Read the Electrical and Computer Engineering program requirements at UA and employ the <span style="color:green">read_csv</span> function to construct a Curriculum object.

In [None]:
UA_ECE_curric = read_csv("/Users/daniel/Desktop/Arizona/VIP/G2KU/Assignment1/BS in Electrical and Computer Engineering.csv")

In [None]:
visualize(UA_ECE_curric, notebook=true)

**Create a degree plan for the Electrical and Computer Engineering program at the UA utilizing the <span style="color:green">balance_obj</span> objective.**<br>
The <span style="color:green">optimize_plan</span> function generates an eight-term degree plan, aiming to evenly distributing the credit hours across each term.

In [None]:
dp = optimize_plan(UA_ECE_curric, 8, 12, 19, balance_obj)

In [None]:
visualize(dp, notebook=true)

---

#### **Utilize the "df" DataFrame from cell 3 to address the following Questions:**

---

**Question 1**<br>
Create a code snippet that displays a histogram representing the distribution of "Program_Complexity" across all curricula at the University of Arizona.

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
bar(df.CIP, df.Program_Complexity, xlabel="CIP Code", ylabel="Complexity", legend=false, title="Histogram of Program Complexity at the UArizona")

> **Note:** The complexity distrubtion of the undergraduate programs at the Unviersity of Arizona follows a power-law distribution.  There are numerous low complexity programs at the Univeristy of Arizona, and fewer high complexity programs. This power-law distribution is commonly observed when considering all of the curricula provided by an instituion. 

**Question 2**<br>
Create a code snippet that displays a histogram representing the distribution of "Program_Complexity" across all engineering curricula at the University of Arizona.

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
# (Type your answer here)
# Hint: "CIP" values for engineering curricula start with 14
#
condition = (df."CIP" .>= 14) .& (df."CIP" .< 15)
engineering_curricula = df[condition, :]

bar(engineering_curricula.CIP, engineering_curricula.Program_Complexity, xlabel="Engineering Curricula", ylabel="Complexity", title="Program Complexity for Engineering Curricula", legend=false)

**Question 3**<br>
Create a code snippet that displays a histogram representing the distribution of "Program_Complexity" across all curricula taken by students over the years.

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
# (Type your answer here)
#
histogram(df.Program_Complexity, xlabel="Program Complexity", ylabel="Count", legend=false, title="Histogram of Program Complexity Taken by UA Stds")

**Question 4**<br>
Generate a kernel density estimation plot to illustrate the distribution of "Program_Complexity" across all curricula taken by students throughout the years.

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
# (Type your answer here)
#
kde_estimate = kde(df.Program_Complexity)
plot(kde_estimate.x, kde_estimate.density, ribbon=false, xlabel="Program Complexity", ylabel="Density", label="All Students", title="KDE Plot of Program Complexity Taken by All Stds")

**Question 5**<br>
Create kernel density estimation plots to compare the distribution of "Program_Complexity" for curricula taken by all the students at the University of Arizona over the years with those specifically pursued by engineering students.

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
# (Type your answer here)
#
condition = (df."CIP" .>= 14) .& (df."CIP" .< 15)
engineering_curricula = df[condition, :]
kde_estimate_engineering = kde(engineering_curricula.Program_Complexity)
plot!(kde_estimate_engineering.x, kde_estimate_engineering.density, ribbon=false, xlabel="Program Complexity", ylabel="Density",label="Engineering Students", title="KDE Plot of Program Complexity")


In [None]:
histogram(engineering_curricula.Program_Complexity, xlabel="Program Complexity", ylabel="Count", title="Histogram of Program Complexity of Engineering Stdts")

**Analysis:** What does the graph show? 

**Your Answer:**

<textarea style="width:100%;height:100px;border:1px solid #ccc; padding: 8px;">
(Type your answer here)
</textarea>

**Question 6**<br>
Generate kernel density estimation plots for "Program_Complexity" for two distinct groups of students: (Group 1) Male students with a high school GPA between 3 and 3.5 who graduated within 4 years, and (Group 2) Male students with a high school GPA between 3 and 3.5 who did not graduate within 4 years.

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
# (Type your answer here)
#
condition = (df."Gender" .== 0) .& ((df."HSGPA" .<= 3.5) .& (df."HSGPA" .>= 3)) .& (df.grad4 .== 1)
male_gpa335_grad4 = df[condition, :]
kde_estimate = kde(male_gpa335_grad4.Program_Complexity)
plot(kde_estimate.x, kde_estimate.density, ribbon=false, xlabel="Program Complexity", ylabel="Density", label="Graduated in 4 years", title="Program Complexity and Graduation of male students with HSGPA: 3-3.5", titlefontsize=10)
condition2 = (df."Gender" .== 0) .& ((df."HSGPA" .<= 3.5) .& (df."HSGPA" .>= 3)) .& (df.grad4 .== 0)
male_gpa335_notgrad4 = df[condition2, :]
kde_estimate2 = kde(male_gpa335_notgrad4.Program_Complexity)
plot!(kde_estimate2.x, kde_estimate2.density, ribbon=false, label="Did not graduate in 4 years")

**Analysis:** What does the graph show? 

**Your Answer:**

<textarea style="width:100%;height:100px;border:1px solid #ccc; padding: 8px;">
(Type your answer here)
</textarea>

**Question 7**<br>
Create a graph showing the "Gender" distribution among students who received a pell grant (where "Pell_Award" equals 1).

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
# (Type your answer here)
#
pell_all = df.Pell_Award .== 1
pell_males = (df.Pell_Award .== 1) .& (df.Gender .== 0)
pell_females = (df.Pell_Award .== 1) .& (df.Gender .== 1)

n = sum(pell_all)
x = ["Male", "Female"]
y = [(sum(pell_males) / n), (sum(pell_females) / n)]
pie(x, y, title = "Gender of students who received a pell grant", percentformat=:percent)

bar(x, y, xlabel="Gender", ylabel="Proportion", color=["blue", "red"], legend=false, title="Gender of students who received a pell grant")

**Analysis:** What does the graph show? 

**Your Answer:**

<textarea style="width:100%;height:100px;border:1px solid #ccc; padding: 8px;">
(Type your answer here)
</textarea>

### Creating the Logistic Regression Model
Implement Logistic regression using a Generalized Linar Model (GLM) with a logit link function. The "model_grad4" and "model_grad6" models should relate curricular complexity to graduates who were able to complete their degrees in four and six years, respectively.

**Question 8**

**Model 1:**<br>
Create a code snippet to analyze the relationship between "Program_Complexity" and "grad4" using logistic regression.

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
# (Type your answer here)
#
model_grad4 = glm(@formula(grad4 ~ Program_Complexity), df, Binomial(), LogitLink())

In [None]:
coef(model_grad4)
exp(-0.00151861)

**Analysis:** 
analyze the coefficients of this model

**Your Answer:**

<textarea style="width:100%;height:100px;border:1px solid #ccc; padding: 8px;">
(Type your answer here)
</textarea>

**Model 2:**<br>
Create a code snippet to analyze the relationship between "Program_Complexity" and "grad6" using logistic regression.

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
# (Type your answer here)
#
model_grad6 = glm(@formula(grad6 ~ Program_Complexity), df, Binomial(), LogitLink())


In [None]:
exp(0.00112)

**Analysis:** 
analyze the coefficients of this model

**Your Answer:**

<textarea style="width:100%;height:100px;border:1px solid #ccc; padding: 8px;">
(Type your answer here)
</textarea>

**Question 9**<br>
Utilize the logistic regression models developed in earlier cells to plot the probabilities of graduating in four and six years, respectively, as a function of "Program_Complexity".

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
# (Type your answer here)
#
X_values_grad4 = range(minimum(df.Program_Complexity), stop = maximum(df.Program_Complexity), length = 100)
predicted_probs_grad4 = predict(model_grad4, DataFrame(Program_Complexity = X_values_grad4))
plot(X_values_grad4, predicted_probs_grad4, xlabel="Program Complexity", ylabel="Predicted Prob of Graduating", label="Grad4", legend=:right)

X_values_grad6 = range(minimum(df.Program_Complexity), stop = maximum(df.Program_Complexity), length = 100)
predicted_probs_grad6 = predict(model_grad6, DataFrame(Program_Complexity = X_values_grad6))
plot!(X_values_grad6, predicted_probs_grad6, label="Grad6")

**Analysis:** What does the graph show? 

**Your Answer:**

<textarea style="width:100%;height:100px;border:1px solid #ccc; padding: 8px;">
(Type your answer here)
</textarea>

**Question 10**<br>
How does a 40-point reduction in curricular complexity affect the four-year graduation rate?

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
# (Type your answer here)
#
exp(-0.00151861*-40)
-0.00151861*-40

**Interpretation:** Interprete your results

**Your Answer:**

<textarea style="width:100%;height:100px;border:1px solid #ccc; padding: 8px;">
(Type your answer here)
</textarea>


Finally, let's consider the amount of variation explained by these regression models.  With linear models, the $R^2$ value is used to provide such a measure. In particular, $R^2$ describes the proportion of the variation in the dependent variable that can be attributed to the independent variables. This is accomplisthed by measureing how close the observed values of the dependent variable are to the predicted values provided by the linear model. Thus, we can think of $R^2$ as a measure of how much variation the model explains.

For logistic regression models, involving categorial data, a different measure must be developed for explaining variation.  One popular choice, called McFadden's $R^2$ (also know as a "pseudo $R^2$" value) is defined as:
$$ 
\begin{equation}
 R^2_{\text{McFadden}} = 1 - {\log L_c \over \log L_{null}}
\end{equation}
$$
where $L_c$ denotes the maximum likelihood values provided by the fitted model, and $L_{null}$ denotes the corresponding null model, i.e., the model produced using only an intercept, and no covariates. Note that $R^2_{\text{McFadden}} \in [0,1]$, with 0 indicating that the independent variables in the model explain none of the variation and 1 indicating that explains all of it.


**Question 11**<br>
Compute the $R^2_{\text{McFadden}}$ for **Model 1**

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
# (Type your answer here)
#
log_likelihood_model_grad4 = loglikelihood(model_grad4)
null_model = glm(@formula(grad4 ~ 1), df, Binomial(), LogitLink())
log_likelihood_null = loglikelihood(null_model)
mcfadden_r2 = 1 - (log_likelihood_model_grad4 / log_likelihood_null)

**Interpretation:** Interprete your results

**Your Answer:**

<textarea style="width:100%;height:100px;border:1px solid #ccc; padding: 8px;">
(Type your answer here)
</textarea>

## Causal inference

Causal inference is a statistical approach that aims to determine the cause-and-effect relationship between variables. Unlike traditional correlation analysis, which can only identify associations between variables, causal inference seeks to uncover how changes in one variable directly affect another. This distinction is crucial for understanding the underlying mechanisms that drive observed outcomes, enabling researchers and decision-makers to predict the consequences of interventions accurately.

Causal inference is important for several reasons:

1. **Decision Making**: It allows policymakers, scientists, and businesses to make informed decisions by understanding the likely impact of their actions. For instance, in public health, knowing the causal relationship between lifestyle choices and health outcomes can inform effective prevention strategies.

2. **Understanding Complex Systems**: Many systems, such as economic markets, ecosystems, and human biology, are complex and interrelated. Causal inference helps disentangle these relationships, providing insights into how system components interact.

3. **Policy Evaluation**: It is essential for evaluating the effectiveness of policies and interventions. By identifying causal effects, it becomes possible to assess whether a policy achieved its intended outcomes and to optimize future interventions.

4. **Avoiding Spurious Correlations**: Not all associations imply causation. Causal inference techniques help distinguish between mere correlations, which might be coincidental or due to confounding factors, and genuine causal relationships.

5. **Innovation and Development**: In fields like drug development and technology, understanding causality can lead to breakthroughs by identifying which factors most significantly influence desired outcomes.

By focusing on causality rather than correlation, causal inference provides a more profound and actionable understanding of the world, enabling more effective interventions and smarter decision-making across a wide range of disciplines.

**Question 12**<br>
Use the PC-algorithm to infer the causal relationships between program complexity, pell grant, HSGPA, gender and 4-years graduation rate. Plot the graph.

In [None]:
# Your Answer Here:
# ----------------------------------------
# Write your answer below. You may use multiple lines of code if necessary.
# 
# (Type your answer here)
#

## References
<a id='References'></a>

[1] Heileman, G. L., Abdallah, C.T., Slim, A., and Hickman, M. (2018). Curricular analytics: A framework for quantifying the impact of curricular reforms and pedagogical innovations. www.arXiv.org, arXiv:1811.09676 [cs.CY].

[2] Heileman, G. L., Free, H. W., Abar, O. and Thompson-Arjona, W. G, (2019). CurricularAnalytics.jl Toolbox. https://github.com/heileman/CurricularAnalytics.jl.