# Maximum Likelihood Estimation (MLE)

MLE is an estimator commonly used in regression models. MLE asks a simple question: what parameter values for the model are most likely given the data? Or, put differently, what model parameters are most likely to be the cause of data we observe? So MLE works backward, using optimization algorithms to identify the likelihood of the data based on candidate values of our model parameters (like the y-intercept and predictor variable slopes). MLE is particularly useful when we cannot algebraically solve for the parameter values like we can for OLS. Mathematically, it is easier to: 

    1. Use the log of the likelihood function;
    2. Take the minimum of the negative likelihood function

So that is why MLE examines negative log-likehoods. The resulting model parameter estimates are the slopes that are most likely to have given rise to the data we observe and are identified at the minimum of the negative log-likelihood function.

Let's do a quick exercise using a linear model to illustrate what MLE is doing. I will ***never*** ask you to do this on your own for an exam or project, but it might be help to better understand the class of models which use MLE. 

In [47]:
# install packages (if needed)
# install.packages("tidyverse")
# install.packages("bbmle") # bbmle will let us program mle
# install.packages("marginaleffects") # interpret results of logit models
# install.packages("performance") # for percent correctly predicted

Load the packages and create some fake data!

In [None]:
library(tidyverse)
library(bbmle)
library(marginaleffects)
library(performance)

# set the start point for random numbers so we get the same data
set.seed(1234)

# generate our error term
e <- rnorm(1000, 0, 10) # mean of 0 and sd of 10

# generate our predictor variable
x <- runif(1000, 18, 98)

# generate y as a function of x and error.
# true values: b0 = 20, b1 = .8
y <- 20 + .8*x + e

# combine fake data into tibble dataframe
df1 <- tibble(y, x, e)

# plot
ggplot(df1, aes(x = x, y=y)) + geom_point()

# check to make sure OLS gives us correct info
ols1 <- lm(y ~ x, data = df1)
summary(ols1)

Now, let's program some MLE.

In [None]:

# Create two functions - these specify parameters that 
# can then take on specific values when you use the 
# function.

# here, we specify the model, working backward from the residuals
# our three parameters are the y-intercept, the slope of x
# and the standard deviation of the residuals 
LL <- function(beta0, beta1, sigma){
  Resids = y  - x*beta1 - beta0
  Resids = suppressWarnings(dnorm(Resids, 0, sigma, log = TRUE))
  -sum(Resids)
}

# use mle2 from the bbmle package to use optimization methods
# to find values of parameters. 
fit <- mle2(minuslogl = LL, # our function
            start = list(beta0 = 0, beta1 = 0, sigma = 1) # starting values for parameter search
            )
fit # call up results

How did we do? Did MLE get close to the OLS estimates? What about the original true values? Please answer using a Markdown section of your Quarto submission.

That's all fine and good, I guess. But it still is a little bit of a black box. This time, let's fix the parameter values for the standard deviation of the residuals and the y-intercept to their true values, and then graph what happens to the negative log likelihood as we cycle through values for `beta1`.

In [None]:
# now only beta1 remains as a parameter. beta0 is set to 20 and sigma to 10.
LL2 <- function(beta1){
  Resids = y  - x*beta1 - 20
  Resids = suppressWarnings(dnorm(Resids, 0, 10, log = TRUE))
  -sum(Resids)
}

# create vector of integers from 0 to 200
testpars <- 0:200
# divide by 100 to scale between 0 and 2
testpars <- testpars / 100

# take each value in `testpars` and plug into LL2 function 
LL.out <- lapply(testpars, LL2)
# write result of function to dataframe named `dfLL`
dfLL <- data.frame(do.call(rbind, LL.out), testpars)

# graph result
ggplot(data = dfLL, aes(x = testpars, y = do.call.rbind..LL.out.)) +
    geom_line() + 
    theme_minimal()

Just to drive home the point here, remake the above graph, this time adding in a vertical dashed red line at `x = 0.8`. Add appropriate axis titles to the graph. Change the color of the line and increase the thickness of the line. You might need to review the arguments for `geom_line` in `ggplot()`. You can find out more [here](https://ggplot2.tidyverse.org/reference/geom_path.html). Do you see how MLE uses the minimum negative log likelihood to identify parameter estimates?

# Logistic Regression


Logistic regression is not the only model for analyzing binary outcome variables, but it is the most common (the other very common option is probit regression). Running a logit model is not particularly difficult. You can use `glm()` (stands for *generalized linear model*), where we specify our model of the outcome and the distributional form that we assume the error term takes. This is a necessary part of any MLE approach to modeling. The exact form the linear model takes as it is mapped to the outcome comes from the *link function*. In logistic regression, the error term is from the *binomial distribution* and the link function is the *logit*.  

In [51]:
# load packages for interpreting logit models

library(marginaleffects) # for interpretation
library(performance) # for percent correctly predicted

# create simulated data
x1 <- rbinom(n = 700, size = 1, prob = 0.5) # dichotomous x variabl
x2 <- round(rnorm(700, 6, 2), digits = 0) # normal x variable
x3 <- runif(700, 0, 100) # uniform x variable

xb <- -2 - 2.6*x1 + .5*x2 + .011*x3 # latent linear variable

probs <- 1/(1 + exp(-xb)) # inverse logit transformation 

# binary outcome y for 700 trials, with the probability y = 1
# set by `probs` equation in previous line
y <- rbinom(n = 700, size = 1, prob = probs)  

# combine vectors into data frame
bin.out <- tibble(x1,x2,x3,y)


Now we can estimate our model using `glm()`.

In [None]:

# estimate model
mod_logit <- glm(y ~ x1 + x2 + x3, 
                    family = binomial(link = "logit"),
                    data = bin.out)

#check model results
summary(mod_logit)


One nice statistic to check is the PCP, or the percent correctly predicted.

In [None]:
performance_pcp(mod_logit)

This output shows us that our model correctly predicts the value of $y$ in 69% of the observations. The "null" model refers to the percent correctly predicted without using any predictor variables. Clearly, knowing our $x$ variables helps us better predict $y$. The likelihood-ratio test is a significant test. The results above show that the difference in PCP between our model and the null is statistically significant. 

## Logit Postestimation and Interpretation

As we have discussed, interpreting logit models is difficult because of the non-linear nature of the inverse logistic curve and the mapping of continuous unbounded values into the probability scale. Typically, we want to know how a predictor variable is expected to change the probability of our units having a characteristic or experiencing an event. In other words, we want to know how $x$ changes $Pr(y = 1)$. We can convey this information in several ways, which have their advantages and disadvantages.

### Predicted Probabilities

**Predicted probabilities** express the $Pr(y = 1)$ for values of our predictor variable. Let's use predictions from our model to understand our model results.  

In [None]:
# to generate a predicted probability y = 1 for each observation in your
# actual data, use predictions() from the marginaleffects package

# syntax: dataframe <- predictions(model name)
preds <- predictions(mod_logit) 

# default is a probability for binary outcome models, stored in `estimate`
# notice that we also receive uncertainty estimates
head(preds)

Typically, we want these predictions to be graphed by values of our predictor variables. That is, we want to show what happens to $Pr(y = 1)$ by values of $x$. We can do that using `predictions()` as well using further arguments in the function. Since we have two other variables ($x1$ and $x2$) which also determine the $Pr(y = 1)$, we need to fix them to specific values, otherwise we will not receive just one value of $Pr(y = 1)$ for each value of $x3$. 

In [None]:
preds <- predictions(mod_logit,
						condition = "x3", # becomes x axis
						newdata = datagrid(x1 = 0.5,     # fix x1 to .5
										   x2 = mean(x2),# fix x2 to its mean
										   x3 = 0:100))  # create table where x3 ranges
										   				 # from 0 to 100

# check results
head(preds)

Looks good! Note that our `preds` dataframe now contains observations with fixed values for $x1$ and $x2$, while $x3$ increases from 0 by 1 unit each observation. If you recall, we created $x3$ from a uniform distribution ranging from 0 to 100, so this reflects the entire range of values of the variable. Now, try your hand at graphing the predicted probability of $y$ by $x3$ using `ggplot()`. Remember to use your predicted dataframe `preds` as the data in `ggplot`. Make the graph look professional, and include the confidence interval around the prediction using the `conf.low` and `conf.high` variables and `ggplot()`'s `geom_ribbon` function. Check out the [geom description here if needed](https://ggplot2.tidyverse.org/reference/geom_ribbon.html?q=geom_ribbon#null). 

Ok, `marginaleffects` does have a way to create predicted probabilities and graph them in one function (although you perhaps lose a bit of control of the graphing process).

In [None]:
p1 <- plot_predictions(mod_logit,
						condition = "x3", # becomes x axis
						newdata = datagrid(x1 = 0.5,
										   x2 = mean(x2),
										   x3 = 0:100))

# this graph works with ggplot and we can add other layers
p1 + theme_classic() + geom_rug(data = bin.out, aes(x3))

We could also plot predicted probabilities by multiple variables. In the example below, we graph the predicted probability of $y$ by both $x3$ and $x1$. If you recall, $x1$ is a binary predictor variable. 

In [None]:
# what if we want to show two vars?
p2 <- plot_predictions(mod_logit,
						condition = list("x3", "x1"), # becomes x axis
						newdata = datagrid(x1 = c(0, 1), # need to specfy both values of x1
										   x2 = mean(x2), # fix x2
										   x3 = 0:100)) # integers for range of x3

p2 + theme_minimal() + geom_rug(data = bin.out, aes(x3))

# if you'd like to see the specfic predicted values, run same function call 
# using predictions() instead of plot_predictions()
preds <- predictions(mod_logit, newdata = datagrid(x1 = c(0, 1 ),
													x2 = mean(x2),
													x3 = round(min(x3),0):round(max(x3),0)))

### Marginal Effects

**Marginal effects** is a way of talking about the average change in $Pr(y = 1)$ for a small change in $x$. We typically plot marginal effects over the full range of a predictor variable for models in which the estimated effect changes, like in the logit model or in a model that uses interaction terms. 

We can use the `slopes` function to generate marginal effects and store as a dataframe or use `plot_slopes` to graph directly. 

In [None]:
# marginal effects
p3 <- plot_slopes(mod_logit, 
						variables = "x3", # this is the variable you'll use to
										  # calculate marginal effects 
						condition = "x3", # the x axis on the graph
						newdata = datagrid(x1 = 0.5, # fix x1
										   x2 = mean(x2), # fix x2
										   x3 = 0:100)) # values of x3

p3

  
# now with multiple variables
p4 <- plot_slopes(mod_logit, 
							variables = "x3",
							condition = list("x3", "x1"), # plot marginal effects
														  # of x3 by values of x3
														  # and values of x1
							newdata = datagrid(x1 = c(0,1),
											   x2 = mean(x2),
											   x3 = 0:100))

p4

Why do you think the effect of $x3$ is going down as $x3$ increases when $x1 = 0$ but going up when $x1 = 1$? After all, the variables are not interacted in the model. Give your explanation in your Quarto report. 

### First-Differences

So-called **first-differences** present changes in the $Pr(y = 1)$ over standardized unit changes of $x$. Typically, this some number of standard deviations below the mean to the same number of standard deviations above the mean on $x$. Another common first difference is a minimum to maximum change. Then, the first difference for one variable could be compared to a first difference for another. First differences are also useful because we specify substantively meaningful changes in $x$ and see if the predicted change in $Pr(y = 1)$ differs significantly from 0. The code below offers examples of some first difference comparisons. 

In [None]:
# first difference of 1 sd below the mean to 1 sd above the mean
# which results in a 2sd change total

# setting other xs to means or modes
avg_comparisons(mod_logit, 
						variables = list(x3 = "2sd"),
						newdata = datagrid())

# setting other xs to specific values
avg_comparisons(mod_logit, 
				variables = list(x3 = "2sd"),
				newdata = datagrid(x1 = 0.5))

# over all values of x1
avg_comparisons(mod_logit, 
					variables = list(x3 = "2sd"),
					by = "x1",
					newdata = datagrid(x1 = c(0,1)))

# test whether first differences of x3 are statistically 
# different at different values of x1 (you probably would
# only do this if you were interacting x1 and x3) 
avg_comparisons(mod_logit, 
					variables = list(x3 = "2sd"),
					by = "x1",
					hypothesis = "b1 - b2 = 0",
					newdata = datagrid(x1 = c(0,1)))

# Wrap-up

Please let me know if you have any questions about the functions we covered in activity and submit your pdf document rendered from Quarto to Canvas. 