# Model Selection and Information Criteria

Refer to [the slides and other info](https://harvard-iacs.github.io/2018-CS109A/a-sections/a-section-2/). This is a conceptual lesson and does not involve coding. However, I think it'd be wise to apply these concepts from the slides and paper here to be ready for the next section.

Let's begin.

# Maximum Likelihood Estimation

First thing's first, **likelihood and probability are interrelated, but not synonymous** in the realm of statistics.

**Likelihood** refers to how well a sample provides support for particular values of a parameter in a model.

**Probability** refers to the chance that a particular outcome occurs based on the values of parameters in a model.

**The critical distinction:**

* Probability: P(X=x|θ) treated as a function of x (θ is fixed)
* Likelihood: P(X=x|θ) treated as a function of θ (x is fixed/observed)

## Statement of the Problem

[PSU Resource](https://online.stat.psu.edu/stat415/lesson/1/1.2)

Suppose we have a random sample <math xmlns="http://www.w3.org/1998/Math/MathML">
  <msub>
    <mi>X</mi>
    <mn>1</mn>
  </msub>
  <mo>,</mo>
  <msub>
    <mi>X</mi>
    <mn>2</mn>
  </msub>
  <mo>,</mo>
  <mo>&#x22EF;</mo>
  <mo>,</mo>
  <msub>
    <mi>X</mi>
    <mi>n</mi>
  </msub>
</math> whose assumed probability distribution depends on some unknown parameter <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>&#x3B8;</mi>
</math>. Our primary goal here will be to find a point estimator <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>u</mi>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>X</mi>
    <mn>1</mn>
  </msub>
  <mo>,</mo>
  <msub>
    <mi>X</mi>
    <mn>2</mn>
  </msub>
  <mo>,</mo>
  <mo>&#x22EF;</mo>
  <mo>,</mo>
  <msub>
    <mi>X</mi>
    <mi>n</mi>
  </msub>
  <mo stretchy="false">)</mo>
</math>, such that <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>u</mi>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>x</mi>
    <mn>1</mn>
  </msub>
  <mo>,</mo>
  <msub>
    <mi>x</mi>
    <mn>2</mn>
  </msub>
  <mo>,</mo>
  <mo>&#x22EF;</mo>
  <mo>,</mo>
  <msub>
    <mi>x</mi>
    <mi>n</mi>
  </msub>
  <mo stretchy="false">)</mo>
</math> is a "good" point estimate of <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>&#x3B8;</mi>
</math>, where <math xmlns="http://www.w3.org/1998/Math/MathML">
  <msub>
    <mi>x</mi>
    <mn>1</mn>
  </msub>
  <mo>,</mo>
  <msub>
    <mi>x</mi>
    <mn>2</mn>
  </msub>
  <mo>,</mo>
  <mo>&#x22EF;</mo>
  <mo>,</mo>
  <msub>
    <mi>x</mi>
    <mi>n</mi>
  </msub>
</math> are the observed values of the random sample.

**Some notes from [this video](https://www.youtube.com/watch?v=k5sbE1_MDwU):** When a random variable is a capital letter (X), we are conveying that this object can take on different values (we haven't observed it yet. It's still random). When we see a lower case (x), we've actually observed a value from the random variable, and it is therefore no longer random.

All random variables (X) have a function with
1. An input: the random variable and
1. An output: how **likely** we are to observe the input number

All random variables have such a function that takes the form of either the probability density function (used for continuous random variables) or the probability mass function (discrete random variables). These are both known as **probability distributions**, which tells us about the structure in the randomness.

Probabilities for a probability density function (the function we're focused on for this unit's purposes) can come from **point values**, such as f(X = x) or **intervals of values**, f(x1 <= X <= x2).

The main idea here is that even though it's impossible to predict values of a random variable ahead of time, **the probability distribution tell us that, over many observations, the *frequency* that they appear will be predictable**. This is what we mean by *structure* - the predictability of the frequency that a random variable will appear despite randomness!

To wrap these ideas up: **probability distributions of statistics can tell us what values we are likely or unlikely to observe**.

Now back to the [PSU resource](https://online.stat.psu.edu/stat415/lesson/1/1.2):

## The Basic Idea

It seems reasonable that a good estimate of the unknown parameter <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>&#x3B8;</mi>
</math> would be the value of <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>&#x3B8;</mi>
</math> that maximizes the **likelihood** of getting the data we observed. Suppose we have a random sample <math xmlns="http://www.w3.org/1998/Math/MathML">
  <msub>
    <mi>X</mi>
    <mn>1</mn>
  </msub>
  <mo>,</mo>
  <msub>
    <mi>X</mi>
    <mn>2</mn>
  </msub>
  <mo>,</mo>
  <mo>&#x22EF;</mo>
  <mo>,</mo>
  <msub>
    <mi>X</mi>
    <mi>n</mi>
  </msub>
</math> for which the PDF of each <math xmlns="http://www.w3.org/1998/Math/MathML">
  <msub>
    <mi>X</mi>
    <mi>i</mi>
  </msub>
</math> is <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>f</mi>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>x</mi>
    <mi>i</mi>
  </msub>
  <mo>;</mo>
  <mi>&#x3B8;</mi>
  <mo stretchy="false">)</mo>
</math>. Then, the joint probability mass (or density) function of <math xmlns="http://www.w3.org/1998/Math/MathML">
  <msub>
    <mi>X</mi>
    <mn>1</mn>
  </msub>
  <mo>,</mo>
  <msub>
    <mi>X</mi>
    <mn>2</mn>
  </msub>
  <mo>,</mo>
  <mo>&#x22EF;</mo>
  <mo>,</mo>
  <msub>
    <mi>X</mi>
    <mi>n</mi>
  </msub>
</math>, which is called <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>L</mi>
  <mo stretchy="false">(</mo>
  <mi>&#x3B8;</mi>
  <mo stretchy="false">)</mo>
</math> is:

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>L</mi>
  <mo stretchy="false">(</mo>
  <mi>&#x3B8;</mi>
  <mo stretchy="false">)</mo>
  <mo>=</mo>
  <mi>P</mi>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>X</mi>
    <mn>1</mn>
  </msub>
  <mo>=</mo>
  <msub>
    <mi>x</mi>
    <mn>1</mn>
  </msub>
  <mo>,</mo>
  <msub>
    <mi>X</mi>
    <mn>2</mn>
  </msub>
  <mo>=</mo>
  <msub>
    <mi>x</mi>
    <mn>2</mn>
  </msub>
  <mo>,</mo>
  <mo>&#x2026;</mo>
  <mo>,</mo>
  <msub>
    <mi>X</mi>
    <mi>n</mi>
  </msub>
  <mo>=</mo>
  <msub>
    <mi>x</mi>
    <mi>n</mi>
  </msub>
  <mo stretchy="false">)</mo>
  <mo>=</mo>
  <mi>f</mi>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>x</mi>
    <mn>1</mn>
  </msub>
  <mo>;</mo>
  <mi>&#x3B8;</mi>
  <mo stretchy="false">)</mo>
  <mo>&#x22C5;</mo>
  <mi>f</mi>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>x</mi>
    <mn>2</mn>
  </msub>
  <mo>;</mo>
  <mi>&#x3B8;</mi>
  <mo stretchy="false">)</mo>
  <mo>&#x22EF;</mo>
  <mi>f</mi>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>x</mi>
    <mi>n</mi>
  </msub>
  <mo>;</mo>
  <mi>&#x3B8;</mi>
  <mo stretchy="false">)</mo>
  <mo>=</mo>
  <munderover>
    <mo data-mjx-texclass="OP" movablelimits="false">&#x220F;</mo>
    <mrow data-mjx-texclass="ORD">
      <mi>i</mi>
      <mo>=</mo>
      <mn>1</mn>
    </mrow>
    <mi>n</mi>
  </munderover>
  <mi>f</mi>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>x</mi>
    <mi>i</mi>
  </msub>
  <mo>;</mo>
  <mi>&#x3B8;</mi>
  <mo stretchy="false">)</mo>
</math>

The The first equality is just the definition of the joint probability mass function. The second equality comes from the fact that we have a random sample, which implies by definition that the Xi are **independent**.

In light of the basic idea of maximum likelihood estimation, one reasonable way to proceed is to treat the likelihood function as a function of theta, and find the value of theta that maximizes it.

## Python Example of MLE

[Quant Econ resource](https://python.quantecon.org/mle.html)

