## The method of moments

We saw in lecture that the method of moments is one strategy we can use to fit a statistical model to observed data. 

The strategy starts by assuming our data follows a random variable with a specific distribution.
Then if our random variable has a single parameter we compute the sample mean $\overline{x}$, set that sample mean equal to the theoretical expected value---which will include this single parameter---and solve for our parameter in terms of our sample mean. 

For example, if we collected data $\mathcal{D} = (0,1,0,1,1)$ and we further assumed this data was generated by $X_{1}, X_{2}, X_{3}, X_{4}, X_{5} \sim \text{Bern}(\theta)$. 
Then The Method of Moments says that we can estimate our parameter $\theta$ by solving 

\begin{align}
    \mathbb{E}\left(\overline{X}\right)  = \overline{x} \\ 
    \theta = \overline{x} \\ 
    \theta = \frac{0+1+0+1+1}{5}\\
    \theta = \frac{3}{5}. 
\end{align}

Our MoM estimate for $\theta$ is $\hat \theta = 3/5$


### The data 
Lets use te MoM and our knowledge of random variables to model a dataset on Medical Expenditures. 
This data set was derived from a cross-sectional study conducted in 1996. 
A single observation is an individual who volunteered for the survey. 
The survey collected from every volunteer: If they report feeling health or not, their age in years, gender, if they have currently active health insurance, if the individual is married, if they are self-employed, the number of members in their family, what region of the US they live, their ethnicity, and highest attained education.   

### A hypothesis 

Suppose we want to use this dataset to compare the probability an individual has health insurance between those who report feeling healthy and those who report not feeling health. 

We can define a random variable $H \sim \text{Bern}(\theta_{\text{healthy}})$ that will represent among those who report feeling healthy, whether an individual has health insurance.
We can define a second random variable $U \sim \text{Bern}(\theta_{\text{unhealthy}})$ that will represent among those who report **not** feeling healthy, whether the individual has health insurance.

First lets import the data. 

In [5]:
d = read.csv("HealthInsurance.csv")
head(d)

Unnamed: 0_level_0,X,health,age,limit,gender,insurance,married,selfemp,family,region,ethnicity,education
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
1,1,yes,31,no,male,yes,yes,yes,4,south,cauc,bachelor
2,2,yes,31,no,female,yes,yes,no,4,south,cauc,highschool
3,3,yes,54,no,male,yes,yes,no,5,west,cauc,ged
4,4,yes,27,no,male,yes,no,no,5,west,cauc,highschool
5,5,yes,39,no,male,yes,yes,no,5,west,cauc,none
6,6,yes,32,no,female,no,no,no,3,south,afam,bachelor


Now lets use the method of moments to decide how we will estimate $\theta_{\text{healthy}}$ and $\theta_{\text{unhealthy}}$. 

The MoM says that as the number of data points increases, the $\overline{h}$ will be a better and better estimate of the expected value $\mathbb{E}(\overline{H})$.

\begin{align}
    \mathbb{E}(\overline{H}) &= \overline{h} \\
    \theta_{\text{healthy}} &= \overline{h} \\
    \theta_{\text{healthy}} &= \sum_{i=1}^{N} h_{i} / N \\
\end{align}

Our estimate for $\theta$ is just the average of each data point $h_{i}$ where $h_{i}$ equals the value one when the individual reported having insurance and the value 0 when they report not having insurance. 


Lets subset our data ``d`` into two dataframes: (i) a data frame where all individual report feeling healthy and (ii) a data frame where all individuals report not feeling healthy. 

In [8]:
healthy = d[d$health=="yes",]
unhealthy = d[d$health!="yes",]

Now we can compute the our MoM estimate for $\theta_{\text{healthy}}$ and $\theta_{\text{unhealthy}}$. 

In [13]:
theta_healthy_mom   = sum(healthy$insurance=="yes")  /nrow(healthy)
theta_unhealthy_mom = sum(unhealthy$insurance=="yes")/nrow(unhealthy)

print(theta_healthy_mom)
print(theta_unhealthy_mom)

[1] 0.8068029
[1] 0.7281399


We can model the whether a healthy individual has health insurance with the random variable 

\begin{align}
    H \sim \text{Bern}(0.81)
\end{align}

and whether an unhealthy individual has health insurance with the random variable 

\begin{align}
    U \sim \text{Bern}(0.73)
\end{align}

### Assignment

Suppose a data analyst collects health insurance data by age, not by indiviual. 

In [20]:
freq = xtabs(~ d$age + d$health)
compact = data.frame(freq)

head(compact)

Unnamed: 0_level_0,d.age,d.health,Freq
Unnamed: 0_level_1,<fct>,<fct>,<int>
1,18,no,5
2,19,no,4
3,20,no,5
4,21,no,6
5,22,no,10
6,23,no,8


The ``compact`` dataset has three columns: **age** which reports the age of a group of individuals,  **health** which reports wither a "yes" or a "no", and **Freq** which counts the number of individuals of a specific **age** and helath insurance status. 

For example, the first row of ``compact`` is (18, "no",5) and this means that 5 individuals who are the age of 18 have no health insurance. 

1. Let us define a random variable $H \sim \text{Binomial}(N,\theta)$ that models the number of individuals under the age of 30 who report having health insurance. (i) Use the MoM to estimate $\theta$ from the ``compact`` data and (ii) plot the pmf for this random variable. 

2. Often the Poisson random variable is a short cut to model random variables with a binomial distribution. Lets suppose $J \sim \text{Pois}(\lambda)$ models the number of individuals under the age of 30 who report having health insurance. (i) Use the MoM to estimate $\theta$ from the ``compact`` data and (ii) plot the pmf for this random variable. 

3. How do the pmfs for 1. and 2. compare?