# Module 2 Practice 3 Answers - Perform a Generalized Estimating Equations analysis

In this lab we will practice running a GEE.  In this case, the dependent variable is nominal, and contains counts.  Recall from the lab that count data fall under the Poisson distribution, so we will use that, and we will assume an Exchangeable covariance structure.

Documentation for the data is [here](../resources/epil.html).

We will model the variable `y`, which are the counts of seizures occurring during each two week period.  We will use the following as the independent variables:

1. age
1. trt
1. base

The variable `subject` identifies the groups.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
!{sys.executable} -m pip install --upgrade "statsmodels>=0.11"

from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.cov_struct import Exchangeable,Independence,Autoregressive
from statsmodels.genmod.families import Poisson

In [None]:
data = pd.read_csv('../resources/epil.csv', index_col=0)
display(data.head())

## Data visualization
Create a scatter plot of the response variable `y` on the y axis, `period` on the x, and color by the variable `trt`

In [None]:
data['trt'] = pd.Categorical(data['trt']) # creating a categorical from the strings gives us an easy way to refernce the values by text or by number

plt.scatter(data['period'], data['y'], c=data['trt'].cat.codes, alpha=0.25, cmap='Spectral')

Create a scatter plot of `y` versus `base`.

In [None]:
plt.scatter(data['base'], data['y'], alpha=0.75)

## Perform a GEE model
See the introductory text for this practice exercise to help you decide which parameters you need.
Print the results summary, and the QIC for the model.


In [None]:
model = GEE.from_formula('y ~ age + trt + base', 'subject', data, cov_struct= Exchangeable(), family=Poisson())
result = model.fit()
print(result.summary())
print(result.qic(result.scale))

## Perform a GEE using a different correlation structure
Run the GEE model again using an Independence correlation structure. Print the results summary, and the QIC for the model.


In [None]:
model = GEE.from_formula('y ~ age + trt + base', 'subject', data, cov_struct=Independence(), family=Poisson())
result = model.fit()
print(result.summary())
print(result.qic(result.scale))

## Interpretation
Interpret the results.  

Which model is better (Exchangeable or Independence)?

In the better model, which features are significate at $\alpha$ = 0.05?

Of the significant features, are their effects positive or negative?

Describe the effects in natural language.

The models are effectively equal. 

The significant features are:
  * age
  * base
  
The effects for age and base are both positive.  For each unit increase in age and base, there is an increase in the predicted count of siezures.

Subjects with a higher number of seizures at the baseline are more likely to have increased seizure counts.  As subjects age, they are also more likely to have increased seizure counts.
