# Analysis and Visualization of Complex Agro-Environmental Data
---
## More on regression: GLM and mixed models

### 1. Run Linear Mixed Models

Linear Mixed Effects models are used in regression analyses for study designs in which multiple observations are made on each ubject and therefore there might be a lack of independence among observations. As an example we will use the Iris dataset, used before as an example for simple regression analysis. In this example we found that the overall effect of petal_width on sepal_length was negative, but that considering each species separately, the relationship was inverted.
Linear mixed models are implemented in `mixedlm()` function of the `statsmodels` module.

In [None]:
# import the packages we are going to be using
import numpy as np # for getting our distribution
import pandas as pd # to handle data frames
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for plotting
from scipy import stats # to compute statistics
import statsmodels.api as sm
import statsmodels.formula.api as smf

# import data ('iris' dataset)
data = sns.load_dataset('iris')
print(data)

In [None]:
# Plot by group
sns.lmplot(x='petal_width', y='sepal_width', hue="species", data=data)

In [None]:
# Run simple linear regression

model = smf.ols(formula='sepal_width ~ petal_width', data=data)
modelf = model.fit()
print(modelf.summary())

In [None]:
# Run  linear mixed model
# We will consider the  default random effects structure (a random intercept for each group).

mmodel = smf.mixedlm(formula='sepal_width ~ petal_width', data=data, groups=data["species"])
mmodelf = mmodel.fit()
print(mmodelf.summary())

Now compare the two results above, especially the regression coefficient of petal_width

### 2. Run logistic regression

Logistic regression is a Generalized Linear Model of the binomial family. It is suitable either for binary data (yes/no) or for proportions of a given event. It uses the logit link function to model the linear predictor. To predict values in the scale of the response (values between 0 and 1), the inverse of the link function is applied to the linear predictor. The predicted values will be the probability of the event to occur. The parameters are usually estimated using a maximum likelihood estimation approach.

In the next example we will use the `EFIplus_medit.csv` database to explore if the presence of the Brown Trout is significantly influenced by the Annual Mean Temperature.

In [6]:
# Import data table (change path to the file accordingly)
df = pd.read_csv('EFIplus_medit.zip',compression='zip', sep=";")

In [None]:
list(df)

In [None]:
# Run logistic regression
# To use a formula with statsmodels.formula.api (smf) we need to create a species name with no spaces (replace by underscores)
df[['Salmo_trutta_fario']] = df[['Salmo trutta fario']]
log_reg = smf.logit("Salmo_trutta_fario ~ temp_ann", data=df).fit()
log_reg.summary()

## References

statsmodels 0.13.5: user guide. Generalized Linear Models. https://www.statsmodels.org/dev/glm.html

statsmodels 0.13.5: user guide. Linear Mixed Effects Models. https://www.statsmodels.org/stable/mixed_linear.html

statsmodels 0.13.5: examples. Linear Mixed Effects Models. https://www.statsmodels.org/stable/examples/notebooks/generated/mixed_lm_example.html