# Modelos Lineales Generalizados en Python
# Regresión Poisson

<img src="https://raw.githubusercontent.com/fhernanb/fhernanb.github.io/master/docs/logo_unal_color.png" alt="drawing" width="200"/>

Aquí se muestran varios ejemplos de como usar Python para ajustar un modelo lineal generalizado. 

Las explicaciones mostradas aquí están basadas en un video de YouTube https://www.youtube.com/watch?v=__oC5IRCFKI

Las librerías necesarias son las siguientes:

In [1]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import glm

Otras librerías que se usarán en los ejemplos son:

In [2]:
import pandas as pd

## Datos

En esta actividad vamos a utilizar los datos de los cangrejos presentados en el capítulo 1 de Agresti (2015). Los datos del ejemplo se refieren al número $Y$ de cangrejos machos pegados al caparazón de las cangrejas hembras. Abajo una figura ilustrativa.

<img src="cangreja.jpg" alt="drawing" width="300"/>

Lo primero que usted debe hacer es leer la base de datos.

In [3]:
file = 'http://users.stat.ufl.edu/~aa/glm/data/Crabs.dat'
datos = pd.read_csv(file, sep='\s+', header=0)
datos.head()

Unnamed: 0,crab,y,weight,width,color,spine
0,1,8,3.05,28.3,2,3
1,2,0,1.55,22.5,3,3
2,3,9,2.3,26.0,1,1
3,4,0,2.1,24.8,3,3
4,5,4,2.6,26.0,3,3


Para ver el tamaño de la base de datos

In [4]:
datos.shape

(173, 6)

Vamos a convertir las variables cualitativas que tienen números en verdaderas variables cualitativas usando pandas.

In [5]:
# Para convertir color
scale_mapper = {1:'medium light', 2:'medium', 3:'medium dark', 4:'dark'}
datos['color'] = datos['color'].replace(scale_mapper)
datos['color'] = pd.Categorical(datos['color'], categories=['medium light', 'medium', 'medium dark', 'dark'])

# Para convertir spine
scale_mapper = {1:'both good', 2:'one worn or broken', 3:'both worn or broken'}
datos['spine'] = datos['spine'].replace(scale_mapper)
datos['spine'] = pd.Categorical(datos['spine'], ordered = True)

datos.head()

Unnamed: 0,crab,y,weight,width,color,spine
0,1,8,3.05,28.3,medium,both worn or broken
1,2,0,1.55,22.5,medium dark,both worn or broken
2,3,9,2.3,26.0,medium light,both good
3,4,0,2.1,24.8,medium dark,both worn or broken
4,5,4,2.6,26.0,medium dark,both worn or broken


## Ejemplo 1

El objetivo de este ejemplo es ajustar el siguiente modelo:

\begin{align}
Y_i &\sim Poisson(\mu_i), \\ 
\log(\mu_i) &= \beta_0 + \beta_1 Weight_i
\end{align}

Para ajustar el modelo:

In [6]:
mod1 = smf.glm(formula='y ~ weight', data=datos, 
               family=sm.families.Poisson(link=sm.families.links.log()))
mod1 = mod1.fit()
mod1.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,173.0
Model:,GLM,Df Residuals:,171.0
Model Family:,Poisson,Df Model:,1.0
Link Function:,log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-458.08
Date:,"Mon, 02 May 2022",Deviance:,560.87
Time:,15:02:09,Pearson chi2:,536.0
No. Iterations:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.4284,0.179,-2.394,0.017,-0.779,-0.078
weight,0.5893,0.065,9.064,0.000,0.462,0.717


Usando los resultados de la tabla anterior podemos escribir el modelo

\begin{align}
Y_i &\sim Poisson(\hat{\mu}_i), \\ 
\log(\hat{\mu}_i) &= -0.4284 + 0.5893 Weight_i
\end{align}

## Ejemplo 2

El objetivo es ahora es ajustar el siguiente modelo:

\begin{align}
Y_i &\sim Poisson(\mu_i), \\ 
\log(\mu_i) &= \beta_0 + \beta_1 Weight_i + \beta_2 colorMedium_i + \beta_3 colorMediumDark_i + \beta_4 colorDark_i
\end{align}



In [8]:
mod2 = smf.glm(formula='y ~ weight + color', data=datos, 
               family=sm.families.Poisson(link=sm.families.links.log()))
mod2 = mod2.fit()
mod2.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,173.0
Model:,GLM,Df Residuals:,168.0
Model Family:,Poisson,Df Model:,4.0
Link Function:,log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-453.55
Date:,"Mon, 02 May 2022",Deviance:,551.8
Time:,15:02:09,Pearson chi2:,535.0
No. Iterations:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.0498,0.233,-0.214,0.831,-0.507,0.407
color[T.medium],-0.2051,0.154,-1.334,0.182,-0.506,0.096
color[T.medium dark],-0.4498,0.176,-2.560,0.010,-0.794,-0.105
color[T.dark],-0.4520,0.208,-2.169,0.030,-0.861,-0.044
weight,0.5462,0.068,8.019,0.000,0.413,0.680


Usando los resultados de la tabla anterior podemos escribir el modelo

\begin{align}
Y_i &\sim Poisson(\hat{\mu}_i), \\ 
\log(\hat{\mu}_i) &= -0.0498 + 0.5462 Weight_i - 0.2051 colorMedium_i - 0.4498 colorMediumDark_i - 0.4520 colorDark_i
\end{align}

## Haciendo predicciones

¿Cuál será el número promedio estimado $\hat{\mu}$ de satélites para tres hembras con las siguientes características?

- Peso de 1.75 kg y color de caparazón obscuro.
- Peso de 2.15 kg y color de caparazón medio.
- Peso de 1.46 kg y color de caparazón medio obscuro.

In [9]:
df = pd.DataFrame({'weight' : [1.75, 2.15, 1.46],
                   'color'  : ['dark', 'medium', 'medium dark']})

print(df)

yhat = mod2.predict(df)
print(yhat)

   weight        color
0    1.75         dark
1    2.15       medium
2    1.46  medium dark
0    1.574573
1    2.507769
2    1.346954
dtype: float64


## Ejemplo 3

El objetivo es ahora es agregar la variable spine al predictor lineal $\eta$ del modelo anterior.

In [10]:
mod3 = smf.glm(formula='y ~ weight + color + spine', data=datos, 
               family=sm.families.Poisson(link=sm.families.links.log()))
mod3 = mod3.fit()
mod3.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,173.0
Model:,GLM,Df Residuals:,166.0
Model Family:,Poisson,Df Model:,6.0
Link Function:,log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-452.5
Date:,"Mon, 02 May 2022",Deviance:,549.7
Time:,15:02:09,Pearson chi2:,533.0
No. Iterations:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.0426,0.254,-0.168,0.866,-0.540,0.454
color[T.medium],-0.2677,0.168,-1.595,0.111,-0.597,0.061
color[T.medium dark],-0.5209,0.194,-2.683,0.007,-0.901,-0.140
color[T.dark],-0.5397,0.225,-2.396,0.017,-0.981,-0.098
spine[T.both worn or broken],0.0909,0.119,0.760,0.447,-0.143,0.325
spine[T.one worn or broken],-0.1607,0.211,-0.760,0.447,-0.575,0.254
weight,0.5476,0.073,7.482,0.000,0.404,0.691


Reto: ¿cree usted que la variable spine mejoró o no el modelo?

## Funciones de enlace disponibles

Para conocer otras posibles funciones de enlace se puede utilizar la siguiente instrucción:

In [7]:
sm.families.family.Poisson.links

[statsmodels.genmod.families.links.log,
 statsmodels.genmod.families.links.identity,
 statsmodels.genmod.families.links.sqrt]