# DATA SCIENCE SESSIONS VOL. 3
### A Foundational Python Data Science Course
## Tasklist 19: Generalized Linear Models IV. Poisson Regression. Negative Binomial Regression. 

[&larr; Back to course webpage](https://datakolektiv.com/)

Feedback should be send to [goran.milovanovic@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com). 

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

![](../img/IntroRDataScience_NonTech-1.jpg)

### Lecturers

[Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner](https://www.linkedin.com/in/gmilovanovic/)

[Aleksandar Cvetković, PhD, DataKolektiv, Consultant](https://www.linkedin.com/in/alegzndr/)

[Ilija Lazarević, MA, DataKolektiv, Consultant](https://www.linkedin.com/in/ilijalazarevic/)

![](../img/DK_Logo_100.png)

***

In this tasklist you'll get to practice Poisson and Negative Binomial Regression some more. Everything you need to know for this tasklist is already laid out in Session19 notebook, so you can dive in right away.

In [1]:
### --- Setup - importing the libraries

# - supress those annoying 'Future Warning'
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# - data
import numpy as np
import pandas as pd

# - os
import os

# - ml
from sklearn.linear_model import PoissonRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

import statsmodels.api as sm
import statsmodels.formula.api as smf

# - visualization
import matplotlib.pyplot as plt
import seaborn as sns

# - parameters
%matplotlib inline

pd.options.mode.chained_assignment = None  # default='warn'
sns.set_theme()

# - rng
rng = np.random.default_rng(1234)

# - plots
plt.rc("figure", figsize=(8, 6))
plt.rc("font", size=14)
sns.set_theme(style='white')

# - directory tree
data_dir = os.path.join(os.getcwd(), '_data')

We'll take a look at `Household Size in the Philippines` dataset, taken from the book [Beyond Multiple Linear Regression: Applied Generalized Linear Models and Multilevel Models in R, Paul Roback and Julie Legler](https://bookdown.org/roback/bookdown-BeyondMLR/); the dataset can be downloaded [here](https://github.com/proback/BeyondMLR), and you can also find it as `fHH1.csv` file in `_data` folder.

In [2]:
df = pd.read_csv(os.path.join(data_dir, "fHH1.csv"))
df

Unnamed: 0,location,age,total,numLT5,roof
0,CentralLuzon,65,0,0,Predominantly Strong Material
1,MetroManila,75,3,0,Predominantly Strong Material
2,DavaoRegion,54,4,0,Predominantly Strong Material
3,Visayas,49,3,0,Predominantly Strong Material
4,MetroManila,74,3,0,Predominantly Strong Material
...,...,...,...,...,...
1495,Visayas,37,2,0,Predominantly Strong Material
1496,MetroManila,45,3,1,Predominantly Strong Material
1497,MetroManila,34,4,1,Predominantly Strong Material
1498,IlocosRegion,58,3,0,Predominantly Strong Material


What we would like to do is to be able to predict the total variable, which represents the number of people living in a household (other than the head of the household), from the following covariates:

- `location`: where the house is located (regions in the Philippines, whose The Philippine Statistics Authority (PSA) spearheads from the Family Income and Expenditure Survey (FIES) are the source of this dataset);
- `age`: the age of the head of household;
- `numLT5`: the number of people in the household under 5 years of age;
- `roof`: the type of roof in the household (either Predominantly Light/Salvaged Material, or Predominantly Strong Material: stronger material can sometimes be used as a proxy for greater wealth).

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   location  1500 non-null   object
 1   age       1500 non-null   int64 
 2   total     1500 non-null   int64 
 3   numLT5    1500 non-null   int64 
 4   roof      1500 non-null   object
dtypes: int64(3), object(2)
memory usage: 58.7+ KB


As we can see, the dataset does not contain missing values, and the types of variables are consistent with their definitions; so, no need for cleaning this dataset. 

**01.** Make a countplot showing the counts of every value present in the `total` variable. What distribution does this chart resemble?

In [4]:
### your code here ###

**02.** Perform one-hot encoding of every categorical predictor in the dataset.

In [5]:
### your code here ###

**03.** 

**a)** Using `statsmodels` fit Poisson Regression on the given data to predict values of `total` variable from the rest of the variables. 

In [6]:
### your code here ###

**b)** Take the exponentials of model coefficients and interpret the values.

In [7]:
### your code here ###

**c)** Compute AIC of the model. 

In [8]:
### your code here ###

**d)** Is there overdispersion in the model?

In [9]:
### your code here ###

**04.** 

**a)** Use `scikit-learn` to fit Poisson Regression on the given data to predict values of `total` variable from the rest of the variables. Do not perform regularization.

In [10]:
### your code here ###

**b)** What's $D^2$ score of the model?

In [11]:
### your code here ###

**05.** Using `scikit-learn` and cross-validation, fit Poisson Regression with $L_2$ regularization. Calculate AIC and $D^2$ score of the optimally regularized model.

In [12]:
### your code here ###

**06.** Using `statsmodels` fit Negative Binomial Regression. Compute AIC and $D^2$ score of this model. Is there some significant improvement over Poisson Regression model? Why?

In [13]:
### your code here ###

DataKolektiv, 2022/23.

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

<font size=1>License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.</font>