In [1]:
import pandas as pd
import numpy as np

from statsmodels.formula.api import ols, glm
import statsmodels.api as sm

salary = pd.read_csv('../../data/salary.csv')

## Linear model, a special case of GLM
### In this exercise you will fit a linear model two ways, one using the ols() function and one using the glm() function. This will show how a linear model is a special case of a generalized linear model (GLM).

### You will use the preloaded salary dataset introduced in the video.
### Recall that the linear model in Python is defined as: ols(formula = 'y ~ X', data = my_data).fit()
### and the generalized linear model can be trained using [FORMULAS EXCLUDED] glm(formula = 'y ~ X', data = my_data, family = sm.families.___).fit()

### Instructions 
-    Import the statsmodels.api with the common alias sm, and the ols and glm modules from the statsmodels.formula.api.
-    Fit a linear model by predicting Salary with Experience using the salary dataset.

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols, glm

# Fit a linear model
model_lm = ols(formula = 'Salary ~ Experience',
               data = salary).fit()

# View model coefficients
print(model_lm.params)

-    Fit a GLM using the same formula and data, as for the linear model, but this time include the Gaussian() family as additional input.

In [None]:
from statsmodels.formula.api import ols, glm
import statsmodels.api as sm

# Fit a GLM
model_glm = glm(formula = 'Salary ~ Experience',
                data = salary,
                family = sm.families.Gaussian()).fit()

# View model coefficients
print(model_glm.params)

## Linear model and a binary response variable
### In the video, you saw an example of fitting a linear model to a binary response variable and how things can go wrong quickly. You learned that, given the linear line fit, you can obtain fitted values , which are not in line with the logic of the problem since the response variable takes on values 0 and 1.

### Using the preloaded crab dataset, you will study this effect by modeling y as a function of x using the GLM framework.

### Recall that the GLM model formulation is:
### glm(formula = 'y ~ X', data = my_data, family = sm.families.____).fit() <missing formulat> where you specify formula, data, and family.

### Also, recall that a GLM with:

-    the Gaussian family is a linear model (a special case of GLMs)
-    the Binomial family is a logistic regression model.

### Instruction
-    Using the crab dataset, define the model formula so that y is predicted by width.
-    To fit a linear model using GLM formula, use Gaussian() for the family argument which assumes y is continuous and approximately normally distributed.
-    To fit a logistic model using GLM formula, use Binomial() for the family argument.
-    Fit a model using glm() with appropriate arguments and use print() and summary() to view summaries of the fitted models.

In [8]:
crab = pd.read_csv("data/crab.csv")

In [9]:
# Define model formula
formula = 'y ~ width'

# Define probability distribution for the response variable for 
# the linear (LM) and logistic (GLM) model
family_LM = sm.families.Gaussian()
family_GLM = sm.families.Binomial()

# Define and fit a linear regression model
model_LM = glm(formula = formula, data = crab, family = family_LM).fit()
print(model_LM.summary())

# Define and fit a logistic regression model
model_GLM = glm(formula = formula, data = crab, family = family_GLM).fit()
print(model_GLM.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                      y   No. Observations:                  173
Model:                            GLM   Df Residuals:                      171
Model Family:                Gaussian   Df Model:                            1
Link Function:               Identity   Scale:                         0.19515
Method:                          IRLS   Log-Likelihood:                -103.13
Date:                Tue, 12 Mar 2024   Deviance:                       33.371
Time:                        04:04:04   Pearson chi2:                     33.4
No. Iterations:                     3   Pseudo R-squ. (CS):             0.1730
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.7655      0.421     -4.190      0.0

## Comparing predicted values
### In the previous exercise, you have fitted both a linear and a GLM (logistic) regression model using crab data, predicting ywith width. In other words, you wanted to predict the probability that the female has a satellite crab nearby given her width.

### In this exercise, you will further examine the estimated probabilities (the output) from the two models and try to deduce if the linear fit would be suitable for the problem at hand.
### The usual practice is to test the model on new, unseen, data. Such dataset is called test sample.
### The test sample has been created for you and loaded in the workspace. Note that you need test values for all variables present in the model, which in this example is width.
### The crab dataset has been preloaded in the workspace.

### Instructions
-    Using print() view the test set.
-    Using the test sample, compute estimated probabilities using .predict() on the fitted linear model model_LM and save as pred_lm. Also, compute estimated probabilities using .predict() on the fitted GLM (logistic) model saved as model_GLM and save as pred_glm.
-    Using pandas DataFrame() combine predictions from both models and save as predictions.
-    Concatenate the test and predictions and save as all_data. View all_data using print().

In [10]:
test = pd.DataFrame({"width": [17.8, 24.6, 28.1, 32.0, 33.7], "y": [0, 0, 1, 1, 1]})

In [11]:
# View test set
print(test)

# Compute estimated probabilities for linear model: pred_lm
pred_lm = model_LM.predict(test)

# Compute estimated probabilities for GLM model: pred_glm
pred_glm = model_GLM.predict(test)

# Create dataframe of predictions for linear and GLM model: predictions
predictions = pd.DataFrame({'Pred_LM': pred_lm, 'Pred_GLM': pred_glm})

# Concatenate test sample and predictions and view the results
all_data = pd.concat([test, predictions], axis = 1)
print(all_data)

   width  y
0   17.8  0
1   24.6  0
2   28.1  1
3   32.0  1
4   33.7  1
   width  y   Pred_LM  Pred_GLM
0   17.8  0 -0.136287  0.029309
1   24.6  0  0.486122  0.470299
2   28.1  1  0.806480  0.834983
3   32.0  1  1.163450  0.972363
4   33.7  1  1.319052  0.987941


## Model fitting step-by-step
### In the video lecture, you learned the key components for fitting a GLM in Python using the statsmodels package. In this exercise you will define the components of the GLM step by step and finally fit the model by calling the .fit() method.

### The dataset which you will use is on the contamination of groundwater with arsenic in Bangladesh where we want to model the household decision on switching the current well.
### The columns in the dataset are:
-    switch: 1 if the change of the current well occurred; 0 otherwise
-    arsenic: The level of arsenic contamination in the well
-    distance: Distance to the closest known safe well
-    education: Years of education of the head of the household
-    Dataset wells has been preloaded in the workspace.

### Instructions
-    Create a regression formula where switch is predicted by distance100.

In [12]:
wells = pd.read_csv("data/wells.csv")

In [13]:
# Define the formula the the logistic model
model_formula = 'switch ~ distance100'

-    Define the link function using logit for the Binomial distribution.

In [14]:
# Define the formula the the logistic model
model_formula = 'switch ~ distance100'

# Define the correct probability distribution and the link function of the response variable
link_function = sm.families.links.logit
model_family = sm.families.Binomial(link = link_function())



-    With the datasets wells, fit the model using the glm() with the previously defined formula and family.
-    Fit the model using the .fit() method.

In [15]:
# Define the formula the the logistic model
model_formula = 'switch ~ distance100'

# Define the correct probability distribution and the link function of the response variable
link_function = sm.families.links.logit()
model_family = sm.families.Binomial(link = link_function)

# Fit the model
wells_fit = glm(formula = model_formula, 
                 data = wells, 
                 family = model_family).fit()

## Results of the model fit using summary()
### In the previous exercise you fitted a logistic regression model wells_fit using glm() and .fit(). The second step after fitting the model is to examine the model results. To do this you will use the .summary() function, which provides an overview of the model coefficients and how well they fit, along with several other statistical measures.

### In the lessons to come, you will learn how to interpret the model output and the details of the given statistical measures and how to interpret them.

### The model wells_fit has been preloaded in the workspace.

### Instructions
-    Using summary(), view the model summary from wells_fit.

In [16]:
# View the results of the wells_fit model
print(wells_fit.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                 switch   No. Observations:                 3010
Model:                            GLM   Df Residuals:                     3008
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2030.6
Date:                Wed, 13 Mar 2024   Deviance:                       4061.3
Time:                        02:21:56   Pearson chi2:                 3.01e+03
No. Iterations:                     4   Pseudo R-squ. (CS):            0.01409
Covariance Type:            nonrobust                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.6108      0.060     10.104      

## Extracting parameter estimates
### Coefficient estimates are generally of main interest in a regression model. In the previous exercise you learned how to view the results of the model fit and hence the coefficient values along with their corresponding statistics. In this exercise you will learn how to extract the coefficients from the model object.

### The attribute .params contains the coefficients of the fitted model, starting with the intercept value. To compute a 95% confidence interval for the coefficients you can use the method .conf_int() of the fitted model wells_fit.

### Recall that the model you fitted was saved as wells_fit and as such is loaded in your workspace.

### Instructions
-    Save the coefficients as intercept and slope using the .params attribute.
-    Print the saved intercept and slope.
-    Extract and print 95% confidence intervals for the coefficients using .conf_int().

In [17]:
# Extract coefficients from the fitted model wells_fit
intercept, slope = wells_fit.params

# Print coefficients
print('Intercept =', intercept)
print('Slope =', slope)

# Extract and print confidence intervals
print(wells_fit.conf_int())

Intercept = 0.6108118803818956
Slope = -0.629080847955768
                    0         1
Intercept    0.492327  0.729297
distance100 -0.820345 -0.437816
