<a href="https://colab.research.google.com/github/francji1/01ZLMA/blob/main/code/01ZLMA_ex01_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Exercise 01 of the course 01ZLMA.

## Contents

*   Recap of multivariable linear regression (FJFI course 01RAD)
*   Discussion about organizational/run of the course
*   Introdaction into Google colab notebooks (R version https://colab.to/r or Python versions)
* **Exponential type distributions**


In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
# Load the R magic extension
%load_ext rpy2.ipython

## Regression analysis


Dataset description:
https://jse.amstat.org/datasets/fishcatch.txt



### Fish Catch Dataset

159 fishes of 7 species are caught and measured. Altogether there are
8 variables.  All the fishes are caught from the same lake
(Laengelmavesi) near Tampere in Finland.

SOURCES:
Brofeldt, Pekka: Bidrag till kaennedom on fiskbestondet i vaera
        sjoear. Laengelmaevesi. T.H.Jaervi: Finlands Fiskeriet  Band 4,
        Meddelanden utgivna av fiskerifoereningen i Finland.
        Helsingfors 1917

VARIABLE DESCRIPTIONS:

1.  Obs       Observation number ranges from 1 to 159
2.  Species   (Numeric)
        Code Finnish  Swedish    English        Latin      
         1   Lahna    Braxen     Bream          Abramis brama
         2   Siika    Iiden      Whitewish      Leusiscus idus
         3   Saerki   Moerten    Roach          Leuciscus rutilus
         4   Parkki   Bjoerknan  Blicca         Abramis bjrkna
         5   Norssi   Norssen    Smelt          Osmerus eperlanus
         6   Hauki    Jaedda     Pike           Esox lucius
         7   Ahven    Abborre    Perch          Perca fluviatilis

3.  Weight      Weight of the fish (in grams)
4.  Length1     Length from the nose to the beginning of the tail (in cm)
5.  Length2     Length from the nose to the notch of the tail (in cm)
6.  Length3     Length from the nose to the end of the tail (in cm)
7.  Height%     Maximal height as % of Length3
8.  Width%      Maximal width as % of Length3
9.  Sex         1 = male 0 = female



In [None]:
# Read the dataset without column names
url = "http://jse.amstat.org/datasets/fishcatch.dat.txt"
col_names = ['Obs', 'Species', 'Weight', 'Len1', 'Len2', 'Len3', 'Height', 'Width', 'Sex']
fishcatch = pd.read_csv(url, sep='\s+', header=None, names=col_names)
fishcatch

In [None]:
# Remove rows with missing Weight
fish = fishcatch.dropna(subset=['Weight']).copy()

# Replace missing values in 'Sex' with "unknown"
fish['Sex'] = fish['Sex'].fillna("unknown")

# Recode Sex: change 1 -> "male", 0 -> "female"
fish['Sex'] = fish['Sex'].replace({1: "male", 0: "female", '1': "male", '0': "female"})

# Convert 'Sex' and 'Species' to categorical types
fish['Sex'] = fish['Sex'].astype('category')
fish['Species'] = fish['Species'].astype('category')

# Drop the Obs column
fish = fish.drop(columns=['Obs'])

# Display summary statistics (all columns)
summary = fish.describe(include='all').T
print(summary)


In [None]:
fish

In [None]:
print(fish.describe(include='all').T)

In [None]:
# Define the numeric variables for the pair plot.
num_vars = ['Weight', 'Len1', 'Len2', 'Len3', 'Height', 'Width']

# Create a palette mapping for species using the same palette as PairGrid.
species_categories = fish['Species'].cat.categories
palette = dict(zip(species_categories, sns.color_palette("deep", n_colors=len(species_categories))))

def corrcoef_per_species(x, y, **kwargs):
    ax = plt.gca()
    # Retrieve the corresponding subset of the original dataframe.
    subset = fish.loc[x.index]
    groups = subset.groupby('Species')
    species_list = list(groups.groups.keys())
    N = len(species_list)
    # Define vertical positions for each species annotation.
    offsets = np.linspace(0.7, 0.3, N) if N > 1 else [0.5]
    for sp, offset in zip(species_list, offsets):
        group_mask = subset['Species'] == sp
        x_group = x[group_mask]
        y_group = y[group_mask]
        if len(x_group) > 1:
            r = np.corrcoef(x_group, y_group)[0, 1]
            ax.text(0.5, offset, f"{sp}: {r:.2f}",
                    transform=ax.transAxes,
                    ha='center', va='center',
                    fontsize=10, color=palette[sp])
    ax.set_xticks([])
    ax.set_yticks([])

# Create a PairGrid with an enlarged figure.
g = sns.PairGrid(fish, vars=num_vars, hue='Species', height=3)
g.map_lower(sns.scatterplot, alpha=0.6, s=30)
g.map_diag(sns.kdeplot)
g.map_upper(corrcoef_per_species)
g.add_legend(title="Species")
plt.suptitle("Pair Plot with Correlations by Species", y=1.02)
plt.show()


In [None]:
species_list = fish['Species'].cat.categories
num_vars = ['Weight', 'Len1', 'Len2', 'Len3', 'Height', 'Width']

n = len(species_list)
fig, axes = plt.subplots(1, n, figsize=(5 * n, 4))
if n == 1:
    axes = [axes]

for ax, sp in zip(axes, species_list):
    subset = fish[fish['Species'] == sp]
    corr = subset[num_vars].corr()
    sns.heatmap(corr, annot=True, cmap="coolwarm", ax=ax, vmin=-1, vmax=1, cbar=False)
    ax.set_title(f"Correlation: {sp}")

plt.suptitle("Correlation Matrices per Species", y=1.05)
plt.tight_layout()
plt.show()


In [None]:
plt.figure(figsize=(7,5))
sns.boxplot(x='Sex', y='Weight', data=fish, showfliers=False, linewidth=1)
sns.stripplot(x='Sex', y='Weight', data=fish, jitter=0.1, alpha=0.5, size=4, color='black')
plt.xlabel("Sex")
plt.ylabel("Weight")
plt.title("Weight vs Sex")
plt.tight_layout()
plt.show()


In [None]:
plt.figure(figsize=(8,6))
# Boxplot: grouping by Species and split by Sex
sns.boxplot(x='Species', y='Weight', hue='Sex', data=fish, showfliers=False, linewidth=1)
# Jittered points overlaid
sns.stripplot(x='Species', y='Weight', data=fish, dodge=True, jitter=0.1, alpha=0.5, size=4, color='black')
plt.title("Weight vs Species\nDistinguish between Sex")
plt.xlabel("Species")
plt.ylabel("Weight")
plt.figtext(0.5, 0.01, "version: 001", ha="center", fontsize=9)
plt.legend(title="Sex", loc="lower center", bbox_to_anchor=(0.5, -0.2), ncol=3)
plt.tight_layout()
plt.show()


**To** load required libraries (with R runtime type):


In [None]:
%%R
#cat(system('sudo apt-get install -y gmp', intern=TRUE), sep = "\n")
#cat(system('sudo apt-get install -y partitions', intern=TRUE), sep = "\n")

list_of_packages <- c("tidyverse","dplyr","MASS","knitr", "GGally", "reactable","gridExtra","IRdisplay") #
missing_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
missing_packages
if(length(missing_packages)) install.packages(missing_packages)
lapply(list_of_packages, library, character.only = TRUE)

#For sure: set dplyr functions
select    <- dplyr::select;
rename    <- dplyr::rename;
mutate    <- dplyr::mutate;
summarize <- dplyr::summarize;
arrange   <- dplyr::arrange;
slice     <- dplyr::slice;
filter    <- dplyr::filter;
recode    <- dplyr::recode

In [None]:
%%R
#@title Read dataset (without colnames)
fishcatch <- read.table("http://jse.amstat.org/datasets/fishcatch.dat.txt") %>%
  rename(Obs =V1,
         Species = V2,
         Weight = V3,
         Len1 = V4,
         Len2 = V5,
         Len3 = V6,
         Height = V7,
         Width = V8,
         Sex = V9
         )

head(fishcatch)
fishcatch %>%
   summary() %>%
   kable(format = "pipe")

Your workong directory is in the cloud. You can mount your Google drive, or your local hard drive
(https://stackoverflow.com/questions/56679549/how-to-mount-google-drive-to-r-notebook-in-colab)
 (https://github.com/katewall/medium_tutorials/blob/main/210630_Medium_ColabwithR.ipynb).

In [None]:
%%R
R.version.string
getwd()

|   |     Obs      |   Species    |    Weight     |     Len1     |     Len2     |     Len3     |    Height    |    Width     |     Sex       |
|:--|:-------------|:-------------|:--------------|:-------------|:-------------|:-------------|:-------------|:-------------|:--------------|
|   |Min.   :  1.0 |Min.   :1.000 |Min.   :   0.0 |Min.   : 7.50 |Min.   : 8.40 |Min.   : 8.80 |Min.   :14.50 |Min.   : 8.70 |Min.   :0.0000 |
|   |1st Qu.: 40.5 |1st Qu.:2.000 |1st Qu.: 120.0 |1st Qu.:19.05 |1st Qu.:21.00 |1st Qu.:23.15 |1st Qu.:24.25 |1st Qu.:13.40 |1st Qu.:0.0000 |
|   |Median : 80.0 |Median :5.000 |Median : 272.5 |Median :25.20 |Median :27.30 |Median :29.40 |Median :27.10 |Median :14.60 |Median :0.0000 |
|   |Mean   : 80.0 |Mean   :4.497 |Mean   : 398.7 |Mean   :26.25 |Mean   :28.42 |Mean   :31.23 |Mean   :28.31 |Mean   :14.12 |Mean   :0.2361 |
|   |3rd Qu.:119.5 |3rd Qu.:7.000 |3rd Qu.: 650.0 |3rd Qu.:32.70 |3rd Qu.:35.50 |3rd Qu.:39.65 |3rd Qu.:37.60 |3rd Qu.:15.30 |3rd Qu.:0.0000 |
|   |Max.   :159.0 |Max.   :7.000 |Max.   :1650.0 |Max.   :59.00 |Max.   :63.40 |Max.   :68.00 |Max.   :44.50 |Max.   :20.90 |Max.   :1.0000 |
|   |NA            |NA            |NA's   :1      |NA            |NA            |NA            |NA            |NA            |NA's   :87     |

In [None]:
#reactable(fishcatch)

In [None]:
%%R
mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data[condition, ] %>% mutate(...)
  .data
}

In [None]:
%%R
fish <- fishcatch %>%
  drop_na(Weight) %>%
  mutate_cond(is.na(Sex), Sex = "unknown") %>%
  mutate(Sex = as.factor(Sex)) %>%
  mutate(Sex = recode(Sex,"1" = "male", "0" = "female")) %>%
  mutate(Species = factor(Species)) %>%
  select(-Obs)
fish %>% summary() %>% kable()


In [None]:
%%R
p <- fish %>%
  ggpairs(
    columns = 2:8,
    mapping = aes(color = Species),
    upper = list(continuous = wrap("cor", size = 3)),
    lower = list(continuous = wrap("points", alpha = 0.3, size = 0.5))
  ) +
  theme(legend.position = "bottom") +
  labs(color = "Species Type")

# Save the plot with the desired dimensions (e.g., 12 inches by 8 inches)
ggsave("ggpairs_enlarged.png", p, width = 12, height = 8, dpi = 300)
p
# Display the saved image in the notebook
#IRdisplay::display_png("ggpairs_enlarged.png")

In [None]:
from IPython.display import Image, display
display(Image(filename="ggpairs_enlarged.png"))

# Regression


## What should you be able to do

* Recode Species, instead numeric use fish names.
* Which variables look promising, which variables can be omitted.
* Edit the dataset
* Plot Height vs. Weight, Leng3 vs. Weight, Width vs. Weight

### In R

In [None]:
%%R
summary(fish)

In [None]:
%%R
Fish <- fish %>%
  filter(Species %in% c(1,3,7)) %>%
  mutate(Species = recode(Species,"1" = "Bream",
                                  "2" = "Whitewish",
                                  "3" = "Roach",
                                  "4" = "Blicca",
                                  "5" = "Smelt",
                                  "6" = "Pike",
                                  "7" = "Perch")) %>%
  mutate(Species = fct_drop(Species))%>%
  select(-Sex) %>%
  filter(Weight != 0)
head(Fish)
summary(Fish)

In [None]:
%%R -w 1200 -h 400
options(repr.plot.width = 15, repr.plot.height = 5, repr.plot.res = 90)

Fish <- Fish %>% mutate(Height = Height*Len3,
                        Width = Width*Len3)

Height_Weight <- ggplot(Fish,aes(x=Height,y=Weight,col=Species))+
                 geom_point(size=5,alpha=0.6)+
                 geom_smooth(col="grey40",method = "lm",se=F,lty="dashed",lwd=2,formula="y~x")

LengthV_Weight <- ggplot(Fish,aes(x=Len3,y=Weight,col=Species))+
                  geom_point(size=5,alpha=0.6)+
                  geom_smooth(col="grey40",method = "lm",se=F,lty="dashed",lwd=2,formula="y~x")

Width_Weight <- ggplot(Fish,aes(x=Width,y=Weight,col=Species))+
                 geom_point(size=5,alpha=0.6)+
                 geom_smooth(col="grey40",method = "lm",se=F,lty="dashed",lwd=2,formula="y~x")

#Height_Weight
#LengthV_Weight
#Width_Weight

grid.arrange(Height_Weight, LengthV_Weight, Width_Weight, ncol = 3)

In [None]:
%%R -w 1200 -h 400

Height_Weight_log <- ggplot(Fish,aes(x=log(Width),y=log(Weight),col=Species))+
                 geom_point(size=5,alpha=0.6)+
                 geom_smooth(col="grey40",method = "lm",se=F,lty="dashed",lwd=2,,formula="y~x")

LengthV_Weight_log <- ggplot(Fish,aes(x=log(Len3),y=log(Weight),col=Species))+
                  geom_point(size=5,alpha=0.6)+
                  geom_smooth(col="grey40",method = "lm",se=F,lty="dashed",lwd=2,,formula="y~x")

Width_Weight_log  <- ggplot(Fish,aes(x=log(Height),y=log(Weight),col=Species))+
                 geom_point(size=5,alpha=0.6)+
                 geom_smooth(col="grey40",method = "lm",se=F,lty="dashed",lwd=2,,formula="y~x")


grid.arrange(Height_Weight_log, LengthV_Weight_log, Width_Weight_log, ncol = 3)


In [None]:
%%R
# Model with all variables and interaction till 2rd order ...
m0  <- lm(Weight ~ (.)^2, data = Fish)
summary(m0)
# Ufff

### In python:

In [None]:
# Display summary of the DataFrame in a table format
fish_summary = fish.describe(include='all').T
fish_summary


In [None]:
Fish = fish[fish['Species'].isin([1, 3, 7]) & (fish['Weight'] != 0)].copy()

Fish['Height'] = Fish['Height'] * Fish['Len3']
Fish['Width'] = Fish['Width'] * Fish['Len3']

# Replace Species numbers with names using a map
# Recode Species values
species_map = {
    1: "Bream",
    2: "Whitewish",
    3: "Roach",
    4: "Blicca",
    5: "Smelt",
    6: "Pike",
    7: "Perch"
}
Fish['Species'] = Fish['Species'].map(species_map)
Fish['Species'] = Fish['Species'].cat.remove_unused_categories()


# Ensure the Species column is of categorical type
Fish['Species'] = Fish['Species'].astype('category')

# Drop the 'Sex' column
Fish = Fish.drop(columns=['Sex'])




In [None]:
# Display the first few rows of the DataFrame
print(Fish.head())

# Summary statistics for the DataFrame
fish_summary = Fish.describe(include='all').T
fish_summary


In [None]:
# Set a larger figure size: width 15, height 5 inches.
plt.figure(figsize=(15, 5))

# Plot 1: Height vs Weight
plt.subplot(1, 3, 1)
sns.scatterplot(data=Fish, x='Height', y='Weight', hue='Species', s=100, alpha=0.6)
sns.regplot(data=Fish, x='Height', y='Weight', scatter=False, color='grey',
            line_kws={'linestyle': '--', 'linewidth': 2})
plt.title("Height vs Weight")

# Plot 2: Len3 vs Weight
plt.subplot(1, 3, 2)
sns.scatterplot(data=Fish, x='Len3', y='Weight', hue='Species', s=100, alpha=0.6)
sns.regplot(data=Fish, x='Len3', y='Weight', scatter=False, color='grey',
            line_kws={'linestyle': '--', 'linewidth': 2})
plt.title("Len3 vs Weight")

# Plot 3: Width vs Weight
plt.subplot(1, 3, 3)
sns.scatterplot(data=Fish, x='Width', y='Weight', hue='Species', s=100, alpha=0.6)
sns.regplot(data=Fish, x='Width', y='Weight', scatter=False, color='grey',
            line_kws={'linestyle': '--', 'linewidth': 2})
plt.title("Width vs Weight")

# Adjust layout and show
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(15, 5))

# Plot 1: log(Width) vs log(Weight)
plt.subplot(1, 3, 1)
sns.scatterplot(x=np.log(Fish['Width']), y=np.log(Fish['Weight']),
                hue=Fish['Species'], s=100, alpha=0.6)
sns.regplot(x=np.log(Fish['Width']), y=np.log(Fish['Weight']),
            scatter=False, color='grey',
            line_kws={'linestyle': '--', 'linewidth': 2})
plt.title("log(Width) vs log(Weight)")

# Plot 2: log(Len3) vs log(Weight)
plt.subplot(1, 3, 2)
sns.scatterplot(x=np.log(Fish['Len3']), y=np.log(Fish['Weight']),
                hue=Fish['Species'], s=100, alpha=0.6)
sns.regplot(x=np.log(Fish['Len3']), y=np.log(Fish['Weight']),
            scatter=False, color='grey',
            line_kws={'linestyle': '--', 'linewidth': 2})
plt.title("log(Len3) vs log(Weight)")

# Plot 3: log(Height) vs log(Weight)
plt.subplot(1, 3, 3)
sns.scatterplot(x=np.log(Fish['Height']), y=np.log(Fish['Weight']),
                hue=Fish['Species'], s=100, alpha=0.6)
sns.regplot(x=np.log(Fish['Height']), y=np.log(Fish['Weight']),
            scatter=False, color='grey',
            line_kws={'linestyle': '--', 'linewidth': 2})
plt.title("log(Height) vs log(Weight)")

plt.tight_layout()
plt.show()


In [None]:
import statsmodels.formula.api as smf

# Ensure 'Species' is treated as categorical.
Fish['Species'] = Fish['Species'].astype('category')

# List predictors (all columns except 'Weight')
predictors = list(Fish.columns)
predictors.remove('Weight')

# Build the formula string, e.g. "Weight ~ (Len1 + Len2 + Len3 + Height + Width + Species)**2"
formula = "Weight ~ (" + " + ".join(predictors) + ")**2"
model = smf.ols(formula, data=Fish).fit()

print(model.summary())


### Quesions:
* How can you interpret previous result?
* Comment and discussion: how to select model.


In [None]:
%%R
m0_BIC  <- stepAIC(m0, k=log(dim(Fish)[1]))

m0_AIC  <- stepAIC(m0)
summary(m0_BIC)
summary(m0_AIC)

In [None]:
%%R
install.packages("leaps")
library(leaps)

In [None]:
%%R
best_subset <- regsubsets(Weight ~ (.)^2, Fish, nvmax = 20,really.big=T)
results <- summary(best_subset)
plot(best_subset)


In [None]:
%%R
# source: https://afit-r.github.io/model_selection
tibble(predictors = 1:20,
       adj_R2 = results$adjr2,
       Cp = results$cp,
       BIC = results$bic) %>%
  gather(statistic, value, -predictors) %>%
  ggplot(aes(predictors, value, color = statistic)) +
  geom_line(show.legend = F) +
  geom_point(show.legend = F) +
  facet_wrap(~ statistic, scales = "free")

#### AIC and BIC

**AIC:**
$
AIC = -2 \log L + 2p \approx n \log( \frac{SS_{\text{residuals}}}{n})  + 2p \approx  n \log(\hat{\sigma}^2) + 2p
$

**BIC:**
$
BIC = -2 \log L + p \log p \approx n \log( \frac{SS_{\text{residuals}}}{n}) + p \log p \approx  n \log(\hat{\sigma}^2) + p \log p
$

**Mallows':**
$
C_{p}=\frac{SS_{\text{residuals p}}}{\hat{\sigma}^2} - n + 2p
$



# Your turn 02 (run analysis in Python)

* Is there problem with multicolinearity? If so, how can we cope with it.
* Try to find such a good model (based on adjuste R^2) with a maximum of 7 covariates.

Hint: try to use expert knowledge.


In [None]:
%%R
m1 <- lm(Weight ~ (.), data = Fish)
summary(m1)

In [None]:
%%R
# VIF
print(1/(1-(summary(lm(Len1 ~.,data = Fish %>% select(-Weight)))$r.squared)))
print(1/(1-(summary(lm(Len2 ~.,data = Fish %>% select(-Weight)))$r.squared)))
print(1/(1-(summary(lm(Len3 ~.,data = Fish %>% select(-Weight)))$r.squared)))
print(1/(1-(summary(lm(Height ~.,data = Fish %>% select(-Weight)))$r.squared)))
print(1/(1-(summary(lm(Width  ~.,data = Fish %>% select(-Weight)))$r.squared)))

In [None]:
%%R
kappa(scale(data.matrix(Fish)))
kappa(scale(data.matrix(Fish %>% select(-Len1,-Len2))))

In [None]:
%%R
m1 <- lm(Weight ~ Species+Len2:Len3:Height+Len2:Len3:Width, data = Fish)
summary(m1)

In [None]:
%%R
options(repr.plot.width = 10, repr.plot.height = 5, repr.plot.res = 90)

par(mfrow = c(2, 2))
plot(m1, pch = 20, col = "blue4", lwd = 2)


In [None]:
%%R
log_m1 <- lm(log(Weight) ~ Species+Len2:Len3:Height+Len2:Len3:Width, data = Fish)
summary(log_m1)

In [None]:
%%R
par(mfrow = c(2, 2))
plot(log_m1, pch = 20, col = "blue4", lwd = 2)


In [None]:
%%R
log_m1 <- lm(log(Weight) ~ Species+Len2:Len3:Height+Len2:Len3:Width, data = Fish[-54,])
par(mfrow = c(2, 2))
plot(log_m1, pch = 20, col = "blue4", lwd = 2)


In [None]:
%%R
# Box-Cox transformation
BC_m0  <- lm(Weight ~ Species+I(Len3^2) +Len3:Height:Width, data = Fish)
summary(BC_m0)
bc     <- boxcox(m1, lambda = seq(-1,1 , 1/100))
lambda <- bc$x[which.max(bc$y)]
lambda

In [None]:
%%R
BC_m1 <- lm(Weight^(1/2) ~ Species+I(Len3^2) + Len3:Height+Len3:Width, data = Fish)
summary(BC_m1)

#Fish$Weight_lambda = (Fish$Weight^lambda-1)/lambda
#BC_m1 <- lm(Weight_lambda  ~ Species+I(Len3^2) + Len3:Height+Len3:Width, data = Fish)

summary(BC_m1)
par(mfrow = c(2, 2))
plot(BC_m1, pch = 20, col = "blue4", lwd = 2)


In [None]:
%%R
#install.packages("pbkrtest")
#install.packages("lme4")
#install.packages("RcppEigen")
#install.packages("car")
#library(car)

In [None]:
%%R
m_f <- lm(log(Weight) ~ Species+log(Len3)*log(Height)*log(Width), data = Fish[-54,])
summary(m_f)

#Fish$Weight_lambda = (Fish$Weight^lambda-1)/lambda
#BC_m1 <- lm(Weight_lambda  ~ Species+I(Len3^2) + Len3:Height+Len3:Width, data = Fish)

summary(m_f)
par(mfrow = c(2, 2))
plot(m_f, pch = 20, col = "blue4", lwd = 2)


Is linear regression clear?

# Let's start with GLM

##  Necessary theory recap from Lectures 01-03

Let's consider (m1):
  1. We have an *i.i.d.* sample $(y_i,\ldots,y_n)$ from random variables $Y_1,\ldots,Y_n$ with PDF (Probability Density Function) $f(y;\theta;\phi)$ in the exponential (one-parameter) family of probability distributions
  $$f(y;\theta;\phi) = exp\left(\frac{y \theta - b(\theta)}{a(\phi)} - c(y,\phi)\right),$$
  subject to usual regularity conditions  (one dimensional case, i.e. $y_i,\theta_i \in R, a(\phi) >0, \phi >0)$.
  2. **A regression matrix** $X$ and vector of unknown parameters $\beta$, which define the **linear predictor**
    $$ η = X \beta $$
  3. **A link function** $g(⋅)$
  $$\eta_i = g(\mu_i) = x_i^T \beta, \ \text{where} \ \mu_i = E[Y_i] \ \ i = 1,\ldots,n$$

The **dispersion** $a(\phi)$ is typically known. If not, we take it as **nuisance parameter**.

Link function satisfying $g(\mu_i) = \theta_i$ is called **canonical**.

For $b(\theta) \in C^2$ we showed:
$$E[Y] = b'(\theta) $$
$$V[Y] = a(\phi) b''(\theta) $$
and defined variance function $v(\mu) = \frac{\partial \mu}{\partial \theta}$, so that  $$V[Y] = a(\phi) v(\mu)$$

Relations:

$$
\beta \xrightarrow[]{\eta_i = x_i^T\beta} \eta
\xrightarrow[]{\mu_i = g^{-1}(\eta_i)}  \mu
\xrightarrow[]{\theta_i = (b')^{-1}(\mu_i)}  \theta
$$

Inverse relatiions
$$
\eta_i
\xleftarrow[]{}  \mu
\xleftarrow[]{}  \theta
$$


**Lemma**:
Let Y have an exponential type distribution with density given in m1, where $b(\theta)$ is 2 times continuously differentiable, then there exists a everywhere finite moment generating function $M_Y(t) = E[e^{ty}]$ that is 2 times differentiable at 0 and it holds:
* $E[Y] = b'(\theta)$
* $V[Y] = a(\phi) b''(\theta)$

## HW 01

Compute $E[Y]$, $V[Y]$, and $v(Y)$ by the help of moment generating function theory for the following so-called “big five” Exponential Family  distributions (Normal, Poisson, Gamma, Inverse Gaussian, Binomial)

* Normal: $N(\mu,\sigma^2)$
* Poisson: $Poi(\lambda)$
* Bernoulli: $Ber(p)$

  $f(y,p) = p^y(1-p)^{1-y}$
* Gamma: $\Gamma[a,p]$

 $ {\displaystyle f(y,a,p)={\frac {a ^{p }}{\Gamma (p)}}y^{p -1}e^{-a y}}$
* Inverse: Gaussian $IG[\mu, \lambda]$

  ${\displaystyle f(y;\mu ,\lambda )={\sqrt {\frac {\lambda }{2\pi y^{3}}}}\exp {\biggl (}-{\frac {\lambda (y-\mu )^{2}}{2\mu ^{2}y}}{\biggr )}}$


Questions:
* Which distributions can fulfill homoscedasticity and why?
* For which distribution the variance increases with the square of the mean value?
* Does exists a distribution, where $V[Y] = k \cdot \mu$ ?


## Solution

####  Normal:

$N(\mu,\sigma^2)$

$f(y, \mu, \sigma^2) = \frac{1}{\sqrt {{(2\pi )} \sigma^2}}  {\mathrm {e}}^-{\frac{{\left(y- \mu \right)}^{T}{\left( y-\mu \right)}}{\sigma^2}} = {\mathrm {e}}^{\frac{y \mu - \frac{\mu^2}{2}}{\sigma^2} - \left(  \frac{y^2}{2\sigma^2} + \frac{1}{2} ln(2 \pi \sigma^2) \right)}$

 $y \theta - b(\theta) = \frac{y \mu - \frac{\mu^2}{2}}{\sigma^2} ⇒ b(\theta) = \frac{\theta^2}{2}$

* Natural parameter: $\theta = \mu  \Rightarrow b(\theta) = \frac{\mu^2}{2}$
* Dispersion function: $\phi = \sigma^2 ⇒ a(\phi) = \sigma^2$
*  $E[Y] = b'(\theta) = \theta = \mu$
* $V[Y] = \sigma^2 b''(\theta)= \sigma^2 $

Linear variance function: $v(\mu) = b''(\theta) =  1$


#### Bernoulli: $Ber(p)$

  $f(y,p) = p^y(1-p)^{1-y} = exp(y \textrm{ln}(p) + (1-y)\textrm{ln}(1-p)) = exp(y \textrm{ln}(\frac{p}{1-p}) + ln(1-p))$

* $\phi = 1$ and $b(\theta) = ln(1+e^{\theta})$ and $\theta = \textrm{ln}(\frac{p}{1-p}) ⇒ p = \frac{e^{\theta}}{1+e^{\theta}}$
* $ E[Y] =  b'(\theta) = \frac{e^{\theta}}{1+e^{\theta}} = p$
* $ V[Y] = b''(\theta) = \frac{e^{\theta}}{1+e^{\theta}} - e^{\theta} \frac{e^{\theta}}{(1+e^{\theta})^2} = p - p^2 = p(p-1)$
* Variance function: $v(\mu) = \mu(1-\mu)$





####  Poisson: $Poi(\lambda)$

$f(y,\lambda) = \frac{\lambda^y e^{-\lambda}}{y!} = exp(y ln(\lambda)  -\lambda -ln(y!)) $

* $\theta = ln(\lambda) \Rightarrow b(\theta) = e^{\theta}  \ \text{and} \  \phi = 1$
*  $E[Y] = b'(\theta) = e^{\theta} = \lambda$
* $V[Y] = b''(\theta)=e^{\theta} = \lambda$

Linear variance function: $v(\mu) = \mu$




###  Gamma Distribution $\Gamma(a,p)$

- Shape $p>0$,  
- Rate $a>0$
- Support $x>0$  

$$
f(y;\,a,p)
\;=\;
\frac{a^p}{\Gamma(p)}\,y^{\,p-1}\,e^{-\,a\,y},
\quad y>0.
$$


Solution by Rosenkracová, Rusá, Vojtášek

$\ell(y; \theta, \phi) = \exp \left[ \frac{y\theta - b(\theta)}{a(\phi)} - c(y, \phi) \right]$




$
f(y; p, a) = \frac{a^p}{\Gamma(p)} y^{p-1} e^{-ay} = \exp\left[ p \ln a - \ln \Gamma(p) + (p-1) \ln y - ay \right] = \exp \left[ -ay - (- p \ln a) - \ln \Gamma(p) + (p-1) \ln y \right] =
$


* $\theta = - a, \quad \phi = 1$
* $ b(\theta) = - p\ln a = - p \ln (-\theta)$
* $ a(\phi) = \phi = 1$
* $ c(y, \phi) = (p-1) \ln y - \ln \Gamma(p)$

*  $$E[Y] = b'(\theta) = -p \frac{1}{-\theta} (-1)= -\frac{p}{\theta} = \frac{p}{a}$$
* $$V[Y] = a(\phi) b''(\theta)= \frac{p}{\theta^2} = \frac{p}{a^2} $$


Should be:
* $\theta = - \frac{a}{p}, \quad \phi = \frac{1}{p}$
* $ b(\theta) = - \ln (-\theta$)



#### Inverse Gauss

$$
f(y;\,\mu,\lambda)
\;=\;
\sqrt{\frac{\lambda}{2\pi\,y^3}}
\,\exp\!\Bigl[-\frac{\lambda\,(y-\mu)^2}{2\,\mu^2\,y}\Bigr],
\quad y>0.
$$



Solution by Rosenkracová, Rusá, Vojtášek:

$
f(y;\,\mu,\lambda)
\;=\;
\sqrt{\frac{\lambda}{2\pi\,y^3}}
\,\exp\!\Bigl[-\frac{\lambda\,(y-\mu)^2}{2\,\mu^2\,y}\Bigr]=\exp\!\Bigl[\frac{1}{2}\ln λ-\frac{1}{2}\ln 2π-\frac{3}{2}\ln y-\frac{\lambda\,(y-\mu)^2}{2\,\mu^2\,y}\Bigr]\\= \exp\Bigl[\frac{1}{2}\ln λ-\frac{1}{2}\ln 2π-\frac{3}{2}\ln y-\frac{\lambda y}{2\,\mu^2}+\frac{\lambda}{\mu}- \frac{\lambda}{2\,\,y}\Bigr] = \\
= \exp\Bigl[\lambda(-\frac{1}{2\,\mu^2}y - (-\frac{1}{\mu})) + \frac{1}{2}\ln λ-\frac{1}{2}\ln 2π-\frac{3}{2}\ln y- \frac{\lambda}{2\,\,y}\Bigr]
$

* Natural parameter: $\theta = - \frac{1}{2\,\mu^2}, \quad b(\theta) = - \frac{1}{\mu} = - \sqrt{-2\theta}$
* Dispersion function: $\phi = \frac{1}{\lambda}, \quad a(\phi) = \phi$
*  $$E[Y] = b'(\theta) = \frac{1}{\sqrt{-2\theta}} = \mu$$
* $$V[Y] = a(\phi) b''(\theta)= \frac{(-2\theta)^{-3/2}}{\lambda}= \frac{\mu^{3}}{\lambda} $$

#LLM summary of Exponential Family “Big Five” + Bernoulli

Below are derivations of $E[Y]$, $\mathrm{Var}[Y]$, and the variance function $v(\mu)$ for each of the classic one-parameter exponential-family distributions:

1. **Normal** $N(\mu,\sigma^2)$  
2. **Poisson** $\mathrm{Pois}(\lambda)$  
3. **Binomial**$(n,p)$ (Bernoulli is the special case $n=1$)  
4. **Gamma**$(p,a)$  
5. **Inverse Gaussian**$(\mu,\lambda)$  
6. **Bernoulli**$(p)$ [listed separately, though it’s a special binomial]

We emphasize each distribution’s:
- **Mean** $E[Y]$,
- **Variance** $\mathrm{Var}[Y]$,
- **Variance function** $v(\mu)$ (in the GLM sense, meaning $\mathrm{Var}[Y] = \phi\,v(\mu)$).

Recall that in an exponential-family form

$$
f(y;\theta,\phi) \;=\;
\exp\!\Bigl[\frac{\,y\,\theta - b(\theta)\,}{\,a(\phi)\,} \;-\; c(y,\phi)\Bigr],
$$

we have $\,\mu = b'(\theta)\,$ and $\,\mathrm{Var}(Y)=a(\phi)\,b''(\theta).\,$ The function $\,v(\mu) = b''\bigl(\theta(\mu)\bigr)\,$ is called the **variance function**.

---

## 1. Normal Distribution $N(\mu,\sigma^2)$

**PDF**  
$$
f(y;\,\mu,\sigma^2)
= \frac{1}{\sqrt{2\pi\,\sigma^2}}
  \exp\!\Bigl[-\frac{(y-\mu)^2}{2\,\sigma^2}\Bigr].
$$

### Exponential-Family Form

In canonical form (with natural parameter $\theta = \mu$ and dispersion $a(\phi) = \sigma^2$):

$$
f(y;\theta,\sigma^2)
= \exp\!\Bigl[\frac{y\,\theta - \tfrac{\theta^2}{2}}{\sigma^2}
 - \Bigl(\tfrac{y^2}{2\,\sigma^2} + \tfrac12\ln(2\pi\,\sigma^2)\Bigr)\Bigr].
$$

Here $b(\theta) = \frac{\theta^2}{2}$, so $\mu = \theta$.

### Moments via MGF

- The MGF of a normal is $\,M(t) = \exp\!\bigl(\mu\,t + \tfrac12 \sigma^2 t^2\bigr)$.
- From standard properties, $E[Y] = \mu$ and $\mathrm{Var}[Y] = \sigma^2$.

### Key Quantities

- **Mean**: $E[Y] = \mu$  
- **Variance**: $\mathrm{Var}[Y] = \sigma^2$  
- **Variance function**: $\,v(\mu) = 1\,$ (in GLM form, $\mathrm{Var}[Y] = \sigma^2 \times 1$).

### Comments

- The **only** one-parameter EF with variance independent of the mean (i.e., homoscedastic) is the Normal.
- Commonly used for continuous data with the same variance at all $\mu$.

### Summary

$$
\boxed{E[Y] = \mu,\quad
\mathrm{Var}[Y] = \sigma^2,\quad
v(\mu)=1.}
$$


---

## 2. Poisson Distribution $\mathrm{Pois}(\lambda)$

**PMF**  
$$
P(Y = y)
= \frac{\lambda^y e^{-\lambda}}{y!},
\quad y=0,1,2,\dots.
$$

### Exponential-Family Form

$$
f(y;\lambda)
= \exp\!\Bigl[y\,\ln(\lambda) - \lambda - \ln(y!)\Bigr].
$$

Let $\theta = \ln(\lambda)$, hence $\lambda = e^\theta$.  
Then $b(\theta) = e^\theta$ and $\phi=1$.

### Moments via MGF

- The MGF of Poisson is $M(t)=\exp\bigl[\lambda(e^t-1)\bigr]$.
- From known expansions, $E[Y] = \lambda$ and $\mathrm{Var}[Y] = \lambda$.

### Key Quantities

- **Mean**: $E[Y] = \lambda$
- **Variance**: $\mathrm{Var}[Y] = \lambda$
- **Variance function**: $v(\mu) = \mu$  (so $\mathrm{Var}[Y] = \mu$ when $\phi=1$)

### Comments

- Canonical link is the **log** link: $\theta = \ln(\mu)$.
- Suitable for **count** data when mean = variance.

### Summary

$$
\boxed{E[Y] = \lambda,\quad
\mathrm{Var}[Y] = \lambda,\quad
v(\mu) = \mu.}
$$


---

## 3. Binomial Distribution $\mathrm{Binomial}(n,p)$

Although **Bernoulli**$(p)$ is the special case $n=1$, we treat the general binomial first.

**PMF**  
$$
P(Y=y)
= \binom{n}{y}\, p^y(1-p)^{n-y},
\quad y=0,1,\dots,n.
$$

### Exponential-Family Form

$$
\log f(y;\theta)
= y\,\theta
 - n\ln\bigl(1+ e^\theta\bigr)
 + \ln\binom{n}{y},
$$

where $\theta = \ln\!\bigl(\tfrac{p}{1-p}\bigr).$  
Hence $b(\theta) = n\ln\bigl(1+ e^\theta\bigr)$, $\phi=1$.

### Moments via MGF

- From standard binomial facts, $E[Y] = n\,p$, $\mathrm{Var}[Y] = n\,p\,(1-p)$.
- In EF form, $\mu = n\,p$.

### Key Quantities

- **Mean**: $E[Y] = n\,p$
- **Variance**: $\mathrm{Var}[Y] = n\,p\,(1-p)$
- **Variance function**:
  $$
    v(\mu) = \mu\Bigl(1 - \frac{\mu}{n}\Bigr).
  $$
  (Here $\mu = n\,p$.)

### Comments

- Canonical link is the **logit**: $\log(\tfrac{p}{1-p})$.
- For $n=1$, we get the **Bernoulli** distribution.

### Summary

$$
\boxed{E[Y] = n\,p,\quad
\mathrm{Var}[Y] = n\,p(1-p),\quad
v(\mu)= \mu\Bigl(1-\frac{\mu}{n}\Bigr).}
$$


---

## 4. Gamma Distribution $\Gamma(p,a)$

Sometimes parameterized by shape $p$ and rate $a$. Then $E[Y] = \tfrac{p}{a}$ and $\mathrm{Var}[Y] = \tfrac{p}{a^2}$. Another notation is $(\alpha,\beta)$.

**PDF**  
$$
f(y; p,a)
= \frac{a^p}{\Gamma(p)}\, y^{p-1} e^{-a\,y},
\quad y>0.
$$

### Exponential-Family Form (Sketch)

- **Mean**: $\,E[Y] = \tfrac{p}{a}.$
- **Variance**: $\,\mathrm{Var}[Y] = \tfrac{p}{a^2}.$
- Therefore, $\mathrm{Var}[Y] = \mu^2 / p$ if $\mu = p/a$.

In standard GLM form, it’s well known that

$$
v(\mu) = \mu^2.
$$

### MGF

$$
M(t)
= \Bigl(\frac{a}{a - t}\Bigr)^p, \quad t<a.
$$

Taking derivatives of $M(t)$ at $t=0$ confirms the mean and variance.

### Key Quantities

- **Mean**: $E[Y] = \tfrac{p}{a}$
- **Variance**: $\mathrm{Var}[Y] = \tfrac{p}{a^2}$
- **Variance function**: $v(\mu) = \mu^2$

### Comments

- Canonical link (if $p$ is known) is the **inverse** link $\eta = 1/\mu$ in many treatments.
- Used for positive, skewed data (e.g. waiting times, survival data).

### Summary

$$
\boxed{E[Y] = \frac{p}{a},\quad
\mathrm{Var}[Y] = \frac{p}{a^2},\quad
v(\mu) = \mu^2.}
$$


---

## 5. Inverse Gaussian Distribution $\mathrm{IG}(\mu,\lambda)$

**PDF**  
$$
f(y;\,\mu,\lambda)
= \sqrt{\frac{\lambda}{2\pi\,y^3}}
  \,\exp\!\Bigl[-\frac{\lambda\,(y-\mu)^2}{2\,\mu^2\,y}\Bigr],
\quad y>0.
$$

### Known Moments

- **Mean**: $E[Y] = \mu$  
- **Variance**: $\mathrm{Var}[Y] = \tfrac{\mu^3}{\lambda}.$

Hence the variance function is $v(\mu) = \mu^3$ in GLM notation.

### Exponential-Family Form (Outline)

It can be shown (somewhat more involved) that, with suitable $\theta$ and $\phi=\lambda$, we get precisely these mean and variance relationships.

### Key Quantities

- **Mean**: $E[Y] = \mu$
- **Variance**: $\mathrm{Var}[Y] = \frac{\mu^3}{\lambda}$
- **Variance function**: $v(\mu) = \mu^3$

### Comments

- Often used for **positive** data, especially if heavier tails than Gamma or for certain “first passage time” processes.
- Canonical link is more complex, involving $\theta = \tfrac{1}{2\mu^2}$ in some parameterizations.

### Summary

$$
\boxed{E[Y] = \mu,\quad
\mathrm{Var}[Y] = \frac{\mu^3}{\lambda},\quad
v(\mu) = \mu^3.}
$$


---

## 6. Bernoulli Distribution $\mathrm{Ber}(p)$

Although this is just $\mathrm{Binomial}(n=1,p)$, it’s commonly listed separately.

**PMF**  
$$
P(Y=1)=p,\quad P(Y=0)=1-p,\quad Y\in\{0,1\}.
$$

### Exponential-Family Form

$$
f(y;\,p)
= p^y(1-p)^{1-y}
= \exp\!\Bigl[y\,\ln\!\bigl(\tfrac{p}{1-p}\bigr) + \ln(1-p)\Bigr].
$$

Thus $\theta=\ln\!\bigl(\tfrac{p}{1-p}\bigr)$, $\,\phi=1$, $\,b(\theta)=\ln(1+ e^\theta)$.

### Moments

- $E[Y] = p,$  
- $\mathrm{Var}[Y] = p(1-p).$

Hence the variance function is $v(\mu) = \mu\,(1-\mu)$.

### Key Quantities

- **Mean**: $E[Y] = p$
- **Variance**: $\mathrm{Var}[Y] = p(1-p)$
- **Variance function**: $v(\mu)=\mu(1-\mu)$

### Comments

- Canonical link is the **logit**: $\log\bigl(\tfrac{p}{1-p}\bigr)$.
- Simplest discrete EF on $\{0,1\}$.

### Summary

$$
\boxed{E[Y] = p,\quad
\mathrm{Var}[Y] = p(1-p),\quad
v(\mu) = \mu(1-\mu).}
$$


---

# Final Remarks

1. **MGF Approach**: For each distribution, you can compute $E[Y]$ and $\mathrm{Var}[Y]$ via
   $$
   M_Y(t) = E[e^{tY}],\quad
   E[Y] = M_Y'(0),\quad
   \mathrm{Var}[Y] = M_Y''(0) - \bigl(M_Y'(0)\bigr)^2.
   $$

2. **Variance Functions**:
   - Normal: $v(\mu)=1$  
   - Poisson: $v(\mu)=\mu$  
   - Binomial $(n,p)$: $v(\mu)=\mu\bigl(1-\mu/n\bigr)$; Bernoulli is $n=1$  
   - Gamma: $v(\mu)=\mu^2$  
   - Inverse Gaussian: $v(\mu)=\mu^3$

3. These are the **core** one-parameter EFs used in classical **GLMs**. The corresponding “canonical links” are those making $\eta = \theta$.

