# Homework 4

Đinh Vũ Gia Hân - 22127098

---

# Import library

In [1]:
import numpy as np
from scipy import optimize
import math

# Generalization Error

## 1. 
This answer is referenced from [1].

With probability $\geq 1 - \delta$, the VC generalization bound is given by:
$$|E_{out} - E_{in}| \leq \sqrt{\frac{8}{N} \ln \left( \frac{4 m_{\mathcal{H}}(2N)}{\delta} \right) }$$

If we want 95% confidence that our generalization error is at most 0.05 which means:
$$\delta = 1 - 0.95 = 0.05$$

Furthermore all the answers have $N > d_{VC}$ so we have the approximate bound:
\begin{align}
m_{\mathcal{H}}(N) &= N^{d_{VC}} \\
\rightarrow m_{\mathcal{H}}(2N) &= (2N)^{d_{VC}}
\end{align}

So, the bound becomes:
$$\epsilon \leq \sqrt{\frac{8}{N} \ln \left( \frac{4 (2N)^{d_{VC}}}{\delta} \right) }$$

In [2]:
def original_vc_bound(N, dVC, delta, mH):
    """
    Calculate the original VC bound

    Parameters:
    ----------
    N : int
        Number of samples
    dVC : int
        VC dimension
    delta : float
        Confidence level
    mH : int
        The growth function

    Returns:
    ----------
    float
        The original VC bound
    """
    return math.sqrt(8 * math.log(4 * mH / delta) / N)

In [3]:
Ns = [400000, 420000, 440000, 460000, 480000]
dVC = 10
delta = 0.05

for N in Ns:
    print(f'With N = {N}, omega = {original_vc_bound(N, dVC, delta, (2 * N) ** dVC)}')

With N = 400000, omega = 0.05297276596538537
With N = 420000, omega = 0.05178593269970576
With N = 440000, omega = 0.050678810077732374
With N = 460000, omega = 0.04964277890917069
With N = 480000, omega = 0.04867047569610589


The closest numerical approximation of the sample size that the VC generalization bound predicts is 460000.

**Question 1:** So the correct answer is [d] 460,000

## 2.

With $N > d_{VC}$, we can use:
$$m_{\mathcal{H}}(N) = N^{d_{VC}}$$

We also have $d_{VC} = 50$, $\delta = 0.05$, and $N = 10000$.

Therefore, the inequalities in sentence b, c, and d become:

[b] Rademacher Penalty Bound: $\epsilon \leq \sqrt{\frac{2 \ln(2N \times N^{d_{VC}})}{N}} + \sqrt{\frac{2}{N} \ln \frac{1}{\delta}} + \frac{1}{N}$

[c] Parrondo and Van den Broek: $\epsilon \leq \sqrt{\frac{1}{N} \left( 2 \epsilon + \ln \frac{6 \times (2N)^{d_{VC}}}{\delta} \right)}$

[d] Devroye: $\epsilon \leq \sqrt{\frac{1}{2N} \left( 4\epsilon (1 + \epsilon) + \ln \frac{4 N^{2d_{VC}}}{\delta} \right)}$

In [4]:
def rademacher_penalty_bound(N, dVC, delta, mH):
    """
    Calculate the Rademacher penalty bound

    Parameters:
    ----------
    N : int
        Number of samples
    dVC : int
        VC dimension
    delta : float
        Confidence level
    mH : int
        The growth function

    Returns:
    ----------
    float
        The Rademacher penalty bound
    """
    return math.sqrt(2 * math.log(2 * N * mH) / N) + math.sqrt(2 * math.log(1 / delta) / N) + 1 / N

In [5]:
def parrondo_van_den_broek_bound(N, dVC, delta, mH):
    """
    Calculate the Parrondo and Van den Broek bound

    Parameters:
    ----------
    N : int
        Number of samples
    dVC : int
        VC dimension
    delta : float
        Confidence level
    mH : int
        The growth function

    Returns:
    ----------
    float
        The Parrondo and Van den Broek bound
    """
    tmp = lambda eps: math.sqrt((2 * eps + math.log(6 * mH / delta)) / N) - eps
    return optimize.brentq(tmp, 0, 5) 

In [6]:
def devroye(N, dVC, delta, mH):
    """
    Calculate the Devroye bound

    Parameters:
    ----------
    N : int
        Number of samples
    dVC : int
        VC dimension
    delta : float
        Confidence level
    mH : int
        The growth function

    Returns:
    ----------
    float
        The Devroye bound
    """
    tmp = lambda eps: math.sqrt((4 * eps * (1 + eps) + math.log(4 / delta) + math.log(mH)) / (2 * N)) - eps
    return optimize.brentq(tmp, 0, 5) 

In [7]:
N = 10000
dVC = 50
delta = 0.05

print(f'With N = {N}, dVC = {dVC}, delta = {delta}, Original VC bound = {original_vc_bound(N, dVC, delta, (2 * N) ** dVC)}')
print(f'With N = {N}, dVC = {dVC}, delta = {delta}, Rademacher penalty bound = {rademacher_penalty_bound(N, dVC, delta, N ** dVC)}')
print(f'With N = {N}, dVC = {dVC}, delta = {delta}, Parrondo and Van den Broek bound = {parrondo_van_den_broek_bound(N, dVC, delta, (2 * N) ** dVC)}')
print(f'With N = {N}, dVC = {dVC}, delta = {delta}, Devroye bound = {devroye(N, dVC, delta, N ** (dVC * 2))}')

With N = 10000, dVC = 50, delta = 0.05, Original VC bound = 0.632174915200836
With N = 10000, dVC = 50, delta = 0.05, Rademacher penalty bound = 0.3313087859616395
With N = 10000, dVC = 50, delta = 0.05, Parrondo and Van den Broek bound = 0.22369829368078606
With N = 10000, dVC = 50, delta = 0.05, Devroye bound = 0.21522804980824667


At N = 10000, the Devroye bound is the smallest with $\epsilon \leq 0.215$

**Question 2:** So the correct answer is [d] Devroye: $\epsilon \leq \sqrt{\frac{1}{2N} \left( 4\epsilon (1 + \epsilon) + \ln \frac{4 m_{\mathcal{H}}(N^2)}{\delta} \right)}$

## 3.

For the same value of $d_{VC}$ and $\delta$ of problem 2, we will compute the bound for N = 5.

Since $N < d_{VC}$, the approximation bound $N^{d_{VC}}$ cannot be used anymore. We will use $m_{\mathcal{H}}(N) = 2^N$ instead.

In [8]:
N = 5
dVC = 50
delta = 0.05

print(f'With N = {N}, dVC = {dVC}, delta = {delta}, Original VC bound = {original_vc_bound(N, dVC, delta, 2 ** (2 * N))}')
print(f'With N = {N}, dVC = {dVC}, delta = {delta}, Rademacher penalty bound = {rademacher_penalty_bound(N, dVC, delta, 2 ** N)}')
print(f'With N = {N}, dVC = {dVC}, delta = {delta}, Parrondo and Van den Broek bound = {parrondo_van_den_broek_bound(N, dVC, delta, 2 ** (2 * N))}')
print(f'With N = {N}, dVC = {dVC}, delta = {delta}, Devroye bound = {devroye(N, dVC, delta, 2 ** (N ** 2))}')

With N = 5, dVC = 50, delta = 0.05, Original VC bound = 4.254597220000659
With N = 5, dVC = 50, delta = 0.05, Rademacher penalty bound = 2.813654929686762
With N = 5, dVC = 50, delta = 0.05, Parrondo and Van den Broek bound = 1.7439535969958095
With N = 5, dVC = 50, delta = 0.05, Devroye bound = 2.264540762867992


At N = 5, the Parrondo and Van den Broek bound is the smallest with $\epsilon \leq 1.744$

**Question 3:** So the correct answer is [c] Parrondo and Van den Broek: $\epsilon \leq \sqrt{\frac{1}{N} \left( 2 \epsilon + \ln \frac{6 m_{\mathcal{H}}(2N)}{\delta} \right)}$

#  Bias and Variance

## 4.
This answer is referenced from [2].

We have:
\begin{align}
\bar{g}(\mathbf{x}) &\approx \frac{1}{K} \sum_{k=1}^{K} g^{(\mathcal{D}_k)}(\mathbf{x}) \\
&= \frac{1}{K} \sum_{k=1}^{K} a^{(\mathcal{D}_i)} x \\
&= x \times \frac{1}{K} \sum_{k=1}^{K} a^{(\mathcal{D}_i)} \\
&= x\hat{a} \\
&= \hat{a}x
\end{align}

So the expected value $\bar{g}(\mathbf{x})$ is equal to $\hat{a}x$.

To find the expected value, we will do the experiments as follow: create two random points with x in range [-1, 1] and y calculated using the formula $f(x) = \sin(\pi x)$. Then we train the model using linear regression. We will do the experiment 1000 times and calculate the value $\hat{a}$.

In [9]:
def problem4():
    """
    Calculate the average of a

    Returns:
    ----------
    a_avg: float
        The average of a
    """
    N_runs = 1000   # number of runs
    N = 2           # number of samples 
    a_total = 0     # total of a

    # loop for RUNS times
    for _ in range(N_runs):
        # define two random points
        x = np.random.uniform(-1, 1, N) # x1, x2
        y = np.sin(np.pi * x)           # y1, y2

        # train the model with linear regression
        X = np.array([x]).T
        w = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), y)
        a = w[0]

        # add a to a_total
        a_total += a

    # calculate the average of a
    a_avg = a_total / N_runs

    return a_avg

In [10]:
# print round average a to 2 decimal digits
print(f'Average a = {np.round(problem4(), 2)}')

Average a = 1.44


There is no answer match exactly to our result.

**Question 4:** So the correct answer is [e] None of the above

## 5.

The bias is given by:
$$\text{bias}(x) = {(\bar{g}(\mathbf{x}) - f(x))}^2$$

To find the bias, we will do the experiments as follow: create 1000 random test points with x in range [-1, 1] and y calculated using the formula $f(x) = \sin(\pi x)$. Then we train the model using linear regression for 1000 times and calculate the average square error.

In [11]:
N_runs_5 = 1000     # number of runs
N_test_5 = 1000     # number of test points
a_total_5 = 0       # total of a

# generate test set
x_test_5 = np.random.uniform(-1, 1, N_test_5)
y_test_5 = np.sin(np.pi * x_test_5)

# calculate the average of a
a_avg_5 = problem4()

# calculate predicted y
y_pred_5 = a_avg_5 * x_test_5

# calculate bias
bias = np.mean((y_test_5 - y_pred_5) ** 2)

# print bias
print(f'Bias = {bias}')

Bias = 0.2966333276111952


The bias calculated is closest to 0.3.

**Question 5:** So the correct answer is [b] 0.3

## 6.

The variance is given by:
$$\text{var}(x) = {\mathbb{E}}_{\mathcal{D}}[{(g^{(\mathcal{D})}(x) - \bar{g}(\mathbf{x}))}^2]$$

To find the variance, we perform the following steps: create 1000 random test points x_test in the range [−1,1]. For each point, train the model 100 times using two random data points sampled from [−1,1]. For each run, calculate the squared difference between predictions using the model's slope and the average slope. Finally, average these squared differences across datasets and test points to compute the variance.

In [12]:
# calculate average of a
a_avg_6 = problem4()

e_X = 0             # expectation over X
N_runs_d_6 = 100    # number of runs over d
N_runs_x_6 = 1000   # number of runs over x

for _ in range(N_runs_x_6):
    N = 2
    x_test = np.random.uniform(-1, 1)
    e_D = 0        # expectation over D

    for _ in range(N_runs_d_6):
        # define two random points
        x = np.random.uniform(-1, 1, N)
        y = np.sin(np.pi * x)

        # train the model with linear regression
        X = np.array([x]).T
        w = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), y)
        a = w[0]

        # calculate predicted y
        y_pred = a * x_test

        # calculate predicted y bar
        y_pred_bar = a_avg_6 * x_test

        # calculate expectation over D
        e_D += (y_pred - y_pred_bar) ** 2 / N_runs_d_6

    # calculate expectation over X
    e_X += e_D / N_runs_x_6

# calculate variance
variance = e_X

# print variance
print(f'Variance = {variance}')

Variance = 0.2389021765586611


The variance calculated is closest to 0.2.

**Question 6:** So the correct answer is [a] 0.2

## 7.

In [13]:
def problem7():
    """
    Calculate the average of a for all the hypothesis

    Returns:
    ----------
    b_total_a: float
        The total of b for the hypothesis h(x) = b
    a_total_b: float
        The total of a for the hypothesis h(x) = ax
    a_total_c: float
        The total of a for the hypothesis h(x) = ax + b
    b_total_c: float
        The total of b for the hypothesis h(x) = ax + b
    a_total_d: float
        The total of a for the hypothesis h(x) = ax^2
    a_total_e: float
        The total of a for the hypothesis h(x) = ax^2 + b
    b_total_e: float
        The total of b for the hypothesis h(x) = ax^2 + b
    """
    N_runs = 1000   # number of runs
    N = 2           # number of samples 
    b_total_a = 0   # total of b for the hypothesis h(x) = b
    a_total_b = 0   # total of a for the hypothesis h(x) = ax
    a_total_c = 0   # total of a for the hypothesis h(x) = ax + b
    b_total_c = 0   # total of b for the hypothesis h(x) = ax + b
    a_total_d = 0   # total of a for the hypothesis h(x) = ax^2
    a_total_e = 0   # total of a for the hypothesis h(x) = ax^2 + b
    b_total_e = 0   # total of b for the hypothesis h(x) = ax^2 + b

    # loop for RUNS times
    for _ in range(N_runs):
        # define two random points
        x = np.random.uniform(-1, 1, N) # x1, x2
        y = np.sin(np.pi * x)           # y1, y2

        # train the model with linear regression for h(x) = b
        X_a = np.array([np.ones(N)]).T
        w_a = np.dot(np.dot(np.linalg.inv(np.dot(X_a.T, X_a)), X_a.T), y)
        b_a = w_a[0]

        # train the model with linear regression for h(x) = ax
        X_b = np.array([x]).T
        w_b = np.dot(np.dot(np.linalg.inv(np.dot(X_b.T, X_b)), X_b.T), y)
        a_b = w_b[0]

        # train the model with linear regression for h(x) = ax + b
        X_c = np.array([x, np.ones(N)]).T
        w_c = np.dot(np.dot(np.linalg.inv(np.dot(X_c.T, X_c)), X_c.T), y)
        a_c = w_c[0]
        b_c = w_c[1]

        # train the model with linear regression for h(x) = ax^2
        X_d = np.array([x * x]).T
        w_d = np.dot(np.dot(np.linalg.inv(np.dot(X_d.T, X_d)), X_d.T), y)
        a_d = w_d[0]

        # train the model with linear regression for h(x) = ax^2 + b
        X_e = np.array([x * x, np.ones(N)]).T
        w_e = np.dot(np.dot(np.linalg.inv(np.dot(X_e.T, X_e)), X_e.T), y)
        a_e = w_e[0]
        b_e = w_e[1]

        # add total
        b_total_a += b_a
        a_total_b += a_b
        a_total_c += a_c
        b_total_c += b_c
        a_total_d += a_d
        a_total_e += a_e
        b_total_e += b_e

    # calculate the average 
    b_avg_a = b_total_a / N_runs
    a_avg_b = a_total_b / N_runs
    a_avg_c = a_total_c / N_runs
    b_avg_c = b_total_c / N_runs
    a_avg_d = a_total_d / N_runs
    a_avg_e = a_total_e / N_runs
    b_avg_e = b_total_e / N_runs

    return b_avg_a, a_avg_b, a_avg_c, b_avg_c, a_avg_d, a_avg_e, b_avg_e

Compute bias for all the hypothesis.

In [14]:
N_runs = 1000     # number of runs
N_test = 1000     # number of test points
b_total_a = 0   # total of b for the hypothesis h(x) = b
a_total_b = 0   # total of a for the hypothesis h(x) = ax
a_total_c = 0   # total of a for the hypothesis h(x) = ax + b
b_total_c = 0   # total of b for the hypothesis h(x) = ax + b
a_total_d = 0   # total of a for the hypothesis h(x) = ax^2
a_total_e = 0   # total of a for the hypothesis h(x) = ax^2 + b
b_total_e = 0   # total of b for the hypothesis h(x) = ax^2 + b

# generate test set
x_test = np.random.uniform(-1, 1, N_test)
y_test = np.sin(np.pi * x_test)

# calculate the average 
b_avg_a, a_avg_b, a_avg_c, b_avg_c, a_avg_d, a_avg_e, b_avg_e = problem7()

# calculate predicted y
y_pred_a = b_avg_a
y_pred_b = a_avg_b * x_test
y_pred_c = a_avg_c * x_test + b_avg_c
y_pred_d = a_avg_d * x_test * x_test
y_pred_e = a_avg_e * x_test * x_test + b_avg_e

# calculate bias
bias_a = np.mean((y_test - y_pred_a) ** 2)
bias_b = np.mean((y_test - y_pred_b) ** 2)
bias_c = np.mean((y_test - y_pred_c) ** 2)
bias_d = np.mean((y_test - y_pred_d) ** 2)
bias_e = np.mean((y_test - y_pred_e) ** 2)

# print bias
print(f'Bias for h(x) = b: {bias_a}')
print(f'Bias for h(x) = ax: {bias_b}')
print(f'Bias for h(x) = ax + b: {bias_c}')
print(f'Bias for h(x) = ax^2: {bias_d}')
print(f'Bias for h(x) = ax^2 + b: {bias_e}')

Bias for h(x) = b: 0.507175366246416
Bias for h(x) = ax: 0.26543395620941546
Bias for h(x) = ax + b: 0.19381904552754087
Bias for h(x) = ax^2: 0.5065178548334311
Bias for h(x) = ax^2 + b: 3.7381660216148647


Compute the variance.

In [17]:
# calculate average
b_avg_a, a_avg_b, a_avg_c, b_avg_c, a_avg_d, a_avg_e, b_avg_e = problem7()

e_X_a, e_X_b, e_X_c, e_X_d, e_X_e = 0, 0, 0, 0, 0    # expectation over X for each hypothesis
N_runs_d = 100    # number of runs over d
N_runs_x = 1000   # number of runs over x

for _ in range(N_runs_x):
    N = 2
    x_test = np.random.uniform(-1, 1)
    e_D_a, e_D_b, e_D_c, e_D_d, e_D_e = 0, 0, 0, 0, 0  # expectation over D for each hypothesis

    for _ in range(N_runs_d):
        # define two random points
        x = np.random.uniform(-1, 1, N)
        y = np.sin(np.pi * x)

        # train the model with linear regression for h(x) = b
        X_a = np.array([np.ones(N)]).T
        w_a = np.dot(np.dot(np.linalg.inv(np.dot(X_a.T, X_a)), X_a.T), y)
        b_a = w_a[0]

        # train the model with linear regression for h(x) = ax
        X_b = np.array([x]).T
        w_b = np.dot(np.dot(np.linalg.inv(np.dot(X_b.T, X_b)), X_b.T), y)
        a_b = w_b[0]

        # train the model with linear regression for h(x) = ax + b
        X_c = np.array([x, np.ones(N)]).T
        w_c = np.dot(np.dot(np.linalg.inv(np.dot(X_c.T, X_c)), X_c.T), y)
        a_c = w_c[0]
        b_c = w_c[1]

        # train the model with linear regression for h(x) = ax^2
        X_d = np.array([x * x]).T
        w_d = np.dot(np.dot(np.linalg.inv(np.dot(X_d.T, X_d)), X_d.T), y)
        a_d = w_d[0]

        # train the model with linear regression for h(x) = ax^2 + b
        X_e = np.array([x * x, np.ones(N)]).T
        w_e = np.dot(np.dot(np.linalg.inv(np.dot(X_e.T, X_e)), X_e.T), y)
        a_e = w_e[0]
        b_e = w_e[1]
        
        # calculate predicted y
        y_pred_a = b_a 
        y_pred_b = a_b * x_test
        y_pred_c = a_c * x_test + b_c
        y_pred_d = a_d * x_test * x_test
        y_pred_e = a_e * x_test * x_test + b_e

        # calculate predicted y bar
        y_pred_bar_a = b_avg_a
        y_pred_bar_b = a_avg_b * x_test
        y_pred_bar_c = a_avg_c * x_test + b_avg_c
        y_pred_bar_d = a_avg_d * x_test * x_test
        y_pred_bar_e = a_avg_e * x_test * x_test + b_avg_e

        # calculate expectation over D
        e_D_a += (y_pred_a - y_pred_bar_a) ** 2 / N_runs_d
        e_D_b += (y_pred_b - y_pred_bar_b) ** 2 / N_runs_d
        e_D_c += (y_pred_c - y_pred_bar_c) ** 2 / N_runs_d
        e_D_d += (y_pred_d - y_pred_bar_d) ** 2 / N_runs_d
        e_D_e += (y_pred_e - y_pred_bar_e) ** 2 / N_runs_d

    # calculate expectation over X
    e_X_a += e_D_a / N_runs_x
    e_X_b += e_D_b / N_runs_x
    e_X_c += e_D_c / N_runs_x
    e_X_d += e_D_d / N_runs_x
    e_X_e += e_D_e / N_runs_x

# calculate variance
variance_a = e_X_a
variance_b = e_X_b
variance_c = e_X_c
variance_d = e_X_d
variance_e = e_X_e

# print variance
print(f'Variance for h(x) = b: {variance_a}')
print(f'Variance for h(x) = ax: {variance_b}')
print(f'Variance for h(x) = ax + b: {variance_c}')
print(f'Variance for h(x) = ax^2: {variance_d}')
print(f'Variance for h(x) = ax^2 + b: {variance_e}')

Variance for h(x) = b: 0.25033156750051694
Variance for h(x) = ax: 0.23283189769471757
Variance for h(x) = ax + b: 1.6622258546160749
Variance for h(x) = ax^2: 16.480600412344224
Variance for h(x) = ax^2 + b: 12739.209005568395


In [18]:
# print expected out-of-sample error for each hypothesis
print(f'Expected out-of-sample error for h(x) = b: {bias_a + variance_a}')
print(f'Expected out-of-sample error for h(x) = ax: {bias_b + variance_b}')
print(f'Expected out-of-sample error for h(x) = ax + b: {bias_c + variance_c}')
print(f'Expected out-of-sample error for h(x) = ax^2: {bias_d + variance_d}')
print(f'Expected out-of-sample error for h(x) = ax^2 + b: {bias_e + variance_e}')

Expected out-of-sample error for h(x) = b: 0.7575069337469329
Expected out-of-sample error for h(x) = ax: 0.498265853904133
Expected out-of-sample error for h(x) = ax + b: 1.8560449001436157
Expected out-of-sample error for h(x) = ax^2: 16.987118267177653
Expected out-of-sample error for h(x) = ax^2 + b: 12742.947171590009


The hypothesis $h(x) = ax$ has the least expected value of out-of-sample error.

**Question 7:** So the correct answer is [b] Hypotheses of the form $h(x) = ax$

# VC Dimension

## 8.

This answer is referenced from [3].

According to the topic, we have $m_{\mathcal{H}}(1) = 2$ and the growth function is given as:
$$m_{\mathcal{H}}(N + 1) = 2 m_{\mathcal{H}}(N) - \binom{N}{q}$$

Suppose $N = d - 1$ the growth function becomes:
\begin{align}
m_{\mathcal{H}}(d - 1 + 1) &= 2 m_{\mathcal{H}}(d - 1) - \binom{d - 1}{q} \\
\Leftrightarrow m_{\mathcal{H}}(d) &= 2 m_{\mathcal{H}}(d - 1) - \binom{d - 1}{q} \\
\end{align}

The VC dimension is the largest value of N for which $m_{\mathcal{H}}(N) = 2^N$ so the formula below becomes:
\begin{align}
2^d &= 2 \times 2^{d - 1} - \binom{d - 1}{q} \\
\Leftrightarrow 2^d &= 2^d - \binom{d - 1}{q} \\
\Leftrightarrow \binom{d - 1}{q} &= 0
\end{align}

But $\binom{d - 1}{q}$ is only 0 when $q > d - 1$ or $d < q + 1$ which means $d \leq q$. Since VC dimension is the largest value of N points that the hypothesis can shatter. $d = q$ is therefore the VC dimension of a hypothesis set whose growth function satisfies: $m_{\mathcal{H}}(N + 1) = 2 m_{\mathcal{H}}(N) - \binom{N}{q}$

**Question 8:** So the correct answer is [c] q

## 9.

For lower bound:
- Consider the intersection of hypothesis sets $\bigcap_{k=1}^{K} H_k$. Among these, we will take $H_m$ as the one with smallest $d_{VC}$. The VC dimension of the intersection of hypothesis sets is limited by the hypothesis with the smallest $d_{VC}$. This is because the intersection cannot shatter more points than the set with the smallest VC dimension. So, the lower bound for $d_{VC}(\bigcap_{k=1}^{K} H_k)$ is 0, as an empty set or singleton set has a VC dimension of 0.
- Therefore the correct answer can only be [a] or [b] or [c].

For upper bound:
- Consider the intersection of hypothesis sets $\bigcap_{k=1}^{K} H_k$. Among these, we will take $H_m$ as the one with smallest $d_{VC}$. When we find the VC dimension of the intersection of the sets, we can only use $H_m$ to shatter the maximum points. So, the VC dimension of $H_m$ is the upper bound for $d_{VC}(\bigcap_{k=1}^{K} H_k)$

**Question 9:** So the correct answer is [b] $0 \leq d_{\text{VC}}\left( \bigcap_{k=1}^{K} \mathcal{H}_k \right) \leq \min\{ d_{\text{VC}}(\mathcal{H}_k) \}_{k=1}^{K}$


## 10.
This answer is referenced from [4].

Assume that $H_1$ is a hypothesis where all points in the box [-1, 1] x [-1, 1] are classified as +1 and $H_2$ is a hypothesis set where all points are classified as -1.

For $N = 1$, there are $2^N = 2^1 = 2$ possible dichotomies. However, $H_1$ cannot shatter all the points as it always outputs +1. $N = 1$ is therefore the break point of $H_1$ which means that the VC dimension of $H_1$ is 0. And $H_2$ is similar to $H_1$ as it always outputs -1, so the VC dimension of $H_2$ is 0 too.

For $N = 1$, the union of $H_1$ and $H_2$ can shatter one point because it can produce 2 dichotomies -1 and +1. However for $N = 2$, it cannot shatter two points because it cannot generate all $2^2 = 4$ dichotomies. So we can infer that the VC dimension of $H_1 \bigcup H_2$ is 1.

For lower bound: 
- Consider the union of hypothesis sets $\bigcup_{k=1}^{K} H_k$. Among these, we will take $H_m$ as the one with highest $d_{VC}$. When we find the VC dimension of the union of the sets, we can always use $H_m$ to shatter the maximum points. So, the VC dimension of $H_m$ must be at least a lower bound for $d_{VC}(\bigcup_{k=1}^{K} H_k)$
- Therefore the correct answer can only be [d] or [e].

For upper bound:
- Consider the upper bound in answer [d] with $K = 2$:
$$\sum_{k = 1}^{2} d_{VC}({\mathcal{H}}_k) =  d_{VC}({\mathcal{H}}_1) +  d_{VC}({\mathcal{H}}_2) = 0 + 0 = 0$$
$\Rightarrow$ This doesn't match with the value found above.
- Consider the upper bound in answer [e] with $K = 2$:
$$2 - 1 + \sum_{k = 1}^{2} d_{VC}({\mathcal{H}}_k) = 1 + d_{VC}({\mathcal{H}}_1) +  d_{VC}({\mathcal{H}}_2) = 1 + 0 + 0 = 1$$
$\Rightarrow$ This matches with the value found above.

**Question 10:** So the correct answer is [e] $\max\{ d_{\text{VC}}(\mathcal{H}_k) \}_{k=1}^{K} \leq d_{\text{VC}}\left( \bigcup_{k=1}^{K} \mathcal{H}_k \right) \leq K - 1 + \sum_{k=1}^{K} d_{\text{VC}}(\mathcal{H}_k)$

# References

[1]: homefish. "edX Learning From Data 2017" <i>Github</i>, 2017, https://github.com/homefish/edX_Learning_From_Data_2017/blob/master/homework_4/homework_4_problem_1_Generalization_Error.ipynb. (Accessed date: 01/12/2024)

[2]: homefish. "edX Learning From Data 2017" <i>Github</i>, 2017, https://github.com/homefish/edX_Learning_From_Data_2017/blob/master/homework_4/homework_4_problem_4_5_6_Bias_and_Variance.ipynb. (Accessed date: 01/12/2024)

[3]: homefish. "edX Learning From Data 2017" <i>Github</i>, 2017, https://github.com/homefish/edX_Learning_From_Data_2017/blob/master/homework_4/homework_4_problem_8_VC_dimension.ipynb. (Accessed date: 01/12/2024)

[4]: homefish. "edX Learning From Data 2017" <i>Github</i>, 2017, https://github.com/homefish/edX_Learning_From_Data_2017/blob/master/homework_4/homework_4_problem_9_10_VC_dimension.ipynb. (Accessed date: 01/12/2024)