# 2025 USA-NA-AIO Round 2, Problem 3 — ANSWERS

## Problem 3 (100 points)

In this problem, you are asked to study Contrastive Language-Image Pre-Training (CLIP), a powerful tool in multimodal AI.

In [None]:
# Run code in this cell

"""
DO NOT MAKE ANY CHANGE IN THIS CELL.
HINT: If something is not corrected installed, simply run this cell for few more times.
"""
!pip install datasets transformers

---

## $\color{red}{\text{WARNING !!!}}$

Beyond importing libraries/modules/classes/functions in the following cell, you are **NOT** allowed to import anything else for the following purposes:

- As a part of your final solution. For instance, if a problem asks you to build a model without using sklearn but you use it, then you will not earn points.

- Temporarily import something to assist you to get a solution. For instance, if a problem asks you to manually compute eigenvalues but you temporarily use `np.linalg.eig` to get an answer and then delete your code, then you violate the rule.

**Rule of thumb:** Each part has its particular purpose to intentionally test you something. Do not attempt to find a shortcut to circumvent the rule.


In [None]:
# Run code in this cell

"""
DO NOT MAKE ANY CHANGE IN THIS CELL.
"""
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

from transformers import BertTokenizer, BertModel, ViTModel

---

We will use flickr30k dataset to do image-language matching.


In [None]:
# Run code in this cell

"""
DO NOT MAKE ANY CHANGE IN THIS CELL.
"""
from datasets import load_dataset
dataset_train = load_dataset("USAAIO/2025-Round2-Problem3", split='train')

---

## Part 1 (5 points, coding task)

Do the following tasks to explore the properties of `dataset_train`:

1. `dataset_train` is a list-like object. Print the number of elements in it.

2. Consider index `idx = 2025`. Print the type of `dataset_train[idx]`.

3. Print all keys in `dataset_train[idx]`.

4. Name the value associated with the key `image` as `image_PIL`. Print it.

5. Convert `image_PIL` to a NumPy array object, called `image_np`. Print `image_np` and its shape.

6. Display this image by using `plt.imshow`.

7. Print the value associated with the key `alt_text`. Print its type.

In [None]:
### WRITE YOUR SOLUTION HERE ###

print(len(dataset_train))

idx = 2025
print(type(dataset_train[idx]))
print(dataset_train[idx].keys())

image_PIL = dataset_train[idx]['image']
print(image_PIL)

image_np = np.array(image_PIL)
print(image_np)
print(image_np.shape)

plt.imshow(image_np)
plt.show()

print(dataset_train[idx]['alt_text'])
print(type(dataset_train[idx]['alt_text']))

""" END OF THIS PART """

---

## Part 2 (5 points, coding task)

This dataset is too big. In our contest, we only use a small portion with 1000 samples.

To avoid introducing any bias, we will randomly select 1000 distinct samples.

Use NumPy to randomly select 1000 sample indices.
- Use the random seed number `2025` to generated randomized indices. After the generation is completed, reset the seed number back to `None`.
- The name of the output is called `indices`. It must be a list that contains 1000 integer type (not numpy array integers) objects.

In [None]:
### WRITE YOUR SOLUTION HERE ###

np.random.seed(2025)
indices = np.random.permutation(len(dataset_train))[:1000]
np.random.seed()
indices = [int(idx) for idx in indices]

""" END OF THIS PART """

---

## Part 3 (5 points, coding task)

In this part, we create our image and text datasets.

- All sample indices are selected from `indices` generated in Part 2.
- All images (resp. texts) are extracted from the key `image` (resp. `alt_text`).
- The image (resp. text) dataset is called `image_list` (resp. `text_list`). The data type of both datasets are `list`.
- In `image_list`, each element is a PIL object.
- In `text_list`, each element is a string object.

In [None]:
### WRITE YOUR SOLUTION HERE ###

image_list = [dataset_train[idx]['image'] for idx in indices]
text_list = [dataset_train[idx]['alt_text'][0] for idx in indices]

""" END OF THIS PART """

---

## Part 4 (5 points, coding task)

In this part, we preprocess image data.

1. Your job is to create a tensor `images_pt` from `image_list` that has shape `(1000, 3, 224, 224)` and datatype `float64`.

2. The data range is from -1 to 1.

3. **Hint:** If `a` is a PIL object, then you can use `a.resize` to resize it.

4. Print `images_pt.shape`.

5. Print `images_pt.dtype`.

6. Print `images_pt[5]`.

In [None]:
### WRITE YOUR SOLUTION HERE ###

images_pt_list = []

for image in image_list:
    image = image.resize((224,224)) # Resize the image
    image_np = np.array(image) # Convert to numpy array
    image_pt = torch.from_numpy(image_np) # Convert to pytorch tensor
    image_pt = image_pt.permute(2,0,1) # Permute the dimension
    image_pt = image_pt / 255 # Normalize value between 0 and 1
    image_pt = image_pt * 2 - 1 # Normalize value between -1 and 1
    images_pt_list.append(image_pt)

images_pt = torch.stack(images_pt_list)

print(images_pt.shape)
print(images_pt.dtype)
print(images_pt[5])

""" END OF THIS PART """

---

## Part 5 (5 points, non-coding task)

Note that our final goal is to build a CLIP neural network. For the image data, we will use Vision Transformers (ViT) to extract image embeddings.

With the above high level information, please explain the reasons behind the following things that you did in Part 4.

1. Why the channel dimension is ahead of the height and width dimensions?

2. Why the sizes of all images are normalized to (224, 224)?

3. Why each pixel value is normalized between -1 and 1?


**Answer:**

1. The input of ViT requires the channel dimension to go ahead of the height and width dimensions.

2. ViT model requires this dimension.

3. ViT model requires data to fall into this range.

""" END OF THIS PART """


In [None]:
### WRITE YOUR SOLUTION HERE ###

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

token_id_list = tokenizer(text_list)['input_ids']

print(token_id_list)
print(type(token_id_list))
print(len(token_id_list))

print(token_id_list[5])
print(type(token_id_list[5]))
print(type(token_id_list[5][0]))

token_id_list = [torch.tensor(token_id_list[idx]) for idx in range(len(token_id_list))]
print(token_id_list[5:7])
print(token_id_list[5][0].dtype)

""" END OF THIS PART """

---

## Part 7 (5 points, non-coding task)

This part follows Part 6.

Do the following tasks.

1. Explain why token lists of all samples begin with token ID 101.
2. Explain why token lists of all samples end with token ID 102.


In [None]:
### WRITE YOUR SOLUTION HERE ###

class MyDataset(Dataset):
    def __init__(self, images_pt, token_id_list):
        self.images_pt = images_pt
        self.token_id_list = token_id_list

    def __len__(self):
        return len(self.token_id_list)

    def __getitem__(self, idx):
        return self.images_pt[idx], self.token_id_list[idx]

CLIP_dateset = MyDataset(images_pt, token_id_list)

""" END OF THIS PART """

---

## Part 9 (5 points, coding task)

### Part 9.1

Define your own collate function.

The function name is `my_collate_fn`.

**Padding**

For text data, let the longest sample be with K tokens.

Consider another text sample with L tokens satisfying L < K. Then, in addition to those L tokens, this sample is padded with K-L padding tokens whose values are 0.

**Outputs**

- `token_id_batch`. If the batch size is B and the longest sample in the text data has K tokens, then `token_id_batch` is a tensor with shape (B,K).

- `attention_mask_batch`. This is a tensor that has shape (B,K). If a position is occupied by a non-padding token, its value is 1. Otherwise, if it is occupied by a padding token, its value is 0. Data types are int64.

- `image_batch`. This is a tensor that has shape (B,3,224,224).

### Part 9.2

Define a DataLoader object called `CLIP_dataloader`.

- Set `batch_size = 16`.

- Set `shuffle = True`.

- Use the collate function defined in Part 9.1.


In [None]:
### WRITE YOUR SOLUTION HERE ###

# Part 9.1

def my_collate_fn(batch):
    image_batch_input, token_id_batch_input = zip(*batch)

    image_batch = torch.stack(image_batch_input)

    max_len_token_id = max([len(token_id) for token_id in token_id_batch_input])
    token_id_batch = []
    attention_mask_batch = []

    for token_id in token_id_batch_input:
        token_id_batch.append(torch.concatenate([token_id, torch.zeros(max_len_token_id - len(token_id), dtype=torch.int64)]))
        attention_mask_batch.append(torch.concatenate([torch.ones(len(token_id), dtype=torch.int64), \
                                                       torch.zeros(max_len_token_id - len(token_id), dtype=torch.int64)]))

    token_id_batch = torch.stack(token_id_batch)
    attention_mask_batch = torch.stack(attention_mask_batch)

    return image_batch, token_id_batch, attention_mask_batch

# Part 9.2

batch_size = 16

CLIP_dataloader = DataLoader(CLIP_dateset, batch_size=batch_size, shuffle=True, collate_fn=my_collate_fn)

""" END OF THIS PART """


---

## Part 10 (5 points, non-coding task)

In this part, you are asked to answer some questions about a CLIP model that you shall build in the next part.

Write your answers in the text cell below.

To get answers, you may need to run experimental code to better learn the ViT and Bert models.

We only grade your answers in the text cell.


**1. Image encoder**

- Define `model_image = ViTModel.from_pretrained('google/vit-base-patch16-224')`. We use all blocks except the last pooler layer. That is, this ViT model has two outputs: with their key names as `last_hidden_state` and `pooler_output`. You should take the value associated with the key `last_hidden_state`.

- From the last hidden state, we project from position 0 to a latent space with dimension `embedding_size` (e.g., 512). The output is called **image embedding**.


**2. Text encoder**

- Define `model_text = BertModel.from_pretrained('bert-base-uncased')`. We use all blocks except the last pooler layer. That is, this Bert model has two outputs: with their key names as `last_hidden_state` and `pooler_output`. You should take the value associated with the key `last_hidden_state`.

- From the last hidden state, we project from position 0 to a latent space with dimension `embedding_size` (e.g., 512). The output is called **text embedding**.

**Answer the following questions.** (Reasoning is required only for Question 3)

1. Let `image_batch` be with shape `(B,3,224,224)`. What is the shape of `model_image(image_batch)['last_hidden_state']`?

2. Let `token_id_batch` and `attention_mask_batch` be with shape `(B,L)`. What is the shape of `model_text(input_ids = token_id_batch, attention_mask = attention_mask_batch)['last_hidden_state']`?

3. For both the image encoder and the text encoder, we project the last hidden state from position 0 to a latent space with the same dimension `embedding_size`.

   3.1. Why do we add this additional out-projection layer?

   3.2. Why this layer is added on position 0 only?

   3.3. Why the output dimensions from these two encoders are the same?


In [None]:
### DO YOUR EXPERIMENTAL STUDY HERE ###

image_batch, token_id_batch, attention_mask_batch = next(iter(CLIP_dataloader))

model_image = ViTModel.from_pretrained('google/vit-base-patch16-224')

print(model_image(image_batch).keys())

print(model_image(image_batch)['last_hidden_state'].shape)

model_text = BertModel.from_pretrained('bert-base-uncased')

print(model_text(input_ids=token_id_batch, attention_mask=attention_mask_batch).keys())

print(model_text(input_ids=token_id_batch, attention_mask=attention_mask_batch)['last_hidden_state'].shape)

""" END OF THIS PART """


**Answer:**

1. The shape of `model_image(image_batch)['last_hidden_state']` is **(B, 197, 768)**.
   - 197 = 1 (CLS token) + 196 (14×14 patches from 224×224 image with patch size 16)
   - 768 is the hidden dimension of ViT-base

2. The shape of `model_text(input_ids=token_id_batch, attention_mask=attention_mask_batch)['last_hidden_state']` is **(B, L, 768)**.
   - L is the sequence length (number of tokens)
   - 768 is the hidden dimension of BERT-base

3.1. We add this additional out-projection layer to map the hidden representations from both encoders to a common latent space where image and text embeddings can be compared directly using similarity metrics (e.g., cosine similarity).

3.2. Position 0 is used because:
   - For ViT, position 0 corresponds to the [CLS] token which aggregates information from all image patches through self-attention.
   - For BERT, position 0 also corresponds to the [CLS] token which serves as the sentence-level representation that captures the meaning of the entire text.

3.3. The output dimensions must be the same so that we can compute similarity (e.g., dot product or cosine similarity) between image embeddings and text embeddings in the shared latent space. This is essential for CLIP's contrastive learning objective.

""" END OF THIS PART """


---

## Part 11 (5 points, coding task)

In this part, you are asked to build your CLIP model.

- The class name is `MyCLIP`. It subclasses `nn.Module`.

- **`__init__`:**

    - It takes one input argument - the size of the final embedding of text and image data. Set its default value as 512.

    - Attribute `log_tau` is the log of temperature. It is a learnable parameter. Its initial value follows the standard normal distribution.

- **`__forward__`:**

    - It returns two objects: image embedding, text embedding.

In [None]:
### WRITE YOUR SOLUTION HERE ###

class MyCLIP(nn.Module):
    def __init__(self, embedding_size=512):
        super().__init__()
        self.model_image = ViTModel.from_pretrained('google/vit-base-patch16-224')
        self.model_text = BertModel.from_pretrained('bert-base-uncased')
        self.embedding_size = embedding_size
        self.last_layer_image = nn.Linear(768, self.embedding_size)
        self.last_layer_text = nn.Linear(768, self.embedding_size)
        self.log_tau = nn.Parameter(torch.randn(1))

    def encoder_image(self, image_batch):
        image_embedding = self.model_image(image_batch)['last_hidden_state'][:,0]
        image_embedding = self.last_layer_image(image_embedding)
        return image_embedding

    def encoder_text(self, token_id_batch, attention_mask_batch):
        text_embedding = self.model_text(input_ids=token_id_batch, attention_mask=attention_mask_batch)['last_hidden_state'][:,0]
        text_embedding = self.last_layer_text(text_embedding)
        return text_embedding

    def forward(self, image_batch, token_id_batch, attention_mask_batch):
        return self.encoder_image(image_batch), self.encoder_text(token_id_batch, attention_mask_batch)

""" END OF THIS PART """

---

## Part 12 (5 points, non-coding task)

Explain why we use $\log \tau$ as an attribute, not $\tau$.


**Answer:**

$\log \tau$ can take all real values. So we do not have to worry about its range.

However, $\tau$ must be positive. Hence, if we use $\tau$, we always need to take care of its domain.

""" END OF THIS PART """


---

## Part 13 (5 points, coding task)

Do the following tasks:

1. Define your model by calling `model_CLIP = MyCLIP()`.

2. Fix all parameter values in the ViT and Bert blocks in your model. That is, you are only allowed to train:
   - Out-projection matrices in the image and text encoders.
   - Temperature.


In [None]:
### WRITE YOUR SOLUTION HERE ###

model_CLIP = MyCLIP()

for param in model_CLIP.model_text.parameters():
    param.requires_grad = False

for param in model_CLIP.model_image.parameters():
    param.requires_grad = False

""" END OF THIS PART """

---

## Part 14 (5 points, coding task)

Do the following tasks:

1. Set the learning rate as `1e-3`.

2. Choose your optimization algorithm as Adam.

3. Define an optimizer.


In [None]:
### WRITE YOUR SOLUTION HERE ###

lr = 1e-3
optimizer = optim.Adam(model_CLIP.parameters(), lr=lr)

""" END OF THIS PART """

---

## Part 15 (5 points, coding task)

In this part, you are asked to define a loss function.

Let $I_i$ and $T_j$ be image $i$'s embedding and text $j$'s embedding, respectively. Let $B$ be the batch size. Let $\tau$ be the temperature.

Then the loss function is defined as

$$\mathcal{L} = \frac{1}{2} \left( -\frac{1}{B} \sum_{i=0}^{B-1} \log \frac{\exp(\text{SIM}(I_i, T_i) / \tau)}{\sum_{j=0}^{B-1} \exp(\text{SIM}(I_i, T_j) / \tau)} - \frac{1}{B} \sum_{i=0}^{B-1} \log \frac{\exp(\text{SIM}(I_i, T_i) / \tau)}{\sum_{j=0}^{B-1} \exp(\text{SIM}(I_j, T_i) / \tau)} \right),$$

where

$$\text{SIM}(I_i, T_j) = \frac{I_i^\top T_j}{\|I_i\|_2 \|T_j\|_2}.$$


In [None]:
### WRITE YOUR SOLUTION HERE ###

def CLIP_loss_fn(image_embedding, text_embedding):
    image_embedding = image_embedding / torch.norm(image_embedding, dim=-1, keepdim=True)
    text_embedding = text_embedding / torch.norm(text_embedding, dim=-1, keepdim=True)
    sim = torch.sum(image_embedding.unsqueeze(1) * text_embedding.unsqueeze(0), dim=-1)
    loss = .5 * (-torch.mean(torch.diagonal(torch.log_softmax(sim / torch.exp(model_CLIP.log_tau), dim=0))) \
                 -torch.mean(torch.diagonal(torch.log_softmax(sim / torch.exp(model_CLIP.log_tau), dim=1))))
    return loss

""" END OF THIS PART """

---

## Part 16 (5 points, coding task)

In this part, you are asked to train your model.

1. Set the number of epochs as 100.

2. Do training on GPU.

3. For every epoch, print the average loss per sample in this epoch.

4. You may use `tqdm` to track your progress and help you manage your time.


In [None]:
### WRITE YOUR SOLUTION HERE ###

num_epochs = 100
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_CLIP.to(device)

for epoch in tqdm(range(num_epochs)):
    model_CLIP.train()
    optimizer.zero_grad()
    loss_cum = 0
    for image_batch, token_id_batch, attention_mask_batch in CLIP_dataloader:
        image_batch = image_batch.to(device)
        token_id_batch = token_id_batch.to(device)
        attention_mask_batch = attention_mask_batch.to(device)
        image_embedding, text_embedding = model_CLIP(image_batch, token_id_batch, attention_mask_batch)
        loss = CLIP_loss_fn(image_embedding, text_embedding)
        loss.backward()
        optimizer.step()
        loss_cum += loss.item() * image_batch.shape[0]
    loss = loss_cum / len(CLIP_dateset)
    print(f'Epoch {epoch}, Loss: {loss}')

""" END OF THIS PART """

---

So far, we use the cosine function to measure the similarity between two vectors. Next, you are asked to do theoretical study of its reasonableness.

Your task is to prove the following theorem.

**Theorem:**

Let $x, y \in \mathbb{R}^d$ be two independent $d$-dim vectors that follow the same multi-variate standard normal distribution $\mathcal{N}(0_d, I_{d \times d})$.

Then for any $\epsilon > 0$, when $d$ is large,

$$P\left( \frac{x^\top y}{\|x\|_2 \|y\|_2} > \epsilon \right) \leq \frac{1}{\epsilon^2 d}.$$

We prove this in multiple steps.

---

## Part 17 (5 points, non-coding task)

First, you are asked to prove the following lemma.

**Lemma 1:**

If $x \sim \mathcal{N}(0_d, I_{d \times d})$, then for any unit vector $\hat{e} \in \mathbb{R}^d$,

$$\hat{e}^\top x \sim \mathcal{N}(0, 1).$$

That is, the projection of $x$ onto $\hat{e}$ is a standard normal random variable.

**Hint:** You can directly use the result that $\hat{e}^\top x$ is normal. Therefore, you only need to prove that $\hat{e}^\top x$ has mean 0 and variance 1.


**Answer:**

First, we have

$$\mathbb{E}[\hat{e}^\top x] = \hat{e}^\top \mathbb{E}[x] = \hat{e}^\top 0_d = 0.$$

Second, we have

$$\text{Var}[\hat{e}^\top x] = \mathbb{E}[(\hat{e}^\top x)^2] - (\mathbb{E}[\hat{e}^\top x])^2 = \mathbb{E}[(\hat{e}^\top x)^2] = \mathbb{E}[\hat{e}^\top x x^\top \hat{e}] = \hat{e}^\top \mathbb{E}[x x^\top] \hat{e} = \hat{e}^\top I_{d \times d} \hat{e} = \hat{e}^\top \hat{e} = 1.$$

Therefore, $\hat{e}^\top x$ is a standard normal random variable.

""" END OF THIS PART """


---

## Part 18 (5 points, non-coding task)

Lemma 1 implies that the projection of $x$ onto any direction is a standard normal. Therefore, all directions are homogeneous.

Therefore,

$$P\left( \frac{x^\top y}{\|x\|_2 \|y\|_2} > \epsilon \right) = P\left( \frac{x^\top y}{\|x\|_2 \|y\|_2} > \epsilon \mid x = \hat{x} \right), \quad \forall \hat{x} \in \mathbb{R}^d.$$

For simplicity, we consider

$$\hat{x} = \begin{bmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \in \mathbb{R}^d.$$

Therefore, we only need to bound

$$P\left( \frac{y_0}{\|y\|_2} > \epsilon \right)$$

By symmetry, it is easy to see that

$$\mathbb{E}\left[ \frac{y_0}{\|y\|_2} \right] = 0.$$

Hence, we get

$$P\left( \frac{y_0}{\|y\|_2} > \epsilon \right) \leq \frac{\text{Var}\left[ \frac{y_0}{\|y\|_2} \right]}{\epsilon^2} = \frac{\mathbb{E}\left[ \frac{y_0^2}{\|y\|_2^2} \right]}{\epsilon^2}$$

$$= \frac{1}{\epsilon^2 d} \mathbb{E}\left[ \frac{y_0^2}{\frac{1}{d} \sum_{i=0}^{d-1} y_i^2} \right]$$

where the first inequality follows from the Chebyshev's inequality.

To prove the theorem, it is equivalent to prove the following lemma.

**Lemma 2:**

Let $y_0, \cdots, y_{d-1}$ be identically and independent variables that are all standard normals. Then for large $d$,

$$\mathbb{E}\left[ \frac{y_0^2}{\frac{1}{d} \sum_{i=0}^{d-1} y_i^2} \right] \approx 1.$$

In this part, your task is to prove this lemma.

**Hint:** It is hard to prove this statement in an exact way. You can make any reasonable approximation.


**Answer:**

Since $y_i$ is a standard normal, $\mathbb{E}[y_i^2] = 1$ and $\text{Var}[y_i^2] = 2$.

For large $d$, the central limit theorem implies

$$\frac{1}{d} \sum_{i=0}^{d-1} y_i^2 \sim \mathcal{N}\left(1, \frac{2}{d}\right).$$

Hence, for large $d$, $\frac{1}{d} \sum_{i=0}^{d-1} y_i^2$ can be approximated as its mean value, 1.

Therefore, for large $d$,

$$\mathbb{E}\left[ \frac{y_0^2}{\frac{1}{d} \sum_{i=0}^{d-1} y_i^2} \right] \approx \mathbb{E}[y_0^2] = 1.$$

""" END OF THIS PART """


---

## Part 19 (5 points, non-coding task)

Lemmas 1 and 2 jointly imply the theorem above. Please use the result in this theorem to explain why it is reasonable to use the cosine function to measure similarity of two embedding vectors and why the latent space needs to be high dimensional (such as 512, 768, 1024).


**Answer:**

The theorem states that in a high dimensional space, almost all pairs of vectors are orthogonal (independent), except very few that are in the same direction.

This is exactly what we want in matching images and texts. For instance, suppose there are 30k pairs of iamges and texts. For each image embedding vector, we want it to be aligned with only one text embedding vector, but orthogonal to other 30k-1 text embedding vectors. This is guaranteed by the above theorem.

Recall that a key condition of the above theorem is that the dimension must be high. Therefore, in image and text embeddings, the embedded vectors must be high dimensional.

""" END OF THIS PART """


---

## Part 20 (5 points, non-coding task)

In the loss function, we introduced a crutial learnable parameter $\tau$, called temperature.

Let us explore some properties of $\tau$.

Let $z_0 > z_1 > \cdots > z_{N-1}$.

Define

$$f_i = \frac{\exp(z_i / \tau)}{\sum_{j=0}^{N-1} \exp(z_j / \tau)}.$$

Do the following analysis. Reasoning is required.

1. Compute $\lim_{\tau \to 0^+} f_i$.

2. Compute $\lim_{\tau \to \infty} f_i$.


**Answer:**

1. We have

$$\lim_{\tau \to 0^+} f_i = \lim_{\tau \to 0^+} \frac{\exp(z_i / \tau)}{\sum_{j=0}^{N-1} \exp(z_j / \tau)} = \lim_{\tau \to 0^+} \frac{\exp((z_i - z_0) / \tau)}{\sum_{j=0}^{N-1} \exp((z_j - z_0) / \tau)} = \begin{cases} 1 & \text{if } i = 0 \\ 0 & \text{if } i \neq 0 \end{cases}.$$

2. We have

$$\lim_{\tau \to \infty} f_i = \lim_{\tau \to \infty} \frac{\exp(z_i / \tau)}{\sum_{j=0}^{N-1} \exp(z_j / \tau)} = \frac{1}{\sum_{j=0}^{N-1} 1} = \frac{1}{N}.$$

""" END OF THIS PART """
