# Interview Question Bank /w Answers

## Machine Learning 🤖

1. Write a top-k sampling/nucleus sampling function for GPT in numpy
2. Write the transformer layer with PyTorch
3. Debug distributed numpy NN implementation
4. Write a data processing pipeline with pandas/numpy
5. Implement LR with numpy
6. Implement a model from a white paper.. 

### 1. Write a top-k sampling/nucleus sampling function for GPT in numpy
---

First, what is a top-k sampling method? Here are some articles
- [artical from hugging face](https://huggingface.co/blog/how-to-generate)
- [The Large Language Model Playbook](https://cyrilzakka.github.io/llm-playbook/nested/topk.html)


Here is what I got asking GPT-4 the question "What is a top-k sampling method in regards to GPT?"

Top-k sampling is a technique used in text generation models like GPT (Generative Pre-trained Transformer) to produce more coherent and contextually relevant text. This method is a way to balance randomness and determinism in the text generation process. Here's a detailed explanation:

1. **Understanding the Basics**:
    - In text generation, the model predicts the next word in a sequence based on the probability distribution over the vocabulary given the current context (the words that have already been generated).
    - This probability distribution is typically very broad, meaning many words could be potentially chosen as the next word, some of which might make sense and others which might not.

2. **How Top-k Sampling Works**:
    - Top-k sampling narrows down this choice. Instead of considering the entire vocabulary, it limits the possible next-word choices to the top 'k' most likely words as predicted by the model.
    - 'k' is a parameter that you choose. A smaller 'k' makes the text generation less random (as fewer choices are considered) and a larger 'k' makes it more random.
    - Once the top 'k' words are selected, one of them is randomly chosen based on their probability distribution. This word then becomes the next word in the generated text.

3. **Benefits**:
    - This method helps in reducing the randomness of the output by not considering highly improbable words, which leads to more coherent and contextually appropriate text.
    - It also helps in avoiding the 'garbage text' problem, where the model might output nonsensical or irrelevant text, especially in longer sequences.

4. **Drawbacks**:
    - If 'k' is set too low, the model might become too deterministic and repetitive, as it will always choose among the most likely words.
    - If 'k' is set too high, the model might still generate irrelevant or nonsensical text.

5. **Comparison with Other Sampling Methods**:
    - Top-k sampling is one approach among several used to control text generation. Another common method is 'temperature sampling', where the probability distribution is adjusted to control the randomness of the choice.
    - There's also 'nucleus sampling' or 'top-p sampling', which is a more dynamic approach. Instead of choosing a fixed number of top candidates ('k'), it chooses the smallest set of words whose cumulative probability exceeds a threshold 'p'. This method adapts to the model’s confidence in its predictions.

In summary, top-k sampling in GPT models is a technique to make the text generation process more controlled and relevant, by limiting the next-word choices to a subset of most probable words. It strikes a balance between creativity and coherence, which is crucial in applications like chatbots, story generation, and other natural language processing tasks.

Lets break down the following implementation of top-k

In [50]:
import numpy as np
np.random.seed(42)

def top_k_sampling(logits, k):
    top_k_indices = np.argsort(logits)[-k:]  # Get indices of top-k logits
    top_k_logits = logits[top_k_indices]  # Get the top-k logits
    top_k_probs = np.exp(top_k_logits) / np.sum(np.exp(top_k_logits))  # Convert logits to probabilities
    selected_index = np.random.choice(top_k_indices, p=top_k_probs)  # Sample from the top-k indices based on the probabilities
    return selected_index

The arguments are `logits` and `k`, where `k` is the number of choices to choose from. What are the `logits` though?
Logits are the raw, un-normalized output of the last layer of a neural network which can be any number in $\mathbb{R}$.
Logits are before any activation function (like softmax) is applied. This function is to transform the logits into a more interpretable from, often probabilites.


`top_k_indices = np.argsort(logits)[-k:]`
Here we are getting the indices for the top-k arguments. First lets understand the function `argsort`. This is [from the documentation](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html)
```
numpy.argsort(a, axis=-1, kind=None, order=None)
    Returns the indices that would sort an array.

    Perform an indirect sort along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of the same shape as a that index data along the given axis in sorted order.
```

Lets see what argsort is doing.

In [51]:
random_logits = np.random.uniform(-100.0, 100.0, (10,))
display(random_logits)
list_indicies = np.argsort(random_logits)
list_indicies

array([-25.09197623,  90.14286128,  46.39878836,  19.73169684,
       -68.79627191, -68.80109593, -88.38327757,  73.23522915,
        20.22300235,  41.61451556])

array([6, 5, 4, 0, 3, 8, 9, 2, 7, 1], dtype=int64)

Here is an example of what the `[-k:]` is doing, which is grabbing the last `k` elements of the list. Since the indices are sorted, the last 3 elements are the ones with the heights value.

In [52]:
list_indicies = list_indicies[-3:]
list_indicies

array([2, 7, 1], dtype=int64)

Get the top `k` logits

In [53]:
top_k_random_logits = random_logits[list_indicies]
top_k_random_logits

array([46.39878836, 73.23522915, 90.14286128])

Next we convert the logits to probabilities.

In [54]:
top_k_probs = np.exp(top_k_random_logits) / np.sum(np.exp(top_k_random_logits))
top_k_probs

array([1.00505654e-19, 4.54055191e-08, 9.99999955e-01])

In [55]:
selected_index = np.random.choice(list_indicies, p=top_k_probs)
print(f"Selected Index: {selected_index}")
print(f"Selected value: {random_logits[selected_index]}")

Selected Index: 1
Selected value: 90.14286128198324


### Write the transformer layer with PyTorch
---