**Table of contents**<a id='toc0_'></a>    
- [Playbook: Sampling Methodologies in Large Language Models](#toc1_)    
  - [Sampling Methodologies (Deep Learning: Foundations and Concepts)](#toc1_1_)    
  - [Multinomial](#toc1_2_)    
  - [Categorical Distribution Overview](#toc1_3_)    
  - [Multinomial Distribution](#toc1_4_)    
  - [Formalizing Multinomial Sampling](#toc1_5_)    
  - [Greedy vs Probabilistic sampling](#toc1_6_)    
- [How to Derive Sigmoid and Softmax Functions from Exponential Family in Machine Learning Context](#toc2_)    
  - [Bernoulli Distribution as an Exponential Family Member](#toc2_1_)    
  - [Expressing Bernoulli in Exponential Family Form](#toc2_2_)    
  - [Derivation of the Sigmoid Function](#toc2_3_)    
- [Softmax -> CrossEntropyLoss -> KL Divergence](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Playbook: Sampling Methodologies in Large Language Models](#toc0_)

Main thing is to answer the question: why does temperature in llm enable the
model to be more random when it is high and more deterministic when it is low?
worth noting that if greedy sampling then it is deterministic, if sampling from
multinomial then it is random - and dependent on softmax. but if softmax
preserves order then how does it become more random? Precisely why i said greedy
sampling is deterministic since the order is preserved but multinomial sampling
is random but the order is also preserved - the key lies in the sharpen/dampen
effect of the softmax distribution. If sharp, means dominated by 1 value
usually, 0.99, so samplign from that converges to greedy sampling -
deterministic. If dampened, means more uniform, so sampling from that converges
to more diverse sampling. But is if converge to uniform no right.

-   softmax sharpens/dampens distribution
-   multinomial enables randomness
-   greedy sampling enables deterministic
-   and multinomial with T=0 converges to greedy = deterministic sampling```

And SINCE the softmax function is not invariant under scaling, we can introduce
a temperature parameter $T$ to control the entropy of the output distribution.
BECAUSE TEMPERATURE IS EFFECTIVELY SCALING! The temperature is a way to control
the entropy of a distribution, while preserving the relative ranks of each
event.


## <a id='toc1_1_'></a>[Sampling Methodologies (Deep Learning: Foundations and Concepts)](#toc0_)

We have seen that the output of a decoder transformer is a probability
distribution over values for the next token in the sequence, from which a
particular value for that token must be chosen to extend the sequence. There are
several options for selecting the value of the token based on the computed
probabilities (Holtzman et al., 2019). One obvious approach, called greedy
search, is simply to select the token with the highest probability. This has the
effect of making the model deterministic, in that a given input sequence always
generates the same output sequence. Note that simply choosing the highest
probability token at each stage is not the same as selecting the highest
probability sequence of tokens. To find the most probable sequence, we would
need to maximize the joint distribution over all tokens, which is given by

$$
p\left(\mathbf{y}_1, \ldots, \mathbf{y}_N\right)=\prod_{n=1}^N p\left(\mathbf{y}_n \mid \mathbf{y}_1, \ldots, \mathbf{y}_{n-1}\right)
$$

If there are $N$ steps in the sequence and the number of token values in the
dictionary is $K$ then the total number of sequences is
$\mathcal{O}\left(K^N\right)$, which grows exponentially with the length of the
sequence, and hence finding the single most probable sequence is infeasible. By
comparison, greedy search has cost $\mathcal{O}(K N)$, which is linear in the
sequence length.

One technique that has the potential to generate higher probability sequences
than greedy search is called beam search. Instead of choosing the single most
probable token value at each step, we maintain a set of $B$ hypotheses, where
$B$ is called the beam width, each consisting of a sequence of token values up
to step $n$. We then feed all these sequences through the network, and for each
sequence we find the $B$ most probable token values, thereby creating $B^2$
possible hypotheses for the extended sequence. This list is then pruned by
selecting the most probable $B$ hypotheses according to the total probability of
the extended sequence. Thus, the beam search algorithm maintains $B$ alternative
sequences and keeps track of their probabilities, finally selecting the most
probable sequence amongst those considered. Because the probability of a
sequence is obtained by multiplying the probabilities at each step of the
sequence and since these probability are always less than or equal to one, a
long sequence will generally have a lower probability than a short one, biasing
the results towards short sequences. For this reason the sequence probabilities
are generally normalized by the corresponding lengths of the sequence before
making comparisons. Beam search has cost $\mathcal{O}(B K N)$, which is again
linear in the sequence length. However, the cost of generating a sequence is
increased by a factor of $B$, and so for very large language models, where the
cost of inference can become significant, this makes beam search much less
attractive.

One problem with approaches such as greedy search and beam search is that they
limit the diversity of potential outputs and can even cause the generation
process to become stuck in a loop, where the same sub-sequence of words is
repeated over and over. As can be seen in Figure 12.17, human-generated text may
have lower probability and hence be more surprising with respect to a given
model than automatically generated text.

Instead of trying to find a sequence with the highest probability, we can
instead generate successive tokens simply by sampling from the softmax
distribution at each step. However, this can lead to sequences that are
nonsensical. This arises from the typically very large size of the token
dictionary, in which there is a long tail of many token states each of which has
a very small probability but which in aggregate account for a significant
fraction of the total probability mass. This leads to the problem in which there
is a significant chance that the system will make a bad choice for the next
token.

As a balance between these extremes, we can consider only the states having the
top $K$ probabilities, for some choice of $K$, and then sample from these
according to their renormalized probabilities. A variant of this approach,
called top- $p$ sampling or nucleus sampling, calculates the cumulative
probability of the top outputs until a threshold is reached and then samples
from this restricted set of token states.

A 'softer' version of top- $K$ sampling is to introduce a parameter $T$ called
temperature into the definition of the softmax function (Hinton, Vinyals, and
Dean, 2015) so that

$$
y_i=\frac{\exp \left(a_i / T\right)}{\sum_j \exp \left(a_j / T\right)}
$$

and then sample the next token from this modified distribution. When $T=0$, the
probability mass is concentrated on the most probable state, with all other
states having zero probability, and hence this becomes greedy selection. For
$T=1$, we recover the unmodified softmax distribution, and as
$T \rightarrow \infty$, the distribution becomes uniform across all states. By
choosing a value in the range $0<T<1$, the probability is concentrated towards
the higher values.

One challenge with sequence generation is that during the learning phase, the
model is trained on a human-generated input sequence, whereas when it is running
generatively, the input sequence is itself generated from the model. This means
that the model can drift away from the distribution of sequences seen during
training.


## <a id='toc1_2_'></a>[Multinomial](#toc0_)

In multinomial (or probabilistic) sampling, the model samples from the entire
probability distribution obtained after applying softmax:

At higher temperatures, the probability distribution becomes more uniform,
increasing the likelihood of sampling less probable tokens, thereby introducing
more randomness or diversity into the selection process. At lower temperatures,
the distribution becomes sharper, concentrating most of the probability mass on
a few high-probability tokens. This makes the selection less random and more
predictable, closely aligning with the greedy selection outcome but still
allowing for some variability.

KEY is to understand multinomial in the sampling.

## <a id='toc1_3_'></a>[Categorical Distribution Overview](#toc0_)

The categorical distribution is a probability distribution that describes the
result of a random event that can take on one of $K$ possible outcomes, with
each outcome having its own probability. It is the generalization of the
Bernoulli distribution for variables with more than two states.

Let $Y$ be a discrete random variable that can take on $K$ different states. We
say $Y$ follows a categorical distribution with parameters
$\boldsymbol{\pi} = (\pi_1, \pi_2, ..., \pi_K)$ if the probability of $Y$ taking
on the value $k$ is given by:

$$\mathbb{P}(Y=k)=\pi_k \quad \text{for } k=1,2, \ldots, K$$

Here, $\pi*k$ represents the probability of the event that $Y=k$, with the
constraint that $\sum*{k=1}^K \pi_k = 1$.

The probability mass function (PMF) of $Y$ is then compactly defined as:

$$\mathbb{P}(Y=k)=\prod\_{k=1}^K \pi_k^{I\{Y=k\}}$$

where $I\{Y=k\}$ is the indicator function, equal to 1 if $Y=k$ and 0 otherwise.
This PMF formulation concisely represents the categorical distribution,
highlighting that each $\pi_k$ contributes to the probability mass function only
when $Y=k$.

## <a id='toc1_4_'></a>[Multinomial Distribution](#toc0_)

The multinomial distribution generalizes the categorical distribution to
multiple independent trials, each of which results in a categorically
distributed outcome.

Consider a random experiment that results in one of $K$ possible outcomes, with
each outcome having a fixed probability $\pi*k$, where $\sum*{k=1}^K
\pi_k = 1$.
If we repeat the experiment $n$ times independently, the vector
$\mathbf{Y} = (Y_1, Y_2, ..., Y_K)$ describing the number of times each outcome
occurs follows a multinomial distribution with parameters $n$ and
$\boldsymbol{\pi}$.

The probability of observing a specific outcome vector
$\mathbf{y} = (y_1, y_2, ..., y_K)$, where $\sum*{k=1}^K y_k = n$, is given by:

$$
\mathbb{P}(\mathbf{Y}=\mathbf{y}; n, \boldsymbol{\pi}) =
\frac{n!}{y_1!y_2!\cdots y_K!} \prod*{k=1}^K \pi_k^{y_k}
$$

In this formulation:

-   $n$ is the total number of trials.
-   $y_k$ is the number of times outcome $k$ occurs in $n$ trials.
-   $\pi_k$ is the probability of outcome $k$ occurring in a single trial.

---

The multinomial distribution is a generalization of the binomial distribution.
It models the probabilities of observing counts among multiple categories and is
parametrized by probabilities $\pi_1, \pi_2, \ldots, \pi_n$ corresponding to $n$
outcomes. These probabilities must satisfy two conditions:

1. $0 \leq \pi_i \leq 1$ for all $i$,
2. $\sum\_{i=1}^{n} \pi_i = 1$.

Given a single trial, the probability of outcome $i$ occurring is $\pi_i$. When
sampling from a multinomial distribution, each sample (or draw) is independent,
and the probability of observing a specific outcome follows the distribution
defined by $\pi$.

## <a id='toc1_5_'></a>[Formalizing Multinomial Sampling](#toc0_)

In the context of generative models, after computing the softmax distribution
over the logits (or scores) $z_i$ for each token $i$ in the vocabulary, the
softmax function at temperature $T$ is applied to obtain probabilities:

$$p*i = \frac{\exp(z_i / T)}{\sum_{j=1}^{n} \exp(z_j / T)}$$

Here, $p_i$ represents the probability of selecting token $i$ as the next token
in the sequence, forming a probability distribution
$\pi = [p_1, p_2, \ldots, p_n]$ over the vocabulary.

Given $\pi$, multinomial sampling draws a sample $s$ where $P(s=i) = p_i$. This
process can be repeated to generate sequences of tokens.


## <a id='toc1_6_'></a>[Greedy vs Probabilistic sampling](#toc0_)

If your model employs a greedy strategy for selecting tokens (e.g., always
choosing the token with the highest probability), then adjusting the temperature
won't change the selected token. This approach is common in tasks where
precision is critical, and the aim is to reduce randomness to a minimum, such as
in certain classification tasks or when generating text where maximum coherence
is desired.

# <a id='toc2_'></a>[How to Derive Sigmoid and Softmax Functions from Exponential Family in Machine Learning Context](#toc0_)

REFERENCE:

-   6.2.2.2, 6.2.2.3 of The Deep Learning Book by Goodfellow et al.
-   3.4. of Deep Learning: Foundations and Concepts by Bishop et al.

SAMPLE CONTENT BELOW (TO BE REFINED): We can derive via exponential family.

## <a id='toc2_1_'></a>[Bernoulli Distribution as an Exponential Family Member](#toc0_)

The Bernoulli distribution can be expressed in the exponential family form as
follows:

$$
p(y | \eta) = b(y) \exp(\eta^T y - A(\eta))
$$

For the Bernoulli distribution:

-   $y$ is the binary outcome (0 or 1).
-   $\eta$ (or $\theta$ in some formulations) is the natural parameter of the
    distribution.
-   The base measure $b(y) = 1$, since it doesn't affect the form of the
    Bernoulli distribution.
-   The sufficient statistic $y$ is the identity function of the outcome.

To match the Bernoulli distribution to the exponential family form, we recognize
that $p(y | \eta)$ for $y \in \{0,1\}$ is given by
$p(y = 1 | \eta) =
\sigma(\eta)$ and $p(y = 0 | \eta) = 1 - \sigma(\eta)$, where
$\sigma(\eta)$ is the sigmoid function. The Bernoulli distribution can be
written as:

$$
p(y | \eta) = \sigma(\eta)^y (1 - \sigma(\eta))^{1-y}
$$

## <a id='toc2_2_'></a>[Expressing Bernoulli in Exponential Family Form](#toc0_)

Let's express the Bernoulli distribution in the form that highlights the
exponential family structure. To do this, note that the probability of success
$p = \sigma(\eta)$ where $\sigma(\eta)$ is the sigmoid function. By definition:

$$
\sigma(\eta) = \frac{1}{1 + e^{-\eta}}
$$

So, we have:

$$
p(y | \eta) = \frac{e^{\eta y}}{1 + e^{\eta}}
$$

where the natural parameter $\eta = \log\left(\frac{p}{1-p}\right)$, and
$A(\eta) = -\log(1 - p) = \log(1 + e^{\eta})$.

## <a id='toc2_3_'></a>[Derivation of the Sigmoid Function](#toc0_)

The sigmoid function is derived as the transformation of the natural parameter
$\eta$ back to the probability $p$. From the natural parameter, we have
$\eta = \log\left(\frac{p}{1-p}\right)$, solving for $p$ gives us the sigmoid
function:

$$
\eta = \log\left(\frac{p}{1-p}\right) \Rightarrow e^{\eta} = \frac{p}{1-p} \Rightarrow p = \frac{e^{\eta}}{1 + e^{\eta}}
$$

By substituting $\eta$ back with a linear combination of features (e.g.,
$z =
\boldsymbol{w}^\top \boldsymbol{x} + b$ in machine learning), we obtain the
sigmoid function used for binary classification:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

In this manner, the sigmoid function is derived as a special case of applying
the exponential family formulation to the Bernoulli distribution, with the
natural parameter $\eta$ serving as the link between the linear predictor and
the probabilities of the outcomes.


# <a id='toc3_'></a>[Softmax -> CrossEntropyLoss -> KL Divergence](#toc0_)

Building on my moderately comprehensive softmax docs, we can talk about ce loss
and kl divergence?

# Necessary and Sufficient

1. Necessary Condition:

    - If condition A is necessary for condition B, it means that B cannot be
      true unless A is also true.
    - In other words, if B is true, then A must be true as well.
    - Example: "Being a mammal is necessary for being a cat." If an animal is a
      cat, it must be a mammal.

2. Sufficient Condition:

    - If condition A is sufficient for condition B, it means that if A is true,
      then B must be true as well.
    - However, B can still be true even if A is not true.
    - Example: "Being a cat is sufficient for being a mammal." If an animal is a
      cat, it is definitely a mammal. However, not all mammals are cats.

3. Necessary and Sufficient Condition:

    - If condition A is both necessary and sufficient for condition B, it means
      that A and B are equivalent.
    - In other words, B is true if and only if A is true.
    - Example: "Having four sides of equal length and four right angles is
      necessary and sufficient for being a square." A shape is a square if and
      only if it has four equal sides and four right angles.

4. Necessary but Not Sufficient Condition:
    - If condition A is necessary but not sufficient for condition B, it means
      that A must be true for B to be true, but A being true does not guarantee
      that B is true.
    - In other words, B cannot be true without A, but the presence of A does not
      automatically make B true.
    - Example: "Being a mammal is necessary but not sufficient for being a cat."
      All cats are mammals, but not all mammals are cats. Being a mammal is a
      requirement for being a cat, but it doesn't automatically make an animal a
      cat.