## üî• **Temperature Scaling in Language Models**

Temperature is a hyperparameter that controls how *deterministic* or *random* the next-token predictions are during sampling.

After the model outputs **logits**, we normally apply **softmax** to convert them into probabilities.  
With temperature scaling, we modify logits before softmax:

$$
p_i = \text{softmax}\left(\frac{\text{logits}_i}{T}\right)
$$

---

### ### üßä **Low Temperature (T < 1) ‚Äî More Deterministic**
Example: **T = 0.1**

- Large logits become *even larger*, small logits become *even smaller*  
- Probability distribution becomes **sharper**
- Model picks the most likely token almost every time
- Useful for:
  - Technical writing  
  - Math / reasoning  
  - Code generation  
  - Factual responses  
  - When you need consistent, reliable outputs  

---

### üî• **High Temperature (T > 1) ‚Äî More Creative / Random**
Example: **T = 5**

- Logits become **more similar**  
- Even low-probability tokens get a chance  
- Probability distribution becomes **flatter**
- Output becomes more diverse and creative

---

### ‚≠ê **Temperature = 1 ‚Üí Normal Sampling**
Temperature of **1** does nothing:

$$
\frac{\text{logits}}{1} = \text{logits}
$$

Softmax produces normal probabilities.

The model samples tokens **exactly according to their true probability distribution**.

---

## üé≤ **How Sampling Works (Multinomial Sampling)**

After softmax, we have probabilities.  
We sample using:

```python
next_token_id = torch.multinomial(probas, num_samples=1).item()


In [5]:
# Let say we have this small vocabulary 

word2idx = {
    'I' : 0,
    'am' : 1, 
    'learning' : 2,
    'continuously' : 3,
    'for' : 4
}

idx2word = {i : word for word, i in word2idx.items()}

In [9]:
# lets get a random vector of size of vocab 
import torch 
torch.manual_seed(42)
logits = torch.rand(5)
logits

tensor([0.8823, 0.9150, 0.3829, 0.9593, 0.3904])

In [10]:
# apply softmax to convert them to probs
probs = logits.softmax(dim=-1)
probs

tensor([0.2309, 0.2385, 0.1401, 0.2493, 0.1412])

In [None]:
# from above token at index 3 is most likely (high probability)
# To generate text with more variety, we can replace the argmax with a function that samples from a probability distribution (here, the probability scores the LLM generates for each vocabulary entry at each token generation step).
sampling_result = {}
for _ in range(1000):
    sampled_id = torch.multinomial(probs, num_samples=1).item()
    sampled_word = idx2word[sampled_id]

    # update dictionary counts
    if sampled_word not in sampling_result:
        sampling_result[sampled_word] = 0
    
    sampling_result[sampled_word] += 1

print(sampling_result)

{'learning': 157, 'am': 223, 'I': 222, 'continuously': 249, 'for': 149}


In [14]:
## now lets try with lowering the temperature 
temp = 0.01
logits2 = logits/temp 
probs2 = logits2.softmax(dim=-1)
print(probs2)

## this increase probability of word with highest logit and reduce down others (more deterministic)

sampling_result2 = {}
for _ in range(1000):
    sampled_id = torch.multinomial(probs2, num_samples=1).item()
    sampled_word = idx2word[sampled_id]

    # update dictionary counts
    if sampled_word not in sampling_result2:
        sampling_result2[sampled_word] = 0
    
    sampling_result2[sampled_word] += 1

print(sampling_result2)

tensor([4.4567e-04, 1.1767e-02, 9.1224e-26, 9.8779e-01, 1.9476e-25])
{'continuously': 987, 'am': 13}


See only word at index 4 is sampled most

In [15]:
## now lets try with increase the temperature 
temp = 6
logits3 = logits/temp 
probs3 = logits3.softmax(dim=-1)
print(probs3)

## this increase probability of word with highest logit and reduce down others (more deterministic)

sampling_result3 = {}
for _ in range(1000):
    sampled_id = torch.multinomial(probs3, num_samples=1).item()
    sampled_word = idx2word[sampled_id]

    # update dictionary counts
    if sampled_word not in sampling_result3:
        sampling_result3[sampled_word] = 0
    
    sampling_result3[sampled_word] += 1

print(sampling_result3)

tensor([0.2058, 0.2069, 0.1893, 0.2084, 0.1896])
{'I': 210, 'for': 196, 'learning': 166, 'continuously': 206, 'am': 222}


All word gets some chance to appear