<a href="https://colab.research.google.com/github/abhithakkar1998/llm-learning-journey/blob/main/Day1_HuggingFace_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install dependencies
!pip install transformers datasets huggingface_hub -q

In [None]:
#We’ll use Hugging Face’s free model google/flan-t5-base (a small instruction-tuned model).
from transformers import pipeline

# Load a small model for text generation
generator = pipeline("text2text-generation", model="google/flan-t5-base")

# Test query
query = "Explain neural networks in simple words."
response = generator(query, max_length=100, do_sample=True)

print("Query:", query)
print("Response:", response[0]['generated_text'])

###Let’s open the black box a bit.###

When you ran the above code it looked like magic. But under the hood, it’s really 3 steps:

1. Tokenization (Text → Numbers)
  * LLMs don’t “understand” words directly.
  * Every word/piece of a word is mapped to a number (token ID).
  * Example: "Explain neural networks in simple words."<br/>
  ↓<br/>
  ["Explain", " neural", " networks", " in", " simple", " words", "."]<br/>
  ↓<br/>
  [1234, 5678, 91011, 2345, 6789, 1112, 42]<br/>
  These IDs are like a language encoding that the model can process.

2. Model Inference (Numbers → Numbers)
  * The token IDs are fed into a neural network (a giant sequence model like T5).
  * The model predicts the probability of the next token.
  * Example: If the current text is "Neural networks are", the model might predict:<br/>
  "powerful" with 0.4 probability<br/>
  "complex" with 0.3 probability<br/>
  "used" with 0.2 probability<br/>
  ....<br/>

3. Decoding (Numbers → Text)
  * The model keeps predicting the next token, one step at a time, until it finishes.
  * Strategies for decoding:
    * Greedy: Always pick the highest probability (may sound robotic).
    * Sampling: Randomly pick based on probability distribution (adds variety).
    * Beam search: Explore multiple possible continuations, then choose the best one.
  * Finally, the generated token IDs are mapped back into words: <br/>
  [1234, 5678, 91011, 2345, 6789, 1112, 42] -> "Neural networks are powerful models that mimic the brain..."

###Pipeline = Wrapper###

The pipeline function in Hugging Face just wraps all 3 steps:
* Tokenize input.
* Run the model forward pass.
* Decode output.

So you can focus on the task (“summarize”, “translate”, “answer question”) instead of manually handling tokens.

###Why is the output vague?###
* Model size & capability: flan-t5-base is only ~250M parameters → designed for small tasks. It’s instruction-tuned, but it tends to produce short answers.
* Generation parameters: By default, the pipeline doesn’t force long responses.
* Decoding strategy: Default decoding may lean towards very safe answers. Adding sampling or beam search helps produce richer outputs.

###How to Improve the Output###

In [None]:
#Increase output length + use sampling
response = generator(
    query,
    max_new_tokens=100,       # allow up to 100 new tokens
    do_sample=True,           # add randomness
    temperature=0.7,          # controls creativity (0.7 is a good balance)
    top_p=0.9                 # nucleus sampling
)

print(response[0]['generated_text'])

* max_new_tokens → Limits how many words/tokens the model can generate in its response.
* do_sample → Enables randomness instead of always picking the most likely next word.
* temperature → Controls how much creativity vs. predictability is allowed in generation.
* top_p → Restricts choices to only the most likely words until a probability threshold is covered (nucleus sampling).

For more details -

https://huggingface.co/docs/transformers/en/main_classes/pipelines
https://huggingface.co/docs/transformers/en/main_classes/text_generation

In [None]:
#Use a slightly larger model if Colab allows
generator = pipeline("text2text-generation", model="google/flan-t5-large")
query = "Explain neural networks in simple words."
response = generator(
    query,
    max_new_tokens=100,       # allow up to 100 new tokens
    do_sample=True,           # add randomness
    temperature=0.7,          # controls creativity (0.7 is a good balance)
    top_p=0.9                 # nucleus sampling
)
print(response[0]['generated_text'])

In [None]:
#How the same query changes when we adjust temperature and top_p

# 1. Greedy (deterministic, no sampling)
resp_greedy = generator(query, max_new_tokens=60, do_sample=False)

# 2. Sampling with low temperature (more focused)
resp_temp_low = generator(query, max_new_tokens=60, do_sample=True, temperature=0.3, top_p=0.9)

# 3. Sampling with medium temperature (balanced)
resp_temp_mid = generator(query, max_new_tokens=60, do_sample=True, temperature=0.7, top_p=0.9)

# 4. Sampling with high temperature (creative, risk of nonsense)
resp_temp_high = generator(query, max_new_tokens=60, do_sample=True, temperature=1.2, top_p=0.9)

print("\n=== Greedy ===")
print(resp_greedy[0]['generated_text'])

print("\n=== Low Temperature (0.3) ===")
print(resp_temp_low[0]['generated_text'])

print("\n=== Medium Temperature (0.7) ===")
print(resp_temp_mid[0]['generated_text'])

print("\n=== High Temperature (1.2) ===")
print(resp_temp_high[0]['generated_text'])

Why the first three are identical?

* Greedy (do_sample=False): Always picks the single most likely next token.
* Sampling (do_sample=True): Picks randomly from the probability distribution of possible tokens.
  * But… if one token has extremely high probability (say 0.98 or 0.99), then even with sampling and low/medium temperature, the model almost always picks that token. The result looks the same as greedy.

* Why does temperature not change it much?
  * Temperature scaling reshapes the probability distribution.
  * Low temp (0.3) → makes the top choice even sharper, so variation disappears.
  * Medium temp (0.7) → smooths a little, but if the top token is already dominant, it still wins.
  * High temp (1.2) → finally flattens the distribution enough that second/third best tokens get picked, giving variation

* What’s happening here specifically
  * At the very first step of generation:
    * The token "a" (start of "a neural network...") is overwhelmingly the top candidate.
    * Temp 0.3 and 0.7 don’t change that — "a" is still chosen.
    * Only at higher temperature (1.2) do alternative beginnings like "Neural" get some chance.

SUMMARY: The first three are the same because the model is extremely confident about the first token and temperature 0.3–0.7 isn’t enough to shake it.
