### LLMs (Large Language Models)

- Characterization of LLMs:
    - Scale: contain millions, billions or even more parameters
    - General capabilities: it can perform multiple taks without task-specific training
    - In-context learning: it can learn from examples provided in prompt
    - Emergent abilities: As these models frow in size, it can demonstrate capabilities thayt werenot explicitily programmed for

- Limitations of LLM:
    - Hallucinations: can generate incorrect information confidently.
    - Lack of true understanding: lack of understanding of the world and operate purely on statistical patterns.
    - Biases: may reproduce biases present in training data or inputs
    - Context Windows: have limited context windows
    - Computational resources: require significant computational resources


In [1]:
## Tranformer library, pipeline object

from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
classifier_object = pipeline('sentiment-analysis')
# by default, pipeline function slect particular pre-trained model that has been fine-tuned for sentiment analysis
# model wgets downloaded, and cached

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [3]:
## 3 MAJOR STEPS OF PIPELINE:
# 1. the text is preprocessed intop a format the model can understand
# 2. preprocessed input are passed to the model
# 3. prections of the model are post processed , to understanble langauge

# passing string for analysis
classifier_object("I've been waiting for a HuggingFace course my whole life.")

[{'label': 'POSITIVE', 'score': 0.9598050713539124}]

In [6]:
# passing list of sentence to get predictions
inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I don't hate this so much!",
    "I hate this so much!"
]
classifier_object(inputs)

[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'POSITIVE', 'score': 0.996126115322113},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [7]:
### ZERO SHOT CLASSIFICATION
## classify text without prior training on specific labels
##allow us to specify which labels to use for classification

classifier_object = pipeline('zero-shot-classification')


classifier_object(
    'This is a course about the Transformers library on huggingface',
    candidate_labels = ['education', 'politics', 'sports'],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'sequence': 'This is a course about the Transformers library on huggingface',
 'labels': ['education', 'sports', 'politics'],
 'scores': [0.9443621039390564, 0.03222975879907608, 0.02340812422335148]}

##### Hallucination experiment

In [8]:
# mis spelling the label, still model is able to assign high classes
classifier_object(
    'This is a course about the Transformers library on huggingface',
    candidate_labels = ['educion', 'politics', 'sports'],
)

{'sequence': 'This is a course about the Transformers library on huggingface',
 'labels': ['educion', 'sports', 'politics'],
 'scores': [0.9692893624305725, 0.017789985984563828, 0.012920672073960304]}

In [10]:
# adding both label, correct and incorrect, mis-spelled label got more probability
classifier_object(
    'This is a course about the Transformers library on huggingface',
    candidate_labels = ['educations', 'educion', 'politics', 'sports'],
)

{'sequence': 'This is a course about the Transformers library on huggingface',
 'labels': ['educion', 'educations', 'sports', 'politics'],
 'scores': [0.7425496578216553,
  0.23392359912395477,
  0.013628487475216389,
  0.009898222051560879]}

##### Text Generation
- given some text(prompt), model tries to complete the text, and it stops based on end of statement logic
  

In [11]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In this course, we will teach you how to get started with the Arduino development environment… You won't need the tools needed to make an accurate, reliable and functional digital representation of your Arduino, but you will use them to design, create, produce"}]

In [17]:
## Define model in pipeline method
# control total lenght of the generated text, and num of different sequences returned by model
generator = pipeline('text-generation', model='HuggingFaceTB/SmolLM2-360M')
generator("In this course, we will teach you how to", max_length=40)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use the Python programming language to solve real-world problems. We will start by introducing you to the basics of Python, including its syntax, data types'}]

##### Mask filling
 - model try to fill the blanks given a text
 - topk arguments control how many different outcomes we want
 - special token "(< mask >)" where it try to predict the missing word

In [19]:
# unmasker = pipeline('fill-mask')
unmasker('This course will teach you all about <mask> models', top_k=2)


[{'score': 0.19631582498550415,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models'},
 {'score': 0.044492267072200775,
  'token': 745,
  'token_str': ' building',
  'sequence': 'This course will teach you all about building models'}]

In [23]:
#### Named Entity recognition
# ner = pipeline('ner', grouped_entities=True)
ner("My name is Sylvain and I work at Face in Brooklyn.")

[{'entity_group': 'PER',
  'score': np.float32(0.9983753),
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': np.float32(0.99058795),
  'word': 'Face',
  'start': 33,
  'end': 37},
 {'entity_group': 'LOC',
  'score': np.float32(0.99158233),
  'word': 'Brooklyn',
  'start': 41,
  'end': 49}]

In [22]:
#### Question answering given a context

qna = pipeline("question-answering")
qna(question='where do i work?', context='my name is sylvain and i work at hugging face in brooklyn')

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'score': 0.5335550308227539, 'start': 33, 'end': 45, 'answer': 'hugging face'}

#### bit on how transformer works
- all llms goes through pre-training process, using self-supervised method
- self supervise method: method where objective is automatically computed from the inputs of the model(no explicit label needed)
- it devlops statistical understanding of the langauage it has been trained on, not suitable for specific task yet
- after pre-training, generally fine-tun is done on specific task in supervised way
- two types of pre-training can be done on text data:
    - causal language modeling: model try to predict next word, given all previous word
    - masked langauge modeling: try to predict the masked word given all words in sentence
- Model has 2 componenet:
    - Encoder: encoder recieves inputs and calculate feature in lower dims
    - Decoder: uses encoder representation along with other input to generate target sequence
- Each part can be used independently, based on the what type of task we are tring to solve:
    - Encoder only models: Good for tasks that requires understanding of the input, e.g: sentence classification, and ner
    - Decoder only models: Good for generative task, e.g: text generation
    - Encoder-Decoder models: good for generative task that require input, such as translation or summarization
- Attention concept
    - a word by itself has meaning, but the meaning is deeply affected by the context
    - it was devloped for langauge translation
    - in the encoder, the attention layer can use all words in sentence(since we require all the words in sentence to understand the whole meaning)
    - the decoder howeever predict words in sequential manner one by one, it takes all the previous words that was generated by decoder, and representation of inputs from encoder?


- Context lenght & Attention span:
    - it refers to the maximum number of tokens that the LLM can process at once
    - it depends on several factors:
        - model architecture and size
        - Available computational resources
        - The complexity of the input and desired output
- Prompting:
    - when we pass information to LLMs, structure input in a way which guides the generation of the LLMs to the desired output
    - Since model primary task is to predict next token, it is essential to craft better prompts



- Inference of LLMs can be devided into two phase:
    - The prefill phase:
        - Tokenization: converting input text into tokens
        - EMbedding conversion: transforming these tokens into numerical representation that captures its meaning
        - Initial Processing: running these embedding through model's nn to create representation of the all tokens and its context

 
    - The Decode phase:
         - Attentions computation: Looking back at all previous tokens to understand context
         - Probability calculations: determining the lieklihood of each possible next token
         - Token selection: choosing the next token based on these probabilities
         - continuation check: Deciding whether to continue or stop generation



##### Sampling Strategies
- Understanding token selections: From Probabilities to token choices
    - Raw Logits: raw output from model, without any post processing
    - Tempreture control: like a creativity dial, higher settings (>1.0) makes choices more random and creative, lower settings(<1.0) make it more deterministic
    - Top-p(Nucleus) Sampling: Instead of considering all possible words, we only look at the most likely ones that add up to choosen probability threshold
    - Top-k Filtering: An alternative approach where we only consider the k most likely next word

- Managing Repetition: Keeping output fresh:
    - Presence Penalties: A fixed penalty applied to any token that has appered before, regardless of how often. helps prevent the model from reusing same words
    - Frequency Penalty: A scaling penalty that increases based on how often a token has been used. The more a word appears, the less likely it is to be choosen again

- Controlling Generation Length:
    - Token limits: setting minimum and maximum token counts
    - Stop Sequences: Defining specific patterns that signal the end of generation
    - End of sequence detection: Letting the model naturally conclude its reponse


- Beam Search
    - Instead of committing to a single choice at each step, it explores multiple possible paths simulaneoulsy - like a chess player thinking several moves ahead
    - Steps:
        - At each step, maintain multiple candidate sequences(5-10)
        - FOr each candidate, compute probabilities for the next token
        - Keep only the most pronimising combinations of sequences and next tokens
        - Continue this process untill reaching the desired lenght or stop conditions
        - Select the sequence with the highest overall probability




- Practical Challenge and optimization:
    - Time to First Token(TTFT): How quickly can we get the first response? It is mostly affected by prefill phase
    - Time Per Output Token(TPOT): How fast can we generate susequent tokens? overall generation speed
    - Throughput: How many request can we handle simulateneously, affects scaling and cost efficiency
    - VRAM usage: How much GPU memory do we need?
    - Context Length challenge:
        - Memory usage: Grows quadratically with context length
        - Processing SPeed: Decrease linearly with longer contexts
        - Resource Allocation: Require careful balancing of VRAM usage

