<a href="https://colab.research.google.com/github/drpetros11111/Hands-On-Large-Language-Models/blob/drpp_study/Copy_of_Chapter_1_Introduction_to_Language_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Chapter 1 - Introduction to Language Models</h1>
<i>Exploring the exciting field of Language AI</i>


<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter01/Chapter%201%20-%20Introduction%20to%20Language%20Models.ipynb)

---

This notebook is for Chapter 1 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---

In [None]:
# %%capture
# !pip install transformers>=4.40.1 accelerate>=0.27.2

# Phi-3

The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately (although that isn't always necessary).

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

Although we can now use the model and tokenizer directly, it's much easier to wrap it in a `pipeline` object:

In [None]:
from transformers import pipeline

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False
)

Finally, we create our prompt as a user and give it to the model:

In [None]:
# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

 Why did the chicken join the band? Because it had the drumsticks!


# Tokenization Example

In [None]:
!pip install transformers

from transformers import AutoTokenizer

colors_list = [
 '102;194;165', '252;141;98', '141;160;203',
 '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence, tokenizer_name):
   tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
   token_ids = tokenizer(sentence).input_ids
   for idx, t in enumerate(token_ids):
       # Ensure consistent indentation within the for loop
       print(
           f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +
           tokenizer.decode(t) +
           '\x1b[0m',
           end=' '
       )



In [None]:
!pip install transformers

from transformers import AutoTokenizer

colors_list = [
 '102;194;165', '252;141;98', '141;160;203',
 '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence, tokenizer_name):
   tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
   token_ids = tokenizer(sentence).input_ids
   for idx, t in enumerate(token_ids):
       # Ensure consistent indentation within the for loop
       print(
           f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +
           tokenizer.decode(t) +
           '\x1b[0m',
           end=' '
       )

# Call the function with a sentence and tokenizer name
sentence = "This is a test sentence."
tokenizer_name = "bert-base-uncased"  # Example tokenizer

show_tokens(sentence, tokenizer_name)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98mthis[0m [0;30;48;2;141;160;203mis[0m [0;30;48;2;231;138;195ma[0m [0;30;48;2;166;216;84mtest[0m [0;30;48;2;255;217;47msentence[0m [0;30;48;2;102;194;165m.[0m [0;30;48;2;252;141;98m[SEP][0m 

# Tokenization example


---


# 1. This line uses !pip install
transformers to install the transformers library.

This library is essential for working with pre-trained language models like BERT, GPT-2, etc.

It provides tools for tokenization, model loading, and inference.



---


#2. Importing AutoTokenizer:

    from transformers import AutoTokenizer


This line imports the AutoTokenizer class from the transformers library.

AutoTokenizer is a powerful class that can automatically load the appropriate tokenizer for a given pre-trained model.

------------------------------
#3. Defining colors_list:


    colors_list = [
       '102;194;165', '252;141;98', '141;160;203',
       '231;138;195', '166;216;84', '255;217;47'
       ]

This creates a list called colors_list containing strings representing RGB color codes.

These colors will be used to highlight the tokens in the output.

-------------------
#4. Defining the show_tokens function:


    def show_tokens(sentence, tokenizer_name):
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        token_ids = tokenizer(sentence).input_ids
        for idx, t in enumerate(token_ids):
            # Ensure consistent indentation within the for loop
            print(
              f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +
              tokenizer.decode(t) +
              '\x1b[0m',
              end=' '
               )

This function takes a sentence and tokenizer_name as input.

It loads the specified tokenizer using AutoTokenizer.from_pretrained(tokenizer_name).

It tokenizes the sentence using the loaded tokenizer to get the token_ids.
It then iterates through the token_ids using enumerate to get both the index (idx) and the token ID (t).

Inside the loop, it prints each token with a colored background using ANSI escape codes.

    f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m'
    
sets the background color.

##tokenizer.decode(t)
decodes the token ID back into its text representation.

##'\x1b[0m'
resets the color to default.

##end=' '
ensures that the tokens are printed on the same line with a space separator.

In essence, this code snippet demonstrates how to use the transformers library to tokenize a sentence, and then it visually highlights each token with a different color using ANSI escape codes for a more readable output.