<a href="https://colab.research.google.com/github/bforoura/GenAI/blob/main/Module8/chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **GPT-2 Chatbot Implementation**

* This is an implementation of a simple chatbot based on OpenAI's GPT-2 model.

* GPT-2 is a language model capable of generating human-like text based on a given input.

* Hugging Face offers a wide variety of pre-trained models, including GPT-2, that are available for download and use locally without the need for an **API key**.

* If you're using **Hugging Face’s Inference API** to make model predictions directly on **their servers** (i.e., not downloading and running the model locally), then you would need an API key to authenticate and access the API for hosting models remotely.

* The **chatbot** is designed to interact with users, generate contextually relevant responses, and provide an engaging conversational experience.


* The code uses the transformers library from **Hugging Face** to load the pre-trained GPT-2 model and tokenizer, and PyTorch is used for handling the model’s operations.


* The chatbot supports features like **adjustable response length**, **temperature sampling**, and **top-k** or **top-p** sampling for more diverse and creative responses.

* The **pre-trained GPT-2 model** and its corresponding tokenizer are loaded using the from_pretrained method from the transformers library.

* The tokenizer is responsible for converting human-readable text into a format (tokens) that the model can understand, and vice versa.





**Sampling Parameters**

* To make the chatbot's responses more natural and less repetitive, several sampling techniques are used:

>>* **Temperature**: This parameter controls the randomness of the text generation. A higher temperature (e.g., 1.0) introduces more randomness and creativity, while a lower temperature (e.g., 0.2) makes the output more deterministic and focused.

>>* **Top-k Sampling**: This restricts the model to choose the next token from the **top k most likely options**, preventing it from picking less likely words that might cause nonsensical output.

>>* **Top-p Sampling (Nucleus Sampling)**: Instead of limiting the next token choices to the top k tokens, top-p considers the **smallest set of tokens whose cumulative probability exceeds the threshold p**, thus ensuring that the most probable, yet **diverse tokens** are chosen.



**Challenges and Limitations**

* **Repetition**
>* Initially, the chatbot might generate repetitive responses due to greedy decoding (always picking the most probable next word).
>* However, by enabling sampling and adjusting parameters like temperature, top-k, and top-p, the repetition is reduced, and the responses can become more varied and creative.

* **Context Awareness**
>* GPT-2 can generate coherent responses based on a single input but lacks long-term context awareness.
>* In a real-world conversation, where the context from previous interactions is important, the model might struggle to maintain a consistent conversation.
>* This limitation can be addressed by **fine-tuning the model** on domain-specific conversations or using a more sophisticated model like GPT-3 or GPT-4.

* **Response Quality**
>* While GPT-2 produces human-like responses, the quality of its responses can sometimes be inconsistent.
>* Fine-tuning or incorporating additional techniques, such as reinforcement learning from human feedback (RLHF), can improve the quality and relevance of the responses.

In [1]:
!pip install transformers



In [9]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load a larger version of GPT-2
model_name = "gpt2-large"  # Change this to gpt2-medium, gpt2-large, or gpt2-xl for bigger models
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Set the model to evaluation mode
model.eval()


def generate_response(prompt, max_length=100, temperature=0.7, top_k=50, top_p=0.9):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # Create attention mask
    # This means that initially, the attention mask is set to 1 for every token,
    # signaling that the model should attend to all tokens
    attention_mask = torch.ones(input_ids.shape, device=input_ids.device)


    # Generate response with attention mask and added parameters
    with torch.no_grad():
        output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_length=max_length,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            num_return_sequences=1,
            pad_token_id=50256,
            do_sample=True  # Enable sampling to use temperature, top_k, and top_p
        )

    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return response

print("Chatbot: Hi there! How can I help you?")
while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        print("Chatbot: Goodbye!")
        break

    response = generate_response(user_input)
    print("Chatbot:", response)


Chatbot: Hi there! How can I help you?
You: How's it going?
Chatbot: How's it going? Well, we have a little bit of a problem. We have a lot of people who have been waiting in line. They've been waiting for over two hours. So I'm going to have to cut the line down. I've just got to get it down to two. I'm going to have to do that. But I'm going to have to get it down to two.

I know that there are a lot of people that are here for the first
You: The pink elephant sang a loud note.
Chatbot: The pink elephant sang a loud note.

"The green elephant sang a loud note."

"The blue elephant sang a loud note."

"The yellow elephant sang a loud note."

"The red elephant sang a loud note."

"The yellow elephant sang a loud note."

"The red elephant sang a loud note."

"The yellow elephant sang a loud note."

"The red elephant sang a loud note."

"The


KeyboardInterrupt: Interrupted by user