**Introduction:**
The guide introduces the first step towards building a practical AI Assistant using open-source LLM technology. It aims to empower users through practical experience, free resources, and intuitive understanding, even enabling them to develop their Advanced Assistants (AIs).

**Fundamental Concepts:**

- **Tokenization:** Tokenization is the process of breaking down text into more manageable pieces or "tokens," which are the small units that LLMs understand and process. Tokens could be single words, subwords, or punctuation marks.
  
- **Quantization:** This is a technique for shrinking the size of LLM models, leading to quicker performance and reduced memory usage. It's particularly useful when running models on less powerful hardware, although it's noted that in this guide, quantization isn't a requirement due to the manageable size and performance capabilities of the Utilized LLM in a Google Colab environment.
  
- **Intuitive GPT:** It's recommended as a supplementary tool for educational purposes, providing detailed explanations in simple language for complex technical subjects.


For more detail read this article [link text](https://medium.com/@skt7/build-your-first-ai-assistant-with-open-source-llm-on-google-colab-74bbb6fbca7b)

**Step 1 : Install All the Required Libraries**

In [None]:
!pip install accelerate



**Step 2: Download the Model from hugging face**

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Specify the LLM model we'll be using
model_name = "microsoft/Phi-3-mini-4k-instruct"
# Configure for GPU usage
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)
# Load the tokenizer for the chosen model
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Create a pipeline object for easy text generation with the LLM
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

**Step3: Configure LLM generation parameters**

In [5]:
# Parameters to control LLM's response generation
generation_args = {
    "max_new_tokens": 512,     # Maximum length of the response
    "return_full_text": False,      # Only return the generated text
}

In [6]:
def query(messages):
  """Sends a conversation history to the AI assistant and returns the answer.

  Args:
    messages (list): A list of dictionaries, each with "role" and "content" keys.

  Returns:
    str: The answer from the AI assistant.
  """

  output = pipe(messages, **generation_args)
  return output[0]['generated_text']

In [7]:
# Example: Math Problem
messages = [
    {"role": "system", "content": "You are a helpful digital assistant. Please provide safe, ethical and accurate information to the user."},
    {"role": "user", "content": "What about solving the equation 2x + 3 = 7?"}
]
result = query(messages)
print(result)

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


 To solve the equation 2x + 3 = 7, follow these steps:

1. Subtract 3 from both sides of the equation:
   2x + 3 - 3 = 7 - 3
   2x = 4

2. Divide both sides by 2:
   2x/2 = 4/2
   x = 2

So, the solution to the equation 2x + 3 = 7 is x = 2.


**Builing the chat function**

In [None]:
def chat():
  """Enables interactive chat sessions with the AI assistant."""

  # Initialize the conversation with instructions for the AI assistant
  conversation_history = [
      {"role": "system", "content": "You are a helpful digital assistant. Please provide safe, ethical and accurate information to the user."}
  ]

  # Main chat loop
  while True:
    user_input = input("You: ")

    # Check if the user wants to exit the chat
    if user_input.lower() == "exit":
        break

    # Add user's message to the conversation history
    conversation_history.append({"role": "user", "content": user_input})

    # Get a response from the AI assistant
    try:
        response = query(conversation_history)
        print("Assistant: ", response)

        # Record the AI assistant's response in the conversation history
        conversation_history.append({"role": "assistant", "content": response})

    except Exception as e:
        print(f"An error occurred: {e}, please try again.")

In [None]:
chat()
