# This Module 1 from LLM Zoomcamp from DTC

## LLM Zoomcamp 1.1 - Introduction to LLM and RAG

### LLM (Large Language Model)

- Language Model: Basic language (NLP) models predict next token/word based on previous ones.
- LLM: The LMs trained on gorges data with billion and billons of parameter which trained Neural Networks. 

A Large Language Model (LLM) is a type of artificial intelligence model that uses deep learning to understand and generate human language. It's trained on massive amounts of text data, allowing it to learn patterns and structures in language and perform various natural language processing (NLP) tasks. 


Here's a more detailed explanation:

-  Deep Learning:
LLMs are based on deep learning, a subfield of machine learning that uses artificial neural networks with multiple layers to analyze data and learn complex patterns. 

- Transformer Architecture:
Many LLMs are built upon the Transformer architecture, which allows them to process relationships between words in a sentence, even if they're far apart. 

- Training Data:
LLMs are trained on vast amounts of text, such as books, articles, and websites, to learn the nuances of language and its various forms. 

- Capabilities:
LLMs can perform a wide range of NLP tasks, including:
    Text Generation: Creating different textual formats, like poems, code, scripts, musical pieces, email, letters, etc. 

- Translation: Translating languages. 
    - Question Answering: Answering questions based on provided information. 
    - Summarization: Condensing large amounts of text into a shorter version. 
    - Sentiment Analysis: Determining the emotional tone of a piece of text. 
    - Code Generation: Writing code. 

- Applications:
LLMs have a wide range of applications across various industries, including:
    - Customer Service: Providing automated customer support. 
    - Content Creation: Generating marketing copy, blog posts, and other content. 
    - Research: Analyzing large datasets of text to extract insights. 
    - Education: Helping students with writing and language learning. 
In essence, LLMs are powerful tools that can understand, generate, and manipulate human language, making them valuable in many fields. 


![What is LLM](LLM-zoomcamp-whatIsLLM.drawio.png)

### RAG (Retrieval Augmented Generation)

Retrieval-Augmented Generation, is a technique in natural language processing (NLP) that combines the strengths of retrieval and generative AI models. It works by first retrieving relevant information from a knowledge base and then using a large language model (LLM) to generate a response that incorporates the retrieved data. This allows for more accurate, up-to-date, and contextually relevant outputs. 

Here's a more detailed breakdown:

*Retrieval*: RAG utilizes search algorithms to query external data sources like databases, knowledge bases, or even the web. 

*Integration*: The retrieved information is then integrated with a pre-trained LLM. 

*Generation*: The LLM uses the retrieved data to generate a response, which can be a question answer, a summary, or even new text. 

Benefits of RAG:
- Enhanced Accuracy and Relevance:
By accessing external knowledge, RAG can generate more precise and relevant responses. 

- Improved Contextual Understanding:
The retrieved information helps the LLM better understand the context of the user's query, leading to more fitting answers. 

- Real-time Updates:
RAG can incorporate up-to-date information from external sources, ensuring that the generated responses are current. 

- Source Attributions:
RAG can provide citations or references to the sources used to generate the response, improving trust and transparency. 

- Cost-Effective:
RAG can deliver some of the benefits of a custom LLM without the high cost of retraining or fine-tuning a new model. 


![What is RAG](LLM-zoomcamp-whatIsRAG.drawio.png)

## LLM Zoomcamp 1.2 - Configuring Your Environment 
Will be using codespace in loacl vscode via git.

1. install requierments
'''
bash 
pip install tqdm notebook openai elasticsearch pandas scikit-learn ipywidgets
'''

2. Generate a key in openai and export in terminal
'''
bash 
export OPENAI_API_KEY="<your key>"
'''

### 1.2: Test openai api

In [4]:
import os
from openai import OpenAI

token = os.environ["GITHUB_TOKEN"]
endpoint = "https://models.github.ai/inference"
model = "openai/gpt-4.1"

client = OpenAI(
    base_url=endpoint,
    api_key=token,
)

response = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {
            "role": "user",
            "content": "What is the capital of France?",
        }
    ],
    temperature=1.0,
    top_p=1.0,
    model=model
)

print(response.choices[0].message.content)



The capital of France is Paris.


In [5]:
# completion = client.chat.completions.create(
#   model=model,
#   messages=[
#     {
#       "role": "user",
#       "content": "What is the meaning of life?"
#     }
#   ]
# )


In [6]:
# print(completion.choices[0].message.content)

In [7]:
completion = client.chat.completions.create(
  model=model,
  messages=[
    {
      "role": "user",
      "content": "Is it toolate to join the course?"
    }
  ]
)

print(completion.choices[0].message.content)

Could you please specify which course you are referring to? If you provide the name or details of the course, I can give you more accurate information about enrollment deadlines.


## LLM Zoomcamp 1.3 Retrieval

### 1.3.1 Implement a Search Engine

For that go to original repo to follow

#### [Build Your Own Search Engine ](https://github.com/alexeygrigorev/build-your-own-search-engine)

or go to [internal folder](build_your_own_search_engine/README.md)

Instead of building a serach engine we can continue a minimalist one buld by Alex from DTC. [Link](https://github.com/alexeygrigorev/minsearch)

#### Intro to RAG

In [8]:
import minsearch

In [9]:
# load data

import requests 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [10]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [11]:
# index the documents
# Create and fit the index
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

In [12]:
# SELECT * WHERE course = 'data-engineering-zoomcamp';

SELECT * WHERE course = 'data-engineering-zoomcamp';

In [13]:
q = 'the course already started, can I still enroll?'

In [14]:
index.fit(documents)

<minsearch.minsearch.Index at 0x7067ff3d4590>

In [15]:
# define the boost to identify importance of fields
# the higher the value, the more important the field is
# the default value is 1.0 for all fields
boost = {
    'question': 3.0, 'section': 0.5
}

In [16]:
results  = index.search(
    query=q,
    boost_dict=boost,
    num_results=5
)

In [17]:
results

[{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
  'section': 'General course-related questions',
  'question': 'The course has already started. Can I still join it?',
  'course': 'machine-learning-zoomcamp'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the cour

In [18]:
results  = index.search(
    query=q,
    filter_dict={'course': 'data-engineering-zoomcamp'},
    boost_dict=boost,
    num_results=5
)

In [19]:
results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

## LLM Zoomcamp 1.4 - Generating Answers with LLM

In [20]:
import os
api_key = os.getenv("OPENAI_API_KEY")


In [22]:
import os
from openai import OpenAI

token = os.environ["GITHUB_TOKEN"]
endpoint = "https://models.github.ai/inference"
model = "openai/gpt-4.1"

client = OpenAI(
    base_url=endpoint,
    api_key=token,
)

In [23]:
q

'the course already started, can I still enroll?'

In [24]:
response = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "user", "content": q}
    ]
)

In [25]:
response.choices[0].message.content

'Whether you can still enroll in a course after it has started depends on a few factors:\n\n1. **Institution Policy**: Many educational institutions have an "add/drop" period—typically the first week or two of classes—during which you can still enroll. After this period, enrollment may not be allowed.\n2. **Course Type**: Some courses (like online or self-paced ones) may allow late enrollment, while others (especially those with in-person meetings or lab components) might not.\n3. **Instructor Approval**: Sometimes, you can enroll late if you get permission from the course instructor or department.\n4. **Program Requirements**: Certain programs have strict requirements and deadlines that cannot be bypassed.\n\n**What should you do?**\n- Check the course or institution’s website for add/drop deadlines and late enrollment policies.\n- Contact the course instructor or academic advisor to ask if late enrollment is possible.\n- Be prepared to catch up on any missed material if you are allow

In [26]:
prompt_template = """
PROMPT:
You are a course teaching assistant.
Your task is to help students with their questions about the course material based on the context provided from the FAQ database.
Use only the context to answer the questions. If the context does not provide enough information, respond with "I don't know" or "I don't have enough information to answer that question."
Be polite and concise in your responses.

QUESTION: {question}

CONTEXT:
{context}
"""


In [27]:
results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

In [28]:
context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

In [29]:
print(context)

section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

section: General course-related questions
question: Course - When will the course start?
answer: The purpose of this document is to capture frequently asked technical questions
The exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start wit

In [30]:
prompt = prompt_template.format(question=q, context=context).strip()
print(prompt)

PROMPT:
You are a course teaching assistant.
Your task is to help students with their questions about the course material based on the context provided from the FAQ database.
Use only the context to answer the questions. If the context does not provide enough information, respond with "I don't know" or "I don't have enough information to answer that question."
Be polite and concise in your responses.

QUESTION: the course already started, can I still enroll?

CONTEXT:
section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the 

In [31]:
response = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "user", "content": prompt}
    ]
)

In [32]:
response.choices[0].message.content

"Yes, you can still join the course after the start date. Even if you don't register, you're still eligible to submit the homeworks. Just be mindful of the deadlines for the final projects."

## LLM Zoomcamp 1.5 - The RAG Flow Cleaning and Modularizing Code

In [33]:
def search(query):

    boost = {
        'question': 3.0,
        'section': 0.5
    }

    # Search the index with the query and boost
    # Filter by course if needed
    # For example, if you want to filter by 'data-engineering-zoomcamp'
    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )
    return results

In [34]:
def build_prompt(query, search_results):

    prompt_template = """
    PROMPT:
    You are a course teaching assistant.
    Your task is to help students with their questions about the course material based on the context provided from the FAQ database.
    Use only the context to answer the questions. If the context does not provide enough information, respond with "I don't know" or "I don't have enough information to answer that question."
    Be polite and concise in your responses.

    QUESTION: {question}

    CONTEXT:
    {context}
    """.strip()

    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    prompt = prompt_template.format(question=query, context=context).strip()
    
    return prompt

In [35]:
import os
from openai import OpenAI

token = os.environ["GITHUB_TOKEN"]
endpoint = "https://models.github.ai/inference"
model = "openai/gpt-4.1"

client = OpenAI(
    base_url=endpoint,
    api_key=token,
)

def llm(prompt):

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content

In [36]:
query = "How do I run Kafka?"

def rag(query):
    """
    Run the RAG process: search, build prompt, and get answer from LLM.
    """
    # Search for relevant documents
    search_results = search(query)

    prompt = (build_prompt(query, search_results))

    answer = llm(prompt)

    return answer

In [37]:
rag("The course already started, can I still enroll?")

'Yes, you can still enroll even after the course has started. You are eligible to submit the homeworks, but please be mindful of the deadlines for final project submissions.'