# This Module 1 from LLM Zoomcamp from DTC

## LLM Zoomcamp 1.1 - Introduction to LLM and RAG

### LLM (Large Language Model)

- Language Model: Basic language (NLP) models predict next token/word based on previous ones.
- LLM: The LMs trained on gorges data with billion and billons of parameter which trained Neural Networks. 

A Large Language Model (LLM) is a type of artificial intelligence model that uses deep learning to understand and generate human language. It's trained on massive amounts of text data, allowing it to learn patterns and structures in language and perform various natural language processing (NLP) tasks. 


Here's a more detailed explanation:

-  Deep Learning:
LLMs are based on deep learning, a subfield of machine learning that uses artificial neural networks with multiple layers to analyze data and learn complex patterns. 

- Transformer Architecture:
Many LLMs are built upon the Transformer architecture, which allows them to process relationships between words in a sentence, even if they're far apart. 

- Training Data:
LLMs are trained on vast amounts of text, such as books, articles, and websites, to learn the nuances of language and its various forms. 

- Capabilities:
LLMs can perform a wide range of NLP tasks, including:
    Text Generation: Creating different textual formats, like poems, code, scripts, musical pieces, email, letters, etc. 

- Translation: Translating languages. 
    - Question Answering: Answering questions based on provided information. 
    - Summarization: Condensing large amounts of text into a shorter version. 
    - Sentiment Analysis: Determining the emotional tone of a piece of text. 
    - Code Generation: Writing code. 

- Applications:
LLMs have a wide range of applications across various industries, including:
    - Customer Service: Providing automated customer support. 
    - Content Creation: Generating marketing copy, blog posts, and other content. 
    - Research: Analyzing large datasets of text to extract insights. 
    - Education: Helping students with writing and language learning. 
In essence, LLMs are powerful tools that can understand, generate, and manipulate human language, making them valuable in many fields. 


![What is LLM](LLM-zoomcamp-whatIsLLM.drawio.png)

### RAG (Retrieval Augmented Generation)

Retrieval-Augmented Generation, is a technique in natural language processing (NLP) that combines the strengths of retrieval and generative AI models. It works by first retrieving relevant information from a knowledge base and then using a large language model (LLM) to generate a response that incorporates the retrieved data. This allows for more accurate, up-to-date, and contextually relevant outputs. 

Here's a more detailed breakdown:

*Retrieval*: RAG utilizes search algorithms to query external data sources like databases, knowledge bases, or even the web. 

*Integration*: The retrieved information is then integrated with a pre-trained LLM. 

*Generation*: The LLM uses the retrieved data to generate a response, which can be a question answer, a summary, or even new text. 

Benefits of RAG:
- Enhanced Accuracy and Relevance:
By accessing external knowledge, RAG can generate more precise and relevant responses. 

- Improved Contextual Understanding:
The retrieved information helps the LLM better understand the context of the user's query, leading to more fitting answers. 

- Real-time Updates:
RAG can incorporate up-to-date information from external sources, ensuring that the generated responses are current. 

- Source Attributions:
RAG can provide citations or references to the sources used to generate the response, improving trust and transparency. 

- Cost-Effective:
RAG can deliver some of the benefits of a custom LLM without the high cost of retraining or fine-tuning a new model. 


![What is RAG](LLM-zoomcamp-whatIsRAG.drawio.png)

## LLM Zoomcamp 1.2 - Configuring Your Environment 
Will be using codespace in loacl vscode via git.

1. install requierments
'''
bash 
pip install tqdm notebook openai elasticsearch pandas scikit-learn ipywidgets
'''

2. Generate a key in openai and export in terminal
'''
bash 
export OPENAI_API_KEY="<your key>"
'''

### 1.2: Test openai api

In [None]:
import os
api_key = os.getenv("OPENAI_API_KEY_V2", "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") # works in notebook
api_key

'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'

In [2]:
from openai import OpenAI
import os

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=api_key
)

completion = client.chat.completions.create(
  model="meta-llama/llama-3.3-8b-instruct:free",
  messages=[
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ]
)


In [3]:
print(completion.choices[0].message.content)

A question that has puzzled philosophers, theologians, scientists, and thinkers for centuries! The meaning of life is a complex and subjective concept that has been debated and explored in various ways. Here are some possible perspectives:

1. **Biological and Physical Perspective**: From a biological and physical standpoint, the meaning of life is to survive and reproduce, ensuring the continuation of the species. This perspective suggests that life is driven by instinct, necessity, and the pursuit of self-preservation.
2. **Existentialist Perspective**: Existentialists believe that life has no inherent meaning, and it's up to individuals to create their own purpose and meaning. This perspective emphasizes freedom, choice, and personal responsibility.
3. **Humanistic Perspective**: Humanists argue that the meaning of life is to seek happiness, fulfillment, and self-actualization. This perspective focuses on the cultivation of human potential, personal growth, and the pursuit of well-b

In [4]:
completion = client.chat.completions.create(
  model="meta-llama/llama-3.3-8b-instruct:free",
  messages=[
    {
      "role": "user",
      "content": "Is it toolate to join the course?"
    }
  ]
)

print(completion.choices[0].message.content)

It seems like you're inquiring about joining a course, but I'd need more details to provide a helpful response. Could you please provide more information about the course you're interested in, such as its nature (online, physical, study program, etc.), the deadline for enrollment, or any other relevant details? This will help me give you a more accurate answer regarding whether it's too late to join.


## LLM Zoomcamp 1.3 Retrieval

### 1.3.1 Implement a Search Engine

For that go to original repo to follow

#### [Build Your Own Search Engine ](https://github.com/alexeygrigorev/build-your-own-search-engine)

or go to [internal folder](build_your_own_search_engine/README.md)

Instead of building a serach engine we can continue a minimalist one buld by Alex from DTC. [Link](https://github.com/alexeygrigorev/minsearch)

#### Intro to RAG

In [5]:
import minsearch

In [6]:
# load data

import requests 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [7]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [8]:
# index the documents
# Create and fit the index
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

In [9]:
# SELECT * WHERE course = 'data-engineering-zoomcamp';

SELECT * WHERE course = 'data-engineering-zoomcamp';

In [10]:
q = 'the course already started, can I still enroll?'

In [11]:
index.fit(documents)

<minsearch.minsearch.Index at 0x7767292712b0>

In [12]:
# define the boost to identify importance of fields
# the higher the value, the more important the field is
# the default value is 1.0 for all fields
boost = {
    'question': 3.0, 'section': 0.5
}

In [13]:
results  = index.search(
    query=q,
    boost_dict=boost,
    num_results=5
)

In [14]:
results

[{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
  'section': 'General course-related questions',
  'question': 'The course has already started. Can I still join it?',
  'course': 'machine-learning-zoomcamp'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the cour

In [15]:
results  = index.search(
    query=q,
    filter_dict={'course': 'data-engineering-zoomcamp'},
    boost_dict=boost,
    num_results=5
)

In [16]:
results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

## LLM Zoomcamp 1.4 - Generating Answers with LLM

In [21]:
import os
api_key = os.getenv("OPENAI_API_KEY", "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")

In [22]:
from openai import OpenAI

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=api_key
)

In [23]:
q

'the course already started, can I still enroll?'

In [25]:
response = client.chat.completions.create(
    model="meta-llama/llama-3.3-8b-instruct:free",
    messages=[
        {"role": "user", "content": q}
    ]
)

In [28]:
response.choices[0].message.content

"Whether you can still enroll in a course that has already started depends on the specific policies of the course provider, such as a college, online learning platform, or training organization. Here are some general factors to consider:\n\n1. **Check the course provider's policy**: Look for information on their website, in the course details, or contact their customer support or enrollment team directly. Some providers might allow late enrollment, while others might have strict deadlines.\n\n2. **Late Enrollment Fees**: If late enrollment is permitted, there might be additional fees. These can vary widely, so it's essential to ask about any extra costs.\n\n3. **catch-up process**: If you enroll late, you might need to catch up on the material that has already been covered. This could mean additional work on your part to stay on track with the rest of the class.\n\n4. **Alternative start points**: In some cases, you might be allowed to start at a later point in the course, depending on

In [41]:
prompt_template = """
PROMPT:
You are a course teaching assistant.
Your task is to help students with their questions about the course material based on the context provided from the FAQ database.
Use only the context to answer the questions. If the context does not provide enough information, respond with "I don't know" or "I don't have enough information to answer that question."
Be polite and concise in your responses.

QUESTION: {question}

CONTEXT:
{context}
"""

In [42]:
results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

In [43]:
context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

In [44]:
print(context)

section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

section: General course-related questions
question: Course - When will the course start?
answer: The purpose of this document is to capture frequently asked technical questions
The exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start wit

In [45]:
prompt = prompt_template.format(question=q, context=context).strip()
print(prompt)

PROMPT:
You are a course teaching assistant.
Your task is to help students with their questions about the course material based on the context provided from the FAQ database.
Use only the context to answer the questions. If the context does not provide enough information, respond with "I don't know" or "I don't have enough information to answer that question."
Be polite and concise in your responses.

QUESTION: the course already started, can I still enroll?

CONTEXT:
section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the 

In [46]:
response = client.chat.completions.create(
    model="meta-llama/llama-3.3-8b-instruct:free",
    messages=[
        {"role": "user", "content": prompt}
    ]
)

In [47]:
response.choices[0].message.content

"Yes, you can still enroll in the course even though it has already started. You'll still be eligible to submit homeworks, but be aware of the deadlines for the final projects."