# **NLP Assignment 4: LLM Applications**

Welcome to this assignment! In this task, you will learn how to build **LLM applications from the ground up** using the **LangChain framework** — a widely adopted and industry-standard toolkit for developing AI-driven applications.

---

## Preparation

Before starting the assignment, it is **strongly recommended** that you:

- Review the provided manual thoroughly.
- Attempt the preliminary exercises to become familiar with LangChain’s concepts and syntax.

Working through these exercises independently will help you build a **solid foundation** in LLM application development and prepare you for real-world use cases in research and industry.

---

## Guidelines

- <span style="color:green">**LCEL Syntax:**</span> You must use LangChain’s **LCEL syntax** for every task.  
- <span style="color:green">**Library Consistency:**</span> Use the same libraries and functions as demonstrated in the manual for Parts 1 and 2. While alternatives exist, using them may indicate a lack of understanding of the core material.  
- <span style="color:green">**Context Engineering:**</span> Starting from Part 1, Task 1, all chains must follow the principles of **Context Engineering using System Messages**, as detailed in the manual. Failure to adhere will result in a loss of marks.  
- <span style="color:green">**Reflective Questions:**</span> All reflective questions must be completed independently. The use of AI tools for these questions is strictly prohibited.

# **Part 1: Building LLM pipelines**

![LangChain Logo](figs/LangChain_logo.png)

# **Introduction to LangChain**

### Setting up Gemini's API key

Let's start by connecting to Gemini. You’ll need an API key from Google for this.

In [28]:
import numpy as np
import torch
import glob
import os, getpass
from dotenv import load_dotenv
import pandas as pd
import json
import pypdf
from pprint import pprint

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers import BertTokenizer, BertModel
from langchain_chroma import Chroma
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser

load_dotenv()
if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter your Google AI API key: ")

MODEL_NAME = "gemini-2.5-flash"  # or, feel free to use any LLM here
llm = ChatGoogleGenerativeAI(model=MODEL_NAME, temperature=0)

In [3]:
# From the manual, test the code samples here. Write code on your own to grasp the core concept.
prompt = ChatPromptTemplate.from_template("Tell me a funny joke. The joke must be related to {pakistani_politician} Its okay to make fun of politicians! Make it quick, snappy and witty")

llm = ChatGoogleGenerativeAI(model=MODEL_NAME, temperature=0)
parser = StrOutputParser()
chain = prompt | llm | parser

In [None]:
## Invoke a chain that prompts the LLM to write a funny joke. The top 3 funniest jokes in the class would get a free cookie from me :)
message = chain.invoke({"pakistani_politician": "Nawaz Sharif"})

print(message)

What's Nawaz Sharif's favorite cure for political pressure?

A first-class ticket to London for "medical treatment"!


## Task 1: Prompt Engineering



##### Converting Unstructured Data to a Structured Format
We'll be dealing with a common problem that almost all data scientists face. How to convert unstructured data into meaningful data that can be easily processed? LLMs can be used for this task to convert textual data to a JSON object.

Use the data from this Kaggle dataset for this task:
[Resume Dataset](https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset)

Your goal is to convert each resume in this dataset to a JSON object, which contains the following information:

- candidate_name 
- candidate_skills (a list of relevant skills)
- candidate_experience: (candidate experience in years)
- candidate_profession

You have to do so through **Zero-Shot, One-Shot and Few-Shot Prompting**

For this problem:
- The Sample Inputs can be found under `datasets/part1/task1/resumes`
- The Sample Outputs are in the JSON file `datasets/part1/task1/resume_outputs.json`

In [4]:
# load the sample inputs
RESUMES_PATH = "datasets/part1/task1/resumes"

resume_files = glob.glob(f"{RESUMES_PATH}/*.txt")
sample_resumes = []
for file_path in resume_files:
    with open(file_path, "r", encoding="utf-8") as f:
        sample_resumes.append(f.read())

In [None]:
sample_resumes

['**Resume 4: Data Analyst**\n\n**Jessica Park**\nChicago, IL | (555) 987-6543 | jessica.park@email.com | linkedin.com/in/jessicapark\n\n**Summary**\nAnalytical and detail-oriented Data Analyst with 2 years of experience transforming complex datasets into actionable insights. Proficient in SQL, data visualization, and statistical analysis. Passionate about using data to drive business strategy and improve operational efficiency.\n\n**Professional Experience**\n\n*Data Analyst*, Retail Insights Corp., Chicago, IL | 2022 – Present\n*   Wrote complex SQL queries to extract and analyze sales, customer, and inventory data from a centralized data warehouse.\n*   Created interactive Tableau dashboards for the executive team, tracking KPIs and identifying sales trends, which contributed to a 15% reduction in excess inventory.\n*   Performed A/B test analysis for marketing campaigns, providing recommendations that improved conversion rates by 8%.\n*   Assisted in building and maintaining ETL pi

In [4]:
# load dataset from Kaggle
KAGGLE_RESUME_PATH = "datasets/part1/task1/kaggle_resumes/UpdatedResumeDataSet.csv"

resumes_df = pd.read_csv(KAGGLE_RESUME_PATH)

resumes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  962 non-null    object
 1   Resume    962 non-null    object
dtypes: object(2)
memory usage: 15.2+ KB


In [None]:
resumes_df

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
...,...,...
957,Testing,Computer Skills: â¢ Proficient in MS office (...
958,Testing,â Willingness to accept the challenges. â ...
959,Testing,"PERSONAL SKILLS â¢ Quick learner, â¢ Eagerne..."
960,Testing,COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...


In [None]:
# LLM generated batch iterator
# using batches for less requests
resumes_df_small = resumes_df.sample(n=100, random_state=42)

def batch_iterator(df, batch_size):
    """Yield batches of rows from a DataFrame."""
    for start in range(0, len(df), batch_size):
        yield df.iloc[start:start + batch_size]

# # Example usage:
batch_size = 10
for batch in batch_iterator(resumes_df, batch_size):
    # Process each batch (e.g., send to LLM, print, etc.)
    print(batch)
    break

       Category                                             Resume
0  Data Science  Skills * Programming Languages: Python (pandas...
1  Data Science  Education Details \r\nMay 2013 to May 2017 B.E...
2  Data Science  Areas of Interest Deep Learning, Control Syste...
3  Data Science  Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4  Data Science  Education Details \r\n MCA   YMCAUST,  Faridab...
5  Data Science  SKILLS C Basics, IOT, Python, MATLAB, Data Sci...
6  Data Science  Skills â¢ Python â¢ Tableau â¢ Data Visuali...
7  Data Science  Education Details \r\n B.Tech   Rayat and Bahr...
8  Data Science  Personal Skills â¢ Ability to quickly grasp t...
9  Data Science  Expertise â Data and Quantitative Analysis â...


### Solve the task using:

### 1. Zero-Shot Prompting

In [None]:
zeroshot_system_prompt = """
    You are a master data analyst and annotator, especially skilled in creating JSON objects from rows of a dataframe.
    YOu will be provided a dataframe batch, and your job is to convert them into JSON objects for each row.
    Only Categories: "candidate_name", "candidate_skills", "candidate_experience", "candidate_profession"
"""

zeroshot_user_prompt = """
Convert these rows into JSON:

{batch}

"""
zero_shot_prompt_template = ChatPromptTemplate.from_messages([
    ("system", zeroshot_system_prompt),
    ("user", zeroshot_user_prompt),
])

json_parser = JsonOutputParser()

zeroshot_chain = zero_shot_prompt_template | llm | json_parser

zeroshot_jsons = []

batch_size = 4
for i, batch in enumerate(batch_iterator(resumes_df_small, batch_size)):
    print(f"Batch #{i+1}")
    batch_text = "\n---\n".join(batch['Resume'].astype(str).tolist())
    message = zeroshot_chain.invoke({"batch": batch_text})
    zeroshot_jsons.append(message)

Batch #1
Batch #2
Batch #2
Batch #3
Batch #3
Batch #4
Batch #4
Batch #5
Batch #5
Batch #6
Batch #6
Batch #7
Batch #7
Batch #8
Batch #8
Batch #9
Batch #9
Batch #10
Batch #10
Batch #11
Batch #11
Batch #12
Batch #12
Batch #13
Batch #13
Batch #14
Batch #14
Batch #15
Batch #15
Batch #16
Batch #16
Batch #17
Batch #17
Batch #18
Batch #18
Batch #19
Batch #19
Batch #20
Batch #20
Batch #21
Batch #21
Batch #22
Batch #22
Batch #23
Batch #23
Batch #24
Batch #24
Batch #25
Batch #25


In [None]:
zeroshot_jsons[:2]

[[{'candidate_name': None,
   'candidate_skills': ['Java',
    'Servlet',
    'JSP',
    'Spring Boot',
    'HTML5',
    'CSS3',
    'Bootstrap',
    'JavaScript',
    'JQuery',
    'Ajax',
    'AngularJs',
    'MySQL',
    'Eclipse',
    'Spring Tool Suit',
    'Net beans',
    'Sublime Text',
    'Atom',
    'Windows XP',
    'Windows 7',
    'Windows 8',
    'Windows 10',
    'Spring',
    'Hibernate',
    'KendoUI',
    'Core Java'],
   'candidate_experience': ['Css: Less than 1 year',
    'Ajax: Less than 1 year',
    'Servlet: Less than 1 year',
    'Html5: Less than 1 year',
    'Spring: Less than 1 year',
    'Java: Less than 1 year',
    'Jquery: Less than 1 year',
    'Jsp: Less than 1 year',
    'Javascript: Less than 1 year',
    'Bootstrap: Less than 1 year',
    'Spring Boot: Less than 1 year',
    'Java developer at Salcluster technologies',
    'OmegaSoft Technologies pvt.ltd: 5 months',
    'Internship Project (Employment Times): 4 months',
    'Project (GST And Sales 

### 2. One-Shot Prompting

In [5]:
SAMPLE_RESUME_PATH = "datasets/part1/task1/resumes/resume1.txt"
SAMPLE_RESUME_2_PATH = "datasets/part1/task1/resumes/resume2.txt"
SAMPLE_RESUME_3_PATH = "datasets/part1/task1/resumes/resume3.txt"
SAMPLE_JSON_PATH = "datasets/part1/task1/resume_outputs.json"

sample_resume = None
sample_json = None

with open(SAMPLE_RESUME_PATH, "r") as file:
    sample_resume = file.read()

with open(SAMPLE_JSON_PATH, "r") as f:
    sample_json = json.load(f)['resume1']

In [None]:
sample_resume

'**Resume 1: Senior Software Engineer**\n\n**Alex Chen**\nSan Francisco, CA | (123) 456-7890 | alex.chen@email.com | linkedin.com/in/alexchen\n\n**Summary**\nA seasoned Senior Software Engineer with over 10 years of experience in designing, developing, and scaling high-performance web applications. Proven ability to lead technical projects, mentor junior engineers, and drive the adoption of modern software architecture and DevOps practices. Deep expertise in the JavaScript/TypeScript ecosystem and cloud platforms.\n\n**Professional Experience**\n\n*Senior Software Engineer*, TechFlow Inc., San Francisco, CA | 2018 – Present\n*   Led the redesign of a monolithic payment processing service into a microservices architecture, improving system uptime from 99.9% to 99.99% and reducing latency by 40%.\n*   Architected and implemented a real-time data dashboard using React, Node.js, and WebSockets, serving over 1 million daily active users.\n*   Mentored 4 junior and mid-level engineers, condu

In [None]:
sample_json

{'candidate_name': 'Alex Chen',
 'candidate_skills': ['JavaScript',
  'TypeScript',
  'React',
  'Node.js',
  'Microservices',
  'AWS',
  'Docker',
  'Kubernetes',
  'CI/CD',
  'System Architecture',
  'Mentoring'],
 'candidate_experience': 10,
 'candidate_profession': 'Senior Software Engineer'}

In [None]:
oneshot_system_prompt = """ 
    You are a master data analyst and annotator, especially skilled in creating JSON objects from rows of a dataframe.
    You will be provided a dataframe batch, and your job is to convert them into JSON objects for each row.
    Only Categories: "candidate_name", "candidate_skills", "candidate_experience", "candidate_profession"
"""

oneshot_user_prompt = """
    Given a row of data, your job is to convert this row into a JSON object. Analyse the following example for reference...
    
    Sample row:
    {sample_row}
    
    Sample JSON:
    {sample_json}
    
    Now, convert these rows into JSON:
    {batch}
    
"""

oneshot_prompt_template = ChatPromptTemplate.from_messages([
    ("system", oneshot_system_prompt),
    ("user", oneshot_user_prompt),
])

json_parser = JsonOutputParser()

oneshot_chain = oneshot_prompt_template | llm | json_parser

oneshot_jsons = []

batch_size = 4
for i, batch in enumerate(batch_iterator(resumes_df_small, batch_size)):
    print(f"Batch #{i+1}")
    batch_text = "\n---\n".join(batch['Resume'].astype(str).tolist())
    message = oneshot_chain.invoke(
    {
        "sample_row": sample_resume, 
        "sample_json": sample_json, 
        "batch": batch_text,
    })
    
    oneshot_jsons.append(message)

Batch #1
Batch #2
Batch #3
Batch #4
Batch #5
Batch #6
Batch #7
Batch #8
Batch #9
Batch #10
Batch #11
Batch #12
Batch #13
Batch #14
Batch #15
Batch #16
Batch #17
Batch #18
Batch #19
Batch #20
Batch #21
Batch #22
Batch #23
Batch #24
Batch #25


In [None]:
oneshot_jsons

[[{'candidate_name': None,
   'candidate_skills': ['Java',
    'Servlet',
    'JSP',
    'Spring Boot',
    'HTML5',
    'CSS3',
    'Bootstrap',
    'JavaScript',
    'jQuery',
    'Ajax',
    'AngularJS',
    'MySQL',
    'Eclipse',
    'Spring Tool Suite',
    'NetBeans',
    'Sublime Text',
    'Atom',
    'Spring',
    'Hibernate',
    'KendoUI',
    'MVC Architecture'],
   'candidate_experience': 0,
   'candidate_profession': 'Full Stack Java Developer'},
  {'candidate_name': None,
   'candidate_skills': ['Spring MVC',
    'Hibernate',
    'JDBC',
    'Java',
    'J2EE',
    'Azure Web Services',
    'JSP',
    'Struts',
    'Servlet',
    'REST API',
    'JavaScript',
    'AJAX',
    'HTML',
    'JSON',
    'PHP',
    'MS SQL',
    'MySQL',
    'Oracle',
    'Apache Tomcat',
    'OneSignal',
    'Ionic',
    'AngularJS',
    'Linux',
    'Mac OS',
    'Windows Server'],
   'candidate_experience': 2,
   'candidate_profession': 'Java Developer'},
  {'candidate_name': None,
   'can

### 3. Few-Shot Prompting

In [6]:
with open(SAMPLE_JSON_PATH, "r") as f:
    sample_json = json.load(f)

# limit to first 3 examples
sample_json = list(sample_json.items())[:3]

In [7]:
sample_resume_2 = None
sample_resume_3 = None

with open(SAMPLE_RESUME_2_PATH, "r") as file:
    sample_resume_2 = file.read()
    
with open(SAMPLE_RESUME_3_PATH, "r") as file:
    sample_resume_3 = file.read()

In [8]:
output_path = "datasets/part1/task1/kaggle_outputs.json"
fewshot_jsons = []

fewshot_system_prompt = """ 
    You are a master data analyst and annotator, especially skilled in creating JSON objects from rows of a dataframe.
    You will be provided a dataframe batch, and your job is to convert them into JSON objects for each row.
    Only Categories: "candidate_name", "candidate_skills", "candidate_experience", "candidate_profession"
"""

fewshot_user_prompt = """
    Given a row of data, your job is to convert this row into a JSON object. Analyse the following example for reference...
    
    Sample row 1:
    {sample_row_1}
    
    Sample JSON 1:
    {sample_json_1}
    
    Sample row 2:
    {sample_row_2}
    
    Sample JSON 2:
    {sample_json_2}
    
    Sample row 3:
    {sample_row_3}
    
    Sample JSON 3:
    {sample_json_3}
    
    Now, convert these rows into JSON:
    {batch}
    
"""

fewshot_prompt_template = ChatPromptTemplate.from_messages([
    ("system", fewshot_system_prompt),
    ("user", fewshot_user_prompt),
])

json_parser = JsonOutputParser()

fewshot_chain = fewshot_prompt_template | llm | json_parser


In [None]:
# LLM generated code to process batches of full dataset, saving every batch

# Load previous results if file exists
if os.path.exists(output_path):
    with open(output_path, "r", encoding="utf-8") as file:
        fewshot_jsons = json.load(file)
start_batch = len(fewshot_jsons)
batch_size = 4

for i, batch in enumerate(batch_iterator(resumes_df, batch_size)):
    if i < start_batch:
        continue  # Skip already processed batches

    print(f"Batch #{i+1}")
    batch_text = "\n---\n".join(batch['Resume'].astype(str).tolist())
    try:
        message = fewshot_chain.invoke({
            "sample_row_1": sample_resume, 
            "sample_json_1": sample_json[0][1],
            
            "sample_row_2": sample_resume_2, 
            "sample_json_2": sample_json[1][1], 
            
            "sample_row_3": sample_resume_3, 
            "sample_json_3": sample_json[2][1], 
            
            "batch": batch_text,
        })
        
        fewshot_jsons.append(message)
        
        with open(output_path, "w", encoding="utf-8") as file:
            json.dump(fewshot_jsons, file, ensure_ascii=False, indent=2)
            
    except Exception as e:
        print(f"Error occurred: {e}")
        print("Quota likely exhausted or API key expired. Progress saved. Please switch API key and rerun.")
        break

Batch #173
Batch #174
Batch #174
Batch #175
Batch #175
Batch #176
Batch #176
Batch #177
Batch #177
Batch #178
Batch #178
Batch #179
Batch #179
Batch #180
Batch #180
Batch #181
Batch #181
Batch #182
Batch #182
Batch #183
Batch #183
Batch #184
Batch #184
Batch #185
Batch #185
Batch #186
Batch #186
Batch #187
Batch #187
Batch #188
Batch #188
Batch #189
Batch #189
Batch #190
Batch #190
Batch #191
Batch #191
Batch #192
Batch #192
Batch #193
Batch #193
Batch #194
Batch #194
Batch #195
Batch #195
Batch #196
Batch #196
Batch #197
Batch #197
Batch #198
Batch #198
Batch #199
Batch #199
Batch #200
Batch #200
Batch #201
Batch #201
Batch #202
Batch #202
Batch #203
Batch #203
Batch #204
Batch #204
Batch #205
Batch #205
Batch #206
Batch #206
Batch #207
Batch #207
Batch #208
Batch #208
Batch #209
Batch #209
Batch #210
Batch #210
Batch #211
Batch #211
Batch #212
Batch #212
Batch #213
Batch #213
Batch #214
Batch #214
Batch #215
Batch #215
Batch #216
Batch #216
Batch #217
Batch #217
Batch #218
Batch #218

In [None]:
# Save the results for this in the datasets/task1/kaggle_outputs.json file
# I'm assuming this means the few-shot only
with open("datasets/part1/task1/kaggle_outputs.json", "w+", encoding="utf-8") as file:
    json.dump(fewshot_jsons, file, ensure_ascii=False, indent=2)


### **Reflective Question:** Comment on the quality of LLM responses under each prompting technique. Which one is better in this case, and in what scenario(s) would we choose one technique over the other?

**Answer:**

All three strategies can convert the resumes to the specified JSON format. 

Zero-shot prompting, since its not given much context for the description of the fields, it tries to fit as much as it can in each field.

One-shot and few-shot prompting can ascertain the field descriptions from the given examples, so they're more consistent.

In this case, one-shot prompting is the better technique since it consumes less input tokens, and performs pretty consistently.

If we had less restrictions on input tokens, we could go for the few-shot technique, otherwise one-shot is sufficient in this case.

## Task 2: Building a Job Hiring Assistant

You are going to build an AI assistant that helps employers find the best candidates for a job.

- Employers share job descriptions.  
- Job seekers share their resumes.  
- Your assistant's tasks are to:
  1. Pull out important candidate details from the resume (like name, skills, and experience).  
  2. Summarize the job description in a few clear sentences.  
  3. Compare the candidate with the job and give a fit score, a recommended role, and notes on why they are a good fit.

---

### How the assistant works

1. Extract candidate details from the resume.  
   - Output should be a JSON with the same fields as in Task 1.

2. Summarize the job description.  
   - Make it short and easy to understand (2-3 sentences).  

3. Evaluate the candidate.  
   - Take the candidate info from step 1 and the job summary from step 2.  
   - Decide how well the candidate fits the job.

---

### Flow of the pipeline

1. Chain 1: Resume → Candidate Info (JSON)  
2. Chain 2: Job Description → Summary (plain text)  
3. Chain 3: Candidate Info + Job Summary → Fit Assessment (fit score, recommended role, alignment notes)

---

### Requirements

- Chains 1 and 2 should run at the same time (in parallel).  
- Both Chains 1 and 2 should use **few-shot prompting**.
- Evaluate this on the resumes from the Kaggle dataset.
- Use the job description provided below.

In [None]:
# # getting the first 100 JSON entries for this
# # kaggle_outputs.json has batches of 4, so I need 25 entries from the array
# json_obj_arrays = None

# with open(output_path) as file:
#     json_obj_arrays = json.load(file)
    
# # Flatten the list of lists to get individual resume JSONs
# resume_jsons = [item for batch in json_obj_arrays[:25] for item in batch]

# len(resume_jsons)

99

In [None]:
job_description = """
We're hiring a full-stack developer to maintain backend services, build REST APIs,
and collaborate on deploying scalable applications using Docker and PostgreSQL.
"""

job_description_system_prompt = """
You are an linguistics expert, proficient in generating thoughtful and insightful summaries which encode meaning.
"""

job_description_user_prompt = """
Summarize the following job description into 2-3 short 
"""


resume_chain = ...
job_description_chain = ...
evaluation_chain = ...

# Task 3: Retrieval-Augmented Generation (RAG)

In this task, you will implement a complete **RAG (Retrieval-Augmented Generation)** pipeline using LangChain.  
The goal is to build a system that retrieves relevant document chunks from a knowledge base and uses them to answer questions accurately.

The following resources have also been attached for your reference:

[RAG Resource](https://medium.com/@callumjmac/implementing-rag-in-langchain-with-chroma-a-step-by-step-guide-16fc21815339)

[Chroma Resource](https://docs.langchain.com/oss/python/integrations/vectorstores/chroma)

---

### Step 1: Set Up a Document Loader

You first need to load your dataset (e.g., a PDF document) and convert it into text.  
Use LangChain's `PyPDFLoader` for this purpose.

In [9]:
BOOK_PATH = "datasets/part1/task1/Speech and Language Processing.pdf"

loader = PyPDFLoader(BOOK_PATH)

In [10]:
docs = loader.load()

In [11]:
print(docs[2].page_content)

Contents
I Large Language Models 1
1 Introduction 3
2 Words and Tokens 4
2.1 Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Morphemes: Parts of Words . . . . . . . . . . . . . . . . . . . . . 8
2.3 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Subword Tokenization: Byte-Pair Encoding . . . . . . . . . . . . 13
2.5 Rule-based tokenization . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Simple Unix Tools for Word Tokenization . . . . . . . . . . . . . 28
2.9 Minimum Edit Distance . . . . . . . . . . . . . . . . . . . . . . . 29
2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 N-gram

In [12]:
len(docs)

622

### Step 2: Text Splitter – Making the Data Digestible

Given the limited size of an LLM’s context window, it is not feasible for the model to process the entire text at once.
Therefore, we divide the text into smaller chunks (tokens).

To preserve context, we add some overlap between chunks so that each piece retains part of its surrounding context.

In [13]:
# LLM generated
def text_splitter(document, chunk_size, chunk_overlap):
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return splitter.split_documents(document)

DEFAULT_CHUNK_SIZE = 128
DEFAULT_CHUNK_OVERLAP = 50

In [None]:
text_chunks = text_splitter(docs, DEFAULT_CHUNK_SIZE, DEFAULT_CHUNK_OVERLAP)

### Step 3: Converting tokens to Embeddings

Next, use a pre-trained BERT model from HuggingFace to convert the text chunks into embeddings.
These embeddings serve as numerical representations of the document text and will be stored for retrieval later.

In [30]:
device = torch.device("mps" if torch.backends.mps.is_available() else "cuda")
BERT_MODEL_PATH = "google-bert/bert-base-uncased"

tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_PATH)
model = BertModel.from_pretrained(BERT_MODEL_PATH).to(device)

# def get_bert_embeddings(text_chunks):
#     embeddings = []
    
#     # iterate through each chunk
#     for doc in text_chunks:
#         text = doc.page_content
        
#         # tokenize the text
#         inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        
#         with torch.no_grad():
#             # get embeddings
#             outputs = model(**inputs)
            
#             # Use the [CLS] token embedding as the chunk embedding
#             cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()
            
#             embeddings.append(cls_embedding)

#     return embeddings

# chunk_embeddings = get_bert_embeddings(text_chunks)

### Step 4: Store the embeddings in a Vector Store

Once the embeddings are generated, store them in a Vector Store for efficient retrieval.
The retriever will query this vector store to find relevant chunks based on semantic similarity.

For this task, use the **Chroma Vector Store**, configured to run locally (in-memory).

In [31]:
# get BERT embeddings from the docs
def get_bert_embeddings(text_chunks):
    embeddings = []
    
    print(f"Total chunks: {len(text_chunks)}\n")
    
    # iterate through each chunk
    for i, doc in enumerate(text_chunks):
        text = doc.page_content
        print(f"Embedding chunk {i+1}")
        
        # tokenize the text
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        with torch.no_grad():
            # get embeddings
            outputs = model(**inputs)
            
            # Use the [CLS] token embedding as the chunk embedding
            cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().cpu().numpy()
            
            embeddings.append(cls_embedding)

    return embeddings

chunk_embeddings = get_bert_embeddings(text_chunks)

Total chunks: 39214

Embedding chunk 1
Embedding chunk 2
Embedding chunk 3
Embedding chunk 2
Embedding chunk 3
Embedding chunk 4
Embedding chunk 5
Embedding chunk 6
Embedding chunk 7
Embedding chunk 4
Embedding chunk 5
Embedding chunk 6
Embedding chunk 7
Embedding chunk 8
Embedding chunk 9
Embedding chunk 8
Embedding chunk 9
Embedding chunk 10
Embedding chunk 11
Embedding chunk 12
Embedding chunk 13
Embedding chunk 10
Embedding chunk 11
Embedding chunk 12
Embedding chunk 13
Embedding chunk 14
Embedding chunk 15
Embedding chunk 16
Embedding chunk 17
Embedding chunk 18
Embedding chunk 14
Embedding chunk 15
Embedding chunk 16
Embedding chunk 17
Embedding chunk 18
Embedding chunk 19
Embedding chunk 20
Embedding chunk 21
Embedding chunk 19
Embedding chunk 20
Embedding chunk 21
Embedding chunk 22
Embedding chunk 23
Embedding chunk 24
Embedding chunk 25
Embedding chunk 22
Embedding chunk 23
Embedding chunk 24
Embedding chunk 25
Embedding chunk 26
Embedding chunk 27
Embedding chunk 28
Embeddin

In [None]:
# create a vector store
vector_store = Chroma(
    embedding_function=None,
    embedding_data=chunk_embeddings,
    collection_name="Part1_Task3_RAG_VecStore",
    persist_directory="./vectorstores/Part1_Task3_RAG_VecStore/"
)

TypeError: Chroma.__init__() got an unexpected keyword argument 'embedding_data'

### Step 5: Build a RAG chain

Now, connect all components into a complete RAG chain:
- Define a prompt template that integrates the retrieved context and the user query.
- Connect this chain to your chosen LLM.

The LLM will generate answers using both the retrieved context and its internal knowledge.

### Step 6: Query the System

Query your RAG system using the following prompts.
Your retriever should return the top-k elements from the vector store as context, where k = 5.

For each query, store:
1.	The average cosine similarity between the query vector and each of the top-5 retrieved document chunks (averaged across all queries).
2.	The LLM-generated response for each query.

In [None]:
queries = [
    "What section is BPE training from? Also list the examples used for this algorithm in the book.",
    "What formula does the book use for the Minimum Edit Distance algorithm?",
    # page 42
    "What problems are highlighted in the book when dealing with scale in large n-gram models?",
    # page 47
    "What technique of visualizing a language model was proposed by Shannon in 1948? Furthermore, list the section which contains content relevant to this topic.",
    # page 404
    "What examples does the book use to demonstrate syntactic constituency?"    
]

### **Reflective Question:** Did you get the expected results for each query? If not, then why do you think this may be so? What parameter(s) can we tune in this setup to improve performance?

**Answer:**

---

## Retrieval Analysis

After querying, analyze the retrieval quality of your RAG system.

Create a plot where:
- The x-axis represents the retrieval ranking position (Chunk 1, Chunk 2, …, Chunk 5)
- The y-axis shows the average cosine similarity score for chunks at that ranking position.

### **Reflective Question:** Comment on the trend observed. Is it what we're expecting? Why or why not?
**Answer:**

---

### Additional Experiment

Now, invoke the same RAG chain for the same queries, but vary the number of retrieved chunks (k) using the following values:

[1, 10, 20, 50]

### **Reflective Question** How does changing the value of k affect the overall quality of results?

**Answer:**


# Task 4: AI-as-a-Judge

In this task, you will design an **LLM-based evaluation and reflection loop** for a Retrieval-Augmented Generation (RAG) system. Your goal is to implement a **Critic–Reflector mechanism** that evaluates, scores, and iteratively improves the performance of your RAG pipeline using the following dataset:

> **Dataset:** [Neural Bridge RAG Dataset](https://huggingface.co/datasets/neural-bridge/rag-dataset-12000)

---

## Objectives:

You will:
1. Implement a **Critic** agent that assesses your RAG model’s answers.  
2. Use a **Reflector** agent to analyze the Critic’s feedback and optimize the RAG Generator’s performance across multiple iterations.  
3. Visualize how the **reward (Critic rating)** evolves as your system self-tunes.

---

## System Components

| Component | Model | Role |
|------------|--------|------|
| **Generator** | Gemini 2.0 Flash | Produces answers using your RAG pipeline |
| **Critic** | Gemini 2.5 Pro | Evaluates Generator’s output and provides rewards |
| **Reflector** | Gemini 2.5 Pro | Adjusts the Generator’s configuration to maximize reward |

---

## Part 1 — Setting Up the Critic

1. Configure an **LLM Critic** using the **Gemini 2.5 Pro** model (or a higher version than your RAG Generator).  
   - Set the **temperature** parameter to `0` for deterministic evaluation.  

2. The Critic should take the following as input:
   - The user query (question)  
   - The retrieved context/documents  
   - The Generator’s response  

3. The Critic must output:
   - An **expected (ideal) answer**
   - A **numerical rating (out of 10)** indicating how well the Generator’s response aligns with the expected output  

4. Use **few-shot prompting** to guide the Critic’s evaluations.  
   - Collect several sample query–retrieval–response–rating pairs by manually evaluating model outputs from the dataset.  
   - Provide these examples in your Critic’s prompt to calibrate its scoring behavior.

> 💡 *Tip:* This few-shot calibration step should be clearly shown in your code before running the evaluation loop.

In [None]:
# TODO
# 1. Initialize the Critic
# 2. Run the Critic with a few examples from the dataset, and evaluate the outputs. 
#    You should cherry-pick the best outputs from the bunch, and use them as your few-shot samples for the following task.

## Part 2 — Designing the Reflector

The **Reflector** takes as input:
- The original user query  
- The Generator’s response  
- The Critic’s expected response and rating  
- Any other hyperparameters or metrics you deem useful  

Using this information, the Reflector should:
- Modify or tune the Generator’s configuration to **maximize the Critic’s rating (reward)** over time.  
- The Reflector may adjust:
  - The Generator’s **system or user messages** (prompt engineering)  
  - The **text splitter parameters** (e.g., `chunk_size`, `chunk_overlap`)  
  - Any other relevant **retriever or generator hyperparameters** that can improve the RAG pipeline’s accuracy or coherence.

In [None]:
# TODO: Initialize the Reflector

## Part 3 — Iterative Optimization Loop

1. Run your **Critic–Reflector feedback loop** for **30 iterations**.  
2. After each iteration:
   - Re-run the Generator with the updated configuration.  
   - Re-evaluate using the Critic.  
   - Record the **reward score** assigned by the Critic.  

3. Plot the **reward progression** over the 30 iterations to visualize how your RAG system improves (or regresses) with reflection-driven optimization.

In [None]:
# TODO: Integrate the Critic and Reflector with the RAG system, and run the loop

#### **Reflective Questions:**

##### 1. Did you see any improvements in the pipeline's performance? Explain why or why not? Explain the general trend of reward progression here.

##### 2. Are LLMs a reliable source for scoring purposes? Explain why did we initialize the temperature value to 0 for these tasks?

##### 3. Which hyperparameters had the most significant impact in optimizing the RAG performance?