# Retrieval-Augmented Generation Lab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/S24-CS143AI/blob/main/rag_lab.ipynb)

In this lab, we'll combine code from our previous two labs on language models and semantic search to perform Retrieval-Augmented Generation - a strategy for providing large language models with information necessary to respond to user queries.

First, we need to install some libraries.

In [None]:
import sys
!{sys.executable} -m pip install transformers sentence_transformers requests

## A variation on the chat bot from last week

Last time, we used the *conversational* pipeline, though I recently learned that this is being depreciated. Here's a similar way you can make a chatbot using the `text2test-generation` pipeline.

In [2]:
from transformers import pipeline

chatbot = pipeline("text2text-generation", model="facebook/blenderbot-400M-distill")

  return self.fget.__get__(instance, owner)()


In [3]:
conversation = "User: What is computer science?\n"
response = chatbot(conversation)
print(response)

[{'generated_text': ' Computer science is a branch of mathematics that deals with computation.'}]


In [4]:
conversation += "Assistant: "+response[0]["generated_text"]
conversation += "User: Does it draw from disciplines other than mathematics?"

In [5]:
response = chatbot(conversation)
print(response)

[{'generated_text': ' Yes, it does. It is the study of algorithms and how they work.'}]


## Combining with Semantic Search

Now we are going to combine the chat bot code with the semantic search code from last time. 

Here's the strategy:
* Perform semantic search on the user's query
* Collect the most relevant documents from the semantic search
* Prompt the chatbot with the question and the relevant documents

### Getting the data

The data we're using comes in JSON format with information about course names and descriptions.

In [6]:
import requests
import json
from sentence_transformers import SentenceTransformer
from transformers import pipeline, Conversation
import numpy as np

def euclidean_distance(vec1, vec2):
    """Calculate the Euclidean distance between two vectors."""
    return np.linalg.norm(vec1 - vec2)

course_descriptions_url = 'https://raw.githubusercontent.com/ericmanley/S24-CS143AI/main/data/course_descriptions.json'
course_descriptions = requests.get(course_descriptions_url).json()

print("Here's a few of the course descriptions from our data:")
course_descriptions[:5]

Here's a few of the course descriptions from our data:


[{'course_code': 'ACCT 0--',
  'course_name': 'ACCT-LOWER DIVISION',
  'description': 'No course description is available.'},
 {'course_code': 'ACCT 041',
  'course_name': 'INTRODUCTION TO FINANCIAL ACCOUNTING',
  'description': 'The elements of the financial statements, accounting for deferrals, the double-entry accounting system, internal control and cash, receivables and payables, inventory, operational assets, long-term debt, equity transactions, income measurement, and comprehensive treatment of the balance sheet, the income statement and the statement of cash flows.  Financial statement analysis will be integrated throughout the course.  Prereq.:  None.'},
 {'course_code': 'ACCT 042',
  'course_name': 'INTRODUCTION TO MANAGERIAL ACCOUNTING',
  'description': 'Explaining manufacturing and nonmanufacturing costs and how they are reported in the financial statements, computing the cost of providing a service or manufacturing a product, determining cost behavior as activity levels ch

### Preparing the data for the language model

We need to be able to feed text to the language models, so we'll concatenate the important parts of each record in the JSON data. For RAG, we usually refer to each chunk of text like this (each course) as a **document** even though it isn't a file - just a unit of text to consider as a whole.

In [7]:
course_description_concat = []

for course in course_descriptions:
    curr_entry = course["course_code"]+course["course_name"]+course["description"]
    course_description_concat.append(curr_entry)
    
print("Here's what they will look like when we feed them to a language model:")
course_description_concat[:5]

Here's what they will look like when we feed them to a language model:


['ACCT 0--ACCT-LOWER DIVISIONNo course description is available.',
 'ACCT 041INTRODUCTION TO FINANCIAL ACCOUNTINGThe elements of the financial statements, accounting for deferrals, the double-entry accounting system, internal control and cash, receivables and payables, inventory, operational assets, long-term debt, equity transactions, income measurement, and comprehensive treatment of the balance sheet, the income statement and the statement of cash flows.  Financial statement analysis will be integrated throughout the course.  Prereq.:  None.',
 'ACCT 042INTRODUCTION TO MANAGERIAL ACCOUNTINGExplaining manufacturing and nonmanufacturing costs and how they are reported in the financial statements, computing the cost of providing a service or manufacturing a product, determining cost behavior as activity levels change, accumulating and presenting relevant data for decision-making, profit planning and budgeting, capital expenditure decisions and financial statement analysis. Prereq.:  AC

### Creating the embeddings for each entry

Now we will create all of the embeddings for our course descriptions - this is the same as last time. 

Typically, you would only do this once, when you're setting up the chatbot to begin with. You should save these to a file that the chatbot has access to - typically called an **index** or **vector database**.

In [8]:

model = SentenceTransformer('all-MiniLM-L6-v2')
    
course_descriptions_embeddings = model.encode(course_description_concat)



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Performing the semantic search

This code just turns what we worked on last time into a function.

We'll pass the query, the course info documents, the embeddings of those documents, and the number of items we want returned.

It return a big string with all the relevant documents concatenated.

In [9]:
def get_relevant_documents(query,documents,document_embeddings,n):
    
    # get the embedding for the query
    query_embedding = model.encode(query)

    # a list for sorting the document indices by distance to the query embedding
    distance_index_pairs = []

    for i in range(len(documents)):

        curr_dist = euclidean_distance(query_embedding,document_embeddings[i])
        distance_index_pairs.append( (curr_dist,i) )

    distance_index_pairs.sort()
    
    relevant_documents = ""
    for j in range(n):
        relevant_documents += documents[ distance_index_pairs[j][1]  ]
        
    return relevant_documents

query = "Which courses cover neural networks?"
relevant_courses = get_relevant_documents(query,course_description_concat,course_descriptions_embeddings,5)
print("Here's context that will be helpful in answering the question:",relevant_courses)

Here's context that will be helpful in answering the question: CS 167MACHINE LEARNINGThis course introduces approaches to developing computer programs that learn from data.  Both foundational and contemporary machine learning algorithms will be covered in the context of a variety of data and problem types.  Specific topics will vary but may include artificial neural networks, decision trees, instance-based learning, Bayesian learning, support vector machines, hidden Markov models, reinforcement learning, and natural language processing.  Students will develop their own implementations of the algorithms as well as utilize modern machine learning software and programming libraries.  Pre-requisite: CS 065 and (CS 066 or STAT 040).NSCI 128FUNCTIONAL NEUROANATOMYThis course is focused on understanding the anatomy and function of the mammalian central nervous system. Course topics include the gross and ultrastructure of the brain, with an emphasis on neural systems that underlie motor contro

### Including the search results in the chat prompt

Now we can take the information we found with semantic search and include it in the prompt for the chat.

In [10]:
instructions = "Answer the user's question using this context: "+relevant_courses
question = "User: "+query
conversation = instructions+question
response = chatbot(conversation,max_new_tokens=200)
display(response)


Token indices sequence length is longer than the specified maximum sequence length for this model (527 > 128). Running this sequence through the model will result in indexing errors


IndexError: index out of range in self

## Discussion

We passed 527 tokens/words to the model, but it has an input limit of 128. What are some things we can do to try and fix this problem? Come up with at least two strategies and enter them on the shared slide for your group.



## Exercise: Try another model

Try this with the following model, which allows for larger input but is still small enough to run on Colab: https://huggingface.co/google/flan-t5-base

## Exercise: Put this in a loop

Build the chat bot into a loop that will allow the user to continually ask follow-up questions. Try testing it with questions like "What other topics does that course cover?"

## Exercise: Experiment with the temperature

You can set the `do_sample` and `temperature` parameters to affect how random the output is. Setting `do_sample=True` will allow it to use some randomness in generating output. The `temperature` affects how random it allows the output to be. Experiment with different temperature values and determine which value you're happiest with.

In [11]:
conversation = "User: What is computer science?"
response = chatbot(conversation,do_sample=True,temperature=2.5)
display(response)

[{'generated_text': ' Customer Service and Computer Science is a science that is used to solve problems.'}]