<a href="https://colab.research.google.com/github/davidphamle/Projects/blob/main/TeachingBot_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Teaching Bot

## Install & Import Libraries

In [None]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken
! pip install datasets transformers[sentencepiece]
! pip install nltk rouge_score
#!pip install pickle5
#!pip install transformers

In [None]:
from PyPDF2 import PdfReader, PdfWriter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
import sys
import os
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn import metrics
import matplotlib.pyplot as plt
from datasets import load_metric
# import pickle5 as pickle

In [None]:
# Get your API keys from openai, you will need to create an account.
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
import os
os.environ["OPENAI_API_KEY"] = "********************************************"

In [None]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/MyDrive/FYP"


Mounted at /content/gdrive


In [None]:
# location of the pdf file/files.
# reader = PdfReader('/content/gdrive/My Drive/Inputs/example1.pdf')

merger = PdfWriter()


output = open("document-output.pdf", "w+b")

for filename in os.listdir(root_dir):
    if filename.endswith('.pdf'):
        pdf_file = open(os.path.join(root_dir, filename), 'rb')
        merger.append(pdf_file)


merger.write(output)


reader = PdfReader(output)

# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

In [None]:
# split loaded text into smaller chunks so that during information retrieval we don't hit the token size limits.

text_splitter = RecursiveCharacterTextSplitter(
    #separator = "\n",
    separators = ["\n\n", "\n", " ", ""],
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [None]:
len(texts)

344

In [None]:
texts[0]

'Lecture 08: C, Input/Output  \nFunctions  \nAs a second -year subject, we assume that you are at least somewhat familiar with functions. Thus, \nmost of this should be revision.  \nArguments  \nFunctions take arguments . It is likely at this stage that you have a basic understanding on how to use \nfunctions including argument s. So why am I repeating myself. Well, this is a good opportunity to do \nsome revision with a little bit more depth.  \nSo lets begin with an example:  \nint sum(int i1, int i2) \n{ \n int local = i1 + i2;  \n return local; \n} \n \nSum is a function (well duh) which takes two integers, applies some logic, and gives the caller back \nanother integer. In this case, the name of the function and its return  value match wel l (sum the'

In [None]:
texts[1]

'another integer. In this case, the name of the function and its return  value match wel l (sum the \nintegers given).  In terms of argument s, the above is an example of pass by copy. What does this mean?  \nPass by Copy  \nConsider the following ‘troll’ function and its corresponding ‘main’ call.  \nint sum_troll(int i1, int i2) \n{ \n i1 = -7; \n int local = i1 +  i2; \n return local; \n} \n \nint main() \n{ \n int a = 1; \n int b = 2; \n int c = sum_troll(a, b);  \n // Print everything  \n} \n \nAs we haven’t learned how to print in C, let’s just assume there is some code at the end of main  which \nprints the values of ‘a’, ‘b’ and ‘c’. Now sum_troll  is an awful function. Instead of doing something \nuseful, it returns somethings fairly useless (i2 – 7). In our call we might expect that ‘c’ should now'

faiss similarity search

In [None]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

#create vector store
db = FAISS.from_texts(texts, embeddings)


chain = load_qa_chain(OpenAI(), chain_type="stuff")

In [None]:
#db.save_local("/content/gdrive/MyDrive/FAISS_TeachingBotIndex","TeachingBotIndex")
# db.save_local("/content/gdrive/MyDrive/FAISS_TeachingBotIndex","TeachingBotIndex")

# Save the vector store for later reuse
# with open("/content/gdrive/MyDrive/FAISS_TeachingBotIndex/db.pkl", "wb") as f:
#     pickle.dump(db, f)

# Embedding Saved.

In [None]:
print(db)

<langchain.vectorstores.faiss.FAISS object at 0x7c0f01836d10>


In [None]:
query = "Teach me about UNIX"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Unix is an operating system that is more friendly to programmers and advanced users. It has a design philosophy that "everything is a file" and it has a few common commands that can be used to accomplish tasks, such as "ls" to list the contents of the current directory, "cp" to copy a file, "mv" to move a file, "rm" to delete files, "pwd" to show the current working directory, "cd" to change the current directory, "less" to show a bit of a specified file, and "cat" to concatenate two files or show what is in one file. It is important to learn how to read the manual pages ("man") to understand how to use these commands. It also has a command "chmod" that can have catastrophic consequences if used incorrectly.'

In [None]:
docs

[Document(page_content='malloc  (it is faster).  \nTo that end, it explains the existence of realloc . The signature is:  \nvoid * realloc ( void * ptr, size_t size); \n \nIt takes a void*  which is a bit curious. The name is a bit of a give -away as to what it does. \nUnsurprisingly, it stands for realloc ate. The pointer you give it will be the memory you wish to \nreallocate. In essence, th e idea of realloc  is change the size of some memory you already have. If you \ncan imagine that you have an array of size 10 and you decide (perhaps based on something you \nlearned/will learn in ADSA) you now need the array to be size 20. Well, deallocating 1 0 integers and \nthen reallocating the same 10 integers plus another 20 integers is a complete waste of time. Well,'),
 Document(page_content='up in nice, neat little packages (objects). The interactions of these packages can then be happily \ndiscussed (cars drive on roads, users make withdrawals from bank accounts). C… is lower level tha

In [None]:
def safe_divide(a, b):
    return a / b if b != 0 else 0.0

In [None]:
def evaluate_metrics(queries, true_answers):
    all_precision = []
    all_recall = []
    all_f1 = []

    for query, true_answer in zip(queries, true_answers):
        # Run similarity search on query
        docs = db.similarity_search(query)

        # Get model's answer
        model_answer = chain.run(input_documents=docs, question=query)
        # print(model_answer)
        # print(true_answer)

        # Tokenize true and model answers (splitting by comma and trimming spaces)
        true_tokens = set(true_answer.split(", "))
        model_tokens = set(model_answer.split(", "))

        # Convert sets to binary label format for sklearn metrics
        all_tokens = list(set(true_tokens) | set(model_tokens))
        t_labels = [1 if token in true_tokens else 0 for token in all_tokens]
        m_labels = [1 if token in model_tokens else 0 for token in all_tokens]
        # print(all_tokens)
        # print(t_labels)
        # print(m_labels)

        # Compute metrics
        precision = precision_score(t_labels, m_labels)
        recall = recall_score(t_labels, m_labels)
        f1 = f1_score(t_labels, m_labels)

        all_precision.append(precision)
        all_recall.append(recall)
        all_f1.append(f1)

        # confusion_matrix = metrics.confusion_matrix(t_labels, m_labels)
        # cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, display_labels = [False, True])
        # cm_display.plot()
        # plt.show()

        # true_positive = confusion_matrix[1,1]
        # true_negative = confusion_matrix[0,0]
        # false_positive = confusion_matrix[0,1]
        # false_negative = confusion_matrix[1,0]

        # positive_Precision = safe_divide(true_positive, (true_positive + false_positive))
        # positive_Recall = safe_divide(true_positive, (true_positive + false_negative))

        # positive_F1 = safe_divide(2 * (positive_Precision * positive_Recall), (positive_Precision + positive_Recall))

        # negative_Precision = safe_divide(true_negative, (true_negative + false_negative))
        # negative_Recall = safe_divide(true_negative, (true_negative + false_positive))

        # negative_F1 = safe_divide(2 * (negative_Precision * negative_Recall), (negative_Precision + negative_Recall))

        # print(f"Positive_F1: {positive_F1}")
        # print(f"Negative_F1: {negative_F1}")

    return all_precision, all_recall, all_f1

# Queries and true answers
queries = [
    "What is the pipe operator?",
    "What does makefile do?",
    "What is the structure of a minimalist makefile?",
    "What is a race condition?",
    "What is a signal handler?",
    "What is pthread_cond_wait?",
    "What happens when you establish a connection to a computer via a network?"
    "Teach me about UNIX.",
    "What are Processes?",
    "Explain INodes to me",
    "What's the difference between Asynchronous and Synchronous IO in the lectures given?",
    "What are sockets used for?",
    "Explain pipes like I’m 5.",
    "How to implement pipes in C?",
    "What is the difference between a thread and a process?",
    "How does one glob?",
    "How do you use realloc in C?"
]
true_answers = [
    "Pipes serve as a form of redirection that exists between programs. Instead of reading from a file (which is essentially a finite process once the file runs out) or writing to a file, pipes allow one process to dynamically read the output of a second process and use it as input.",
    "Make is all about date-stamps. A date-stamp tells you when a file was last modified and (usually) an unmodified file does not need to be recompiled. Actually, this isn’t strictly true. There are two reasons to recompile a file. Number one is that it changed (hence date stamps). Number two is that something it relies upon (a dependency) changed. So, the game of Make is to keep track of what files have changed (which is an easy operation using the operating system) and which files depend upon which files… a less easy operation.",
    "A Minimalist Makefile Looks like this: my_exe: my_code.c my_code.h gcc my_code.c -o my_exe This reads: • Make a file called my_exe from my_code.c (the gcc command) • Do this if someone types: make my_exe • Only do something if my_exe is older than my_code.c or my_code.h. Otherwise… do nothing.",
    "Simply put, a race condition is when two independent ‘tasks’ which are interdependent produce different results depending on the timing of how those two tasks are implemented.",
    "Signal handling in C is very straight forward and consists of two main parts (actually, quite similar to threads). There is the ‘setup’ part where you specify when in the code you want the signal handler to become active. Then there is the function which is called by the signal.",
    "In C, the way to implement this ‘signal-based’ method of notifying threads is: pthread_cond_wait(&cond, &lock) The idea here is that the thread which is to read from shared memory needs to wait for the data to be populated. That data is associated with a mutex (&lock) and a ‘pthread_cond_t’ (&cond) which is a kind of condition. In order to call pthread_cond_wait the waiting thread (consumer) must first have the mutex. But wait? If the consumer locks the mutex, how will the producer deposit the data into shared memory? Doesn’t it need the mutex first. Fortunately (i.e. by design), calling pthread_cond_wait unlocks the mutex. Now the producer is free to do its thing, lock the mutex, change some data.",
    "The traditional way of connecting to another computer across a network is via an IP address. Briefly, an IP address is an address similar to a house address. It tells you where the information is meant to go to/come from. Along with the IP address is the port number. The port number is used to determine what is being communicated, a bit like a radio frequency. So… IP Address: • Unique for a computer • Several processes can communicate using the same IP address (i.e. same computer) Port Number: • Unique for an application • Several computers can communicate using the same Port number (i.e. same application).",
    "Unix is a multiuser, multitasking operating system (OS) designed for flexibility and adaptability",
    "The term process (Job) refers to program code that has been loaded into a computer's memory so that it can be executed by the central processing unit (CPU).",
    "By definition, an inode is an index node. It serves as a unique identifier for a specific piece of metadata on a given filesystem. Each piece of metadata describes what we think of as a file.",
    "Synchronous input/output (I/O) occurs while you wait. Applications processing cannot continue until the I/O operation is complete. In contrast, asynchronous I/O (AIO) operations run in the background and do not block user applications.",
    "Mainly for client and server interaction.",
    "A pipe simply refers to a temporary software connection between two programs or commands. An area of the main memory is treated like a virtual file to temporarily hold data and pass it from one process to another in a single direction.",
    "To create a simple pipe with C, we make use of the pipe() system call. It takes a single argument, which is an array of two integers, and if successful, the array will contain two new file descriptors to be used for the pipeline.",
    "Process is the program under action whereas a thread is the smallest segment of instructions that can be handled independently by a scheduler.",
    "In computer programming, glob (/ɡlɒb/) patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *. txt textfiles/ moves ( mv ) all files with names ending in . txt from the current directory to the directory textfiles.",
    "ptr = realloc (ptr,newsize); The above statement allocates a new memory space with a specified size in the variable newsize. After executing the function, the pointer will be returned to the first byte of the memory block. The new size can be larger or smaller than the previous memory."
]

# Run the evaluation
all_precision, all_recall, all_f1 = evaluate_metrics(queries, true_answers)
print(f'Precision: {all_precision}, Recall: {all_recall}, F1-measure: {all_f1}')


Precision: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Recall: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], F1-measure: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


In [None]:
rouge = load_metric("rouge")

def evaluate_ROGUE_metrics(queries, true_answers):
    all_rouge = []

    for query, true_answer in zip(queries, true_answers):
        # Simulated model's answer (replace this with your actual model)
        docs = db.similarity_search(query)
        model_answer = docs[0]  # Simulated answer, replace with actual model output

        # Compute ROUGE scores
        rouge_scores = rouge.compute(predictions=[model_answer], references=[true_answer])
        all_rouge.append(rouge_scores)

        # The rest of your code (confusion matrix etc.) remains the same

    return all_rouge

# Replace queries and true_answers with your actual data
queries = [
    "What is the pipe operator?",
    "What does makefile do?",
    "What is the structure of a minimalist makefile?",
    "What is a race condition?",
    "What is a signal handler?",
    "What is pthread_cond_wait?",
    "What happens when you establish a connection to a computer via a network?"
    "Teach me about UNIX.",
    "What are Processes?",
    "Explain INodes to me",
    "What's the difference between Asynchronous and Synchronous IO in the lectures given?",
    "What are sockets used for?",
    "Explain pipes like I’m 5.",
    "How to implement pipes in C?",
    "What is the difference between a thread and a process?",
    "How does one glob?",
    "How do you use realloc in C?"
]
true_answers = [
    "Pipes serve as a form of redirection that exists between programs. Instead of reading from a file (which is essentially a finite process once the file runs out) or writing to a file, pipes allow one process to dynamically read the output of a second process and use it as input.",
    "Make is all about date-stamps. A date-stamp tells you when a file was last modified and (usually) an unmodified file does not need to be recompiled. Actually, this isn’t strictly true. There are two reasons to recompile a file. Number one is that it changed (hence date stamps). Number two is that something it relies upon (a dependency) changed. So, the game of Make is to keep track of what files have changed (which is an easy operation using the operating system) and which files depend upon which files… a less easy operation.",
    "A Minimalist Makefile Looks like this: my_exe: my_code.c my_code.h gcc my_code.c -o my_exe This reads: • Make a file called my_exe from my_code.c (the gcc command) • Do this if someone types: make my_exe • Only do something if my_exe is older than my_code.c or my_code.h. Otherwise… do nothing.",
    "Simply put, a race condition is when two independent ‘tasks’ which are interdependent produce different results depending on the timing of how those two tasks are implemented.",
    "Signal handling in C is very straight forward and consists of two main parts (actually, quite similar to threads). There is the ‘setup’ part where you specify when in the code you want the signal handler to become active. Then there is the function which is called by the signal.",
    "In C, the way to implement this ‘signal-based’ method of notifying threads is: pthread_cond_wait(&cond, &lock) The idea here is that the thread which is to read from shared memory needs to wait for the data to be populated. That data is associated with a mutex (&lock) and a ‘pthread_cond_t’ (&cond) which is a kind of condition. In order to call pthread_cond_wait the waiting thread (consumer) must first have the mutex. But wait? If the consumer locks the mutex, how will the producer deposit the data into shared memory? Doesn’t it need the mutex first. Fortunately (i.e. by design), calling pthread_cond_wait unlocks the mutex. Now the producer is free to do its thing, lock the mutex, change some data.",
    "The traditional way of connecting to another computer across a network is via an IP address. Briefly, an IP address is an address similar to a house address. It tells you where the information is meant to go to/come from. Along with the IP address is the port number. The port number is used to determine what is being communicated, a bit like a radio frequency. So… IP Address: • Unique for a computer • Several processes can communicate using the same IP address (i.e. same computer) Port Number: • Unique for an application • Several computers can communicate using the same Port number (i.e. same application).",
    "Unix is a multiuser, multitasking operating system (OS) designed for flexibility and adaptability",
    "The term process (Job) refers to program code that has been loaded into a computer's memory so that it can be executed by the central processing unit (CPU).",
    "By definition, an inode is an index node. It serves as a unique identifier for a specific piece of metadata on a given filesystem. Each piece of metadata describes what we think of as a file.",
    "Synchronous input/output (I/O) occurs while you wait. Applications processing cannot continue until the I/O operation is complete. In contrast, asynchronous I/O (AIO) operations run in the background and do not block user applications.",
    "Mainly for client and server interaction.",
    "A pipe simply refers to a temporary software connection between two programs or commands. An area of the main memory is treated like a virtual file to temporarily hold data and pass it from one process to another in a single direction.",
    "To create a simple pipe with C, we make use of the pipe() system call. It takes a single argument, which is an array of two integers, and if successful, the array will contain two new file descriptors to be used for the pipeline.",
    "Process is the program under action whereas a thread is the smallest segment of instructions that can be handled independently by a scheduler.",
    "In computer programming, glob (/ɡlɒb/) patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *. txt textfiles/ moves ( mv ) all files with names ending in . txt from the current directory to the directory textfiles.",
    "ptr = realloc (ptr,newsize); The above statement allocates a new memory space with a specified size in the variable newsize. After executing the function, the pointer will be returned to the first byte of the memory block. The new size can be larger or smaller than the previous memory."
]

# Run the evaluation
all_rouge = evaluate_ROGUE_metrics(queries, true_answers)
print(f'ROUGE Scores: {all_rouge}')

  rouge = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

ROUGE Scores: [{'rouge1': AggregateScore(low=Score(precision=0.28688524590163933, recall=0.6862745098039216, fmeasure=0.4046242774566474), mid=Score(precision=0.28688524590163933, recall=0.6862745098039216, fmeasure=0.4046242774566474), high=Score(precision=0.28688524590163933, recall=0.6862745098039216, fmeasure=0.4046242774566474)), 'rouge2': AggregateScore(low=Score(precision=0.15702479338842976, recall=0.38, fmeasure=0.2222222222222222), mid=Score(precision=0.15702479338842976, recall=0.38, fmeasure=0.2222222222222222), high=Score(precision=0.15702479338842976, recall=0.38, fmeasure=0.2222222222222222)), 'rougeL': AggregateScore(low=Score(precision=0.14754098360655737, recall=0.35294117647058826, fmeasure=0.20809248554913296), mid=Score(precision=0.14754098360655737, recall=0.35294117647058826, fmeasure=0.20809248554913296), high=Score(precision=0.14754098360655737, recall=0.35294117647058826, fmeasure=0.20809248554913296)), 'rougeLsum': AggregateScore(low=Score(precision=0.1475409

In [None]:
def average_rouge(all_rouge):
    avg_rouge = {'rouge-1': {'f': 0, 'p': 0, 'r': 0},
                 'rouge-2': {'f': 0, 'p': 0, 'r': 0},
                 'rouge-l': {'f': 0, 'p': 0, 'r': 0}}

    n = len(all_rouge)

    for rouge in all_rouge:
        for metric in ['rouge-1', 'rouge-2', 'rouge-l']:
            if metric in rouge:
                for sub_metric in ['f', 'p', 'r']:
                    if sub_metric in rouge[metric]:
                        avg_rouge[metric][sub_metric] += rouge[metric][sub_metric]

    for metric in ['rouge-1', 'rouge-2', 'rouge-l']:
        for sub_metric in ['f', 'p', 'r']:
            avg_rouge[metric][sub_metric] /= n

    return avg_rouge



In [None]:
average_rouge_scores = average_rouge(all_rouge)


In [None]:
import pandas as pd

def flatten_rouge_scores(all_rouge):
    flattened_scores = []
    for entry in all_rouge:
        flat_entry = {}
        for metric, aggregate_score in entry.items():
            flat_entry[f"{metric}_F1"] = aggregate_score.mid.fmeasure
            flat_entry[f"{metric}_Precision"] = aggregate_score.mid.precision
            flat_entry[f"{metric}_Recall"] = aggregate_score.mid.recall
        flattened_scores.append(flat_entry)
    return flattened_scores

flattened_scores = flatten_rouge_scores(all_rouge)
df = pd.DataFrame(flattened_scores)
#df
average_values = df.mean()
average_values

rouge1_F1              0.247084
rouge1_Precision       0.186782
rouge1_Recall          0.467267
rouge2_F1              0.128844
rouge2_Precision       0.102316
rouge2_Recall          0.214127
rougeL_F1              0.187235
rougeL_Precision       0.142005
rougeL_Recall          0.363029
rougeLsum_F1           0.187235
rougeLsum_Precision    0.142005
rougeLsum_Recall       0.363029
dtype: float64