In [1]:
import os
import chromadb
import numpy as np
from typing import List
from dotenv import load_dotenv

load_dotenv()
openai_api_key = os.getenv('OPENAI_API_KEY', '')

from langchain import OpenAI
from langchain.llms import OpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from langchain.pydantic_v1 import BaseModel, Field, validator
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains.summarize import load_summarize_chain

## Loading a long PDF document (e.g. book)

In [2]:
# Load the document
document_location = "~/Documents/DTU/5th_semester/deep-learn/neural_networks_and_deep_learning.pdf"
loader = PyPDFLoader(document_location)
pages = loader.load()

# Cut out the open and closing parts
pages = pages[5:]

# Combine the pages, and replace the tabs with spaces
text = ""

for page in pages:
    text += page.page_content
    
text = text.replace('\t', ' ')

We split the text in order to have manageable chunks that are small enough to be used as context for the LLM.

In [3]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "\t"],
                                               chunk_size=8000,
                                               chunk_overlap=1500)

docs = text_splitter.create_documents([text])

After splitting the documents, we embed them in vector space to enable similarity search and fast fetching of documents.

In [4]:
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

vectors = embeddings.embed_documents([x.page_content for x in docs])

### Creating a Chat Model

In [5]:
# initialize chat model
llm = ChatOpenAI(temperature=0,
                 openai_api_key=openai_api_key,
                 max_tokens=3000,
                 model='gpt-3.5-turbo-1106',
                 request_timeout=120
                 )

### Generate questions about some text

In [6]:
# Define the chain function
def generate_comprehension_questions(text):
    # Define the prompt
    prompt = f"Please provide 3 short comprehension questions about the following technical text: {text}"
    
    # Use the model to generate questions
    response = llm.invoke(prompt)

    # Extract and format the questions
    questions = response.content.split("\n")[:3]
    
    return questions

In [7]:
# Example usage
text = docs[17].page_content
questions = generate_comprehension_questions(text)

In [8]:
text = """
Nineteen Eighty-Four (also published as 1984) is a dystopian novel and cautionary tale by English writer George Orwell. It was published on 8 June 1949 by Secker & Warburg as Orwell's ninth and final book completed in his lifetime. Thematically, it centres on the consequences of totalitarianism, mass surveillance and repressive regimentation of people and behaviours within society.[2][3] Orwell, a democratic socialist, modelled the authoritarian state in the novel on the Soviet Union in the era of Stalinism, and Nazi Germany.[4] More broadly, the novel examines the role of truth and facts within societies and the ways in which they can be manipulated.

The story takes place in an imagined future in an unspecified year believed to be 1984, when much of the world is in perpetual war. Great Britain, now known as Airstrip One, has become a province of the totalitarian superstate Oceania, which is led by Big Brother, a dictatorial leader supported by an intense cult of personality manufactured by the Party's Thought Police. The Party engages in omnipresent government surveillance and, through the Ministry of Truth, historical negationism and constant propaganda to persecute individuality and independent thinking.[5]

The protagonist, Winston Smith, is a diligent mid-level worker at the Ministry of Truth who secretly hates the Party and dreams of rebellion. Smith keeps a forbidden diary. He begins a relationship with a colleague, Julia, and they learn about a shadowy resistance group called the Brotherhood. However, their contact within the Brotherhood turns out to be a Party agent, and Smith and Julia are arrested. He is subjected to months of psychological manipulation and torture by the Ministry of Love and is released once he has come to love Big Brother.

Nineteen Eighty-Four has become a classic literary example of political and dystopian fiction. It also popularised the term "Orwellian" as an adjective, with many terms used in the novel entering common usage, including "Big Brother", "doublethink", "Thought Police", "thoughtcrime", "Newspeak", and "2 + 2 = 5". Parallels have been drawn between the novel's subject matter and real life instances of totalitarianism, mass surveillance, and violations of freedom of expression among other themes.[6][7][8] Orwell described his book as a "satire",[9] and a display of the "perversions to which a centralised economy is liable," while also stating he believed "that something resembling it could arrive."[9] Time included the novel on its list of the 100 best English-language novels published from 1923 to 2005,[10] and it was placed on the Modern Library's 100 Best Novels list, reaching number 13 on the editors' list and number 6 on the readers' list.[11] In 2003, it was listed at number eight on The Big Read survey by the BBC.
"""
questions = generate_comprehension_questions(text)

In [9]:
for q in questions:
    print(q)

1. Who is the protagonist of Nineteen Eighty-Four and what is his job at the Ministry of Truth?
2. What are some of the terms popularized by the novel that have entered common usage?
3. What are some of the themes and real-life instances that have been compared to the subject matter of Nineteen Eighty-Four?


## Use the generated questions to generate answers

In [11]:
chroma_client = chromadb.Client()

#chroma_client.delete_collection("my_collection")
collection = chroma_client.get_or_create_collection(name="my_collection")

identifiers = [str(element) for element in list(range(len(vectors)))]
str_docs = [doc.page_content for doc in docs]

collection.add(
    embeddings=vectors,
    documents=str_docs,
    #metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=identifiers
)

In [12]:
def generate_answers(questions, context=None):
    
    answers = []
    
    for question in questions:
        if context is None:
            # get context from vector store
            qe = embeddings.embed_documents([question])
            results = collection.query(query_embeddings=qe,
                                       n_results=2)

            # concatenate strings
            context = ""
            for idx in results['ids'][0]:
                context += str_docs[int(idx)]
        
        answers_prompt = ChatPromptTemplate.from_template("{context}\n{question} Please provide a short and concise answer.")
        answers_chain = answers_prompt | llm
        
        answer = answers_chain.invoke({'context':context, 'question':question}).content
        
        answers.append(answer)
        
    return answers

In [13]:
answers = generate_answers(questions, text)

In [14]:
for q, a in zip(questions, answers):
    print(q)
    print(a)
    print()

1. Who is the protagonist of Nineteen Eighty-Four and what is his job at the Ministry of Truth?
The protagonist of Nineteen Eighty-Four is Winston Smith, and he is a mid-level worker at the Ministry of Truth.

2. What are some of the terms popularized by the novel that have entered common usage?
Some terms popularized by the novel include "Big Brother", "doublethink", "Thought Police", "thoughtcrime", "Newspeak", and "2 + 2 = 5".

3. What are some of the themes and real-life instances that have been compared to the subject matter of Nineteen Eighty-Four?
Some of the themes and real-life instances that have been compared to the subject matter of Nineteen Eighty-Four include totalitarianism, mass surveillance, violations of freedom of expression, and the manipulation of truth and facts within societies.



### Grade user answers

In [21]:
def grade_user_answer(questions, answers):
    for q,a in zip(questions, answers):
        print(q)
        user_answer = input()
        prompt = ChatPromptTemplate.from_template("A user has been tasked to answer a question. The correct answer is: {system_answer}\n The user answered the following: {user_answer} Please rate the correctness of the user answer.")
        grade_chain = prompt | llm
        grade = grade_chain.invoke({'system_answer':a, 'user_answer':user_answer})
        print(grade.content)
        print()

In [20]:
grade_user_answer(questions, answers)

1. Who is the protagonist of Nineteen Eighty-Four and what is his job at the Ministry of Truth?
Winston Smith
The user's answer is correct. The protagonist of Nineteen Eighty-Four is indeed Winston Smith. However, the user did not provide the additional information that he is a mid-level worker at the Ministry of Truth. Therefore, the correctness of the user's answer is partially correct.
2. What are some of the terms popularized by the novel that have entered common usage?
"Newspeak", "Big Brother" and "pepperoni bros"
The user's answer is partially correct. They correctly identified "Newspeak" and "Big Brother" as terms popularized by the novel, but "pepperoni bros" is not a term associated with the novel. Therefore, the user's answer is partially correct.
3. What are some of the themes and real-life instances that have been compared to the subject matter of Nineteen Eighty-Four?
Something about the state doing a lot of monitoring. Also, the first amendment of US law is not upheld. L

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=120.0).


The user's answer is mostly correct, as they mentioned the state doing a lot of monitoring, violations of freedom of expression, and the manipulation of truth and facts within societies. However, they did not specifically mention totalitarianism, which is a key theme in Nineteen Eighty-Four. Additionally, they mentioned the first amendment of US law, which is not directly related to the themes of the book. Overall, the user's answer is partially correct but could be more specific and focused on the themes and real-life instances compared to the subject matter of Nineteen Eighty-Four.
