Thought Process
While researching a solution for this particular task, i found three solutions provided by Langchain library out of the box
1) Stuf Chain
2) Map-Reduce
3) Refine

Based on my research all these solutions are impactful for a smaller size of documents than the book provided. For the number of tokens generated in the book Stuff Chain wouldnt work as the Context window of gpt 4o (the model i used for both the solutions) is 128,000 tokens i.e. less than the tokens in our book. 
For Map-Reduce my research pointed out that Map-Reduce can work as a solution however both Map-Reduce and refine are not a very cost effective solution. To keep the context of the task as close to the real world situations as possible i opted for another strategy very briefly described in the following bullets:
1) Split the documents in separate chunks
2) Generate vector embeddings of those chunks
3) Perform clustering over the vectors
4) Select the vectors which are the closest to the centroids of the clusters
5) Now on the selected vectors perform map reduce i.e. pass the text respective to the vector to the llm instructing the LLM to summarise the text
6) Then pass all the summaries to the LLM asking it to derive a final summary as per requirements.
7) Save the Final Summary to a pdf file.

Post Script: These two links helped me navigate and reach this approach: 
1) https://github.com/gkamradt/langchain-tutorials/blob/main/data_generation/5%20Levels%20Of%20Summarization%20-%20Novice%20To%20Expert.ipynb
2) https://pashpashpash.substack.com/p/tackling-the-challenge-of-document 

In [2]:
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains.summarize import load_summarize_chain
import numpy as np
from sklearn.cluster import KMeans
import os
from langchain_core.messages import HumanMessage
from langchain_openai import AzureChatOpenAI

In [3]:
import os

os.environ["AZURE_OPENAI_API_KEY"] = ""
os.environ["AZURE_OPENAI_ENDPOINT"] = ""
os.environ["AZURE_OPENAI_API_VERSION"] = ""
os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"] = ""

In [4]:
# initializing my model
model = AzureChatOpenAI(
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"],
)

In [5]:
# a sample task to model for smoke testing
message = HumanMessage(
    content="Translate this sentence from English to French. I love programming."
)
model.invoke([message])

AIMessage(content="J'adore programmer.", response_metadata={'token_usage': {'completion_tokens': 4, 'prompt_tokens': 19, 'total_tokens': 23}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_abc28019ad', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}], 'finish_reason': 'stop', 'logprobs': None, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}, id='run-f1654f2b-caf2-472e-8fa1-f3c5dcf32fef-0', usage_metadata={'input_tokens': 19, 'output_tokens': 4, 'total_tokens': 23})

In [6]:
#function used later on to code to clean the pdf textbook
import re
def clean_text(text):
   # Remove the specific phrase 'Free eBooks at Planet eBook.com' and surrounding whitespace
   cleaned_text = re.sub(r'\s*Free eBooks at Planet eBook\.com\s*', '', text, flags=re.DOTALL)
   # Remove extra spaces
   cleaned_text = re.sub(r'\n+', ' ', cleaned_text)
   # Remove any appearance of multiple spaces with a single space 
   cleaned_text = re.sub(r' +', ' ', cleaned_text)
   # Remove non-printable characters, optionally preceded by 'Crime and Punishment'
   cleaned_text = re.sub(r'(Crime and Punishment )?[\x00-\x1F]', '', cleaned_text)
   # Replace newline characters with spaces
   cleaned_text = re.sub(r'\s*-\s*', '-', cleaned_text) # Retain hyphens and remove spaces around them
    # Remove leading and trailing spaces
   cleaned_text = cleaned_text.strip()
   return cleaned_text


In [7]:
from langchain.document_loaders import PyPDFLoader

# Load the book
loader = PyPDFLoader("C:/Users/stech/Downloads/crime-and-punishment.pdf")
pages = loader.load()

# Cut out the irrelevant open and closing parts
pages = pages[6:743]

# Combine the pages, and replace the tabs with spaces
text = ""

for page in pages:
    text += page.page_content
    
text = text.replace('\t', ' ')
clean_text=clean_text(text)

In [8]:
#check how many num_tokens are does the book translate to so that a cost effective decision can be reached over to use map_reduce approach or the current approach
num_tokens = model.get_num_tokens(clean_text)

print (f"This book has {num_tokens} tokens in it")

This book has 267185 tokens in it


In [None]:
# split the book into chunks of 20,000 with an overlap of 5,000 characters
text_splitter = RecursiveCharacterTextSplitter(chunk_size=20000, chunk_overlap=5000)
#prepare documents from the cleaned text
docs = text_splitter.create_documents([clean_text])

In [None]:
#check how many documents are generated
num_documents = len(docs)
print (f"Now our book is split up into {num_documents} documents")

In [None]:
#initializing the embedding model
from langchain_openai import AzureOpenAIEmbeddings
embeddings = AzureOpenAIEmbeddings(
    azure_deployment="ada-002",
    openai_api_version="2024-06-01",
)

In [None]:
#generate embeddings of the documents. The separate commands and sleep() function is used because a single call was leading to more calls to Azure Open AI than the specific call_rate resulting in an error. 
import time
vectors = embeddings.embed_documents([x.page_content for x in docs[0:24]])
time.sleep(5)
vectors += embeddings.embed_documents([x.page_content for x in docs[24:48]])
time.sleep(5)
vectors += embeddings.embed_documents([x.page_content for x in docs[48:72]])
time.sleep(5)
vectors += embeddings.embed_documents([x.page_content for x in docs[72:]])

In [None]:
len(docs)

In [None]:
len(vectors)

In [None]:
# Assuming 'embeddings' is a list or array of 1536-dimensional embeddings
# Choose the number of clusters, these clusters are the most important sections of the book from which we will drive the summary of our book.

num_clusters = 15

# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(vectors)

In [None]:
kmeans.labels_

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import warnings
from warnings import simplefilter

# Filter out FutureWarnings
simplefilter(action='ignore', category=FutureWarning)

# Perform t-SNE and reduce to 2 dimensions
vectorsarray = np.array(vectors)
n_samples = vectorsarray.shape[0]
# Set perplexity to a value less than the number of samples
perplexity = min(30, n_samples - 1)
tsne = TSNE(n_components=2, random_state=42,perplexity=perplexity)
reduced_data_tsne = tsne.fit_transform(vectorsarray)

# Plot the reduced data
plt.scatter(reduced_data_tsne[:, 0], reduced_data_tsne[:, 1], c=kmeans.labels_)
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Book Embeddings Clustered')
plt.show()

In [None]:
# Find the closest embeddings to the centroids
# Create an empty list that will hold your closest points
closest_indices = []

# Iterate over the number of clusters you have to extract only those vectors which are the closest to the centroids
for i in range(num_clusters):
    
    # Get the list of distances from that particular cluster center
    distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1)
    
    # Find the list position of the closest one (using argmin to find the smallest distance)
    closest_index = np.argmin(distances)
    
    # Append that position to your closest indices list
    closest_indices.append(closest_index)

In [None]:
#sort the list
selected_indices = sorted(closest_indices)
selected_indices

In [None]:
from langchain_core.prompts import PromptTemplate
#prompt to generate a summary of the selected vectors
map_prompt = """
You will be given a single passage of a book. This section will be enclosed in triple backticks (```)
Your goal is to give a summary of this section so that a reader will have a full understanding of what happened.
Your response should be at least three paragraphs and fully encompass what was said in the passage.

```{text}```
FULL SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

In [None]:
map_chain = load_summarize_chain(llm=model,
                             chain_type="stuff",
                             prompt=map_prompt_template)

In [None]:
selected_docs = [docs[doc] for doc in selected_indices]

In [None]:
# Make an empty list to hold your summaries
summary_list = []

# Loop through a range of the lenght of your selected docs
for i, doc in enumerate(selected_docs):
    
    # Go get a summary of the chunk
    chunk_summary = map_chain.run([doc])
    
    # Append that summary to your list
    summary_list.append(chunk_summary)
    
    print (f"Summary #{i} (chunk #{selected_indices[i]}) - Preview: {chunk_summary[:250]} \n")

In [None]:
summaries = "\n".join(summary_list)

# Convert it back to a document
summaries = Document(page_content=summaries)

print (f"Your total summary has {model.get_num_tokens(summaries.page_content)} tokens")

In [None]:
#prompt to generate a summary from the summaries we have generated from the previous prompt
combine_prompt = """
You will be given a series of summaries from a book. The summaries will be enclosed in triple backticks (```)
Your goal is to give a verbose summary of approximately 6000 words of what happened in the story.
The reader should be able to grasp what happened in the book.

```{text}```
VERBOSE SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

In [None]:
reduce_chain = load_summarize_chain(llm=model,
                             chain_type="stuff",
                             prompt=combine_prompt_template,
                            )

In [None]:
output = reduce_chain.run([summaries])

In [None]:
print (output)

In [None]:
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=12)
pdf.multi_cell(0, 10, output.encode('latin-1', 'replace').decode('latin-1'))
pdf_output_path = "C:/Users/stech/Downloads/Summary_15.pdf"
pdf.output(pdf_output_path)
print(f"PDF saved successfully at {pdf_output_path}")
