<a href="https://colab.research.google.com/github/ekerintaiwoa/MediaApp/blob/master/ofline_aiagent2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# prompt: create an ai agents that allows users to upload a book, ask question about the book,Summarize Chapters or Sections,Study and Review

# This code requires several libraries that might not be installed by default
# on a standard Colab instance. We'll install them first.

# Install necessary libraries
!pip install PyMuPDF  # For handling PDF files
!pip install nltk  # For natural language processing tasks like tokenization
!pip install scikit-learn  # For text vectorization (TF-IDF)
!pip install faiss-cpu  # For efficient similarity search (vector database)
!pip install transformers  # For using powerful language models
!pip install torch  # PyTorch is a dependency for transformers

import fitz  # PyMuPDF
import nltk
import re
import faiss
import numpy as np
import torch
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# Download necessary NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')

class BookQA:
    def __init__(self):
        self.book_text = ""
        self.chapter_sections = {} # To store chapter/section titles and their text
        self.vectorizer = None
        self.tfidf_matrix = None
        self.index = None # FAISS index for vector search
        self.qa_pipeline = None
        self.summarizer_pipeline = None
        self.tokenizer = None
        self.summarizer_model = None

        # Initialize NLP pipelines and models
        print("Initializing NLP models...")
        self.initialize_models()
        print("Models initialized.")

    def initialize_models(self):
        # Initialize Question Answering pipeline
        # Using a smaller model for demonstration; consider a larger one for better results
        try:
            self.qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
            print("Question Answering model loaded.")
        except Exception as e:
            print(f"Error loading QA model: {e}")
            print("Please ensure you have sufficient resources and internet connection.")
            self.qa_pipeline = None

        # Initialize Summarization model and tokenizer
        # Using a smaller T5 model; consider t5-large or bart-large-cnn for better summaries
        try:
            self.tokenizer = AutoTokenizer.from_pretrained("t5-small")
            self.summarizer_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
            self.summarizer_pipeline = pipeline("summarization", model=self.summarizer_model, tokenizer=self.tokenizer)
            print("Summarization model loaded.")
        except Exception as e:
            print(f"Error loading Summarization model: {e}")
            print("Please ensure you have sufficient resources and internet connection.")
            self.tokenizer = None
            self.summarizer_model = None
            self.summarizer_pipeline = None


    def upload_book(self, file_path):
        """Reads content from a PDF file."""
        try:
            doc = fitz.open(file_path)
            self.book_text = ""
            print(f"Reading {doc.page_count} pages from {file_path}...")
            for page_num in range(doc.page_count):
                page = doc.load_page(page_num)
                self.book_text += page.get_text()
            print("Finished reading book.")
            self.process_book_text()
        except Exception as e:
            print(f"Error reading file: {e}")
            self.book_text = ""
            self.chapter_sections = {}

    def process_book_text(self):
        """Splits book text into chapters/sections and prepares for search."""
        if not self.book_text:
            print("No book text loaded to process.")
            return

        print("Processing book text...")
        # Simple approach to split into sections based on common patterns (e.g., "Chapter X", "Section Y")
        # This is a basic implementation; a more robust parser might be needed for complex books.
        sections = re.split(r'(Chapter \d+|Section \d+)', self.book_text, flags=re.IGNORECASE)

        current_title = "Introduction/Beginning"
        current_text = ""
        for i, part in enumerate(sections):
            if i % 2 == 1: # This part is likely a title
                if current_text.strip():
                    self.chapter_sections[current_title] = current_text.strip()
                current_title = part.strip()
                current_text = ""
            else: # This part is the content
                current_text += part

        if current_text.strip(): # Add the last section
             self.chapter_sections[current_title] = current_text.strip()

        # Handle cases where no clear chapters/sections are found (e.g., a simple document)
        if not self.chapter_sections and self.book_text:
             # Split into chunks for search
             chunk_size = 2000 # characters
             chunks = [self.book_text[i:i + chunk_size] for i in range(0, len(self.book_text), chunk_size)]
             self.chapter_sections = {f"Chunk {i+1}": chunk for i, chunk in enumerate(chunks)}
             print(f"Book text split into {len(chunks)} chunks.")

        print(f"Identified {len(self.chapter_sections)} chapters/sections or chunks.")

        # Prepare for vector search
        self.prepare_for_search()

    def prepare_for_search(self):
        """Vectorizes the text and creates a FAISS index."""
        if not self.chapter_sections:
            print("No sections to vectorize.")
            return

        section_texts = list(self.chapter_sections.values())
        print(f"Vectorizing {len(section_texts)} sections...")

        # Using TF-IDF for simplicity; consider Sentence Transformers for better semantic search
        self.vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
        self.tfidf_matrix = self.vectorizer.fit_transform(section_texts)

        # Create FAISS index
        dimension = self.tfidf_matrix.shape[1]
        self.index = faiss.IndexFlatL2(dimension) # Using L2 distance (Euclidean)
        self.index.add(self.tfidf_matrix.astype('float32')) # FAISS requires float32
        print("FAISS index created.")


    def ask_question(self, question):
        """Answers a question based on the book content."""
        if not self.book_text:
            return "Please upload a book first."
        if not self.qa_pipeline:
             return "Question Answering model is not loaded. Please check initialization."

        print(f"Searching for answer to: '{question}'")

        # Find the most relevant section using the FAISS index
        question_vec = self.vectorizer.transform([question]).toarray().astype('float32')
        D, I = self.index.search(question_vec, 1) # Search for the top 1 most similar section

        if I.size == 0 or I[0][0] == -1:
            return "Could not find a relevant section in the book."

        relevant_section_index = I[0][0]
        section_titles = list(self.chapter_sections.keys())
        relevant_section_title = section_titles[relevant_section_index]
        relevant_section_text = self.chapter_sections[relevant_section_title]

        print(f"Most relevant section: '{relevant_section_title}'")

        # Use the QA pipeline on the relevant section
        try:
            # The QA pipeline has context length limitations. We might need to
            # further refine the relevant text chunk for the QA model.
            # For simplicity, using the whole section here, which might fail for very long sections.
            # A better approach would be to split the section into smaller paragraphs.
            # For this example, let's limit the context length for the pipeline
            max_context_length = self.qa_pipeline.model.config.max_position_embeddings if hasattr(self.qa_pipeline.model.config, 'max_position_embeddings') else 512
            # Let's roughly estimate tokens by characters for now (approx 4 chars per token)
            max_chars = max_context_length * 4
            context = relevant_section_text[:max_chars]


            answer = self.qa_pipeline(question=question, context=context)
            return f"Answer: {answer['answer']} (Source: {relevant_section_title}, Score: {answer['score']:.2f})"
        except Exception as e:
            print(f"Error during QA pipeline: {e}")
            # Fallback: Return the relevant section
            return f"Could not generate a specific answer. Relevant section:\n{relevant_section_text[:500]}..." # Show beginning of section

    def list_sections(self):
        """Lists the identified chapters/sections."""
        if not self.chapter_sections:
            return "No chapters or sections identified yet. Please upload a book."
        print("Chapters/Sections:")
        for i, title in enumerate(self.chapter_sections.keys()):
            print(f"{i+1}. {title}")

    def summarize_section(self, section_identifier):
        """Summarizes a specified chapter or section."""
        if not self.chapter_sections:
            return "No chapters or sections available to summarize. Please upload a book."
        if not self.summarizer_pipeline:
            return "Summarization model is not loaded. Please check initialization."

        section_titles = list(self.chapter_sections.keys())
        section_text = None
        section_title = None

        try:
            # Try to find by number (1-based index)
            section_index = int(section_identifier) - 1
            if 0 <= section_index < len(section_titles):
                section_title = section_titles[section_index]
                section_text = self.chapter_sections[section_title]
        except ValueError:
            # If not a number, try to find by partial title match
            for title in section_titles:
                if section_identifier.lower() in title.lower():
                    section_title = title
                    section_text = self.chapter_sections[title]
                    break

        if section_text is None:
            return f"Could not find a section matching '{section_identifier}'. Use `list_sections()` to see available sections."

        print(f"Summarizing section: '{section_title}'")

        # Summarization models have input length limits. We need to handle long sections.
        # A common approach is to split the text into smaller chunks, summarize each chunk,
        # and then optionally summarize the summaries.
        # For simplicity here, we'll just truncate or split the text for the summarizer input.

        # Let's split the text into smaller chunks and summarize each chunk
        chunk_size = 1000 # characters per chunk for summarization input
        chunks = [section_text[i:i + chunk_size] for i in range(0, len(section_text), chunk_size)]

        all_summaries = []
        print(f"Splitting section into {len(chunks)} chunks for summarization...")
        for i, chunk in enumerate(chunks):
             # Need to handle potential tokenizer limits more robustly
             # A simple approach: if a chunk is too long after tokenization, summarize a smaller part
             input_tokens = self.tokenizer(chunk, return_tensors="pt", max_length=self.tokenizer.model_max_length, truncation=True).input_ids
             if input_tokens.shape[1] > self.tokenizer.model_max_length:
                 print(f"Warning: Chunk {i+1} is too long, truncating for summarization.")
                 # Further split or refine truncation if necessary
                 # For this example, just use the truncated version by the tokenizer

             try:
                summary = self.summarizer_pipeline(chunk, max_length=150, min_length=30, do_sample=False)
                all_summaries.append(summary[0]['summary_text'])
                print(f"Summarized chunk {i+1}/{len(chunks)}")
             except Exception as e:
                 print(f"Error summarizing chunk {i+1}: {e}")
                 all_summaries.append(f"[Error summarizing this part: {str(e)[:100]}...]") # Add error indicator

        # Combine the summaries
        if all_summaries:
            combined_summary = "\n".join(all_summaries)
            return f"Summary of '{section_title}':\n{combined_summary}"
        else:
            return f"Could not generate a summary for '{section_title}'. No valid chunks were processed."


    def study_review(self):
        """Provides options for studying and reviewing."""
        if not self.chapter_sections:
            return "No book loaded for study/review."

        print("\n--- Study & Review Options ---")
        print("1. List Chapters/Sections")
        print("2. Summarize a specific Chapter/Section")
        print("3. Ask a question about the book")
        print("4. Exit Study/Review")

        while True:
            choice = input("Enter your choice: ")
            if choice == '1':
                self.list_sections()
            elif choice == '2':
                section_input = input("Enter the chapter/section number or name to summarize: ")
                summary = self.summarize_section(section_input)
                print(summary)
            elif choice == '3':
                question = input("What is your question about the book? ")
                answer = self.ask_question(question)
                print(answer)
            elif choice == '4':
                print("Exiting Study/Review.")
                break
            else:
                print("Invalid choice. Please try again.")

# --- How to use the agent in Colab ---

# 1. Mount Google Drive to access your files (optional, but common)
from google.colab import drive
drive.mount('/content/drive')

# 2. Create an instance of the BookQA agent
agent = BookQA()

# 3. Upload a book (replace '/content/drive/My Drive/your_book.pdf' with your file path)
# Make sure the PDF is in your Google Drive or uploaded directly to the Colab runtime.
book_path = '/content/drive/My Drive/your_book.pdf' # <--- **CHANGE THIS TO YOUR BOOK PATH**
print(f"\nAttempting to upload book from: {book_path}")
agent.upload_book(book_path)

# --- Now you can interact with the agent ---

# Example 1: Ask a question
if agent.book_text:
    question1 = "What is the main topic of the introduction?"
    answer1 = agent.ask_question(question1)
    print(f"\nQuestion: {question1}")
    print(answer1)

    question2 = "What is the author's perspective on [some topic mentioned in your book]?"
    answer2 = agent.ask_question(question2)
    print(f"\nQuestion: {question2}")
    print(answer2)

    # Example 2: List sections
    print("\nListing sections:")
    agent.list_sections()

    # Example 3: Summarize a section (replace '1' with the actual section number or part of the title)
    section_to_summarize = "1" # <--- **CHANGE THIS TO A SECTION NUMBER OR TITLE FROM `list_sections()`**
    print(f"\nAttempting to summarize section: {section_to_summarize}")
    summary = agent.summarize_section(section_to_summarize)
    print(summary)

    # Example 4: Enter Study/Review mode
    print("\nEntering Study/Review mode:")
    # agent.study_review() # Uncomment this line to start the interactive study mode

else:
    print("\nBook was not loaded successfully. Please check the file path and format.")



In [1]:

# Install necessary libraries
!pip install PyMuPDF  # For handling PDF files
!pip install nltk  # For natural language processing tasks like tokenization
!pip install scikit-learn  # For text vectorization (TF-IDF)
!pip install faiss-cpu  # For efficient similarity search (vector database)
!pip install transformers  # For using powerful language models
!pip install torch  # PyTorch is a dependency for transformers

Collecting PyMuPDF
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m71.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.3
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m44.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvid

In [2]:
import fitz  # PyMuPDF
import nltk
import re
import faiss
import numpy as np
import torch
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

In [35]:
# Download necessary NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

import fitz  # PyMuPDF
import nltk
import re
import faiss
import numpy as np
import torch
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

class BookQA:
    def __init__(self):
        self.book_text = ""
        self.chapter_sections = {} # To store chapter/section titles and their text
        self.vectorizer = None
        self.tfidf_matrix = None
        self.index = None # FAISS index for vector search
        self.qa_pipeline = None
        self.summarizer_pipeline = None
        self.tokenizer = None
        self.summarizer_model = None

        # Initialize NLP pipelines and models
        print("Initializing NLP models...")
        self.initialize_models()
        print("Models initialized.")

    def initialize_models(self):
        # Initialize Question Answering pipeline
        # Using a smaller model for demonstration; consider a larger one for better results
        try:
            self.qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
            print("Question Answering model loaded.")
        except Exception as e:
            print(f"Error loading QA model: {e}")
            print("Please ensure you have sufficient resources and internet connection.")
            self.qa_pipeline = None

        # Initialize Summarization model and tokenizer
        # Using a smaller T5 model; consider t5-large or bart-large-cnn for better summaries
        try:
            self.tokenizer = AutoTokenizer.from_pretrained("t5-small")
            self.summarizer_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
            self.summarizer_pipeline = pipeline("summarization", model=self.summarizer_model, tokenizer=self.tokenizer)
            print("Summarization model loaded.")
        except Exception as e:
            print(f"Error loading Summarization model: {e}")
            print("Please ensure you have sufficient resources and internet connection.")
            self.tokenizer = None
            self.summarizer_model = None
            self.summarizer_pipeline = None

    def clean_text(self, text):
        """Basic text cleaning."""
        text = text.lower() # Convert to lowercase
        text = re.sub(r'[^a-z0-9\s]', '', text) # Remove non-alphanumeric characters except spaces
        text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
        return text

    def upload_book(self, file_path):
        """Reads content from a PDF file."""
        try:
            doc = fitz.open(file_path)
            self.book_text = ""
            print(f"Reading {doc.page_count} pages from {file_path}...")
            for page_num in range(doc.page_count):
                page = doc.load_page(page_num)
                self.book_text += page.get_text()
            print("Finished reading book.")
            self.process_book_text()
        except Exception as e:
            print(f"Error reading file: {e}")
            self.book_text = ""
            self.chapter_sections = {}

    def process_book_text(self):
        """Splits book text into chapters/sections and prepares for search."""
        if not self.book_text:
            print("No book text loaded to process.")
            return

        print("Processing book text...")
        # Simple approach to split into sections based on common patterns (e.g., "Chapter X", "Section Y")
        # This is a basic implementation; a more robust parser might be needed for complex books.
        sections = re.split(r'(Chapter \d+|Section \d+)', self.book_text, flags=re.IGNORECASE)

        current_title = "Introduction/Beginning"
        current_text = ""
        self.chapter_sections = {} # Clear existing sections

        for i, part in enumerate(sections):
            if i % 2 == 1: # This part is likely a title
                if current_text.strip():
                    self.chapter_sections[current_title] = current_text.strip()
                current_title = part.strip()
                current_text = ""
            else: # This part is the content
                current_text += part

        if current_text.strip(): # Add the last section
             self.chapter_sections[current_title] = current_text.strip()

        # Handle cases where no clear chapters/sections are found (e.g., a simple document)
        if not self.chapter_sections and self.book_text:
             # Split into chunks for search
             chunk_size = 2000 # characters
             chunks = [self.book_text[i:i + chunk_size] for i in range(0, len(self.book_text), chunk_size)]
             self.chapter_sections = {f"Chunk {i+1}": chunk for i, chunk in enumerate(chunks)}
             print(f"Book text split into {len(chunks)} chunks.")

        print(f"Identified {len(self.chapter_sections)} chapters/sections or chunks.")

        # Prepare for vector search
        self.prepare_for_search()

    def prepare_for_search(self):
        """Vectorizes the text and creates a FAISS index."""
        if not self.chapter_sections:
            print("No sections to vectorize.")
            self.vectorizer = None
            self.tfidf_matrix = None
            self.index = None
            return

        # Ensure all values in chapter_sections are strings, clean them, and filter empty ones
        valid_section_texts = []
        valid_section_titles = []
        print("Cleaning and filtering sections for vectorization:")
        for title, text in self.chapter_sections.items():
            if isinstance(text, str) and text.strip():
                cleaned_text = self.clean_text(text) # Clean the text
                if cleaned_text: # Only add if cleaned text is not empty
                    valid_section_texts.append(cleaned_text)
                    valid_section_titles.append(title) # Keep track of titles for valid sections
                else:
                    print(f"Warning: Section '{title}' became empty after cleaning. Skipping for vectorization.")
            else:
                print(f"Warning: Section '{title}' is not a valid string or is empty. Skipping for vectorization.")

        print(f"Found {len(valid_section_texts)} valid sections after cleaning and filtering.")

        if not valid_section_texts:
            print("No valid sections to vectorize after filtering.")
            self.vectorizer = None
            self.tfidf_matrix = None
            self.index = None
            return

        print(f"Vectorizing {len(valid_section_texts)} valid sections...")

        # Using TF-IDF for simplicity; consider Sentence Transformers for better semantic search
        self.vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
        try:
            self.tfidf_matrix = self.vectorizer.fit_transform(valid_section_texts)

            # Create FAISS index
            dimension = self.tfidf_matrix.shape[1]
            if dimension > 0:
                self.index = faiss.IndexFlatL2(dimension) # Using L2 distance (Euclidean)
                self.index.add(self.tfidf_matrix.astype('float32')) # FAISS requires float32
                print("FAISS index created.")
                # Store the titles corresponding to the vectorized texts for later lookup
                self._vectorized_section_titles = valid_section_titles
            else:
                print("TF-IDF matrix has no features after vectorization. Cannot create FAISS index.")
                self.index = None
                self._vectorized_section_titles = []

        except Exception as e:
            print(f"Error during TF-IDF vectorization: {e}")
            self.vectorizer = None
            self.tfidf_matrix = None
            self.index = None
            self._vectorized_section_titles = []


    def ask_question(self, question):
        """Answers a question based on the book content."""
        if not self.book_text:
            return "Please upload a book first."
        if not self.qa_pipeline:
             return "Question Answering model is not loaded. Please check initialization."
        if not self.vectorizer or not self.index or not hasattr(self, '_vectorized_section_titles') or not self._vectorized_section_titles:
             return "Book content not vectorized or index not created. Cannot answer questions."


        print(f"Searching for answer to: '{question}'")

        # Find the most relevant section using the FAISS index
        try:
            # Clean the question before vectorizing for search
            cleaned_question = self.clean_text(question)
            if not cleaned_question:
                 return "Please provide a valid question after cleaning."

            question_vec = self.vectorizer.transform([cleaned_question]).toarray().astype('float32')
            D, I = self.index.search(question_vec, 1) # Search for the top 1 most similar section
        except Exception as e:
            print(f"Error transforming question for search: {e}")
            return "Could not process the question for searching."


        if I.size == 0 or I[0][0] == -1:
            return "Could not find a relevant section in the book."

        relevant_section_index = I[0][0]
        # Use the stored vectorized section titles to get the correct title
        relevant_section_title = self._vectorized_section_titles[relevant_section_index]
        relevant_section_text = self.chapter_sections[relevant_section_title] # Use original text for QA


        print(f"Most relevant section: '{relevant_section_title}'")

        # Use the QA pipeline on the relevant section
        try:
            # The QA pipeline has context length limitations. We might need to
            # further refine the relevant text chunk for the QA model.
            # For simplicity, using the whole section here, which might fail for very long sections.
            # A better approach would be to split the section into smaller paragraphs.
            # For this example, let's limit the context length for the pipeline
            max_context_length = self.qa_pipeline.model.config.max_position_embeddings if hasattr(self.qa_pipeline.model.config, 'max_position_embeddings') else 512
            # Let's roughly estimate tokens by characters for now (approx 4 chars per token)
            max_chars = max_context_length * 4
            context = relevant_section_text[:max_chars]


            answer = self.qa_pipeline(question=question, context=context) # Use original question for QA pipeline
            return f"Answer: {answer['answer']} (Source: '{relevant_section_title}', Score: {answer['score']:.2f})"
        except Exception as e:
            print(f"Error during QA pipeline: {e}")
            # Fallback: Return the relevant section
            return f"Could not generate a specific answer. Relevant section from '{relevant_section_title}':\n{relevant_section_text[:500]}..." # Show beginning of section


    def list_sections(self):
        """Lists the identified chapters/sections."""
        if not self.chapter_sections:
            return "No chapters or sections identified yet. Please upload a book."
        print("Chapters/Sections:")
        for i, title in enumerate(self.chapter_sections.keys()):
            print(f"{i+1}. {title}")

    def summarize_section(self, section_identifier):
        """Summarizes a specified chapter or section."""
        if not self.chapter_sections:
            return "No chapters or sections available to summarize. Please upload a book."
        if not self.summarizer_pipeline:
            return "Summarization model is not loaded. Please check initialization."

        section_titles = list(self.chapter_sections.keys())
        section_text = None
        section_title = None

        try:
            # Try to find by number (1-based index)
            section_index = int(section_identifier) - 1
            if 0 <= section_index < len(section_titles):
                section_title = section_titles[section_index]
                section_text = self.chapter_sections[section_title]
        except ValueError:
            # If not a number, try to find by partial title match
            for title in section_titles:
                if section_identifier.lower() in title.lower():
                    section_title = title
                    section_text = self.chapter_sections[title]
                    break

        if section_text is None:
            return f"Could not find a section matching '{section_identifier}'. Use `list_sections()` to see available sections."

        print(f"Summarizing section: '{section_title}'")

        # Summarization models have input length limits. We need to handle long sections.
        # A common approach is to split the text into smaller chunks, summarize each chunk,
        # and then optionally summarize the summaries.
        # For simplicity here, we'll just truncate or split the text for the summarizer input.

        # Let's split the text into smaller chunks and summarize each chunk
        chunk_size = 1000 # characters per chunk for summarization input
        chunks = [section_text[i:i + chunk_size] for i in range(0, len(section_text), chunk_size)]

        all_summaries = []
        print(f"Splitting section into {len(chunks)} chunks for summarization...")
        for i, chunk in enumerate(chunks):
             # Need to handle potential tokenizer limits more robustly
             # A simple approach: if a chunk is too long after tokenization, summarize a smaller part
             input_tokens = self.tokenizer(chunk, return_tensors="pt", max_length=self.tokenizer.model_max_length, truncation=True).input_ids
             if input_tokens.shape[1] > self.tokenizer.model_max_length:
                 print(f"Warning: Chunk {i+1} is too long, truncating for summarization.")
                 # Further split or refine truncation if necessary
                 # For this example, just use the truncated version by the tokenizer

             try:
                summary = self.summarizer_pipeline(chunk, max_length=150, min_length=30, do_sample=False)
                all_summaries.append(summary[0]['summary_text'])
                print(f"Summarized chunk {i+1}/{len(chunks)}")
             except Exception as e:
                 print(f"Error summarizing chunk {i+1}: {e}")
                 all_summaries.append(f"[Error summarizing this part: {str(e)[:100]}...]") # Add error indicator

        # Combine the summaries
        if all_summaries:
            combined_summary = "\n".join(all_summaries)
            return f"Summary of '{section_title}':\n{combined_summary}"
        else:
            return f"Could not generate a summary for '{section_title}'. No valid chunks were processed."


    def study_review(self):
        """Provides options for studying and reviewing."""
        if not self.chapter_sections:
            return "No book loaded for study/review."

        print("\n--- Study & Review Options ---")
        print("1. List Chapters/Sections")
        print("2. Summarize a specific Chapter/Section")
        print("3. Ask a question about the book")
        print("4. Exit Study/Review")

        while True:
            choice = input("Enter your choice: ")
            if choice == '1':
                self.list_sections()
            elif choice == '2':
                section_input = input("Enter the chapter/section number or name to summarize: ")
                summary = self.summarize_section(section_input)
                print(summary)
            elif choice == '3':
                question = input("What is your question about the book? ")
                answer = self.ask_question(question)
                print(answer)
            elif choice == '4':
                print("Exiting Study/Review.")
                break
            else:
                print("Invalid choice. Please try again.")

In [29]:

# --- How to use the agent in Colab ---

# 1. Mount Google Drive to access your files (optional, but common)
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [30]:
# 2. Create an instance of the BookQA agent
agent = BookQA()


Initializing NLP models...


Device set to use cuda:0


Question Answering model loaded.


Device set to use cuda:0


Summarization model loaded.
Models initialized.


In [31]:

# 3. Upload a book (replace '/content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf' with your file path)
# Make sure the PDF is in your Google Drive or uploaded directly to the Colab runtime.
book_path = '/content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf' # <--- **CHANGE THIS TO YOUR BOOK PATH**
print(f"\nAttempting to upload book from: {book_path}")
agent.upload_book(book_path)


Attempting to upload book from: /content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf
Reading 89 pages from /content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf...
Finished reading book.
Processing book text...
Identified 1 chapters/sections or chunks.
Vectorizing 1 valid sections...
Error during TF-IDF vectorization: setting an array element with a sequence.


In [14]:
# --- Now you can interact with the agent ---

# Example 1: Ask a question
if agent.book_text:
    question1 = "BRIAN TRACY IDEAS TO LIVE BY"
    answer1 = agent.ask_question(question1)
    print(f"\nQuestion: {question1}")
    print(answer1)


In [18]:
# 2. Create an instance of the BookQA agent
agent = BookQA()

Initializing NLP models...


Device set to use cuda:0


Question Answering model loaded.


Device set to use cuda:0


Summarization model loaded.
Models initialized.


In [20]:
# 3. Upload a book (replace '/content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf' with your file path)
# Make sure the PDF is in your Google Drive or uploaded directly to the Colab runtime.
book_path = '/content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf' # <--- **CHANGE THIS TO YOUR BOOK PATH**
print(f"\nAttempting to upload book from: {book_path}")
agent.upload_book(book_path)


Attempting to upload book from: /content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf
Reading 89 pages from /content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf...
Finished reading book.
Processing book text...
Identified 1 chapters/sections or chunks.
Vectorizing 1 sections...
Error reading file: setting an array element with a sequence.


In [26]:
# 3. Upload a book (replace '/content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf' with your file path)
# Make sure the PDF is in your Google Drive or uploaded directly to the Colab runtime.
book_path = '/content/drive/MyDrive/627260606-chatbotdoc (1).pdf' # <--- **CHANGE THIS TO YOUR BOOK PATH**
print(f"\nAttempting to upload book from: {book_path}")
agent.upload_book(book_path)


Attempting to upload book from: /content/drive/MyDrive/627260606-chatbotdoc (1).pdf
Reading 97 pages from /content/drive/MyDrive/627260606-chatbotdoc (1).pdf...
Finished reading book.
Processing book text...
Identified 2 chapters/sections or chunks.
Vectorizing 2 sections...
Error reading file: setting an array element with a sequence.


In [25]:
# --- Now you can interact with the agent ---

# Example 1: Ask a question
if agent.book_text:
    question1 = "What is the main topic of the introduction?"
    answer1 = agent.ask_question(question1)
    print(f"\nQuestion: {question1}")
    print(answer1)

    question2 = "What are the key steps for setting goals?" # Example question related to the book title
    answer2 = agent.ask_question(question2)
    print(f"\nQuestion: {question2}")
    print(answer2)

    # Example 2: List sections
    print("\nListing sections:")
    agent.list_sections()

    # Example 3: Summarize a section (replace '1' with the actual section number or part of the title)
    # Check the output of list_sections() to find a section to summarize.
    # For now, let's try summarizing the first identified section (assuming there's at least one)
    if agent.chapter_sections:
        first_section_key = list(agent.chapter_sections.keys())[0]
        print(f"\nAttempting to summarize section: {first_section_key}")
        summary = agent.summarize_section(first_section_key)
        print(summary)
    else:
        print("\nNo sections available to summarize.")


    # Example 4: Enter Study/Review mode
    print("\nEntering Study/Review mode:")
    # agent.study_review() # Uncomment this line to start the interactive study mode

else:
    print("\nBook was not loaded successfully. Please check the file path and format.")


Book was not loaded successfully. Please check the file path and format.


In [28]:
# 3. Upload a book (replace '/content/drive/MyDrive/627260606-chatbotdoc (1).pdf' with your file path)
# Make sure the PDF is in your Google Drive or uploaded directly to the Colab runtime.
book_path = '/content/drive/MyDrive/627260606-chatbotdoc (1).pdf' # <--- **CHANGE THIS TO YOUR BOOK PATH**
print(f"\nAttempting to upload book from: {book_path}")
agent.upload_book(book_path)


Attempting to upload book from: /content/drive/MyDrive/627260606-chatbotdoc (1).pdf
Reading 97 pages from /content/drive/MyDrive/627260606-chatbotdoc (1).pdf...
Finished reading book.
Processing book text...
Identified 2 chapters/sections or chunks.
Vectorizing 2 sections...
Error reading file: setting an array element with a sequence.


In [33]:
# 3. Upload a book (replace '/content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf' with your file path)
# Make sure the PDF is in your Google Drive or uploaded directly to the Colab runtime.
book_path = '/content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf' # <--- **CHANGE THIS TO YOUR BOOK PATH**
print(f"\nAttempting to upload book from: {book_path}")
agent.upload_book(book_path)


Attempting to upload book from: /content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf
Reading 89 pages from /content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf...
Finished reading book.
Processing book text...
Identified 1 chapters/sections or chunks.
Vectorizing 1 valid sections...
Error during TF-IDF vectorization: setting an array element with a sequence.


In [36]:
# 3. Upload a book (replace '/content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf' with your file path)
# Make sure the PDF is in your Google Drive or uploaded directly to the Colab runtime.
book_path = '/content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf' # <--- **CHANGE THIS TO YOUR BOOK PATH**
print(f"\nAttempting to upload book from: {book_path}")
agent.upload_book(book_path)


Attempting to upload book from: /content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf
Reading 89 pages from /content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf...
Finished reading book.
Processing book text...
Identified 1 chapters/sections or chunks.
Vectorizing 1 valid sections...
Error during TF-IDF vectorization: setting an array element with a sequence.


# Task
Explain the error "setting an array element with a sequence" encountered during TF-IDF vectorization when processing PDF files "/content/drive/MyDrive/627260606-chatbotdoc (1).pdf" and "/content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf". If possible, fix the error in the provided code and incorporate the changes. Otherwise, diagnose the error.

## Install sentence transformers

### Subtask:
Add the necessary library for Sentence Transformers.


**Reasoning**:
The subtask is to add the necessary library for Sentence Transformers. This requires installing the `sentence-transformers` package using pip. This needs to be in a separate code cell as per the instructions.



In [37]:
!pip install sentence-transformers



## Update `initialize models`

### Subtask:
Load a Sentence Transformer model instead of relying solely on TF-IDF.


**Reasoning**:
Import the necessary class for Sentence Transformers and initialize the Sentence Transformer model in the `initialize_models` method, replacing the TF-IDF vectorizer initialization.



In [38]:
from sentence_transformers import SentenceTransformer

class BookQA:
    def __init__(self):
        self.book_text = ""
        self.chapter_sections = {} # To store chapter/section titles and their text
        self.vectorizer = None # This will be replaced by the embedding model
        self.tfidf_matrix = None # This will be replaced by the embeddings
        self.index = None # FAISS index for vector search
        self.qa_pipeline = None
        self.summarizer_pipeline = None
        self.tokenizer = None
        self.summarizer_model = None
        self.embedding_model = None # New attribute for Sentence Transformer model


        # Initialize NLP pipelines and models
        print("Initializing NLP models...")
        self.initialize_models()
        print("Models initialized.")

    def initialize_models(self):
        # Initialize Question Answering pipeline
        # Using a smaller model for demonstration; consider a larger one for better results
        try:
            self.qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
            print("Question Answering model loaded.")
        except Exception as e:
            print(f"Error loading QA model: {e}")
            print("Please ensure you have sufficient resources and internet connection.")
            self.qa_pipeline = None

        # Initialize Summarization model and tokenizer
        # Using a smaller T5 model; consider t5-large or bart-large-cnn for better summaries
        try:
            self.tokenizer = AutoTokenizer.from_pretrained("t5-small")
            self.summarizer_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
            self.summarizer_pipeline = pipeline("summarization", model=self.summarizer_model, tokenizer=self.tokenizer)
            print("Summarization model loaded.")
        except Exception as e:
            print(f"Error loading Summarization model: {e}")
            print("Please ensure you have sufficient resources and internet connection.")
            self.tokenizer = None
            self.summarizer_model = None
            self.summarizer_pipeline = None

        # Initialize Sentence Transformer model for embeddings
        try:
            self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
            print("Sentence Transformer model loaded.")
        except Exception as e:
            print(f"Error loading Sentence Transformer model: {e}")
            print("Please ensure you have sufficient resources and internet connection.")
            self.embedding_model = None


    def clean_text(self, text):
        """Basic text cleaning."""
        text = text.lower() # Convert to lowercase
        text = re.sub(r'[^a-z0-9\s]', '', text) # Remove non-alphanumeric characters except spaces
        text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
        return text

    def upload_book(self, file_path):
        """Reads content from a PDF file."""
        try:
            doc = fitz.open(file_path)
            self.book_text = ""
            print(f"Reading {doc.page_count} pages from {file_path}...")
            for page_num in range(doc.page_count):
                page = doc.load_page(page_num)
                self.book_text += page.get_text()
            print("Finished reading book.")
            self.process_book_text()
        except Exception as e:
            print(f"Error reading file: {e}")
            self.book_text = ""
            self.chapter_sections = {}

    def process_book_text(self):
        """Splits book text into chapters/sections and prepares for search."""
        if not self.book_text:
            print("No book text loaded to process.")
            return

        print("Processing book text...")
        # Simple approach to split into sections based on common patterns (e.g., "Chapter X", "Section Y")
        # This is a basic implementation; a more robust parser might be needed for complex books.
        sections = re.split(r'(Chapter \d+|Section \d+)', self.book_text, flags=re.IGNORECASE)

        current_title = "Introduction/Beginning"
        current_text = ""
        self.chapter_sections = {} # Clear existing sections

        for i, part in enumerate(sections):
            if i % 2 == 1: # This part is likely a title
                if current_text.strip():
                    self.chapter_sections[current_title] = current_text.strip()
                current_title = part.strip()
                current_text = ""
            else: # This part is the content
                current_text += part

        if current_text.strip(): # Add the last section
             self.chapter_sections[current_title] = current_text.strip()

        # Handle cases where no clear chapters/sections are found (e.g., a simple document)
        if not self.chapter_sections and self.book_text:
             # Split into chunks for search
             chunk_size = 2000 # characters
             chunks = [self.book_text[i:i + chunk_size] for i in range(0, len(self.book_text), chunk_size)]
             self.chapter_sections = {f"Chunk {i+1}": chunk for i, chunk in enumerate(chunks)}
             print(f"Book text split into {len(chunks)} chunks.")

        print(f"Identified {len(self.chapter_sections)} chapters/sections or chunks.")

        # Prepare for vector search
        self.prepare_for_search()

    def prepare_for_search(self):
        """Vectorizes the text and creates a FAISS index using Sentence Transformers."""
        if not self.chapter_sections:
            print("No sections to vectorize.")
            self.tfidf_matrix = None # Renamed to embeddings
            self.index = None
            return
        if not self.embedding_model:
            print("Sentence Transformer model not loaded. Cannot vectorize sections.")
            self.tfidf_matrix = None # Renamed to embeddings
            self.index = None
            return


        section_texts = list(self.chapter_sections.values())
        print(f"Creating embeddings for {len(section_texts)} sections...")

        try:
            # Create embeddings using the Sentence Transformer model
            embeddings = self.embedding_model.encode(section_texts, convert_to_numpy=True)
            self.tfidf_matrix = embeddings # Store embeddings (renamed from tfidf_matrix)

            # Create FAISS index
            dimension = embeddings.shape[1]
            if dimension > 0:
                self.index = faiss.IndexFlatL2(dimension) # Using L2 distance (Euclidean)
                self.index.add(embeddings.astype('float32')) # FAISS requires float32
                print("FAISS index created.")
                # Store the titles corresponding to the vectorized texts for later lookup
                self._vectorized_section_titles = list(self.chapter_sections.keys()) # Store original titles
            else:
                print("Embeddings have no dimensions after vectorization. Cannot create FAISS index.")
                self.index = None
                self._vectorized_section_titles = []

        except Exception as e:
            print(f"Error during embedding creation or FAISS index creation: {e}")
            self.tfidf_matrix = None # Renamed to embeddings
            self.index = None
            self._vectorized_section_titles = []


    def ask_question(self, question):
        """Answers a question based on the book content."""
        if not self.book_text:
            return "Please upload a book first."
        if not self.qa_pipeline:
             return "Question Answering model is not loaded. Please check initialization."
        if not self.embedding_model or not self.index or not hasattr(self, '_vectorized_section_titles') or not self._vectorized_section_titles:
             return "Book content not vectorized or index not created. Cannot answer questions."


        print(f"Searching for answer to: '{question}'")

        # Find the most relevant section using the FAISS index
        try:
            # Create embedding for the question
            question_vec = self.embedding_model.encode([question], convert_to_numpy=True).astype('float32')
            D, I = self.index.search(question_vec, 1) # Search for the top 1 most similar section
        except Exception as e:
            print(f"Error creating question embedding or searching index: {e}")
            return "Could not process the question for searching."


        if I.size == 0 or I[0][0] == -1:
            return "Could not find a relevant section in the book."

        relevant_section_index = I[0][0]
        # Use the stored vectorized section titles to get the correct title
        relevant_section_title = self._vectorized_section_titles[relevant_section_index]
        relevant_section_text = self.chapter_sections[relevant_section_title] # Use original text for QA


        print(f"Most relevant section: '{relevant_section_title}'")

        # Use the QA pipeline on the relevant section
        try:
            # The QA pipeline has context length limitations. We might need to
            # further refine the relevant text chunk for the QA model.
            # For simplicity, using the whole section here, which might fail for very long sections.
            # A better approach would be to split the section into smaller paragraphs.
            # For this example, let's limit the context length for the pipeline
            max_context_length = self.qa_pipeline.model.config.max_position_embeddings if hasattr(self.qa_pipeline.model.config, 'max_position_embeddings') else 512
            # Let's roughly estimate tokens by characters for now (approx 4 chars per token)
            max_chars = max_context_length * 4
            context = relevant_section_text[:max_chars]


            answer = self.qa_pipeline(question=question, context=context) # Use original question for QA pipeline
            return f"Answer: {answer['answer']} (Source: '{relevant_section_title}', Score: {answer['score']:.2f})"
        except Exception as e:
            print(f"Error during QA pipeline: {e}")
            # Fallback: Return the relevant section
            return f"Could not generate a specific answer. Relevant section from '{relevant_section_title}':\n{relevant_section_text[:500]}..." # Show beginning of section


    def list_sections(self):
        """Lists the identified chapters/sections."""
        if not self.chapter_sections:
            return "No chapters or sections identified yet. Please upload a book."
        print("Chapters/Sections:")
        for i, title in enumerate(self.chapter_sections.keys()):
            print(f"{i+1}. {title}")

    def summarize_section(self, section_identifier):
        """Summarizes a specified chapter or section."""
        if not self.chapter_sections:
            return "No chapters or sections available to summarize. Please upload a book."
        if not self.summarizer_pipeline:
            return "Summarization model is not loaded. Please check initialization."

        section_titles = list(self.chapter_sections.keys())
        section_text = None
        section_title = None

        try:
            # Try to find by number (1-based index)
            section_index = int(section_identifier) - 1
            if 0 <= section_index < len(section_titles):
                section_title = section_titles[section_index]
                section_text = self.chapter_sections[section_title]
        except ValueError:
            # If not a number, try to find by partial title match
            for title in section_titles:
                if section_identifier.lower() in title.lower():
                    section_title = title
                    section_text = self.chapter_sections[title]
                    break

        if section_text is None:
            return f"Could not find a section matching '{section_identifier}'. Use `list_sections()` to see available sections."

        print(f"Summarizing section: '{section_title}'")

        # Summarization models have input length limits. We need to handle long sections.
        # A common approach is to split the text into smaller chunks, summarize each chunk,
        # and then optionally summarize the summaries.
        # For simplicity here, we'll just truncate or split the text for the summarizer input.

        # Let's split the text into smaller chunks and summarize each chunk
        chunk_size = 1000 # characters per chunk for summarization input
        chunks = [section_text[i:i + chunk_size] for i in range(0, len(section_text), chunk_size)]

        all_summaries = []
        print(f"Splitting section into {len(chunks)} chunks for summarization...")
        for i, chunk in enumerate(chunks):
             # Need to handle potential tokenizer limits more robustly
             # A simple approach: if a chunk is too long after tokenization, summarize a smaller part
             input_tokens = self.tokenizer(chunk, return_tensors="pt", max_length=self.tokenizer.model_max_length, truncation=True).input_ids
             if input_tokens.shape[1] > self.tokenizer.model_max_length:
                 print(f"Warning: Chunk {i+1} is too long, truncating for summarization.")
                 # Further split or refine truncation if necessary
                 # For this example, just use the truncated version by the tokenizer

             try:
                summary = self.summarizer_pipeline(chunk, max_length=150, min_length=30, do_sample=False)
                all_summaries.append(summary[0]['summary_text'])
                print(f"Summarized chunk {i+1}/{len(chunks)}")
             except Exception as e:
                 print(f"Error summarizing chunk {i+1}: {e}")
                 all_summaries.append(f"[Error summarizing this part: {str(e)[:100]}...]") # Add error indicator

        # Combine the summaries
        if all_summaries:
            combined_summary = "\n".join(all_summaries)
            return f"Summary of '{section_title}':\n{combined_summary}"
        else:
            return f"Could not generate a summary for '{section_title}'. No valid chunks were processed."


    def study_review(self):
        """Provides options for studying and reviewing."""
        if not self.chapter_sections:
            return "No book loaded for study/review."

        print("\n--- Study & Review Options ---")
        print("1. List Chapters/Sections")
        print("2. Summarize a specific Chapter/Section")
        print("3. Ask a question about the book")
        print("4. Exit Study/Review")

        while True:
            choice = input("Enter your choice: ")
            if choice == '1':
                self.list_sections()
            elif choice == '2':
                section_input = input("Enter the chapter/section number or name to summarize: ")
                summary = self.summarize_section(section_input)
                print(summary)
            elif choice == '3':
                question = input("What is your question about the book? ")
                answer = self.ask_question(question)
                print(answer)
            elif choice == '4':
                print("Exiting Study/Review.")
                break
            else:
                print("Invalid choice. Please try again.")

## Review and refine

### Subtask:
Test the updated agent with the problematic PDF and potentially other documents. Adjust text splitting, model choice, or search parameters as needed.


**Reasoning**:
Re-instantiate the BookQA agent to load the updated code and then attempt to upload one of the problematic PDF files to test if the error is resolved.



In [39]:
# 2. Create an instance of the BookQA agent to load the updated code
agent = BookQA()

# 3. Upload a book (replace the path with one of the problematic PDF files)
# Make sure the PDF is in your Google Drive or uploaded directly to the Colab runtime.
book_path = '/content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf' # Using one of the problematic files
print(f"\nAttempting to upload book from: {book_path}")
agent.upload_book(book_path)

# If upload is successful, list sections to confirm processing
if agent.chapter_sections:
    print("\nBook processed successfully. Listing sections:")
    agent.list_sections()
else:
    print("\nBook processing failed.")

Initializing NLP models...


Device set to use cuda:0


Question Answering model loaded.


Device set to use cuda:0


Summarization model loaded.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sentence Transformer model loaded.
Models initialized.

Attempting to upload book from: /content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf
Reading 89 pages from /content/drive/MyDrive/229250892-Brian-Tracy-Goal-Planner.pdf...
Finished reading book.
Processing book text...
Identified 1 chapters/sections or chunks.
Creating embeddings for 1 sections...
FAISS index created.

Book processed successfully. Listing sections:
Chapters/Sections:
1. Introduction/Beginning


**Reasoning**:
The previous step successfully loaded the book and created embeddings. Now, test the `ask_question`, `list_sections`, and `summarize_section` methods to ensure the agent is functional with the loaded book.



In [40]:
# Test the ask_question method
question1 = "What are the main ideas presented in the book?"
answer1 = agent.ask_question(question1)
print(f"\nQuestion: {question1}")
print(answer1)

question2 = "How to set goals?"
answer2 = agent.ask_question(question2)
print(f"\nQuestion: {question2}")
print(answer2)

# Test the list_sections method (already done in the previous step, but can be repeated)
print("\nListing sections again:")
agent.list_sections()

# Test the summarize_section method
# Assuming there's at least one section, summarize the first one.
if agent.chapter_sections:
    first_section_key = list(agent.chapter_sections.keys())[0]
    print(f"\nAttempting to summarize section: {first_section_key}")
    summary = agent.summarize_section(first_section_key)
    print(summary)
else:
    print("\nNo sections available to summarize.")


Searching for answer to: 'What are the main ideas presented in the book?'
Most relevant section: 'Introduction/Beginning'

Question: What are the main ideas presented in the book?
Answer: The more reasons you have for 
achieving your goal (Source: 'Introduction/Beginning', Score: 0.03)
Searching for answer to: 'How to set goals?'
Most relevant section: 'Introduction/Beginning'

Question: How to set goals?
Answer: act as if it were impossible to fail (Source: 'Introduction/Beginning', Score: 0.01)

Listing sections again:
Chapters/Sections:
1. Introduction/Beginning

Attempting to summarize section: Introduction/Beginning
Summarizing section: 'Introduction/Beginning'
Splitting section into 68 chunks for summarization...


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 1/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 2/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 3/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 4/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 5/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 6/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 7/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 8/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 9/68


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 10/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 11/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 12/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 13/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 14/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 15/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 16/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 17/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 18/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 19/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 20/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 21/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 22/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 23/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 24/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 25/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 26/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 27/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 28/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 29/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 30/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 31/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 32/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 33/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 34/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 35/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 36/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 37/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 38/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 39/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 40/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 41/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 42/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 43/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 44/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 45/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 46/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 47/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 48/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 49/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 50/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 51/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 52/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 53/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 54/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 55/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 56/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 57/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 58/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 59/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 60/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 61/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 62/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 63/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 64/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 65/68


Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 66/68


Your max_length is set to 150, but your input_length is only 131. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=65)
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summarized chunk 67/68
Summarized chunk 68/68
Summary of 'Introduction/Beginning':
BRIAN TRACY GOAL PLANNER M A X I M U M A C H I E V E M E N T G O A L P L A N N N E R . the great secret of success is that there are no secrets of success .
there are no limits on what you can achieve with your life, except the limits you accept in your own mind . you are a potential genius; there is no problem you cannot solve, and no answer you cannot fi nd some- where .
Brian Tracy has consulted for more than 1,000 companies and addressed more than 3,000,000 people in 4,000 talks and seminars . he has studied, researched, written and spoken for 30 years in the fi elds of eco- nomics, history, business, philosophy and psychology .
nguages speaks to corporate and public audiences on the subjects of Person- al and Professional Development . talks and seminars on Leadership, Selling, Self-Esteem, Goals, Strategy, Creativity and Success Psychology bring about immediate changes .
Brian is the president of t

In [41]:
# Test the ask_question method
question1 = "What are the main ideas presented in the book?"
answer1 = agent.ask_question(question1)
print(f"\nQuestion: {question1}")
print(answer1)

question2 = "How to set goals?"
answer2 = agent.ask_question(question2)
print(f"\nQuestion: {question2}")
print(answer2)

Searching for answer to: 'What are the main ideas presented in the book?'
Most relevant section: 'Introduction/Beginning'

Question: What are the main ideas presented in the book?
Answer: The more reasons you have for 
achieving your goal (Source: 'Introduction/Beginning', Score: 0.03)
Searching for answer to: 'How to set goals?'
Most relevant section: 'Introduction/Beginning'

Question: How to set goals?
Answer: act as if it were impossible to fail (Source: 'Introduction/Beginning', Score: 0.01)


## Summary:

### Data Analysis Key Findings

*   The initial error "setting an array element with a sequence" occurred during TF-IDF vectorization when processing the provided PDF files.
*   This error was resolved by replacing the TF-IDF vectorization approach with Sentence Transformer embeddings for creating text representations.
*   The updated `BookQA` class successfully initialized a Sentence Transformer model (`all-MiniLM-L6-v2`) and used it to generate embeddings for the book sections.
*   A FAISS index was created using the generated embeddings, enabling efficient semantic search.
*   The agent successfully processed the previously problematic PDF file using the new embedding approach.
*   Core functionalities, including listing sections, asking questions, and summarizing sections, worked after the update.
*   The summarization pipeline produced warnings regarding `max_new_tokens` and `max_length` parameters but completed the task.

### Insights or Next Steps

*   The switch from TF-IDF to Sentence Transformer embeddings effectively addressed the array dimension error, suggesting that the issue was related to how TF-IDF handled sequences within the text data extracted from the PDFs.
*   Future work could involve addressing the summarization warnings for potentially better control over summary length and exploring more sophisticated text splitting techniques or larger language models for improved QA and summarization quality.
