In [1]:
!pip install langchain faiss-cpu sentence-transformers transformers pypdf langchain-community





In [2]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import pipeline
import re

# --- Step 1: Load and Split PDF ---
def extract_chunks_from_pdf(pdf_path):
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, chunk_overlap=200
    )
    chunks = splitter.split_documents(pages)
    return chunks

# --- Step 2: Embed and Index Chunks ---
def create_vector_store(chunks):
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    vectorstore = FAISS.from_documents(chunks, embeddings)
    return vectorstore

# --- Step 3: Retrieve "Introduction" Chunks ---
def retrieve_introduction(vectorstore, top_k=5):
    retriever = vectorstore.as_retriever(search_type="similarity", k=top_k)
    results = retriever.get_relevant_documents("Extract the Introduction section")
    intro_text = "\n".join([doc.page_content for doc in results])
    return intro_text

# --- Step 4: Summarize the Introduction ---
def summarize_text(text, model_name="google/pegasus-xsum"):
    summarizer = pipeline("summarization", model=model_name, tokenizer=model_name)
    chunks = [text[i:i+1024] for i in range(0, len(text), 1024)]
    summaries = summarizer(chunks, max_length=120, min_length=30, do_sample=False)
    final_summary = " ".join([s['summary_text'] for s in summaries])
    return final_summary

# --- Run the Full Pipeline ---
def extract_and_summarize_intro(pdf_path):
    chunks = extract_chunks_from_pdf(pdf_path)
    vectorstore = create_vector_store(chunks)
    intro = retrieve_introduction(vectorstore)

    print("\n🧾 --- Extracted Introduction ---\n")
    print(intro)

    summary = summarize_text(intro)
    print("\n📝 --- Summarized Introduction ---\n")
    print(summary)

# Example usage:
pdf_file_path = "/content/file.pdf"  # 🔁 Replace with your PDF path
extract_and_summarize_intro(pdf_file_path)


  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  results = retriever.get_relevant_documents("Extract the Introduction section")



🧾 --- Extracted Introduction ---

used to evaluatelanguagemodels,and we describea method for
constructinglatticessuch thattheseartiﬁcialword-error ratescor-
relatewellwithword-errorratescalculatedon genuinelattices.In
addition,thelatticesconstructedarevery narrow ,so thatartiﬁcial
1Itisunclearhow to counthow oftena word occursin each bucket;
e.g.,duringspeechrecognition,languagemodel probabilitiesfora word may
be estimatedmultipletimesateach positionintheutterancewithdifferent
histories.For thepurposesof thiscalculation,we pretendthata totalof
I JKI
words “occur”ateach word positionin an utterancewhere
J
isthe
vocabularyused,and normalizeaccordingly.
1.1.PreviousW ork
Iyeretal.[2]investigatethepredictionofspeechrecognitionperfor-
mance forlanguagemodels intheSwitchboarddomain,fortrigram
models builton differingamounts ofin-domainand out-of-domain
trainingdata.Ov erthetenmodels they constructed,they ﬁnd that
perplexitypredictsword-errorratewellwhen onlyin-domaintrain-
ingdataisused,but

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Device set to use cpu
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.58.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
Your max_length is set to 120, but your input_length is only 96. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)



📝 --- Summarized Introduction ---

In this paper, we investigate the prediction of word-error rates during speechrecognition and speech-to-text training by constructing trigram models. This paper presents the results of a study on the impact of four new language models on the performance of English as a second language (ESL) students. The performance of speech recognition systems is affected by a number offactors, some of which have not been previously reported in any literature on speech recognition. The Janusrecognizer, a speech recognition system based on sparse data, has been developed by a team of researchers, including Slava Katz, IvicaRogina, and Alex Wibel.


In [12]:
!pip install --quiet pymupdf

In [13]:
import os
import re
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import pipeline
from pathlib import Path

# --- PDF Loading and Chunking ---
def extract_chunks_from_pdf(pdf_path):
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()

    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = splitter.split_documents(pages)
    return chunks

# --- Embedding and Vector Store Creation ---
def create_vector_store(chunks):
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    vectorstore = FAISS.from_documents(chunks, embeddings)
    return vectorstore

# --- Retrieve Introduction Chunks ---
def retrieve_introduction(vectorstore, top_k=5):
    retriever = vectorstore.as_retriever(search_type="similarity", k=top_k)
    results = retriever.get_relevant_documents("Extract the Introduction section")
    intro_text = "\n".join([doc.page_content for doc in results])
    return intro_text

# --- Summarization ---
def summarize_text(text, model_name="google/pegasus-xsum"):
    summarizer = pipeline("summarization", model=model_name, tokenizer=model_name)
    chunks = [text[i:i+1024] for i in range(0, len(text), 1024)]
    summaries = summarizer(chunks, max_length=120, min_length=30, do_sample=False)
    final_summary = " ".join([s['summary_text'] for s in summaries])
    return final_summary

# --- Process Single PDF ---
def process_pdf(pdf_path, output_dir):
    chunks = extract_chunks_from_pdf(pdf_path)
    vectorstore = create_vector_store(chunks)
    intro = retrieve_introduction(vectorstore)
    summary = summarize_text(intro)

    # File name processing
    pdf_name = Path(pdf_path).stem

    # Save raw intro
    intro_file = os.path.join(output_dir, f"{pdf_name}_introduction.txt")
    with open(intro_file, "w", encoding="utf-8") as f:
        f.write(intro)

    # Save summary
    summary_file = os.path.join(output_dir, f"{pdf_name}_summary.txt")
    with open(summary_file, "w", encoding="utf-8") as f:
        f.write(summary)

    print(f"✅ Processed: {pdf_name}\n🧾 Saved intro: {intro_file}\n📝 Saved summary: {summary_file}\n")

# --- Batch Process Folder of PDFs ---
def process_all_pdfs(folder_path):
    pdf_files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.lower().endswith('.pdf')]
    for pdf_path in pdf_files:
        process_pdf(pdf_path, folder_path)

# --- Set your path ---
# Replace this with your actual folder in Drive
folder_path = "/content"  # 🔁 Change this

process_all_pdfs(folder_path)


Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


✅ Processed: Speech and Language Processing 2e (SLP2e)
🧾 Saved intro: /content/Speech and Language Processing 2e (SLP2e)_introduction.txt
📝 Saved summary: /content/Speech and Language Processing 2e (SLP2e)_summary.txt



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Your max_length is set to 120, but your input_length is only 23. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=11)


✅ Processed: 2010.12309v3
🧾 Saved intro: /content/2010.12309v3_introduction.txt
📝 Saved summary: /content/2010.12309v3_summary.txt



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Your max_length is set to 120, but your input_length is only 96. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)


✅ Processed: file
🧾 Saved intro: /content/file_introduction.txt
📝 Saved summary: /content/file_summary.txt



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Your max_length is set to 120, but your input_length is only 12. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=6)


✅ Processed: 1-s2.0-S1386505624003368-main
🧾 Saved intro: /content/1-s2.0-S1386505624003368-main_introduction.txt
📝 Saved summary: /content/1-s2.0-S1386505624003368-main_summary.txt



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


✅ Processed: 2407.21330v1
🧾 Saved intro: /content/2407.21330v1_introduction.txt
📝 Saved summary: /content/2407.21330v1_summary.txt



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


✅ Processed: Sinhala_NLP_Tools_Survey_V.5.4.0
🧾 Saved intro: /content/Sinhala_NLP_Tools_Survey_V.5.4.0_introduction.txt
📝 Saved summary: /content/Sinhala_NLP_Tools_Survey_V.5.4.0_summary.txt



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Your max_length is set to 120, but your input_length is only 94. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=47)
Your max_length is set to 120, but your input_length is only 32. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)


✅ Processed: Evaluating Domain Specific LLM Performance Within Economics Using
🧾 Saved intro: /content/Evaluating Domain Specific LLM Performance Within Economics Using_introduction.txt
📝 Saved summary: /content/Evaluating Domain Specific LLM Performance Within Economics Using_summary.txt



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Your max_length is set to 120, but your input_length is only 106. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=53)


✅ Processed: 2503.12051v3
🧾 Saved intro: /content/2503.12051v3_introduction.txt
📝 Saved summary: /content/2503.12051v3_summary.txt



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Your max_length is set to 120, but your input_length is only 83. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=41)


✅ Processed: 2312.16845v1
🧾 Saved intro: /content/2312.16845v1_introduction.txt
📝 Saved summary: /content/2312.16845v1_summary.txt



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Your max_length is set to 120, but your input_length is only 12. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=6)


✅ Processed: 2406.10421v3
🧾 Saved intro: /content/2406.10421v3_introduction.txt
📝 Saved summary: /content/2406.10421v3_summary.txt



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Your max_length is set to 120, but your input_length is only 10. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


✅ Processed: 2306.05179v2
🧾 Saved intro: /content/2306.05179v2_introduction.txt
📝 Saved summary: /content/2306.05179v2_summary.txt



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Your max_length is set to 120, but your input_length is only 44. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)


✅ Processed: 2305.12474v3
🧾 Saved intro: /content/2305.12474v3_introduction.txt
📝 Saved summary: /content/2305.12474v3_summary.txt

