<a href="https://colab.research.google.com/github/Vipul251/Insightful-Review-A-PyMuPDF-and-PyPDFLLM-Exploration/blob/main/PDF_Extraction_A_PyMuPDF_and_PyPDFLLM_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Insight Automation: A PyMuPDF and PyPDFLLM Approach for Informed **Investing**

In [40]:
!pip install pymupdf transformers
!pip install PyMuPDF
!pip install spacy
!pip install transformers
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Title: Automated Financial Document Analysis for Investors Using PyMuPDF and NLP Pipelines
Description
This project is designed to assist investors by automatically extracting, categorizing, summarizing, and analyzing critical information from financial documents. The solution leverages PyMuPDF for PDF extraction, Regular Expressions for categorization based on keywords, and Hugging Face's Transformers NLP pipelines for summarization and sentiment analysis. The key objective is to provide concise insights on future growth prospects, business changes, key triggers, and potential financial impacts, making it suitable for quick investment evaluation.


In [41]:
import fitz
import re
from transformers import pipeline

# Initialize NLP models for summarization and sentiment analysis
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
sentiment_analysis = pipeline("sentiment-analysis")

def extract_text_from_pdf(pdf_path):
    """Extracts all text from each page of the PDF."""
    doc = fitz.open(pdf_path)
    text_content = ""

    for page_num in range(len(doc)):
        page = doc[page_num]
        text_content += page.get_text()

    return text_content

def categorize_text(text, categories):
    """Categorize text based on defined keyword categories."""
    categorized_data = {}

    for category, keywords in categories.items():
        # Build regex pattern to find relevant text segments
        pattern = r"\b(" + "|".join(keywords) + r")\b"
        matches = re.findall(pattern, text, flags=re.IGNORECASE)

        if matches:
            categorized_data[category] = " ".join(matches)

    return categorized_data

def analyze_and_summarize(data):
    """Generates summaries and sentiment analysis for each category."""
    analysis_results = {}

    for category, text in data.items():
        if text:
            # Summarize relevant text
            summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
            # Analyze sentiment
            sentiment = sentiment_analysis(text)
            # Store results with a clear label
            analysis_results[category] = {
                "summary": summary[0]['summary_text'],
                "sentiment": sentiment[0]
            }

    return analysis_results

def main(pdf_path):
    # Define key categories for the investor's interests
    categories = {
        "Future Growth Prospects": ["growth", "expansion", "market", "prospects"],
        "Business Changes": ["restructure", "acquisition", "change", "merger"],
        "Key Triggers": ["trigger", "factor", "risk", "opportunity"],
        "Material Financial Impact": ["earnings", "revenue", "forecast", "profit"]
    }

    # Extract and categorize text
    text = extract_text_from_pdf(pdf_path)
    categorized_data = categorize_text(text, categories)
    analysis_results = analyze_and_summarize(categorized_data)

    # Output structured results
    for category, result in analysis_results.items():
        print(f"\nCategory: {category}")
        print(f"Summary: {result['summary']}")
        print(f"Sentiment: {result['sentiment']['label']} (Score: {result['sentiment']['score']:.2f})")

# Run the main function on your document
pdf_path = "/content/SJS Transcript Call.pdf"  # Update with your file path
main(pdf_path)




No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Your max_length is set to 50, but your input_length is only 25. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)
Your max_length is set to 50, but your input_length is only 10. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)
Your max_length is set to 50, but your input_length is only 37. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually


Category: Future Growth Prospects
Summary: Growth growth market growth growth growth market market. Growth growth in the U.S. has been outpacing the rest of the world.
Sentiment: POSITIVE (Score: 1.00)

Category: Business Changes
Summary: Change is the only way forward for the U.S. economy, says President Barack Obama. Obama has pledged $1.5 billion in stimulus funds to help rebuild the economy.
Sentiment: NEGATIVE (Score: 0.74)

Category: Key Triggers
Summary:  opportunity opportunity. opportunity opportunity opportunity opportunities opportunity opportunity Opportunity opportunity opportunity possibility opportunity opportunity chance opportunity. Opportunity opportunity.
Sentiment: POSITIVE (Score: 1.00)

Category: Material Financial Impact
Summary:  Earnings Earnings earnings per share: $0.05. Earnings per share for the quarter: $1.07. Revenue: $2.01. Revenue for the year: $3.02.
Sentiment: POSITIVE (Score: 0.88)


# PyMuPDF4LLM for Data Extraction. Build better and efficient RAG.


In [39]:


!pip install pymupdf4llm



In [7]:
import pymupdf4llm

In [8]:
md_text = pymupdf4llm.to_markdown("/content/SJS Transcript Call.pdf")

Processing /content/SJS Transcript Call.pdf...


In [9]:
print(md_text)

#### August 03, 2023

 To,

 National Stock Exchange of India Limited BSE Limited Exchange Plaza, 5[th] Floor, Corporate Relationship Department, Plot No. C/1, G Block, 2[nd] Floor, New Trading Wing, Bandra – Kurla Complex, Rotunda Building, P.J. Towers, Bandra (E), Mumbai -400 051 Dalal Street, Mumbai – 400 001
 Symbol: SJS Scrip Code: 543387

 ISIN: INE284S01014
 Dear Sir/Madam,

 Subject: Transcripts of Analysts/Investor Meet/ Earnings Call of the Company pertaining to Q1 of FY 2023-24

 Please find enclosed the transcripts of the Analysts/Investor Meet/ Earnings Call of Q1 FY 2023-24 held on July 27, 2023.

### You are requested to kindly take the same on record.
 Thanking you. Yours faithfully, For S.J.S. Enterprises Limited

Digitally signed by

##### THABRAZ THABRAZ HUSHAIN HUSHAIN WAJID AHMED WAJID AHMED Date: 2023.08.03

11:16:51 +05'30'

### _______________________ Thabraz Hushain W.  Company Secretary and Compliance Officer Membership No.: A51119
 Encl: As above

|National S

In [12]:
import pathlib

In [13]:

pathlib.Path("output.md").write_bytes(md_text.encode())


52487

Extracting specific pages


In [10]:
md_text_pages = pymupdf4llm.to_markdown("/content/SJS Transcript Call.pdf", pages=[1, 2])


Processing /content/SJS Transcript Call.pdf...


Extracting documents for LlamaIndex

In [14]:
!pip install llama_index



In [15]:
llama_reader = pymupdf4llm.LlamaMarkdownReader()

Successfully imported LlamaIndex


In [16]:

llama_docs = llama_reader.load_data("/content/SJS Transcript Call.pdf")

Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...
Processing /content/SJS Transcript Call.pdf...


In [27]:

print(f"LlamaIndex Compatiable Data: {len(llama_docs)}")

LlamaIndex Compatiable Data: 19


In [28]:
{llama_docs[0].text[:500]}

{'#### August 03, 2023\n\n To,\n\n National Stock Exchange of India Limited BSE Limited Exchange Plaza, 5[th] Floor, Corporate Relationship Department, Plot No. C/1, G Block, 2[nd] Floor, New Trading Wing, Bandra – Kurla Complex, Rotunda Building, P.J. Towers, Bandra (E), Mumbai -400 051 Dalal Street, Mumbai – 400 001\n Symbol: SJS Scrip Code: 543387\n\n ISIN: INE284S01014\n Dear Sir/Madam,\n\n Subject: Transcripts of Analysts/Investor Meet/ Earnings Call of the Company pertaining to Q1 of FY 2023-24\n\n Plea'}

Chunking data and extracting it with metadata


In [34]:
md_text_chunks = pymupdf4llm.to_markdown(
    doc="/content/SJS Transcript Call.pdf",
    pages=[0, 1, 2],
    page_chunks=True
)

Processing /content/SJS Transcript Call.pdf...


In [30]:

print(md_text_chunks[0])

{'metadata': {'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'iLovePDF', 'creationDate': "D:20230803111651+05'30'", 'modDate': "D:20230803111651+05'30'", 'trapped': '', 'encryption': None, 'file_path': '/content/SJS Transcript Call.pdf', 'page_count': 19, 'page': 1}, 'toc_items': [], 'tables': [{'bbox': (71.8800048828125, 183.59999084472656, 534.239990234375, 278.2400207519531), 'rows': 1, 'columns': 2}], 'images': [{'number': 1, 'bbox': Rect(71.76000213623047, 50.0, 540.239990234375, 119.2800064086914), 'transform': (468.4800109863281, 0.0, -0.0, 83.52000427246094, 71.76000213623047, 35.76000213623047), 'width': 1952, 'height': 348, 'colorspace': 3, 'cs-name': 'DeviceRGB', 'xres': 96, 'yres': 96, 'bpc': 8, 'size': 44593}], 'graphics': [], 'text': "#### August 03, 2023\n\n To,\n\n National Stock Exchange of India Limited BSE Limited Exchange Plaza, 5[th] Floor, Corporate Relationship Department, Plot No. C/1, G Block, 2[nd] Flo

Detailed word-by-word extraction


Extracting tables neatly


In [38]:
md_text_tables = pymupdf4llm.to_markdown(
    doc="/content/SJS Transcript Call.pdf"
)

Processing /content/SJS Transcript Call.pdf...


In [33]:
md_text_tables

"#### August 03, 2023\n\n To,\n\n National Stock Exchange of India Limited BSE Limited Exchange Plaza, 5[th] Floor, Corporate Relationship Department, Plot No. C/1, G Block, 2[nd] Floor, New Trading Wing, Bandra – Kurla Complex, Rotunda Building, P.J. Towers, Bandra (E), Mumbai -400 051 Dalal Street, Mumbai – 400 001\n Symbol: SJS Scrip Code: 543387\n\n ISIN: INE284S01014\n Dear Sir/Madam,\n\n Subject: Transcripts of Analysts/Investor Meet/ Earnings Call of the Company pertaining to Q1 of FY 2023-24\n\n Please find enclosed the transcripts of the Analysts/Investor Meet/ Earnings Call of Q1 FY 2023-24 held on July 27, 2023.\n\n### You are requested to kindly take the same on record.\n Thanking you. Yours faithfully, For S.J.S. Enterprises Limited\n\nDigitally signed by\n\n##### THABRAZ THABRAZ HUSHAIN HUSHAIN WAJID AHMED WAJID AHMED Date: 2023.08.03\n\n11:16:51 +05'30'\n\n### _______________________ Thabraz Hushain W.  Company Secretary and Compliance Officer Membership No.: A51119\n En