<a href="https://colab.research.google.com/github/chendu029/Text-Summarization/blob/main/python_coding_assignment_text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import fitz  # PyMuPDF for PDF parsing
from transformers import pipeline
import pandas as pd
import PyPDF2
import re
import torch

In [None]:

# Check if GPU is available
device = 0 if torch.cuda.is_available() else -1  # Use GPU if available, else use CPU

def extract_text_from_pdf(pdf_path):
    """
    Extract text from each page of a PDF document.

    :param pdf_path: Path to the PDF file
    :return: Extracted text from the PDF
    """
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text("text")

    if not text.strip():  # Check if the text is empty or just whitespace
        raise ValueError("No text extracted from the PDF. Please check the document.")

    return text

In [None]:
def summarize_text_with_huggingface(text):
    """
    Use Hugging Face's pre-trained summarization model to extract key insights.

    :param text: Text extracted from PDF
    :return: Summary and key insights
    """
    # Load a summarization pipeline from Hugging Face
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=device)

    # Check if the text is too long for the model (max length for BART is typically 1024 tokens)
    max_chunk_size = 1024  # BART max token length
    chunk_size = len(text.split())

    # If text is too large, split into smaller chunks
    if chunk_size > max_chunk_size:
        chunks = [text[i:i+max_chunk_size] for i in range(0, len(text), max_chunk_size)]
    else:
        chunks = [text]

    # Process each chunk separately
    summaries = []
    for chunk in chunks:
        summary = summarizer(chunk, max_length=500, min_length=200, do_sample=False)
        summaries.append(summary[0]['summary_text'])

    # Combine the individual summaries into one
    full_summary = " ".join(summaries)

    # Format the summary for readability (split by sentences)
    formatted_summary = format_summary_readable(full_summary)

    return formatted_summary

In [None]:


def format_summary_readable(summary):
    """
    Format the summary to make it more readable by splitting it into individual sentences.

    :param summary: The summary text to format
    :return: Formatted summary with each sentence on a new line
    """
    # Split summary into sentences using a regular expression (you can adjust the regex to match your needs)
    sentences = re.split(r'(?<=\.)\s+', summary.strip())  # Split at periods followed by spaces

    # Join sentences with line breaks
    readable_summary = "\n".join(sentences)

    return readable_summary

In [None]:

def process_pdf_for_investor(pdf_path):
    """
    Extract text from a PDF and summarize key insights for an investor.

    :param pdf_path: Path to the PDF document
    :return: Summary with key points related to business growth, changes, and material effects
    """
    try:
        # Extract text from the PDF
        text = extract_text_from_pdf(pdf_path)

        # Use Hugging Face summarizer to extract relevant information for an investor
        summary = summarize_text_with_huggingface(text)

        return summary

    except ValueError as e:
        # Handle any errors that occur during text extraction or summarization
        return f"Error: {e}"


In [None]:

pdf_path = '/content/SJS Transcript Call.pdf'  # Replace with the path to your PDF file
summary = process_pdf_for_investor(pdf_path)


Your max_length is set to 500, but your input_length is only 389. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=194)
Your max_length is set to 500, but your input_length is only 361. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=180)
Your max_length is set to 500, but your input_length is only 283. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=141)
Your max_length is set to 500, but your input_length is only 261. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1

In [None]:
print(summary)

Transcripts of Analysts/Investor Meet/ Earnings Call of the Company pertaining to Q1 of FY 2023-24 held on July 27, 2023.
S.J.S.
Enterprises Limited.
The call was held at 11:16:51 +05'30' at the National Stock Exchange of India (NSE) in Mumbai.
The transcript of the call has been sent to Mr.
Thabraz Hushain W.
Ahmed, Company Secretary and Compliance Officer, NSE.
For confidential support call the Samaritans on 08457 90 90 90 or visit a local Samaritans branch, see www.samaritans.org for details.
In the U.S., call the National Suicide Prevention Line on 1-800-273-8255 or visit http://www.suicidepreventionlifeline.org/.
For confidential.
support in the UK, call the 08457 909090 or visit the Samaritans’ website: http: www.
Samaritans.co.uk.
The conference call is being hosted by JM Financial.
All participant lines will be in the listen-only mode.
Should you need assistance, please signal an operator by pressing ‘*’ then ‘0’ on a touchtone phone.
The conference is being recorded.
The call 

In [None]:
from google.colab import drive
drive.mount('/content/drive')