In [5]:
import re
from transformers import pipeline
from pypdf import PdfReader
from google.colab import files

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")


def extract_text(pdf_path):
    with open(pdf_path, "rb") as file:
        reader = PdfReader(file)
        return "".join(page.extract_text() for page in reader.pages)


def clean_text(text):
    text = re.sub(r'\n+', '\n', text)
    text = re.sub(r'\bModerator\b.*?(?=\n)', '', text)
    text = re.sub(r'\bThank you\b.*?(?=\n)', '', text)
    return text


def find_sections(text):
    patterns = {
        "future_growth": r"(strategic outlook|future growth|next level of growth).*",
        "business_changes": r"(acquisition|inorganic growth|portfolio concentration).*",
        "financial_highlights": r"(financial highlights|margins|cash flow|debt).*",
        "material_effects": r"(next year's earnings|revenue growth|PAT growth|market position).*"
    }
    return {key: " ".join(re.findall(pattern, text, re.IGNORECASE | re.DOTALL)) for key, pattern in patterns.items()}


def summarize_sections(sections):
    return {key: summarizer(content, max_length=100, min_length=50, do_sample=False)[0]['summary_text']
            for key, content in sections.items() if content}


def main():
    uploaded = files.upload()
    pdf_path = list(uploaded.keys())[0]
    text = extract_text(pdf_path)
    cleaned_text = clean_text(text)
    sections = find_sections(cleaned_text)
    summary = summarize_sections(sections)


    for key, content in summary.items():
        print(f"\n{key.upper()}:\n{content}\n")


main()




Saving SJS_Transcript_Call.pdf to SJS_Transcript_Call.pdf


Your max_length is set to 100, but your input_length is only 4. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2)
Your max_length is set to 100, but your input_length is only 3. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1)
Your max_length is set to 100, but your input_length is only 4. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2)
Your max_length is set to 100, but your input_length is only 4. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2)



FUTURE_GROWTH:
The U.S. strategic outlook was released on Tuesday. The report was based on a survey of more than 2,000 people. It was the first report of its kind. The U.N. released a statement on the report on Wednesday.


BUSINESS_CHANGES:
CNN.com will feature iReporter photos in a weekly Travel Snapshots gallery. Please submit your best shots of the U.S. for next week. Visit CNN.com/Travel next Wednesday for a new gallery of snapshots from around the world.


FINANCIAL_HIGHLIGHTS:
U.S. financial highlights from the past week include the U.N. General Assembly's vote on climate change. The vote was followed by the European Parliament's vote against the European Union's plan to introduce a carbon trading scheme. The European Parliament voted in favor of a plan that would allow companies to sell shares in Europe.


MATERIAL_EFFECTS:
U.S. revenue growth has been slowing in recent years, according to a new report. The report says the decline is due to a decline in the number of companies