# **Problem Statement**

We need to create an algorithm that extracts key information from a company PDF transcript for an investor’s evaluation. The extracted information should focus on:

1. **Future Growth Prospects**
2. **Key Changes in the Business**
3. **Key Triggers**
4. **Material Information Impacting Future Earnings and Growth**

The goal is to process the PDF, filter relevant content, and summarize key insights into investor-relevant categories.


# **Algorithm Approach**

1. **Install and Import Libraries**: Set up the necessary Python libraries.
2. **Define Configuration and Upload PDF**: Specify file paths and allow for PDF upload.
3. **Extract Text from PDF**: Use `pdfminer.six` to read the PDF content.
4. **Text Processing and Summarization**: Use `spaCy` for NLP processing and a Hugging Face summarization model.
5. **Dynamic Topic Detection**: Use zero-shot classification with batch processing to detect relevant topics without hard-coding keywords.
6. **Organize and Save Results**: Summarize classified sentences by topic, save the output, and present it in a readable format.


# **Step 1: Install Dependencies**

This step installs the required libraries:
- `pdfminer.six` for extracting text from PDFs.
- `transformers` for using Hugging Face's summarization and classification models.


In [16]:
!pip install pdfminer.six transformers




# **Step 2: Define Configuration and Upload the PDF**

Define paths for the uploaded PDF file and output JSON file.
- Upload the PDF and automatically set its path.


In [17]:
import os
from google.colab import files

# Upload the PDF file
uploaded = files.upload()
PDF_PATH = next(iter(uploaded))  # Automatically get the file path of the uploaded PDF
OUTPUT_PATH = "/content/output_summary.json"  # Path to save the summary


Saving SJS Transcript Call.pdf to SJS Transcript Call (2).pdf


# **Step 3: Import Libraries**

Import the necessary libraries for PDF text extraction, NLP processing, and text summarization.


In [18]:
import re
import json
from pdfminer.high_level import extract_text
from transformers import pipeline
import spacy

# Load spaCy model for NLP processing
nlp = spacy.load("en_core_web_sm")


# **Step 4: Define Functions for PDF Extraction and Sentence Filtering**

Define helper functions for:
- Extracting text from the PDF
- Cleaning and tokenizing sentences.


In [19]:
# Extract text from PDF
def extract_text_from_pdf(pdf_path):
    text = extract_text(pdf_path)
    return text

# Clean and tokenize text into sentences
def clean_and_tokenize(text):
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    return sentences


# **Step 5: Dynamic Topic Detection Using Zero-Shot Classification with Batch Processing**

Use zero-shot classification to detect relevant topics dynamically without relying on hard-coded keywords.
Batch processing is implemented to classify multiple sentences at once, improving runtime efficiency.


In [20]:
from transformers import pipeline

# Initialize zero-shot classification pipeline with a smaller model
classifier = pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-6")

# List of topics to dynamically classify sentences
dynamic_categories = ["growth prospects", "business changes", "triggers", "earnings impact", "financial performance"]

# Function to classify sentences by topics with batch processing
def classify_sentences(sentences, batch_size=8):
    classified_sentences = {topic: [] for topic in dynamic_categories}

    # Process sentences in batches
    for i in range(0, len(sentences), batch_size):
        batch = sentences[i:i + batch_size]
        classifications = classifier(batch, dynamic_categories)

        # Print progress to monitor
        print(f"Processing batch {i // batch_size + 1} of {len(sentences) // batch_size + 1}")

        for j, sentence in enumerate(batch):
            # Find the highest-scoring topic for each sentence in the batch
            best_topic = classifications[j]["labels"][0]
            classified_sentences[best_topic].append(sentence)

    return classified_sentences


config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.23G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

# **Step 6: Summarize Each Topic Using Hugging Face Summarization Model**

Summarize each set of sentences by topic using a pre-trained summarization model.


In [23]:
from transformers import pipeline

# Load summarization model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Summarize sentences for each topic
def summarize_classified_text(classified_sentences):
    summarized_info = {}
    for category, sentences in classified_sentences.items():
        joined_text = " ".join(sentences)
        if not joined_text.strip():
            summarized_info[category] = "No relevant information found."
            continue

        # Truncate the input text to a maximum length
        max_input_length = 1024  # Adjust this value based on your model's limitations
        joined_text = joined_text[:max_input_length]

        # Summarize using BART summarizer with truncation and max_length
        summary = summarizer(
            joined_text,
            max_length=100,
            min_length=30,
            do_sample=False,
            truncation=True  # Add truncation
        )
        summarized_info[category] = summary[0]["summary_text"]
    return summarized_info



# **Step 7: Process, Summarize, and Save Results**

This step:
1. Extracts text from the PDF.
2. Dynamically classifies sentences into topics.
3. Summarizes each category and saves the results to a JSON file.


In [25]:
# Main processing function
def main():
    # Extract and preprocess text
    print("Extracting text from PDF...")
    pdf_text = extract_text_from_pdf(PDF_PATH)
    sentences = clean_and_tokenize(pdf_text)

    # Dynamically classify sentences based on topics
    print("Classifying sentences by topic...")
    classified_sentences = classify_sentences(sentences)

    # Summarize each classified category and save results
    print("Summarizing each category...")
    summarized_info = summarize_classified_text(classified_sentences)

    # Save output to JSON file
    with open(OUTPUT_PATH, "w") as outfile:
        json.dump(summarized_info, outfile, indent=4)

    print(f"Summary saved to {OUTPUT_PATH}")

# Run the main function
main()


Extracting text from PDF...
Classifying sentences by topic...
Processing batch 1 of 65
Processing batch 2 of 65
Processing batch 3 of 65
Processing batch 4 of 65
Processing batch 5 of 65
Processing batch 6 of 65
Processing batch 7 of 65
Processing batch 8 of 65
Processing batch 9 of 65
Processing batch 10 of 65
Processing batch 11 of 65
Processing batch 12 of 65
Processing batch 13 of 65
Processing batch 14 of 65
Processing batch 15 of 65
Processing batch 16 of 65
Processing batch 17 of 65
Processing batch 18 of 65
Processing batch 19 of 65
Processing batch 20 of 65
Processing batch 21 of 65
Processing batch 22 of 65
Processing batch 23 of 65
Processing batch 24 of 65
Processing batch 25 of 65
Processing batch 26 of 65
Processing batch 27 of 65
Processing batch 28 of 65
Processing batch 29 of 65
Processing batch 30 of 65
Processing batch 31 of 65
Processing batch 32 of 65
Processing batch 33 of 65
Processing batch 34 of 65
Processing batch 35 of 65
Processing batch 36 of 65
Processing 

# **Step 8: Display Summarized Results**

Display the summarized output from the JSON file in a readable format.


In [26]:
# Display the summary from the output JSON file
with open(OUTPUT_PATH, "r") as infile:
    summary_data = json.load(infile)

# Display summary in a readable format
for category, summary in summary_data.items():
    print(f"**{category.title()}**:\n{summary}\n")


**Growth Prospects**:
Devanshi Dhruva: Thank you, Ronak. Sanjay will take you all through the industry view, our business performance and also give a strategic outlook for the future growth of the company at the end.

**Business Changes**:
Mr. K A Joseph – Managing Director and Co-Founder; Mr. Sanjay Thapar - CEO and Executive Director. Mr. Mahendra Naredi - Chief Financial Officer and Ms. Devanshi Dhruva – Head, Investor Relations.

**Triggers**:
All participant lines will be in the listen-only mode. Should you need assistance during the conference call, please signal an operator by pressing ‘*’ then ‘0’ The duration of this call will be around 60 minutes.

**Earnings Impact**:
SJS Enterprises Limited Q1 FY2024 Earnings Conference Call July 27, 2023. Scrip Code: 543387. Transcripts of Analysts/Investor Meet/ Earnings Call of the Company pertaining to Q1 of FY 2023-24.

**Financial Performance**:
SJS Q1 witnessed a strong revenue growth of 21% YoY and a robust margin performance with E