# Working with Research Papers (Chest X-ray Analysis)

This notebook demonstrates how to process a research paper PDF, summarize its content, and ask specific questions to extract key insights. We focus on a hypothetical paper related to the ChestMNIST dataset, 'Advances in Chest X-ray Classification Using ChestMNIST.' The notebook uses Optical Character Recognition (OCR) to read PDFs, prompts Large Language Models (LLMs) for summarization and question-answering, and includes exercises to apply these techniques to new papers.

**Author**: Mohammad Rezapourian
<br>
**Date**: May 13, 2025
<br>
**License**: Apache-2.0

## Table of Contents
1. [Initial Setup](#initial-setup)
   - [Setup for Google Colab](#setup-for-google-colab)
   - [Setup for Offline Use](#setup-for-offline-use)
2. [PDF Reading via OCR](#pdf-reading-via-ocr)
   - [Loading and Processing the Research Paper](#loading-and-processing-the-research-paper)
3. [Paper Summarization](#paper-summarization)
   - [Prompting for Summarization](#prompting-for-summarization)
   - [Exercise: Summarize a New Paper](#exercise-summarize-a-new-paper)
4. [Asking Specific Questions](#asking-specific-questions)
   - [Prompting for Question-Answering](#prompting-for-question-answering)
   - [Storing Answers in a Table](#storing-answers-in-a-table)
   - [Exercise: Ask Questions on a New Paper](#exercise-ask-questions-on-a-new-paper)
5. [Conclusion](#conclusion)
   - [References](#references)

## Initial Setup

Set up the environment for running the notebook in Google Colab or locally. We’ll install libraries for PDF processing, OCR, and data analysis.

### Setup for Google Colab
<u>Execute these code blocks only in Google Colab!</u>

In [None]:
!apt-get install -q poppler-utils tesseract-ocr
!pip install -q pdf2image pytesseract pandas numpy matplotlib seaborn

In [None]:
import os
import sys
from google.colab import output
output.enable_custom_widget_manager()
%matplotlib inline
import pdf2image
import pytesseract
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import re

### Setup for Offline Use

Ensure `poppler-utils` and `tesseract-ocr` are installed. For example, on Ubuntu:
```bash
sudo apt-get install poppler-utils tesseract-ocr
```
On macOS:
```bash
brew install poppler tesseract
```

In [None]:
%matplotlib inline
import pdf2image
import pytesseract
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import re

## PDF Reading via OCR

We’ll read the research paper PDF using OCR to extract text, assuming it contains details about chest X-ray classification using the ChestMNIST dataset.

### Loading and Processing the Research Paper

We’ll convert the PDF to images and apply OCR to extract text. For demonstration, assume a sample PDF `paper1.pdf` in `./Data/Papers/`. Users should replace it with their own research paper PDF.

In [None]:
# Function to read PDF and extract text
def read_pdf_ocr(pdf_path):
    try:
        images = pdf2image.convert_from_path(pdf_path)
        text = ''
        for img in images:
            text += pytesseract.image_to_string(img) + '\n'
        return text
    except Exception as e:
        print(f'Error reading {pdf_path}: {e}')
        return None

# Example PDF
pdf_path = './Data/Papers/paper1.pdf'
text = read_pdf_ocr(pdf_path)
if text:
    print('Extracted text (first 500 characters):')
    print(text[:500])
else:
    print('Failed to extract text. Please check the PDF path and OCR setup.')

**Note**: Replace `paper1.pdf` with your research paper PDF. Ensure the `./Data/Papers/` directory exists. If no PDF is available, use the sample text below for testing:
```
Title: Advances in Chest X-ray Classification Using ChestMNIST
Authors: Jane Smith, John Lee
Abstract: This paper presents a novel deep learning model for multi-label classification of chest X-rays using the ChestMNIST dataset. The dataset includes 112,120 images with 14 disease labels. Our convolutional neural network (CNN) achieves an accuracy of 92% on the test set, outperforming baseline models.
Introduction: Chest X-rays are critical for diagnosing conditions like pneumonia and effusion. The ChestMNIST dataset provides a standardized benchmark.
Methods: We used a CNN with ResNet-50 architecture, trained on 78,468 training images.
Results: The model achieved high precision for pneumonia (0.95) and effusion (0.93).
Conclusion: Our approach demonstrates the potential of deep learning for automated diagnosis.
```
Save this as `sample_paper.txt` and modify the code to read it if needed.

## Paper Summarization

We’ll prompt an LLM to summarize the research paper, extracting key details like the title, authors, objectives, methods, and findings.

### Prompting for Summarization

**Prompt Template**:
```
Summarize the following research paper. Extract the following details:
- Title
- Authors
- Objectives
- Methods
- Key Findings
Format the output as:
- Title: [value]
- Authors: [value]
- Objectives: [value]
- Methods: [value]
- Key Findings: [value]

Paper text:
[insert extracted text]
```

**Simulated LLM Response** (for `sample_paper.txt`):
```plaintext
- Title: Advances in Chest X-ray Classification Using ChestMNIST
- Authors: Jane Smith, John Lee
- Objectives: Develop a deep learning model for multi-label classification of chest X-rays using the ChestMNIST dataset.
- Methods: Utilized a convolutional neural network with ResNet-50 architecture, trained on 78,468 images.
- Key Findings: Achieved 92% test accuracy, with high precision for pneumonia (0.95) and effusion (0.93).
```

In [None]:
# Function to simulate LLM summarization
def summarize_paper(text):
    summary = {'Title': None, 'Authors': None, 'Objectives': None, 'Methods': None, 'Key Findings': None}
    # Title
    title_match = re.search(r'Title:\s*(.+)', text, re.IGNORECASE)
    if title_match:
        summary['Title'] = title_match.group(1).strip()
    # Authors
    authors_match = re.search(r'Authors:\s*(.+)', text, re.IGNORECASE)
    if authors_match:
        summary['Authors'] = authors_match.group(1).strip()
    # Objectives
    obj_match = re.search(r'Abstract:\s*(.+?)(?:Introduction|$)', text, re.IGNORECASE | re.DOTALL)
    if obj_match:
        summary['Objectives'] = obj_match.group(1).strip()
    # Methods
    methods_match = re.search(r'Methods:\s*(.+?)(?:Results|$)', text, re.IGNORECASE | re.DOTALL)
    if methods_match:
        summary['Methods'] = methods_match.group(1).strip()
    # Key Findings
    findings_match = re.search(r'Results:\s*(.+?)(?:Conclusion|$)', text, re.IGNORECASE | re.DOTALL)
    if findings_match:
        summary['Key Findings'] = findings_match.group(1).strip()
    return summary

# Summarize sample paper
if text:
    summary = summarize_paper(text)
    print('Summary:')
    for key, value in summary.items():
        print(f'{key}: {value}')

### Exercise: Summarize a New Paper

**Task**: Apply the summarization function to a new paper (`paper2.pdf`) and verify the output.

**Code**:


In [None]:
# Your code here
new_pdf_path = './Data/Papers/paper2.pdf'
new_text = read_pdf_ocr(new_pdf_path)
if new_text:
    new_summary = summarize_paper(new_text)
    print('Summary of paper2.pdf:')
    for key, value in new_summary.items():
        print(f'{key}: {value}')
else:
    print('Failed to read paper2.pdf.')

**Solution**: The code reads a new PDF, extracts text via OCR, and summarizes it. Verify the output by checking the extracted fields against the paper’s content. If no PDF is available, test with a new sample text.

## Asking Specific Questions

We’ll ask targeted questions about the paper to extract specific insights, such as dataset details, model performance, and limitations.

### Prompting for Question-Answering

**Questions**:
1. What is the size of the ChestMNIST dataset used in the study?
2. What is the accuracy of the proposed model on the test set?
3. What are the main limitations of the study?

**Prompt Template**:
```
Answer the following questions based on the research paper. Provide concise answers in the format:
- Question: [question]
- Answer: [answer]

Questions:
1. What is the size of the ChestMNIST dataset used in the study?
2. What is the accuracy of the proposed model on the test set?
3. What are the main limitations of the study?

Paper text:
[insert extracted text]
```

**Simulated LLM Response**:
```plaintext
- Question: What is the size of the ChestMNIST dataset used in the study?
- Answer: The ChestMNIST dataset includes 112,120 images.
- Question: What is the accuracy of the proposed model on the test set?
- Answer: The model achieves an accuracy of 92% on the test set.
- Question: What are the main limitations of the study?
- Answer: The study does not address class imbalance and lacks generalizability to other datasets.
```

In [None]:
# Function to simulate question-answering
def answer_questions(text):
    answers = {
        'What is the size of the ChestMNIST dataset used in the study?': None,
        'What is the accuracy of the proposed model on the test set?': None,
        'What are the main limitations of the study?': None
    }
    # Dataset size
    size_match = re.search(r'ChestMNIST dataset.*?(\d+,\d+)', text, re.IGNORECASE)
    if size_match:
        answers['What is the size of the ChestMNIST dataset used in the study?'] = f'The ChestMNIST dataset includes {size_match.group(1)} images.'
    # Accuracy
    acc_match = re.search(r'accuracy of\s*(\d+%)', text, re.IGNORECASE)
    if acc_match:
        answers['What is the accuracy of the proposed model on the test set?'] = f'The model achieves an accuracy of {acc_match.group(1)} on the test set.'
    # Limitations
    lim_match = re.search(r'(limitation|challenge).*?((?:not address|lack).*?)(?:\.|$)', text, re.IGNORECASE | re.DOTALL)
    if lim_match:
        answers['What are the main limitations of the study?'] = lim_match.group(2).strip()
    return answers

# Answer questions
if text:
    answers = answer_questions(text)
    print('Answers to specific questions:')
    for question, answer in answers.items():
        print(f'Question: {question}')
        print(f'Answer: {answer}')

### Storing Answers in a Table

Store the questions and answers in a pandas DataFrame for further analysis.

In [None]:
# Create DataFrame
if text:
    qa_df = pd.DataFrame([
        {'Question': q, 'Answer': a} for q, a in answers.items()
    ])
    print('Question-Answer table:')
    print(qa_df)
else:
    qa_df = pd.DataFrame(columns=['Question', 'Answer'])

### Exercise: Ask Questions on a New Paper

**Task**: Apply the question-answering function to a new paper (`paper2.pdf`) and add the answers to the DataFrame.

**Code**:


In [None]:
# Your code here
new_answers = answer_questions(new_text)
if new_answers:
    new_qa_df = pd.DataFrame([
        {'Question': q, 'Answer': a} for q, a in new_answers.items()
    ])
    qa_df = pd.concat([qa_df, new_qa_df], ignore_index=True)
    print('Updated question-answer table:')
    print(qa_df)
else:
    print('Failed to answer questions for paper2.pdf.')

**Solution**: The code extracts answers from the new paper and appends them to the DataFrame. Verify the answers against the paper’s content to ensure accuracy.

## Conclusion

This notebook demonstrated how to process a research paper PDF using OCR, summarize its content, and answer specific questions to extract key insights. We focused on a paper related to the ChestMNIST dataset, summarizing its objectives, methods, and findings, and addressing targeted questions about dataset size, model performance, and limitations. The DataFrame enables further analysis, such as comparing multiple papers.

### References
- Tesseract OCR: https://github.com/tesseract-ocr/tesseract
- PDF2Image: https://github.com/oschwartz10612/poppler-python
- Pandas Documentation: https://pandas.pydata.org/docs/
- ChestMNIST Dataset: https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community