<a href="https://colab.research.google.com/github/amishra15/Paper-Summarization-with-LLMs-and-ArXiv/blob/main/Scientific_Paper_Summarization_with_LLMs_and_ArXiv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scientific Paper Summarization with LLMs and ArXiv

### Importing Libraries

In [1]:
import os
!pip install transformers arxiv ipywidgets

import arxiv
from transformers import pipeline
import ipywidgets as widgets
from IPython.display import display, clear_output

Collecting arxiv
  Downloading arxiv-2.1.3-py3-none-any.whl.metadata (6.1 kB)
Collecting feedparser~=6.0.10 (from arxiv)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser~=6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Using cached jedi-0.19.1-py2.py3-none-any.whl.metadata (22 kB)
Downloading arxiv-2.1.3-py3-none-any.whl (11 kB)
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6047 sha256=31fc6f295fbee96e1781e786b359a31942c88d059efd7dc3f9b4

### Loading the Pipeline

In [2]:
# Load summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



### Defining Functions

In [3]:
# Define function to fetch papers from ArXiv
def fetch_papers(query, max_results=5):
    search = arxiv.Search(
        query=query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.Relevance
    )
    return list(search.results())

# Define function to summarize text
def summarize_text(text):
    # Set max length to be adaptive based on input length
    input_length = len(text)
    max_length = min(400, input_length if input_length < 800 else input_length // 2)
    min_length = min(100, input_length // 3)
    summary = summarizer(text, max_length=max_length, min_length=min_length, do_sample=False)
    return summary[0]['summary_text']

# Define function to display paper details and summaries
def display_paper_details_and_summaries(papers):
    clear_output()
    for idx, paper in enumerate(papers):
        print(f"{idx + 1}. {paper.title}\n")
        print(f"Authors: {', '.join(author.name for author in paper.authors)}\n")
        summary = summarize_text(paper.summary)
        print("Summary:\n")
        print(summary)
        print("\nLink to paper: ", paper.pdf_url)
        print("\n" + "-"*80 + "\n")

In [4]:
# Define main function to run the summarization system
def run_summarization_system():
    query_input = widgets.Text(description='Query:', placeholder='Enter search query')
    num_papers_input = widgets.IntText(description='Number of Papers:', value=5, min=1, max=20)
    search_button = widgets.Button(description='Search')

    display(query_input, num_papers_input, search_button)

    def on_search_button_click(b):
        query = query_input.value
        num_papers = num_papers_input.value
        if query:
            papers = fetch_papers(query, max_results=num_papers)
            display_paper_details_and_summaries(papers)

    search_button.on_click(on_search_button_click)



### Result

In [7]:
# Run the UI
run_summarization_system()

Your max_length is set to 400, but your input_length is only 265. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=132)


1. Wavelet Convolutional Neural Networks

Authors: Shin Fujieda, Kohei Takayama, Toshiya Hachisuka



Your max_length is set to 400, but your input_length is only 266. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=133)


Summary:

Spatial and spectral approaches are two major approaches for image processing. We propose a novel CNN architecture, wavelet CNNs, whichcombines a multiresolution analysis and CNNs into one model. Wavelet CNN's allow us to utilize spectral information which ismostly lost in conventional CNNs but useful in most image processing tasks. We evaluate the practical performance of wavelet. CNNs on texture classification and image annotation. The experiments show that wave let CNNs can achieve better.accuracy in both tasks than existing models while having significantly fewer.parameters than conventional CNN's.

Link to paper:  http://arxiv.org/pdf/1805.08620v1

--------------------------------------------------------------------------------

2. Fine-grained Recognition in the Noisy Wild: Sensitivity Analysis of Convolutional Neural Networks Approaches

Authors: Erik Rodner, Marcel Simon, Robert B. Fisher, Joachim Denzler



Your max_length is set to 400, but your input_length is only 261. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=130)


Summary:

In this paper, we study the sensitivity of CNN outputs with respect to imagetransformations and noise in the area of fine-grained recognition. We provide an extensive empirical sensitivity analysis of commonly used CNNarchitectures (AlexNet, VGG19, GoogleNet) across various types of image degradation. We demonstrate that input layer dropout or pre-filtering during test time only reduces CNN sensitivity for high levels of degradation. This allows for predicting CNN performance for new domainscomprised by images of lower quality or captured from a different viewpoint.

Link to paper:  http://arxiv.org/pdf/1610.06756v1

--------------------------------------------------------------------------------

3. Iris super-resolution using CNNs: is photo-realism important to iris recognition?

Authors: Eduardo Ribeiro, Andreas Uhl, Fernando Alonso-Fernandez



Your max_length is set to 400, but your input_length is only 255. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=127)


Summary:

The authors explore single image super-resolution using CNNs for iris recognition. They test different CNN architectures and use different training databases. They also use quality assessment, visual results and recognition experiments. The results show that using deeper architectures trained with texture databases that provide a balance between edge preservation and the smoothness of the method can lead to good results in the irisrecognition process. The work was published in the open-access journal, Theoretical and Computational Image Processing (ICP)

Link to paper:  http://arxiv.org/pdf/2210.13125v1

--------------------------------------------------------------------------------

4. Progressive Minimal Path Method with Embedded CNN

Authors: Wei Liao

Summary:

Path-CNN is a method for the segmentation of centerlines of tubular structures. It embeds convolutional neural networks (CNNs) into the Progressive minimal path method. CNNs employ learned image features to improv