# Introduction

In this notebook, I created a paper summariser directly calling LLMs without using existing frameworks such as langchain and llamaindex. 

In [1]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import OpenAI


In [2]:
OPENAI_API_KEY  = os.environ["OPENAI_API_KEY"]
GOOGLE_SEARCH_API_KEY = os.environ["GOOGLE_SEARCH_API_KEY"]

In [3]:
# load document
loader = PyPDFLoader("../../example_paper1.pdf")
documents = loader.load()
print(len(documents))

33


In [4]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=6000, chunk_overlap=0)
text = text_splitter.split_documents(documents)
print(len(text))

33


In [5]:

instruction_prompt = """
    Here is a piece of text from a research paper. 
    Article:
    {text_chunk}
   ----
    
    Summarise the given text by following the Guidance below.
    
    Guidance:
    - Summarise the the key conclusions in a single short paragraph of approx. 100 words.
    - the summary should be highly dense ans concise yet self-contained. 
    - list relevant findings, observations and supportive arguments for the key conclusionsin in 3-5 bullet points. 
    - each finding has 1-2 sentences and be as close to the original text
    - describe the methods in 3-5 short sentences

    <format>
    Summary: <conclusion>
    Findings: <findings>
    Methods: <methods>
    </format>

    
"""
role_prompt = """You are a reseach assistant with the task to do literature review. You never make up any information that isn't in the literatures."""


In [6]:
text_chunk_summary = []
# hardcode where the text input should stop to remove references and appendix
for i in range(7):
    text_chunk = text[i].page_content
    prompt = instruction_prompt.format(text_chunk=text_chunk)
    messages = [{"role": "system", "content": role_prompt},
            {"role": "user", "content": prompt}]
    client = OpenAI()
    response = client.chat.completions.create(
                messages=messages,
                model = "gpt-4", 
    )

    text_chunk_summary.append(response.choices[0].message.content)

In [7]:
text_chunk_summary

['Summary: The study explores the ongoing superiority of tree-based models (like XGBoost and Random Forests) over deep learning methods in handling medium-sized tabular data, even with no consideration for their superior speed. The authors propose three challenge areas for researchers developing tabular-specific Neural Networks - robustness against uninformative features, data orientation preservation, and capacity to learn irregular functions. The paper also presents a benchmark and baseline data feasibility for future research in tabular data architecture.\n\nFindings: \n- Deep learning shows exceptional capabilities with image, language, and audio data sets, but its superiority with tabular data remains uncertain.\n- Tree-based models continue to outperform deep learning methods on medium-sized tabular data.\n- There exist pivotal differences in the inductive biases of tree-based models and Neural Networks.\n- Propagation of robustness against uninformative features, preservation of

In [8]:
print(text_chunk_summary[6])

Summary: The research identifies that Neural Networks (NNs) struggle to fit irregular functions in datasets compared to tree-based models. They find that Multi-Layer Perceptron (MLP) architectures are less robust to uninformative features common in tabular datasets and that data, which are non-rotationally invariant, can impact the learning procedures. Regularization and careful optimization, however, may allow NNs to learn these patterns. 

Findings: 
- NNs struggle to fit irregular functions, which are prevalent in the studied datasets, compared to tree-based models. Tree-based models are unbiased towards low-frequency functions unlike NNs.
- MLP-like architectures face difficulty in handling uninformative features common in tabular datasets, leading to their decreased performance when such features exist. 
- Rotational invariance of data tends to pose a challenge for MLPs, suggesting that learning procedures should not be rotationally invariant. 

Methods: 
- The study employed remo

In [9]:

text_chunk ="\n".join(text_chunk_summary)

prompt = instruction_prompt.format(text_chunk=text_chunk)
messages = [{"role": "system", "content": role_prompt},
        {"role": "user", "content": prompt}]
client = OpenAI()
response = client.chat.completions.create(
            messages=messages,
            model = "gpt-4", 
)

In [10]:
print(response.choices[0].message.content)

Summary: The studies indicate that tree-based models consistently outperform Neural Networks (NNs) in handling tabular data. This performance gap persists even with standard hyperparameters tuning and when only numerical features are used. The unusual performance advantage of tree-based models can be attributed to their use of decision trees within ensemble methods and their superior robustness to uninformative features common in tabular datasets.

Findings: 
- Tree-based models maintain a higher level of performance over NNs across various random search budgets, even with hyperparameters tuning. 
- The superiority of tree-based models endures even when the datasets are limited to numerical variables only, going against the belief that NNs primarily struggle with categorical variables in tabular data.
- Tree-based models handle uninformative features and irregular functions better than MLP-like Neural Network architectures.
- Bagging or boosting methods that utilize decision trees as w

## Put everything together

In [11]:
# from typing import Any
# from serpapi import GoogleSearch
# import regex as re
# import prompts
# from openai import OpenAI


# class PaperSummariser:

#     def __init__(self) -> None:
#         self.stop_at = 7
#         self.PROMPTS = prompts

#     def _summarise_content(self, text) -> Any:
        
#         text_chunk_summary = []
#         for i in range(self.stop_at):
#             text_chunk = text[i].page_content
#             page_summary = self._summarise_page(text_chunk)
#         text_chunk_summary.append(page_summary)

#         all_page_summary ="\n".join(text_chunk_summary)
#         final_summary = self._summarise_page(all_page_summary)

#         return final_summary

#     def _summarise_page(self, text_chunk):
        
#         prompt = self.PROMPTS.INSTRUCTION_PROMPT.format(text_chunk=text_chunk)
#         messages = [{"role": "system", "content": self.PROMPTS.ROLE_PROMPT},
#                 {"role": "user", "content": prompt}]
#         client = OpenAI()
#         response = client.chat.completions.create(
#                     messages=messages,
#                     model = "gpt-4", 
#         )
#         return response.choices[0].message.content
    
#     def get_paper_info(self, text_chunk):

#         prompt = self.PROMPTS.EXTRACTION_PROMPT.format(text_chunk=text_chunk)
#         messages = [{"role": "system", "content": self.PROMPTS.ROLE_PROMPT},
#                 {"role": "user", "content": prompt}]
#         client = OpenAI()
#         response = client.chat.completions.create(
#                 messages=messages,
#                 model = "gpt-4", 
#         )
#         info = eval(response.choices[0].message.content)

#         return info
    
#     @staticmethod
#     def google_scholar_search(query):
#         params = {
#         "engine": "google_scholar",
#         "q": query,
#         "api_key": GOOGLE_SEARCH_API_KEY
#     }
#         search = GoogleSearch(params)
#         results = search.get_dict()
#         summary = results["organic_results"][0]["publication_info"]["summary"]
#         publish_year = re.search(r'(19|20)\d\d', summary).group(0)
#         cited_times = results["organic_results"][0]["inline_links"]["cited_by"]["total"]

#         return publish_year, cited_times

#     def preprocess(self, filepath):

#         loader = PyPDFLoader(filepath)
#         documents = loader.load()
#         text_splitter = RecursiveCharacterTextSplitter(chunk_size=6000, chunk_overlap=0)
#         text = text_splitter.split_documents(documents)

#         return text
    
#     def summarise(self, paper_path):
#         summary = {}
#         text = self.preprocess(paper_path)

#         front_page = text[0]
#         info = self.get_paper_info(front_page)
#         title = info["title"]
#         summary.update(info)

#         publish_year, cited_times = PaperSummariser.google_scholar_search(query=title)
#         summary.update({"publish_year": publish_year, "cited_times": cited_times})

#         final_summary = self._summarise_content(text)
#         summary.update({"summary": final_summary})

#         return summary

    


In [12]:
from summariser import PaperSummariser
automsum = PaperSummariser()


In [13]:
summary = automsum.summarise("../../example_paper3.pdf")

In [14]:
summary

{'title': 'SummEval: Re-evaluating Summarization Evaluation',
 'authors': ['Alexander R. Fabbri',
  'Wojciech Kry ´sci´nski',
  'Bryan McCann',
  'Caiming Xiong',
  'Richard Socher',
  'Dragomir Radev'],
 'publish_year': '2021',
 'cited_times': 416,
 'summary': 'Summary: The research paper presents a versatile evaluation toolkit featuring 14 automatic evaluation metrics intended to simplify multi-metric evaluation and results processing. Accompanying this toolkit is a human evaluation system that utilizes a variety of neural summarization models. The evaluation process, \nencompassing both expert and crowd-sourced judges, emphasizes on four dimensions: coherence, consistency, fluency, and relevance.\n\nFindings:\n- Fourteen automatic evaluation metrics are consolidated in the proposed toolkit, intending to streamline the evaluation procedure.\n- The evaluation process includes using crowd-sourced evaluators and experts to assess summaries from neural models.\n- The evaluation process t

In [19]:
print(summary["summary"])

Summary: The research paper discusses a Python package unifying 14 automatic evaluation metrics under an evaluation toolkit. It utilized human annotations from a pool of crowd-sourced judges and experts to evaluate summaries based on coherence, consistency, fluency, and relevance parameters leading to the evaluation of 16 neural summarization models.

Findings:
- The evaluation toolkit consists of 14 consolidated metrics with batch and example-level evaluation features.
- It offers a consistent interface for multi-metric evaluations and customization options for each metric.
- Over 12,800 human annotations were collected to provide insights into different summary parameters.
- The collected annotations were applied to evaluate 16 different neural summarization models.

Methods:
- The research utilized a Python package-based evaluation toolkit with several automatic metrics.
- It supported both batch and individual evaluations for each metric with custom configuration options.
- Summari

In [None]:
article = text_chunk_summary

system_prompt = """
You are an expert in writing rich and dense summaries in broad domains.
"""

prompt = f"""
Article:

{article}

----

You will generate increasingly concise, entity-dense summaries of the above
Article.

Repeat the following 2 steps 5 times.

- Step 1: Identify 1-3 informative Entities from the Article
which are missing from the previously generated summary and are the most
relevant.

- Step 2: Write a new, denser summary of identical length which covers
every entity and detail from the previous summary plus the missing entities

A Missing Entity is:

- Relevant: to the main story
- Specific: descriptive yet concise (5 words or fewer)
- Novel: not in the previous summary
- Faithful: present in the Article
- Anywhere: located anywhere in the Article

Guidelines:
- The first summary should be long (4-5 sentences, approx. 80 words) yet
highly non-specific, containing little information beyond the entities
marked as missing.

- Use overly verbose language and fillers (e.g. "this article discusses") to
reach approx. 80 words.

- Make every word count: re-write the previous summary to improve flow and
make space for additional entities.

- Make space with fusion, compression, and removal of uninformative phrases
like "the article discusses"

- The summaries should become highly dense and concise yet self-contained,
e.g., easily understood without the Article.

- Missing entities can appear anywhere in the new summary.

- Never drop entities from the previous summary. If space cannot be made,
add fewer new entities.

> Remember to use the exact same number of words for each summary.
Answer in JSON.

> The JSON in `summaries_per_step` should be a list (length 5) of
dictionaries whose keys are "missing_entities" and "denser_summary".

"""


messages = [{"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}]
client = OpenAI()
response = client.chat.completions.create(
            messages=messages,
            model = "gpt-4",   
)



In [None]:
print(response.choices[0].message.content)