# Introduction

In this notebook, I experiment different prompting techniques for summarisation tasks, with the aim to:

- better understand how they work
- craft appropriate prompts for summarising a research paper

# Load packages

In [2]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from openai import OpenAI
import pandas as pd

In [3]:
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] 

# Prepare data

I will be using a paper (in pdf) I recently read and leverage Langchain to load and split. As the paper is long, I will mainly use the first page but have also tested on other pages (chunks).


In [6]:
# load document
loader = PyPDFLoader("../../example_paper1.pdf")
documents = loader.load()
print(f"This paper has {len(documents)} pages")

This paper has 33 pages


In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=6000, chunk_overlap=0)
text = text_splitter.split_documents(documents)
text_chunk = text[0].page_content
print(text_chunk)

Why do tree-based models still outperform deep
learning on tabular data?
Léo Grinsztajn
Soda, Inria Saclay
leo.grinsztajn@inria.frEdouard Oyallon
ISIR, CNRS, Sorbonne UniversityGaël Varoquaux
Soda, Inria Saclay
Abstract
While deep learning has enabled tremendous progress on text and image datasets,
its superiority on tabular data is not clear. We contribute extensive benchmarks of
standard and novel deep learning methods as well as tree-based models such as
XGBoost and Random Forests, across a large number of datasets and hyperparame-
ter combinations. We deﬁne a standard set of 45 datasets from varied domains with
clear characteristics of tabular data and a benchmarking methodology accounting
for both ﬁtting models and ﬁnding good hyperparameters. Results show that tree-
based models remain state-of-the-art on medium-sized data ( ∼10K samples) even
without accounting for their superior speed. To understand this gap, we conduct an
empirical investigation into the differing inductive bi

## Simple prompt

In [17]:
basic_prompt = f"""summarise the text in {text_chunk}"""
role_prompt = """You are a reseach assistant with the task to do literature review. You never make up any information that isn't in the literatures."""

messages = [{"role": "system", "content": role_prompt},
            {"role": "user", "content": basic_prompt}]
client = OpenAI()
response = client.chat.completions.create(
            messages=messages,
            model = "gpt-3.5-turbo",   
        )

In [12]:
print(response.choices[0].message.content)

The text discusses the performance of tree-based models compared to deep learning models on tabular data. While deep learning models have been successful in text and image datasets, their superiority in tabular data is not clear. The authors conduct extensive benchmarks on various datasets and hyperparameter combinations, comparing deep learning methods and tree-based models such as XGBoost and Random Forests. The results show that tree-based models are still state-of-the-art on medium-sized data, even without considering their faster speed. The authors then investigate the differences in the inductive biases of tree-based models and neural networks, highlighting the challenges in building tabular-specific neural networks. To stimulate research in this area, the authors provide a standard benchmark and raw data for baselines.


## chain of thought (COT)

The idea of cot to enable complex reasoning capabilities through intermediate reasoning steps. 

Zero-shot cot can be achieved by simplying add "Let's think step by step" to the orignal prompt. This is more likely to be helpful for math problems and other symbolic reasoning problem, but I still want to give it try to see how it works for this task. 


In [130]:
text_chunk = text[0].page_content
instruction_prompt = f"""Here is a piecec of document from a research paper you will summarise: {text_chunk}

Keep the summary short.

Let's think step by step.

"""

role_prompt = """You are a reseach assistant with the task to do literature review. You never make up any information that isn't in the literatures."""

messages = [{"role": "system", "content": role_prompt},
            {"role": "user", "content": instruction_prompt}]
client = OpenAI()
response = client.chat.completions.create(
            messages=messages,
            model = "gpt-3.5-turbo",   
        )

In [131]:
print(response.choices[0].message.content)

The paper titled "Why do tree-based models still outperform deep learning on tabular data?" by Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux discusses the performance of deep learning models versus tree-based models on tabular data.

The authors conduct extensive benchmarks on standard and novel deep learning methods, as well as tree-based models like XGBoost and Random Forests. They use a standard set of 45 datasets with clear characteristics of tabular data and a benchmarking methodology that accounts for both model fitting and finding good hyperparameters.

The results show that tree-based models continue to be state-of-the-art on medium-sized data, even without considering their superior speed. The authors also investigate the differing inductive biases of tree-based models and neural networks (NNs) through an empirical investigation.

Based on their findings, the authors highlight the challenges that researchers aiming to build tabular-specific NNs should address, including 

## COT (few-shot)

Another technique often mentioned along with zero-shot is few-shot, which involves providing a few examples in the prompt to "teach" the model to reason. However, I found it less useful(practical) for a summarisation task, because unlike math calculations or answering specific questions, it is hard to come up with some (good) summaries. It is also difficult to define what is good in this context. Hence I skipped this approach. 

# COT (steps + formating)

Since the idea of COT is to break down the reasoning process to steps. I tried eplicitly writing down what the steps should be and added an example to specify the format I want. 


In [18]:
text_chunk = text[0].page_content
instruction_prompt = f"""Here is a piece of document from a research paper you will summarise: {text_chunk}

Take the folllowing steps.

First, find the authors of the paper and the titile of the paper. If no information can be found, write "not known" instead.

Then, summarise the key findings and the techniques used to derive the findings in 3-5 sentences. 

Thus, the format of the response should be in JSON as what's shown in the <example></example> tags. Make sure to follow the fommating exactly.

<example>
{{
    "author": author,
    "title": title,
    "summary": summary
}}
</example>

"""
role_prompt = """You are a reseach assistant with the task to do literature review. You never make up any information that isn't in the literatures."""

messages = [{"role": "system", "content": role_prompt},
            {"role": "user", "content": instruction_prompt}]
client = OpenAI()
response = client.chat.completions.create(
            messages=messages,
            model = "gpt-3.5-turbo",   
        )

In [19]:
print(response.choices[0].message.content)

{
    "author": "Léo Grinsztajn, Edouard Oyallon, Gaël Varoquaux",
    "title": "Why do tree-based models still outperform deep learning on tabular data?",
    "summary": "The authors conducted extensive benchmarks comparing tree-based models such as XGBoost and Random Forests with standard and novel deep learning methods on a set of 45 tabular datasets. They found that tree-based models remained state-of-the-art on medium-sized data even without accounting for their superior speed. The authors conducted an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs) and identified challenges for building tabular-specific NNs. They contributed a standard benchmark and raw data for baselines to stimulate research on tabular architectures."
}


## Self-consistency COT

Another technique I encountered is self-consistency. The diea is to sample multple, diverse reasoning paths through few-shot COT and use the generations to select the most consistent answer (majority vote). It has shown to boost the performance of COT on tasks involving arithmetic and commonsense reasoning. 

Since I decided to not use few-shot, the implementation of self-consistency is to ask the model to follow the given steps n times. 


In [34]:
text_chunk = text[0].page_content
self_consistency_prompt = f"""
Imagine three completely independent research assistants who will summarise a given piece of paper {text_chunk}.

Each assistant follows the below 2 steps. 

1. find the authors of the paper and the titile of the paper. If no information can be found, write "not known" instead.

2. summarise the key findings and the techniques used in a short paragraph. 

Each assistant should provide response in JSON following the format given in the <example></example> tags. 

<example>
{{
    "author": author,
    "title": title,
    "summary": <the author> <key findings> <supportive arguments>
}}
</example>

"""

role_prompt = """You are a reseach assistant with the task to do literature review. You never make up any information that isn't in the literatures."""

messages = [
            {"role": "user", "content": self_consistency_prompt}]
client = OpenAI()
response = client.chat.completions.create(
            messages=messages,
            model = "gpt-3.5-turbo",   
        )

In [35]:
print(response.choices[0].message.content)

{
    "assistant1": {
        "author": "Léo Grinsztajn, Edouard Oyallon, Gaël Varoquaux",
        "title": "Why do tree-based models still outperform deep learning on tabular data?",
        "summary": "The authors conducted extensive benchmarks of deep learning methods and tree-based models on a variety of tabular datasets. They found that tree-based models, such as XGBoost and Random Forests, remain state-of-the-art on medium-sized data, even without considering their superior speed. The authors also investigated the differing inductive biases of tree-based models and Neural Networks (NNs) and identified challenges for building tabular-specific NNs. They provide a standard benchmark and raw data for baseline comparisons."
    },
    "assistant2": {
        "author": "Léo Grinsztajn, Edouard Oyallon, Gaël Varoquaux",
        "title": "Why do tree-based models still outperform deep learning on tabular data?",
        "summary": "The paper by Léo Grinsztajn, Edouard Oyallon, and Gaël V

## Chain of density (COD)

Chain of density prompting is a new technique recently developed specifically for summarisation tasks. Summaries are generated by iteratively incorporating missing salient entities from the source text without increasing the length. The prompt is taken from the original paper. I also used gpt-4 as suggested from the paper. 

In [36]:
# text_chunk = ''
# for i in range(6):
#     text_chunk += text[i].page_content
text_chunk = text[0].page_content
article = text_chunk


system_prompt = """
You are an expert in writing rich and dense summaries in broad domains.
"""

prompt = f"""
Article:

{article}

----

You will generate increasingly concise, entity-dense summaries of the above
Article.

Repeat the following 2 steps 5 times.

- Step 1: Identify 1-3 informative Entities from the Article
which are missing from the previously generated summary and are the most
relevant.

- Step 2: Write a new, denser summary of identical length which covers
every entity and detail from the previous summary plus the missing entities

A Missing Entity is:

- Relevant: to the main story
- Specific: descriptive yet concise (5 words or fewer)
- Novel: not in the previous summary
- Faithful: present in the Article
- Anywhere: located anywhere in the Article

Guidelines:
- The first summary should be long (4-5 sentences, approx. 80 words) yet
highly non-specific, containing little information beyond the entities
marked as missing.

- Use overly verbose language and fillers (e.g. "this article discusses") to
reach approx. 80 words.

- Make every word count: re-write the previous summary to improve flow and
make space for additional entities.

- Make space with fusion, compression, and removal of uninformative phrases
like "the article discusses"

- The summaries should become highly dense and concise yet self-contained,
e.g., easily understood without the Article.

- Missing entities can appear anywhere in the new summary.

- Never drop entities from the previous summary. If space cannot be made,
add fewer new entities.

> Remember to use the exact same number of words for each summary.
Answer in JSON.

> The JSON in `summaries_per_step` should be a list (length 5) of
dictionaries whose keys are "missing_entities" and "denser_summary".

"""


messages = [{"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}]
client = OpenAI()
response = client.chat.completions.create(
            messages=messages,
            model = "gpt-4",   
)


In [37]:
print(response.choices[0].message.content)


{
"summaries_per_step": [
{
"missing_entities": ["tabular data", "tree-based models", "deep learning"],
"denser_summary": "The article discusses the use of deep learning and tree-based models like XGBoost and Random Forests on tabular data. It highlights that despite the advancements in deep learning, it is still not evidently superior to tree-based models when it comes to tabular data. The researchers also present benchmarks and challenges linked with the use of deep learning methods on tabular data, implying that the space still has plenty of room for research and development."
},
{
"missing_entities": ["medium-sized data", "inductive biases", "Neural Networks (NNs)"],
"denser_summary": "The article emphasizes that tree-based models, such as XGBoost and Random Forests, continue to outperform deep learning, notably Neural Networks (NNs), on tabular data, especially with medium-sized samples. The discussion also centers around the inductive biases of tree-based models versus NNs and p

## Final selection

In [38]:
text_chunk = text[6].page_content
instruction_prompt = f"""
    Here is a piece of text from a research paper. 
    Article:
    {text_chunk}
   ----
    
    Summarise the given text by following the Guidance below.
    
    Guidance:
    - Summarise the the key conclusions in a single short paragraph of approx. 100 words.
    - the summary should be highly dense ans concise yet self-contained. 
    - list relevant findings, observations and supportive arguments for the key conclusionsin in 3-5 bullet points. 
    - each finding has 1-2 sentences and be as close to the original text
    - describe the methods in 3-5 short sentences

    <format>
    Summary: <conclusion>
    Findings: <findings>
    Methods: <methods>
    </format>

    
"""
role_prompt = """You are a reseach assistant with the task to do literature review. You never make up any information that isn't in the literatures."""


messages = [{"role": "system", "content": role_prompt},
            {"role": "user", "content": instruction_prompt}]
client = OpenAI()
response = client.chat.completions.create(
            messages=messages,
            model = "gpt-3.5-turbo", 
)



In [39]:
print(response.choices[0].message.content)

Summary: The research paper explores the performance of neural networks (NNs) compared to tree-based models on tabular datasets. The study finds that NNs struggle to fit irregular functions compared to tree-based models, which learn piece-wise constant functions. The paper also highlights that uninformative features affect MLP-like NNs more than other models. Additionally, the research shows that data are non-invariant by rotation, highlighting the importance of learning procedures that are not rotationally invariant.

Findings:
- NNs struggle to fit irregular functions compared to tree-based models.
- Uninformative features have a greater impact on MLP-like NNs compared to other models.
- Data are non-invariant by rotation, suggesting the need for learning procedures that are not rotationally invariant.

Methods:
- The study uses tabular datasets and compares the performance of NNs and tree-based models.
- Feature importance, ranked by a Random Forest, is used to drop uninformative fe

# Takeaways

1. Prompting is important. I once read a post on Linkedin saying prompting engineering is "dead" given the rise of RAGs and more capable models. Although it's debatable and may be the case with how rapidly technology evolves, I feel it still has an important role to play. 

2. The performance of different prompting techniques depends on the problem at hand. No single off-the-shelf prompt template works well across all applications. 

3. A good prompt (or a good prompting approach) requries logical thinking and creativity. 

# Useful materials

1. [ANTHROPIC - documentation](https://docs.anthropic.com/claude/docs/optimizing-your-prompt)
2. [ANTHROPIC cookbook](https://github.com/anthropics/anthropic-cookbook/tree/main/long_context)
3. [Summarising Best Practices for Prompt Engineering](https://towardsdatascience.com/summarising-best-practices-for-prompt-engineering-c5e86c483af4)
4. [12 Prompt Engineering Techniques](https://cobusgreyling.medium.com/12-prompt-engineering-techniques-644481c857aa)
5. [Prompt Engineering Guide](https://www.promptingguide.ai/)
6. [From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting](https://arxiv.org/abs/2309.04269)

