## Generation of table summaries with LLM
Ask Mistral 7B Instruct v0.3 to read a table and generate a summary of it <br>
This notebook is used as support for performing table summaries in bulk. 
Once processed the summaries are stored to file.

In [None]:
# !pip install torch==2.1.0+cu121 torchvision==0.15.0+cu121 torchaudio==2.1.0+cu121
# !pip install --upgrade sympy
# !pip uninstall -y transformers
# !pip install transformers
# !pip install bitsandbytes

Suggested to run on Colab with a T4, 15GB gpu memory are required

In [None]:
from google.colab import drive
import numpy as np
from huggingface_hub import notebook_login
import torch
import time
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
import os
import re
import warnings
warnings.filterwarnings('ignore')

In [3]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
drive.mount('/content/gdrive')
%cd gdrive/MyDrive/PW_NLP

Mounted at /content/gdrive
/content/gdrive/MyDrive/PW_NLP


In [None]:
import torch
print(torch.version.cuda)
print(torch.cuda.is_available())

!nvidia-smi

12.4
True
Sat Feb  1 09:05:18 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   43C    P8             12W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                      

In [None]:
model_card = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_card)
tokenizer.pad_token = tokenizer.eos_token

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_card,
    return_dict=True,
    quantization_config=bnb_config,
    device_map='auto'
)

generation_config = model.generation_config
generation_config.max_new_tokens = 1000
generation_config.eos_token_id = tokenizer.eos_token_id
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.temperature = 0.0
generation_config.num_return_sequences = 1

In [None]:
def call_model(model, input_ids_prompt, generation_config):
    start_time = time.time()
    with torch.no_grad():
      output = model.generate(
          **input_ids_prompt,
          max_new_tokens=generation_config.max_new_tokens,
          eos_token_id=generation_config.eos_token_id,
          pad_token_id=generation_config.pad_token_id,
          temperature=generation_config.temperature,
          num_return_sequences=generation_config.num_return_sequences
      )
      end_time = time.time()

    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text, round(end_time-start_time, 2)

In [None]:
def parse_response(text):
    # Define regex patterns for extracting the title, description, and keywords
    title_pattern = r"'title':\s*'(.*?)'"
    description_pattern = r"'description':\s*'(.*?)'"
    keywords_pattern = r"'keywords':\s*\[(.*?)\]"

    # Extract title, description, and keywords using regex
    title_match = re.search(title_pattern, text)
    description_match = re.search(description_pattern, text)
    keywords_match = re.search(keywords_pattern, text)

    # Check if all parts were found
    if title_match and description_match and keywords_match:
        title = title_match.group(1)
        description = description_match.group(1)
        # Clean up the keywords list and join them into a string
        keywords = ', '.join([kw.strip("'") for kw in keywords_match.group(1).split(',')])

        # Return the formatted string
        return f"{title}. {description} {keywords}"
    else:
        return "Error: Could not extract information"

In [None]:
template_short = """
You are a finance expert specialized in analyzing tables.

### Task ###
Summarize the table in a dictionary with the following keys:
- 'title': A concise title for the table.
- 'description': A brief description of the table's contents and key trends.
- 'keywords': A list of relevant keywords.

**Note**: Do not return the table, prompt, or any extra text. Return only the dictionary.

### Table ###
"""

template_long = """
You are a finance expert specialized in analyzing financial tables.

### Task ###
Analyze the table and transcribe its contents into a well-structured summary. Generate a detailed paragraph that captures all relevant information, including numerical data, trends, and observations.
If the table contains acronyms, provide their concise definitions where possible.
Return the output in the form of a dictionary with the following keys:
- 'title': A concise title for the table.
- 'description': A brief description of the table's contents and key trends.
- 'keywords': A list of the most relevant keywords of the table.

**Note**: Do not return the table, prompt, or any extra text. Return only the dictionary.

### Table ###
"""

In [None]:
def create_summaries(dataset, template, template_name, max_new_tokens=1000):
    generation_config.max_new_tokens = max_new_tokens
    print(f"CREATING SUMMARIES FOR {dataset}, with template: {template}\n\n")

    # Check if the results file are already there, and continue in case
    if os.path.exists(f'table_summaries_{dataset}_{template_name}.npy'):
        summaries_npy = np.load(f'table_summaries_{dataset}_{template_name}.npy', allow_pickle='TRUE').item()
        print(f"table_summaries_{dataset}_{template_name}.npy already created!\t It contains: ", len(summaries_npy), " summary tables ", list(summaries_npy.keys()))
    else:
        summaries_npy = {}
    if os.path.exists(f'Timing_{dataset}_{template_name}.npy'):
        timing_npy = np.load(f'Timing_{dataset}_{template_name}.npy', allow_pickle='TRUE').item()
        print(f"Timing_{dataset}_{template_name}.npy already created!\t It contains: ", len(timing_npy), " summary tables ")
    else:
        timing_npy = {}

    # Load the extracted tables to compute the summaries
    table_dict = np.load(f'extracted_tables_{dataset}.npy', allow_pickle='TRUE').item()
    print(f"\nLoaded extracted_tables_{dataset}, it has {len(table_dict)} tables to process")

    count = 0
    print("\n\n")
    for key, value in table_dict.items():
        count += 1
        print(f"\t{count} - Key: {key}")
        # Check if the table has already been processed
        if key in summaries_npy:
            print(f"\t\tSkipping {key}, already processed ...\n")
            continue
        if value == []:
            print(f"\t\tSkipping {key}, no table found ...\n")
            continue

        table = value[0]
        prompt = template + table
        # Tokenize the prompt
        input_ids_prompt = tokenizer(prompt, return_tensors="pt").to(model.device)
        # Call the model
        model_response, total_time = call_model(model, input_ids_prompt, generation_config)
        # Parse the response
        response_text = model_response[len(prompt):]
        parsed_resp = parse_response(response_text)
        print(parsed_resp)
        print(f"Time: {total_time} seconds")
        print("\n\n")

        summaries_npy[key] = parsed_resp
        np.save(f'table_summaries_{dataset}_{template_name}.npy', summaries_npy)
        timing_npy[key] = total_time
        np.save(f'Timing_{dataset}_{template_name}.npy', timing_npy)

-------------------------------------------------

In [None]:
# create_summaries("FinQA", template_short, "short")

CREATING SUMMARIES FOR FinQA, with template: 
You are a finance expert who specializes in analyzing tables.

### Task ###
Summarize the table in a dictionary with the following keys:
- 'title': A concise title for the table.
- 'description': A brief description of the table's contents and key trends.
- 'keywords': A list of relevant keywords.

**Note**: Do not return the table, prompt, or any extra text. Return only the dictionary.

### Table ###



table_summaries_FinQA_short.npy already created!	 It contains:  160  summary tables  ['d6167ace8', 'd61650380', 'd6166eb32', 'd616785a6', 'd61655808', 'd616540de', 'd616760d0', 'd61660f64', 'd61650524', 'd6165a4d4', 'd6168290c', 'd6165f736', 'd6165cdec', 'd61651758', 'd61658206', 'd61665460', 'd616556b4', 'd6166041a', 'd616749e2', 'd6166c440', 'd61666450', 'd61652ff4', 'd616950b6', 'd61655132', 'd61656fc8', 'd616619d2', 'd6165e656', 'd6165385a', 'd61652536', 'd6167bf94', 'd6165e3e0', 'd61679b40', 'd61657de2', 'd61665db6', 'd6166b2c0', 'd616

In [None]:
dataset = "FinQA"
table_dict = np.load(f'table_summaries_{dataset}_short.npy', allow_pickle='TRUE').item()
print(len(table_dict))

247


------------------------------------------------------------------------------------------------------------------

In [None]:
# create_summaries("MultiHeritt", template_short, "short", max_new_tokens=2000)

CREATING SUMMARIES FOR MultiHeritt, with template: 
You are a finance expert who specializes in analyzing tables.

### Task ###
Summarize the table in a dictionary with the following keys:
- 'title': A concise title for the table.
- 'description': A brief description of the table's contents and key trends.
- 'keywords': A list of relevant keywords.

**Note**: Do not return the table, prompt, or any extra text. Return only the dictionary.

### Table ###



Timing_MultiHeritt_short.npy already created!	 It contains:  1  summary tables 

Loaded extracted_tables_MultiHeritt, it has 876 tables to process



	1 - Key: d8327a646
Total Contractual Obligations by Time Horizon. This table shows the total contractual obligations for various financial items, categorized by time horizon. The table highlights the significant long-term debt and purchase obligations not recorded on the Consolidated Statements of Financial Position. debt,  'obligations,  'purchase,  'contractual,  'time horizon,  'fina

In [None]:
dataset = "MultiHeritt"
table_dict = np.load(f'table_summaries_{dataset}_short.npy', allow_pickle='TRUE').item()
len(table_dict)

551

----------------------------------------------------------------------------

In [None]:
# create_summaries("TATQA", template_short, "short")

CREATING SUMMARIES FOR TATQA, with template: 
You are a finance expert who specializes in analyzing tables.

### Task ###
Summarize the table in a dictionary with the following keys:
- 'title': A concise title for the table.
- 'description': A brief description of the table's contents and key trends.
- 'keywords': A list of relevant keywords.

**Note**: Do not return the table, prompt, or any extra text. Return only the dictionary.

### Table ###




Loaded extracted_tables_TATQA, it has 248 tables to process



	1 - Key: d1a71456c
Financial Analysis of Trade Payables and Payments. This table provides a comparison of trade payables, payments received on account, other taxes and social security costs, other payables, accruals, government grants, and non-current payables for the years 2018 and 2019. trade payables,  'payments received on account,  'other taxes,  'other payables,  'accruals,  'government grants,  'non-current payables,  '2018,  '2019
Time: 15.78 seconds



	2 - Key: d1a71

In [None]:
dataset = "TATQA"
table_dict = np.load(f'table_summaries_{dataset}_short.npy', allow_pickle='TRUE').item()
len(table_dict)

248

----------------------------------------------------------------------------

In [None]:
# create_summaries("FinQA", template_long, "long")

CREATING SUMMARIES FOR FinQA, with template: 
You are a finance expert specialized in analyzing financial tables.

### Task ###
Analyze the table and transcribe its contents into a well-structured summary. Generate a detailed paragraph that captures all relevant information, including numerical data, trends, and observations.
If the table contains acronyms, provide their concise definitions where possible.
Return the output in the form of a dictionary with the following keys:
- 'title': A concise title for the table.
- 'description': A brief description of the table's contents and key trends.
- 'keywords': A list of the most relevant keywords of the table.

**Note**: Do not return the table, prompt, or any extra text. Return only the dictionary.

### Table ###




Loaded extracted_tables_FinQA, it has 247 tables to process



	1 - Key: d6167ace8
Revenue (in millions) for the years ending December 31, 2018-2022. The table displays the revenue figures (in millions) for the years 2018 to 

In [None]:
dataset = "FinQA"
table_dict = np.load(f'table_summaries_{dataset}_long.npy', allow_pickle='TRUE').item()
len(table_dict)

247

-----------------------------------------------------------------------------------

In [None]:
create_summaries("MultiHeritt", template_long, "long")

CREATING SUMMARIES FOR MultiHeritt, with template: 
You are a finance expert specialized in analyzing financial tables.

### Task ###
Analyze the table and transcribe its contents into a well-structured summary. Generate a detailed paragraph that captures all relevant information, including numerical data, trends, and observations.
If the table contains acronyms, provide their concise definitions where possible.
Return the output in the form of a dictionary with the following keys:
- 'title': A concise title for the table.
- 'description': A brief description of the table's contents and key trends.
- 'keywords': A list of the most relevant keywords of the table.

**Note**: Do not return the table, prompt, or any extra text. Return only the dictionary.

### Table ###




Loaded extracted_tables_MultiHeritt, it has 876 tables to process



	1 - Key: d8327a646
Total Contractual Obligations of a Company. The table presents the total contractual obligations of a company in millions of dolla

In [None]:
dataset = "MultiHeritt"
table_dict = np.load(f'table_summaries_{dataset}_long.npy', allow_pickle='TRUE').item()
len(table_dict)

551

----------------------------------------------------------------------------

In [None]:
create_summaries("TATQA", template_long, "long")

CREATING SUMMARIES FOR TATQA, with template: 
You are a finance expert specialized in analyzing financial tables.

### Task ###
Analyze the table and transcribe its contents into a well-structured summary. Generate a detailed paragraph that captures all relevant information, including numerical data, trends, and observations.
If the table contains acronyms, provide their concise definitions where possible.
Return the output in the form of a dictionary with the following keys:
- 'title': A concise title for the table.
- 'description': A brief description of the table's contents and key trends.
- 'keywords': A list of the most relevant keywords of the table.

**Note**: Do not return the table, prompt, or any extra text. Return only the dictionary.

### Table ###



table_summaries_TATQA_long.npy already created!	 It contains:  80  summary tables  ['d1a71456c', 'd1a719738', 'd1a725a2e', 'd1a718d56', 'd1a717e60', 'd1a72b690', 'd1a71726c', 'd1a7303b6', 'd1a7244a8', 'd1a71b272', 'd1a73a852',

In [None]:
dataset = "TATQA"
table_dict = np.load(f'table_summaries_{dataset}_long.npy', allow_pickle='TRUE').item()
len(table_dict)

248

----------------------------------------------------------------------------