We need to format our data into SQA format and save into a csv/tsv for the finetuning which needs:

id: optional, id of the table-question pair, for bookkeeping purposes.

annotator: optional, id of the person who annotated the table-question pair, for bookkeeping purposes.

position: integer indicating if the question is the first, second, third,… related to the table. Only required in case of conversational setup (SQA). You don’t need this column in case you’re going for WTQ/WikiSQL-supervised.

question: string

table_file: string, name of a csv file containing the tabular data
answer_coordinates: list of one or more tuples (each tuple being a cell coordinate, i.e. row, column pair that is part of the answer)

answer_text: list of one or more strings (each string being a cell value that is part of the answer)
aggregation_label: index of the aggregation operator. Only required in case of strong supervision for aggregation (the WikiSQL-supervised case)

float_answer: the float answer to the question, if there is one (np.nan if there isn’t). Only required in case of weak supervision for aggregation (such as WTQ and WikiSQL)

the tables refered to in the table_file area should be saved in a folder 

In [79]:
from datasets import load_dataset
import pandas as pd
from transformers import TapasTokenizer

In [14]:
# Load in all qa (train and dev)
semeval_train_qa = load_dataset("cardiffnlp/databench", name="semeval", split="train")
semeval_dev_qa = load_dataset("cardiffnlp/databench", name="semeval", split="dev")

Resolving data files:   0%|          | 0/65 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/49 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/65 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/49 [00:00<?, ?it/s]

In [24]:
# get all unique dataset names
df_ids = list(set(semeval_train_qa["dataset"]))

In [94]:
# load in the forbes dataframe (pandas dataframes)
forbes_id = df_ids[8]
forbes_df = pd.read_parquet(f"hf://datasets/cardiffnlp/databench/data/{forbes_id}/all.parquet")
forbes_qa = pd.read_parquet(f"hf://datasets/cardiffnlp/databench/data/{forbes_id}/qa.parquet")
forbes_sample_df = pd.read_parquet(f"hf://datasets/cardiffnlp/databench/data/{forbes_id}/sample.parquet")

In [95]:
# filter to only questions that have a numerical answer
forbes_qa_num = forbes_qa[forbes_qa['type'] == 'number']
aggregartion_ops = ['SUM', 'COUNT', 'AVERAGE', 'NONE']

In [96]:
# add answer coordinates to the table 
forbes_qa_num

Unnamed: 0,question,answer,type,columns_used,column_types,sample_answer,dataset
5,What is the age of the youngest billionaire?,19.0,number,['age'],['number[UInt8]'],32.0,001_Forbes
6,How many billionaires are there from the 'Tech...,343.0,number,['category'],['category'],0.0,001_Forbes
7,What's the total worth of billionaires in the ...,583600.0,number,"['category', 'finalWorth']","['category', 'number[uint32]']",0.0,001_Forbes
8,How many billionaires have a philanthropy scor...,25.0,number,['philanthropyScore'],['number[UInt8]'],0.0,001_Forbes
9,What's the rank of the wealthiest non-self-mad...,3.0,number,"['selfMade', 'rank']","['boolean', 'number[uint16]']",288.0,001_Forbes


In [97]:
# find all of the answer coordinates

# youngest billionare
row_index = forbes_df['age'].idxmin()
col_index = forbes_df.columns.get_loc('age')
location_one = [(row_index, col_index)]

# number of tech billionaires
row_index = forbes_df.index[forbes_df['category'] == 'Technology'].tolist()
col_index = forbes_df.columns.get_loc('category')
location_two = [(row, col_index) for row in row_index]

# total worth of billionares in Automotive category
row_index = range(len(forbes_df))
col_index = forbes_df.columns.get_loc('category')
location_three = [(row, col_index) for row in row_index]

# number of billionares with philanthropy score over 3 
row_index = forbes_df.index[forbes_df['philanthropyScore'] > 3].tolist()
col_index = forbes_df.columns.get_loc('philanthropyScore')
location_four = [(row, col_index) for row in row_index]

# rank of wealthiest non-self-made billionare
row_index = forbes_df[forbes_df['selfMade'] == False]['finalWorth'].idxmax()
col_index = forbes_df.columns.get_loc('rank')
location_five = [(row_index, col_index)]

answer_coords = [location_one, location_two, location_three, location_four, location_five]
forbes_qa_num['answer_coords'] = answer_coords

agg_ops = [3, 1, 0, 1, 0]
forbes_qa_num['agg_ops'] = agg_ops

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forbes_qa_num['answer_coords'] = answer_coords
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forbes_qa_num['agg_ops'] = agg_ops


In [104]:
# this table now has the correct data (along with the data) to have inputs 
forbes_qa_num = forbes_qa_num.astype(str)
forbes_qa_num['question'] = forbes_qa_num['question'].astype(str)
forbes_qa_num.dtypes
#forbes_qa_num

rank                              uint16
personName                      category
age                              float64
finalWorth                        uint32
category                        category
source                          category
country                         category
state                           category
city                            category
organization                    category
selfMade                            bool
gender                          category
birthDate            datetime64[us, UTC]
title                           category
philanthropyScore                float64
bio                               object
about                             object
dtype: object

In [102]:
# example of how to load in model and then format the data
model_name = "google/tapas-base"
tokenizer = TapasTokenizer.from_pretrained(model_name)

table = forbes_df
queries = forbes_qa_num['question']
answer_coords = forbes_qa_num['answer_coords']
answer_text = forbes_qa_num['answer']

inputs = tokenizer(
    table = table,
    queries = queries,
    answer_coordinates = answer_coords,
    answer_text = answer_text,
    padding = "max_length",
    return_tensors = "pt",
)

inputs



ValueError: queries input must of type `str` (single example), `List[str]` (batch or single pretokenized example). 

dtype('O')