We need to format our data into SQA format and save into a csv/tsv for the finetuning which needs:

id: optional, id of the table-question pair, for bookkeeping purposes.

annotator: optional, id of the person who annotated the table-question pair, for bookkeeping purposes.

position: integer indicating if the question is the first, second, third,… related to the table. Only required in case of conversational setup (SQA). You don’t need this column in case you’re going for WTQ/WikiSQL-supervised.

question: string

table_file: string, name of a csv file containing the tabular data
answer_coordinates: list of one or more tuples (each tuple being a cell coordinate, i.e. row, column pair that is part of the answer)

answer_text: list of one or more strings (each string being a cell value that is part of the answer)
aggregation_label: index of the aggregation operator. Only required in case of strong supervision for aggregation (the WikiSQL-supervised case)

float_answer: the float answer to the question, if there is one (np.nan if there isn’t). Only required in case of weak supervision for aggregation (such as WTQ and WikiSQL)

the tables refered to in the table_file area should be saved in a folder 

In [1]:
import os
import pandas as pd
from datasets import load_dataset
from transformers import TapasTokenizer

In [2]:
# Load in all qa (train and dev)
semeval_train_qa = load_dataset("cardiffnlp/databench", name="semeval", split="train")
semeval_dev_qa = load_dataset("cardiffnlp/databench", name="semeval", split="dev")

Resolving data files:   0%|          | 0/65 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/49 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/65 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/49 [00:00<?, ?it/s]

In [3]:
# RERUN THIS LATER TO REMOVE INDEX, JUST NEED FOR MANUAL ANSWER COORDS

##### load in the forbes dataframe (pandas dataframes) #####
df_ids = ['003_Love',
        '025_Data',
        '034_World', 
        '042_Predict',
        '054_Joe',
        '056_Emoji',
        '064_Clustering'
        ] # these are the ids of the dataframes that have under 512 rows

qa_dict = {} # dict to store all qa 
output_folder = os.getcwd()
for table in df_ids:
    print('Processing: ', table)
    csv_file_path = os.path.join(output_folder, f"{table}.csv")

    # Skip if the CSV file already exists
    if os.path.exists(csv_file_path):
        print(f"CSV for ID {table} already exists. Skipping...")
        continue

    try:
        # Load the all.parquet dataframe and save it as CSV
        df = pd.read_parquet(f"hf://datasets/cardiffnlp/databench/data/{table}/all.parquet")
        df.to_csv(csv_file_path, index=False)
        print(f"Saved CSV for ID {table} at {csv_file_path}.")

        # Load the qa.parquet dataframe and store it in the dictionary
        qa = pd.read_parquet(f"hf://datasets/cardiffnlp/databench/data/{table}/qa.parquet")
        qa_dict[table] = qa

    except Exception as e:
        print(f"Error processing ID {table}: {e}")

Processing:  003_Love
Saved CSV for ID 003_Love at /Users/carterlouchheim/Desktop/CS375/final/Tabular_Data_QA/data/003_Love.csv.
Processing:  025_Data
Saved CSV for ID 025_Data at /Users/carterlouchheim/Desktop/CS375/final/Tabular_Data_QA/data/025_Data.csv.
Processing:  034_World
Saved CSV for ID 034_World at /Users/carterlouchheim/Desktop/CS375/final/Tabular_Data_QA/data/034_World.csv.
Processing:  042_Predict
Saved CSV for ID 042_Predict at /Users/carterlouchheim/Desktop/CS375/final/Tabular_Data_QA/data/042_Predict.csv.
Processing:  054_Joe
Saved CSV for ID 054_Joe at /Users/carterlouchheim/Desktop/CS375/final/Tabular_Data_QA/data/054_Joe.csv.
Processing:  056_Emoji
Saved CSV for ID 056_Emoji at /Users/carterlouchheim/Desktop/CS375/final/Tabular_Data_QA/data/056_Emoji.csv.
Processing:  064_Clustering
Saved CSV for ID 064_Clustering at /Users/carterlouchheim/Desktop/CS375/final/Tabular_Data_QA/data/064_Clustering.csv.


In [4]:
# aggretation operators
aggregartion_ops = ['SUM', 'COUNT', 'AVERAGE', 'NONE'] # it would be hard to add new tags

In [48]:
# assign all of the qa tables
# for each need to manually assing the answer coordinate to each qa row
love_qa = qa_dict[df_ids[0]] # DONE
data_qa = qa_dict[df_ids[1]]
world_qa = qa_dict[df_ids[2]]
predict_qa = qa_dict[df_ids[3]]
joe_qa = qa_dict[df_ids[4]]
emoji_qa = qa_dict[df_ids[5]]
clustering_qa = qa_dict[df_ids[6]]

In [50]:
love_qa = love_qa[love_qa['type'] != 'boolean' ]
love_qa = love_qa[love_qa['type'] != 'list[category]' ]
love_qa = love_qa[love_qa['type'] != 'list[number]' ]
love_qa

Unnamed: 0,question,answer,type,columns_used,column_types,sample_answer,dataset,answer_coords
4,How many unique nationalities are present in t...,13,number,"[What's your nationality?""]""",['category'],1,003_Love,"[(2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5..."
5,What is the average gross annual salary?,56332.81720430108,number,['Gross annual salary (in euros) 💸'],['number[UInt32]'],62710.0,003_Love,"[(7, 0), (7, 1), (7, 2), (7, 3), (7, 4), (7, 5..."
6,How many respondents wear glasses all the time?,98,number,['How often do you wear glasses? 👓'],['category'],5,003_Love,"[(16, 0), (16, 2), (16, 3), (16, 5), (16, 6), ..."
7,What's the median age of the respondents?,33.0,number,['What is your age? 👶🏻👵🏻'],['number[uint8]'],32.5,003_Love,"[(1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5..."
8,What is the most common level of studies achie...,Master,category,['What is the maximum level of studies you hav...,['category'],Master,003_Love,"[(1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5..."
9,Which body complexity has the least number of ...,Very thin,category,['What is your body complexity? 🏋️'],['category'],Obese,003_Love,"[(10, 0), (10, 1), (10, 2), (10, 3), (10, 4), ..."
10,What's the most frequent eye color?,Brown,category,['What is your eye color? 👁️'],['category'],Brown,003_Love,"[(11, 0), (11, 1), (11, 2), (11, 3), (11, 4), ..."
11,Which sexual orientation has the highest repre...,Heterosexual,category,['What's your sexual orientation?'],['category'],Heterosexual,003_Love,"[(4, 0), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5..."


In [51]:
love_df = pd.read_csv(f"{df_ids[0]}.csv")
all_rows = list(range(len(love_df))) # use when the answer attends to all rows in the
love_ans_coords = []
love_float_ans = [] # maybe have to add this

'''# QA 0
qa_0 = [(row, 1) for row in all_rows]
love_ans_coords.append(qa_0)

# QA 1
qa_1 = [(row, 3) for row in all_rows]
love_ans_coords.append(qa_1)

# QA 2
qa_2 = [(row, 8) for row in all_rows]
love_ans_coords.append(qa_2)

# QA 3
qa_3 = [(row, 12) for row in all_rows]
love_ans_coords.append(qa_3)'''

# QA 4
qa_4 = [(row, 2) for row in all_rows]
love_ans_coords.append(qa_4)

# QA 5
qa_5 = [(row, 7) for row in all_rows]
love_ans_coords.append(qa_5)

# QA 6
row_indices = love_df.index[love_df['How often do you wear glasses? 👓'] == 'Constantly'].tolist()
qa_6 = [(row, 16) for row in row_indices]
love_ans_coords.append(qa_6)

# QA 7
qa_7 = [(row, 1) for row in all_rows]
love_ans_coords.append(qa_7)

# QA 8
qa_8 = [(row, 1) for row in all_rows]
love_ans_coords.append(qa_8)

# QA 9
qa_9 = [(row, 10) for row in all_rows]
love_ans_coords.append(qa_9)

# QA 10
qa_10 = [(row, 11) for row in all_rows]
love_ans_coords.append(qa_10)

# QA 11
qa_11 = [(row, 4) for row in all_rows]
love_ans_coords.append(qa_11)

'''# QA 12
qa_12 = [(row, 33) for row in all_rows]
love_ans_coords.append(qa_12)

# QA 13
qa_13 = [(row, 14) for row in all_rows]
love_ans_coords.append(qa_13)

# QA 14
qa_14 = [(row, 3) for row in all_rows]
love_ans_coords.append(qa_14)

# QA 15
qa_15 = [(row, 12) for row in all_rows]
love_ans_coords.append(qa_15)

# QA 16
row_indicies = love_df['Gross annual salary (in euros) 💸'].nlargest(4).index.tolist()
qa_16 = [(row, 7) for row in row_indicies]
love_ans_coords.append(qa_16)

# QA 17
row_indicies = love_df['Happiness scale'].nsmallest(3).index.tolist()
qa_17 = [(row, 32) for row in row_indicies]
love_ans_coords.append(qa_17)

# QA 18
row_indicies = love_df['What is your age? 👶🏻👵🏻'].nlargest(5).index.tolist()
qa_18 = [(row, 1) for row in row_indicies]
love_ans_coords.append(qa_18)

# QA 19
row_indicies = love_df['What is your skin tone?'].nlargest(5).index.tolist()
qa_19 = [(row, 13) for row in row_indicies]
love_ans_coords.append(qa_19)'''

# add these to a dataframe
love_qa['answer_coords'] = love_ans_coords
qa_dict[df_ids[0]] = love_qa

In [None]:
# data qa manual
all_rows = list(range(len(pd.read_csv(f"{df_ids[1]}.csv")))) # use when the answer attends to all rows in the 
# QA 0
# QA 1
# QA 2
# QA 3
# QA 4
# QA 5
# QA 6
# QA 7
# QA 8
# QA 9
# QA 10
# QA 11
# QA 12
# QA 13
# QA 14
# QA 15
# QA 16
# QA 17
# QA 18
# QA 19


In [34]:
# world qa manual
all_rows = list(range(len(pd.read_csv(f"{df_ids[2]}.csv")))) # use when the answer attends to all rows in the 
# QA 0
# QA 1
# QA 2
# QA 3
# QA 4
# QA 5
# QA 6
# QA 7
# QA 8
# QA 9
# QA 10
# QA 11
# QA 12
# QA 13
# QA 14
# QA 15
# QA 16
# QA 17
# QA 18
# QA 19

In [35]:
# predcit qa manual
all_rows = list(range(len(pd.read_csv(f"{df_ids[3]}.csv")))) # use when the answer attends to all rows in the 
# QA 0
# QA 1
# QA 2
# QA 3
# QA 4
# QA 5
# QA 6
# QA 7
# QA 8
# QA 9
# QA 10
# QA 11
# QA 12
# QA 13
# QA 14
# QA 15
# QA 16
# QA 17
# QA 18
# QA 19

In [52]:
# example of how to load in model and then format the data
model_name = "google/tapas-base"
tokenizer = TapasTokenizer.from_pretrained(model_name)

test = qa_dict[df_ids[0]]

table = love_df.astype(str)
queries = list(test['question'])
answer_coords = list(test['answer_coords'])
answer_text = list(test['answer'])



In [53]:
inputs = tokenizer(
    table = table,
    queries = queries,
    answer_coordinates = answer_coords,
    answer_text = answer_text,
    padding = 2048,
    truncation=True,  
    return_tensors = "pt",
    return_overflowing_tokens=True,
)

print(inputs.keys())

  text = normalize_for_match(row[col_index].text)
  cell = row[col_index]


ValueError: Couldn't find all answers

In [55]:
from transformers import TapasTokenizer

class DebugTapasTokenizer(TapasTokenizer):
    def _get_all_answer_ids(self, column_ids, row_ids, answer_coordinates):
        answer_ids = []
        missing_count = 0
        missing_answers = []

        for col_idx, row_idx in answer_coordinates:
            match_found = False
            for idx, (col_id, row_id) in enumerate(zip(column_ids, row_ids)):
                if col_id == col_idx and row_id == row_idx:
                    answer_ids.append(idx)
                    match_found = True
                    break
            
            if not match_found:
                missing_count += 1
                missing_answers.append((col_idx, row_idx))

        if missing_count > 0:
            print(f"Missing answers: {missing_answers}")
        return answer_ids, missing_count

In [57]:
tokenizer = TapasTokenizer.from_pretrained("google/tapas-base-finetuned-wtq")
debug_tokenizer = DebugTapasTokenizer(tokenizer)

tokenizer_config.json:   0%|          | 0.00/490 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/262k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/154 [00:00<?, ?B/s]



TypeError: stat: path should be string, bytes, os.PathLike or integer, not TapasTokenizer

In [None]:
# things to try
# fudge the asnwer coords on the majority stuff and just give the link to a cell that is majority 
# ensure that the answer coords are in the right form and give the right value