## Introduction: TAPAS

* Original TAPAS paper (ACL 2020): https://www.aclweb.org/anthology/2020.acl-main.398/
* Follow-up paper on intermediate pre-training (EMMNLP Findings 2020): https://www.aclweb.org/anthology/2020.findings-emnlp.27/
* Original Github repository: https://github.com/google-research/tapas
* Blog post: https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html

TAPAS is an algorithm that (among other tasks) can answer questions about tabular data. It is essentially a BERT model with relative position embeddings and additional token type ids that encode tabular structure, and 2 classification heads on top: one for **cell selection** and one for (optionally) performing an **aggregation** among selected cells (such as summing or counting).

Similar to BERT, the base `TapasModel` is pre-trained using the masked language modeling (MLM) objective on a large collection of tables from Wikipedia and associated texts. In addition, the authors further pre-trained the model on an second task (table entailment) to increase the numerical reasoning capabilities of TAPAS (as explained in the follow-up paper), which further improves performance on downstream tasks. 

In this notebook, we are going to fine-tune `TapasForQuestionAnswering` on [Sequential Question Answering (SQA)](https://www.microsoft.com/en-us/research/publication/search-based-neural-structured-learning-sequential-question-answering/), a dataset built by Microsoft Research which deals with asking questions related to a table in a **conversational set-up**. We are going to do so as in the original paper, by adding a randomly initialized cell selection head on top of the pre-trained base model (note that SQA does not have questions that involve aggregation and hence no aggregation head), and then fine-tuning them altogether.

First, we install both the Transformers library as well as the dependency on [`torch-scatter`](https://github.com/rusty1s/pytorch_scatter), which the model requires.

In [1]:
! rm -r transformers
! git clone https://github.com/huggingface/transformers.git
! cd transformers
! pip install ./transformers

override r--r--r--  hannahcatri/staff for transformers/.git/objects/pack/pack-598f941cc1528133314cb220793d93512a7689d9.idx? ^C
fatal: destination path 'transformers' already exists and is not an empty directory.
[31mERROR: Directory './transformers' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.[0m


In [2]:
! pip install torch-scatter==latest+cu101 -f https://pytorch-geometric.com/whl/torch-1.7.0.html

Looking in links: https://pytorch-geometric.com/whl/torch-1.7.0.html
[31mERROR: Could not find a version that satisfies the requirement torch-scatter==latest+cu101[0m
[31mERROR: No matching distribution found for torch-scatter==latest+cu101[0m


We also install a small portion from the SQA training dataset, for demonstration purposes. This is a TSV file containing table-question pairs. Besides this, we also download the `table_csv` directory, which contains the actual tabular data.

Note that you can download the entire SQA dataset on the [official website](https://www.microsoft.com/en-us/download/details.aspx?id=54253).

## Prepare the data 

Let's look at the first few rows of the dataset:

In [1]:
import pandas as pd

data = pd.read_excel("sqa_train_set_sfcfs.xlsx")
data.head()

Unnamed: 0,id,annotator,position,question,table_file,answer_coordinates,answer_text
0,sf-2,0,0,which incidents are in mission bay?,table_csv/2022_0117_locations.csv,"['(0,0)', '(11,0)', '(37,0)', '(94,0)', '(283,...","['220170001', '220170053', '220170263', '22017..."
1,sf-2,0,1,"of those, which one has a Berry St address?",table_csv/2022_0117_locations.csv,"[ '(283,0)']",[ '220172434' ]
2,sf-3,1,0,which are the incidents in station area 31?,table_csv/2022_0117_locations.csv,"['(56,0)', '(114,0)', '(141,0)', '(170,0)', '(...","['220170355', '220170807', '220171052', '22017..."
3,sf-3,1,1,"of these, which neighbourhoods are they in?",table_csv/2022_0117_locations.csv,"['(56,3)', '(114,3)', '(141,3)', '(170,3)', '(...","['Golden Gate Park', 'Inner Richmond', 'Inner ..."
4,sf-3,1,2,"and of those, which neighborhood has the highe...",table_csv/2022_0117_locations.csv,"['(345,3)']",['Outer Richmond']


As you can see, each row corresponds to a question related to a table. 
* The `position` column identifies whether the question is the first, second, ... in a sequence of questions related to a table. 
* The `table_file` column identifies the name of the table file, which refers to a CSV file in the `table_csv` directory.
* The `answer_coordinates` and `answer_text` columns indicate the answer to the question. The `answer_coordinates` is a list of tuples, each tuple being a (row_index, column_index) pair. The `answer_text` column is a list of strings, indicating the cell values.

However, the `answer_coordinates` and `answer_text` columns are currently not recognized as real Python lists of Python tuples and strings respectively. Let's do that first using the `.literal_eval()`function of the `ast` module:

In [2]:
import ast

def _parse_answer_coordinates(answer_coordinate_str):
  """Parses the answer_coordinates of a question.
  Args:
    answer_coordinate_str: A string representation of a Python list of tuple
      strings.
      For example: "['(1, 4)','(1, 3)', ...]"
  """

  try:
    answer_coordinates = []
    # make a list of strings
    coords = ast.literal_eval(answer_coordinate_str)
    # parse each string as a tuple
    for row_index, column_index in sorted(
        ast.literal_eval(coord) for coord in coords):
      answer_coordinates.append((row_index, column_index))
  except SyntaxError:
    raise ValueError('Unable to evaluate %s' % answer_coordinate_str)
  
  return answer_coordinates


def _parse_answer_text(answer_text):
  """Populates the answer_texts field of `answer` by parsing `answer_text`.
  Args:
    answer_text: A string representation of a Python list of strings.
      For example: "[u'test', u'hello', ...]"
    answer: an Answer object.
  """
  try:
    answer = []
    for value in ast.literal_eval(answer_text):
      answer.append(value)
  except SyntaxError:
    raise ValueError('Unable to evaluate %s' % answer_text)

  return answer

data['answer_coordinates'] = data['answer_coordinates'].apply(lambda coords_str: _parse_answer_coordinates(coords_str))
data['answer_text'] = data['answer_text'].apply(lambda txt: _parse_answer_text(txt))

data.head(10)

Unnamed: 0,id,annotator,position,question,table_file,answer_coordinates,answer_text
0,sf-2,0,0,which incidents are in mission bay?,table_csv/2022_0117_locations.csv,"[(0, 0), (11, 0), (37, 0), (94, 0), (283, 0)]","[220170001, 220170053, 220170263, 220170622, 2..."
1,sf-2,0,1,"of those, which one has a Berry St address?",table_csv/2022_0117_locations.csv,"[(283, 0)]",[220172434]
2,sf-3,1,0,which are the incidents in station area 31?,table_csv/2022_0117_locations.csv,"[(56, 0), (114, 0), (141, 0), (170, 0), (286, ...","[220170355, 220170807, 220171052, 220171247, 2..."
3,sf-3,1,1,"of these, which neighbourhoods are they in?",table_csv/2022_0117_locations.csv,"[(56, 3), (114, 3), (141, 3), (170, 3), (286, ...","[Golden Gate Park, Inner Richmond, Inner Richm..."
4,sf-3,1,2,"and of those, which neighborhood has the highe...",table_csv/2022_0117_locations.csv,"[(345, 3)]",[Outer Richmond]
5,sf-4,2,0,what is the address of incident 220170143?,table_csv/2022_0117_locations.csv,"[(18, 1)]",[18TH ST/CASTRO ST]


Let's create a new dataframe that groups questions which are asked in a sequence related to the table. We can do this by adding a `sequence_id` column, which is a combination of the `id` and `annotator` columns:

In [3]:
def get_sequence_id(example_id, annotator):
  if "-" in str(annotator):
    raise ValueError('"-" not allowed in annotator.')
  return f"{example_id}-{annotator}"

data['sequence_id'] = data.apply(lambda x: get_sequence_id(x.id, x.annotator), axis=1)
data.head()

Unnamed: 0,id,annotator,position,question,table_file,answer_coordinates,answer_text,sequence_id
0,sf-2,0,0,which incidents are in mission bay?,table_csv/2022_0117_locations.csv,"[(0, 0), (11, 0), (37, 0), (94, 0), (283, 0)]","[220170001, 220170053, 220170263, 220170622, 2...",sf-2-0
1,sf-2,0,1,"of those, which one has a Berry St address?",table_csv/2022_0117_locations.csv,"[(283, 0)]",[220172434],sf-2-0
2,sf-3,1,0,which are the incidents in station area 31?,table_csv/2022_0117_locations.csv,"[(56, 0), (114, 0), (141, 0), (170, 0), (286, ...","[220170355, 220170807, 220171052, 220171247, 2...",sf-3-1
3,sf-3,1,1,"of these, which neighbourhoods are they in?",table_csv/2022_0117_locations.csv,"[(56, 3), (114, 3), (141, 3), (170, 3), (286, ...","[Golden Gate Park, Inner Richmond, Inner Richm...",sf-3-1
4,sf-3,1,2,"and of those, which neighborhood has the highe...",table_csv/2022_0117_locations.csv,"[(345, 3)]",[Outer Richmond],sf-3-1


In [4]:
# let's group table-question pairs by sequence id, and remove some columns we don't need 
grouped = data.groupby(by='sequence_id').agg(lambda x: x.tolist())
grouped = grouped.drop(columns=['id', 'annotator', 'position'])
grouped['table_file'] = grouped['table_file'].apply(lambda x: x[0])
grouped.head(10)

Unnamed: 0_level_0,question,table_file,answer_coordinates,answer_text
sequence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sf-2-0,"[which incidents are in mission bay?, of those...",table_csv/2022_0117_locations.csv,"[[(0, 0), (11, 0), (37, 0), (94, 0), (283, 0)]...","[[220170001, 220170053, 220170263, 220170622, ..."
sf-3-1,"[which are the incidents in station area 31?, ...",table_csv/2022_0117_locations.csv,"[[(56, 0), (114, 0), (141, 0), (170, 0), (286,...","[[220170355, 220170807, 220171052, 220171247, ..."
sf-4-2,[what is the address of incident 220170143?],table_csv/2022_0117_locations.csv,"[[(18, 1)]]",[[18TH ST/CASTRO ST]]


Each row in the dataframe above now consists of a **table and one or more questions** which are asked in a **sequence**. Let's visualize the first row, i.e. a table, together with its queries:

In [5]:
# path to the directory containing all csv files
table_csv_path = "table_csv"

item = grouped.iloc[0]
table = pd.read_csv(table_csv_path + item.table_file[9:]).astype(str) 

display(table)
print("")
print(item.question)

Unnamed: 0,call_number,address,zipcode,neighborhood,station_area
0,220170001,700 Block of 4TH ST,94107.0,Mission Bay,8
1,220170004,1000 Block of POTRERO AVE,94110.0,Potrero Hill,37
2,220170012,500 Block of GREEN ST,94133.0,North Beach,28
3,220170017,1000 Block of MARKET ST,94102.0,Tenderloin,1
4,220170020,0 Block of BLK MOLIMO DR,94127.0,West of Twin Peaks,39
...,...,...,...,...,...
384,220173312,1200 Block of MARKET ST,94103.0,South of Market,36
385,220173322,100 Block of CASELLI AV,94114.0,Castro/Upper Market,24
386,220173329,700 Block of LA PLAYA,94121.0,Outer Richmond,34
387,220173330,WALLER ST/MASONIC AV,94117.0,Haight Ashbury,21



['which incidents are in mission bay?', 'of those, which one has a Berry St address?']


In [6]:
table.iloc[0, 0]

'220170001'

We can see that there are 3 sequential questions asked related to the contents of the table. 

We can now use `TapasTokenizer` to batch encode this, as follows:

In [7]:
import torch
from transformers import TapasTokenizer

# initialize the tokenizer
tokenizer = TapasTokenizer.from_pretrained("google/tapas-base")

In [7]:
for index, row in grouped.iterrows():
    for x in row['answer_coordinates']:
        for y in x:
            #print(y)
            #y=y.strip("()")
            print(table.iloc[y])
    print(row['answer_text'])

220170001
220170053
220170263
220170622
220172434
220172434
[['220170001', '220170053', '220170263', '220170622', '220172434'], ['220172434']]
220170355
220170807
220171052
220171247
220172473
220172577
220172991
Golden Gate Park
Inner Richmond
Inner Richmond
Outer Richmond
Inner Richmond
Inner Richmond
Outer Richmond
Outer Richmond
[['220170355', '220170807', '220171052', '220171247', '220172473', '220172577', '220172991'], ['Golden Gate Park', 'Inner Richmond', 'Inner Richmond', 'Outer Richmond', 'Inner Richmond', 'Inner Richmond', 'Outer Richmond'], ['Outer Richmond']]
18TH ST/CASTRO ST
[['18TH ST/CASTRO ST']]


In [9]:
import torch
from transformers import TapasTokenizer

# initialize the tokenizer
tokenizer = TapasTokenizer.from_pretrained("google/tapas-base")


encoding = tokenizer(table=table, queries=item.question, answer_coordinates=item.answer_coordinates, answer_text=item.answer_text,
                     truncation=True, padding="max_length", return_tensors="pt")
encoding.keys()

ValueError: Couldn't find all answers

TAPAS basically flattens every table-question pair before feeding it into a BERT like model:

In [10]:
tokenizer.decode(encoding["input_ids"][0])

NameError: name 'encoding' is not defined

The `token_type_ids` created here will be of shape (batch_size, sequence_length, 7), as TAPAS uses 7 different token types to encode tabular structure. Let's verify this:

In [12]:
assert encoding["token_type_ids"].shape == (3, 512, 7)



One thing we can verify is whether the `prev_label` token type ids are created correctly. These indicate which tokens were (part of) an answer to the previous table-question pair. 

The prev_label token type ids of the first example in a batch must always be zero (since there's no previous table-question pair). Let's verify this:

In [13]:
assert encoding["token_type_ids"][0][:,3].sum() == 0

However, the `prev_label` token type ids of the second table-question pair in the batch must be set to 1 for the tokens which were an answer to the previous (i.e. the first) table question pair in the batch. The answers to the first table-question pair are the following:

In [14]:
print(item.answer_text[0])

['Tommy Green', 'Janis Dalins', 'Ugo Frigerio', 'Karl Hahnel', 'Ettore Rivolta', 'Paul Sievert', 'Henri Quintric', 'Ernie Crosbie', 'Bill Chisholm', 'Alfred Maasik', 'Henry Cieman', 'John Moralis', 'Francesco Pretti', 'Arthur Tell Schwab', 'Harry Hinkel']


So let's now verify whether the `prev_label` ids of the second table-question pair are set correctly:

In [15]:
for id, prev_label in zip (encoding["input_ids"][1], encoding["token_type_ids"][1][:,3]):
  if id != 0: # we skip padding tokens
    print(tokenizer.decode([id]), prev_label.item())

[CLS] 0
where 0
are 0
they 0
from 0
? 0
[SEP] 0
rank 0
name 0
nationality 0
time 0
( 0
hand 0
) 0
notes 0
[EMPTY] 0
tommy 1
green 1
great 0
britain 0
4 0
: 0
50 0
: 0
10 0
or 0
[EMPTY] 0
jan 1
##is 1
dali 1
##ns 1
latvia 0
4 0
: 0
57 0
: 0
20 0
[EMPTY] 0
[EMPTY] 0
u 1
##go 1
fr 1
##iger 1
##io 1
italy 0
4 0
: 0
59 0
: 0
06 0
[EMPTY] 0
4 0
. 0
0 0
karl 1
hahn 1
##el 1
germany 0
5 0
: 0
06 0
: 0
06 0
[EMPTY] 0
5 0
. 0
0 0
et 1
##tore 1
ri 1
##vo 1
##lta 1
italy 0
5 0
: 0
07 0
: 0
39 0
[EMPTY] 0
6 0
. 0
0 0
paul 1
si 1
##ever 1
##t 1
germany 0
5 0
: 0
16 0
: 0
41 0
[EMPTY] 0
7 0
. 0
0 0
henri 1
qui 1
##nt 1
##ric 1
france 0
5 0
: 0
27 0
: 0
25 0
[EMPTY] 0
8 0
. 0
0 0
ernie 1
cr 1
##os 1
##bie 1
united 0
states 0
5 0
: 0
28 0
: 0
02 0
[EMPTY] 0
9 0
. 0
0 0
bill 1
chi 1
##sho 1
##lm 1
united 0
states 0
5 0
: 0
51 0
: 0
00 0
[EMPTY] 0
10 0
. 0
0 0
alfred 1
ma 1
##asi 1
##k 1
estonia 0
6 0
: 0
19 0
: 0
00 0
[EMPTY] 0
[EMPTY] 0
henry 1
ci 1
##eman 1
canada 0
[EMPTY] 0
d 0
##n 0
##f 0
[EMPTY] 0

This looks OK! Be sure to check this, because the token type ids are critical for the performance of TAPAS.

Let's create a PyTorch dataset and corresponding dataloader. Note the __getitem__ method here: in order to properly set the prev_labels token types, we must check whether a table-question pair is the first in a sequence or not. In case it is, we can just encode it. In case it isn't, we need to encode it together with the previous table-question pair.

Note that this is not the most efficient approach, because we're effectively tokenizing each table-question pair twice when applied on the entire dataset (feel free to ping me a more efficient solution).

In [16]:
class TableDataset(torch.utils.data.Dataset):
    def __init__(self, df, tokenizer):
        self.df = df
        self.tokenizer = tokenizer

    def __getitem__(self, idx):
        item = self.df.iloc[idx]
        table = pd.read_csv(table_csv_path + item.table_file[9:]).astype(str) # TapasTokenizer expects the table data to be text only
        if item.position != 0:
          # use the previous table-question pair to correctly set the prev_labels token type ids
          previous_item = self.df.iloc[idx-1]
          encoding = self.tokenizer(table=table, 
                                    queries=[previous_item.question, item.question], 
                                    answer_coordinates=[previous_item.answer_coordinates, item.answer_coordinates], 
                                    answer_text=[previous_item.answer_text, item.answer_text],
                                    padding="max_length",
                                    truncation=True,
                                    return_tensors="pt"
          )
          # use encodings of second table-question pair in the batch
          encoding = {key: val[-1] for key, val in encoding.items()}
        else:
          # this means it's the first table-question pair in a sequence
          encoding = self.tokenizer(table=table, 
                                    queries=item.question, 
                                    answer_coordinates=item.answer_coordinates, 
                                    answer_text=item.answer_text,
                                    padding="max_length",
                                    truncation=True,
                                    return_tensors="pt"
          )
          # remove the batch dimension which the tokenizer adds 
          encoding = {key: val.squeeze(0) for key, val in encoding.items()}
        return encoding

    def __len__(self):
        return len(self.df)

train_dataset = TableDataset(df=data, tokenizer=tokenizer)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=2)

In [17]:
train_dataset[0]["token_type_ids"].shape

torch.Size([512, 7])

In [18]:
train_dataset[1]["input_ids"].shape

torch.Size([512])

In [19]:
batch = next(iter(train_dataloader))

In [20]:
batch["input_ids"].shape

torch.Size([2, 512])

In [21]:
batch["token_type_ids"].shape

torch.Size([2, 512, 7])

Let's decode the first table-question pair:

In [22]:
tokenizer.decode(batch["input_ids"][0])

'[CLS] where are the players from? [SEP] pick player team position school 1 ben mcdonald baltimore orioles rhp louisiana state university 2 tyler houston atlanta braves c valley hs ( las vegas, nv ) 3 roger salkeld seattle mariners rhp saugus ( ca ) hs 4 jeff jackson philadelphia phillies of simeon hs ( chicago, il ) 5 donald harris texas rangers of texas tech university 6 paul coleman saint louis cardinals of frankston ( tx ) hs 7 frank thomas chicago white sox 1b auburn university 8 earl cunningham chicago cubs of lancaster ( sc ) hs 9 kyle abbott california angels lhp long beach state university 10 charles johnson montreal expos c westwood hs ( fort pierce, fl ) 11 calvin murray cleveland indians 3b w. t. white high school ( dallas, tx ) 12 jeff juden houston astros rhp salem ( ma ) hs 13 brent mayne kansas city royals c cal state fullerton 14 steve hosey san francisco giants of fresno state university 15 kiki jones los angeles dodgers rhp hillsborough hs ( tampa, fl ) 16 greg bloss

In [23]:
#first example should not have any prev_labels set
assert batch["token_type_ids"][0][:,3].sum() == 0

Let's decode the second table-question pair and verify some more:

In [24]:
tokenizer.decode(batch["input_ids"][1])

'[CLS] which player went to louisiana state university? [SEP] pick player team position school 1 ben mcdonald baltimore orioles rhp louisiana state university 2 tyler houston atlanta braves c valley hs ( las vegas, nv ) 3 roger salkeld seattle mariners rhp saugus ( ca ) hs 4 jeff jackson philadelphia phillies of simeon hs ( chicago, il ) 5 donald harris texas rangers of texas tech university 6 paul coleman saint louis cardinals of frankston ( tx ) hs 7 frank thomas chicago white sox 1b auburn university 8 earl cunningham chicago cubs of lancaster ( sc ) hs 9 kyle abbott california angels lhp long beach state university 10 charles johnson montreal expos c westwood hs ( fort pierce, fl ) 11 calvin murray cleveland indians 3b w. t. white high school ( dallas, tx ) 12 jeff juden houston astros rhp salem ( ma ) hs 13 brent mayne kansas city royals c cal state fullerton 14 steve hosey san francisco giants of fresno state university 15 kiki jones los angeles dodgers rhp hillsborough hs ( tamp

In [25]:
assert batch["labels"][0].sum() == batch["token_type_ids"][1][:,3].sum()
print(batch["token_type_ids"][1][:,3].sum())

tensor(132)


In [26]:
for id, prev_label in zip(batch["input_ids"][1], batch["token_type_ids"][1][:,3]):
  if id != 0:
    print(tokenizer.decode([id]), prev_label.item())

[CLS] 0
which 0
player 0
went 0
to 0
louisiana 0
state 0
university 0
? 0
[SEP] 0
pick 0
player 0
team 0
position 0
school 0
1 0
ben 0
mcdonald 0
baltimore 0
orioles 0
r 0
##hp 0
louisiana 1
state 1
university 1
2 0
tyler 0
houston 0
atlanta 0
braves 0
c 0
valley 1
hs 1
( 1
las 1
vegas 1
, 1
n 1
##v 1
) 1
3 0
roger 0
sal 0
##kel 0
##d 0
seattle 0
mariners 0
r 0
##hp 0
sa 1
##ug 1
##us 1
( 1
ca 1
) 1
hs 1
4 0
jeff 0
jackson 0
philadelphia 0
phillies 0
of 0
simeon 1
hs 1
( 1
chicago 1
, 1
il 1
) 1
5 0
donald 0
harris 0
texas 0
rangers 0
of 0
texas 1
tech 1
university 1
6 0
paul 0
coleman 0
saint 0
louis 0
cardinals 0
of 0
franks 1
##ton 1
( 1
tx 1
) 1
hs 1
7 0
frank 0
thomas 0
chicago 0
white 0
sox 0
1b 0
auburn 1
university 1
8 0
earl 0
cunningham 0
chicago 0
cubs 0
of 0
lancaster 1
( 1
sc 1
) 1
hs 1
9 0
kyle 0
abbott 0
california 0
angels 0
l 0
##hp 0
long 1
beach 1
state 1
university 1
10 0
charles 0
johnson 0
montreal 0
expo 0
##s 0
c 0
westwood 1
hs 1
( 1
fort 1
pierce 1
, 1
fl 1
) 

## Define the model

Here we initialize the model with a pre-trained base and randomly initialized cell selection head, and move it to the GPU (if available).

Note that the `google/tapas-base` checkpoint has (by default) an SQA configuration, so we don't need to specify any additional hyperparameters.

In [27]:
from transformers import TapasForQuestionAnswering

model = TapasForQuestionAnswering.from_pretrained("google/tapas-base")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

Some weights of TapasForQuestionAnswering were not initialized from the model checkpoint at google/tapas-base and are newly initialized: ['output_bias', 'output_weights', 'column_output_weights', 'column_output_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


TapasForQuestionAnswering(
  (tapas): TapasModel(
    (embeddings): TapasEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(1024, 768)
      (token_type_embeddings_0): Embedding(3, 768)
      (token_type_embeddings_1): Embedding(256, 768)
      (token_type_embeddings_2): Embedding(256, 768)
      (token_type_embeddings_3): Embedding(2, 768)
      (token_type_embeddings_4): Embedding(256, 768)
      (token_type_embeddings_5): Embedding(256, 768)
      (token_type_embeddings_6): Embedding(10, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.07, inplace=False)
    )
    (encoder): TapasEncoder(
      (layer): ModuleList(
        (0): TapasLayer(
          (attention): TapasAttention(
            (self): TapasSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)


## Training the model

Let's fine-tune the model in well-known PyTorch fashion:

In [28]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

for epoch in range(10):  # loop over the dataset multiple times
   print("Epoch:", epoch)
   for idx, batch in enumerate(train_dataloader):
        # get the inputs;
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        token_type_ids = batch["token_type_ids"].to(device)
        labels = batch["labels"].to(device)
        
        # zero the parameter gradients
        optimizer.zero_grad()
        # forward + backward + optimize
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
                       labels=labels)
        loss = outputs.loss
        print("Loss:", loss.item())
        loss.backward()
        optimizer.step()



Epoch: 0
Loss: 2.2924113273620605
Loss: 1.7504932880401611
Loss: 1.40304696559906
Loss: 1.6494226455688477
Loss: 1.6415494680404663
Loss: 2.8653769493103027
Loss: 2.4297642707824707
Loss: 2.773843288421631
Loss: 3.057562828063965
Loss: 2.458859443664551
Loss: 1.8117353916168213
Loss: 1.8899388313293457
Loss: 1.8997710943222046
Loss: 1.2621345520019531
Epoch: 1
Loss: 2.557004690170288
Loss: 1.0541355609893799
Loss: 1.0075969696044922
Loss: 1.716690182685852
Loss: 1.8415734767913818
Loss: 1.937148928642273
Loss: 1.89765202999115
Loss: 1.9327514171600342
Loss: 2.027282476425171
Loss: 1.077110767364502
Loss: 1.1339160203933716
Loss: 1.259394884109497
Loss: 1.6109447479248047
Loss: 0.9077013731002808
Epoch: 2
Loss: 2.0399744510650635
Loss: 0.5073123574256897
Loss: 0.6933571100234985
Loss: 0.8221110105514526
Loss: 1.2853317260742188
Loss: 1.0606557130813599
Loss: 1.1429767608642578
Loss: 1.0728784799575806
Loss: 1.4115822315216064
Loss: 0.8682160377502441
Loss: 0.9180620312690735
Loss: 0.942

## Inference

As SQA is a bit different due to its conversational nature, we need to run every training example of the a batch one by one through the model (sequentially), overwriting the `prev_labels` token types (which were created by the tokenizer) by the answer predicted by the model. It is based on the [following code](https://github.com/google-research/tapas/blob/f458b6624b8aa75961a0ab78e9847355022940d3/tapas/experiments/prediction_utils.py#L92) from the official implementation:

In [29]:
import collections
import numpy as np

def compute_prediction_sequence(model, data, device):
  """Computes predictions using model's answers to the previous questions."""
  
  # prepare data
  input_ids = data["input_ids"].to(device)
  attention_mask = data["attention_mask"].to(device)
  token_type_ids = data["token_type_ids"].to(device)

  all_logits = []
  prev_answers = None

  num_batch = data["input_ids"].shape[0]
  
  for idx in range(num_batch):
    
    if prev_answers is not None:
        coords_to_answer = prev_answers[idx]
        # Next, set the label ids predicted by the model
        prev_label_ids_example = token_type_ids_example[:,3] # shape (seq_len,)
        model_label_ids = np.zeros_like(prev_label_ids_example.cpu().numpy()) # shape (seq_len,)

        # for each token in the sequence:
        token_type_ids_example = token_type_ids[idx] # shape (seq_len, 7)
        for i in range(model_label_ids.shape[0]):
          segment_id = token_type_ids_example[:,0].tolist()[i]
          col_id = token_type_ids_example[:,1].tolist()[i] - 1
          row_id = token_type_ids_example[:,2].tolist()[i] - 1
          if row_id >= 0 and col_id >= 0 and segment_id == 1:
            model_label_ids[i] = int(coords_to_answer[(col_id, row_id)])

        # set the prev label ids of the example (shape (1, seq_len) )
        token_type_ids_example[:,3] = torch.from_numpy(model_label_ids).type(torch.long).to(device)   

    prev_answers = {}
    # get the example
    input_ids_example = input_ids[idx] # shape (seq_len,)
    attention_mask_example = attention_mask[idx] # shape (seq_len,)
    token_type_ids_example = token_type_ids[idx] # shape (seq_len, 7)
    # forward pass to obtain the logits
    outputs = model(input_ids=input_ids_example.unsqueeze(0), 
                    attention_mask=attention_mask_example.unsqueeze(0), 
                    token_type_ids=token_type_ids_example.unsqueeze(0))
    logits = outputs.logits
    all_logits.append(logits)

    # convert logits to probabilities (which are of shape (1, seq_len))
    dist_per_token = torch.distributions.Bernoulli(logits=logits)
    probabilities = dist_per_token.probs * attention_mask_example.type(torch.float32).to(dist_per_token.probs.device) 

    # Compute average probability per cell, aggregating over tokens.
    # Dictionary maps coordinates to a list of one or more probabilities
    coords_to_probs = collections.defaultdict(list)
    prev_answers = {}
    for i, p in enumerate(probabilities.squeeze().tolist()):
      segment_id = token_type_ids_example[:,0].tolist()[i]
      col = token_type_ids_example[:,1].tolist()[i] - 1
      row = token_type_ids_example[:,2].tolist()[i] - 1
      if col >= 0 and row >= 0 and segment_id == 1:
        coords_to_probs[(col, row)].append(p)

    # Next, map cell coordinates to 1 or 0 (depending on whether the mean prob of all cell tokens is > 0.5)
    coords_to_answer = {}
    for key in coords_to_probs:
      coords_to_answer[key] = np.array(coords_to_probs[key]).mean() > 0.5
    prev_answers[idx+1] = coords_to_answer
    
  logits_batch = torch.cat(tuple(all_logits), 0)
  
  return logits_batch

In [30]:
data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 
        'Age': ["56", "45", "59"],
        'Number of movies': ["87", "53", "69"],
        'Date of birth': ["7 february 1967", "10 june 1996", "28 november 1967"]}
queries = ["How many movies has George Clooney played in?", "How old is he?", "What's his date of birth?"]

table = pd.DataFrame.from_dict(data)

inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt")
logits = compute_prediction_sequence(model, inputs, device)

In [31]:
data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 
        'Age': ["56", "45", "59"],
        'Number of movies': ["87", "53", "69"],
        'Date of birth': ["7 february 1967", "10 june 1996", "28 november 1967"]}
queries = ["How many movies has George Clooney played in?", "How old is he?", "What's his date of birth?"]

table = pd.DataFrame.from_dict(data)
table

Unnamed: 0,Actors,Age,Number of movies,Date of birth
0,Brad Pitt,56,87,7 february 1967
1,Leonardo Di Caprio,45,53,10 june 1996
2,George Clooney,59,69,28 november 1967


In [32]:
inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt")
logits = compute_prediction_sequence(model, inputs, device)

Finally, we can use the handy `convert_logits_to_predictions` function of `TapasTokenizer` to convert the logits into predicted coordinates, and print out the result:

In [33]:
predicted_answer_coordinates, = tokenizer.convert_logits_to_predictions(inputs, logits.cpu().detach())

In [34]:
# handy helper function in case inference on Pandas dataframe
answers = []
for coordinates in predicted_answer_coordinates:
  if len(coordinates) == 1:
    # only a single cell:
    answers.append(table.iat[coordinates[0]])
  else:
    # multiple cells
    cell_values = []
    for coordinate in coordinates:
      cell_values.append(table.iat[coordinate])
    answers.append(", ".join(cell_values))

display(table)
print("")
for query, answer in zip(queries, answers):
  print(query)
  print("Predicted answer: " + answer)

Unnamed: 0,Actors,Age,Number of movies,Date of birth
0,Brad Pitt,56,87,7 february 1967
1,Leonardo Di Caprio,45,53,10 june 1996
2,George Clooney,59,69,28 november 1967



How many movies has George Clooney played in?
Predicted answer: Brad Pitt, Leonardo Di Caprio, George Clooney
How old is he?
Predicted answer: Brad Pitt, Leonardo Di Caprio, George Clooney
What's his date of birth?
Predicted answer: 7 february 1967, 10 june 1996, 28 november 1967


Note that the results here are not correct, that's obvious since we only trained on 28 examples and tested it on an entire different example. In reality, you should train on the entire dataset. The result of this is the `google/tapas-base-finetuned-sqa` checkpoint.

In [35]:
import pandas as pd
import datetime
# set 10 day lookback window
periodend = datetime.datetime.now().isoformat()
periodstart = (datetime.datetime.now()- datetime.timedelta(days=10)).isoformat()
#Get count of incidents to determine limit to retrieve
num=pd.read_json("https://data.sfgov.org/resource/RowID.json?$select=COUNT(incident_number)&$where=call_date%20between%20%27"
                 +str(periodstart)+"%27%20AND%20%27"+str(periodend)+"%27")
maxNumber=num.iloc[0,0]
print('The number of incidents in the set is',maxNumber)
# Use record to write query call to API to get all incidents needed/bypass API default of 1000
query_str=("https://data.sfgov.org/resource/RowID.json?$where=call_date%20between%20%27"
           +str(periodstart)+"%27%20AND%20%27"
           +str(periodend)+"%27&$limit="+str(maxNumber))
cfs_data=pd.read_json(query_str)
print(cfs_data.shape)
cfs_data.sample(5)

The number of incidents in the set is 7562
(7562, 34)


Unnamed: 0,call_number,unit_id,incident_number,call_type,call_date,watch_date,received_dttm,entry_dttm,dispatch_dttm,response_dttm,...,number_of_alarms,unit_type,unit_sequence_in_call_dispatch,fire_prevention_district,supervisor_district,neighborhoods_analysis_boundaries,rowid,case_location,transport_dttm,hospital_dttm
7199,220312124,B03,22014641,Alarms,2022-01-31T00:00:00.000,2022-01-31T00:00:00.000,2022-01-31T15:57:07.000,2022-01-31T15:58:35.000,2022-01-31T15:58:41.000,2022-01-31T15:59:38.000,...,1,CHIEF,1,3.0,6,Financial District/South Beach,220312124-B03,"{'type': 'Point', 'coordinates': [-122.3890405...",,
4335,220283288,E14,22013457,Medical Incident,2022-01-28T00:00:00.000,2022-01-28T00:00:00.000,2022-01-28T22:06:51.000,2022-01-28T22:08:06.000,2022-01-28T22:08:18.000,2022-01-28T22:09:48.000,...,1,ENGINE,1,7.0,1,Outer Richmond,220283288-E14,"{'type': 'Point', 'coordinates': [-122.4876801...",,
337,220232062,B08,22010996,Alarms,2022-01-23T00:00:00.000,2022-01-23T00:00:00.000,2022-01-23T16:32:31.000,2022-01-23T16:33:47.000,2022-01-23T16:33:55.000,2022-01-23T16:36:32.000,...,1,CHIEF,1,8.0,7,West of Twin Peaks,220232062-B08,"{'type': 'Point', 'coordinates': [-122.4628171...",,
2466,220250347,E18,22011675,Medical Incident,2022-01-25T00:00:00.000,2022-01-24T00:00:00.000,2022-01-25T05:49:07.000,2022-01-25T05:51:29.000,2022-01-25T05:51:53.000,2022-01-25T05:53:26.000,...,1,ENGINE,1,8.0,4,Sunset/Parkside,220250347-E18,"{'type': 'Point', 'coordinates': [-122.4829669...",,
151,220232887,58,22011108,Medical Incident,2022-01-23T00:00:00.000,2022-01-23T00:00:00.000,2022-01-23T21:26:04.000,2022-01-23T21:28:52.000,2022-01-23T21:29:01.000,2022-01-23T21:31:01.000,...,1,MEDIC,1,3.0,6,Tenderloin,220232887-58,"{'type': 'Point', 'coordinates': [-122.4126203...",,


In [36]:
cfs_data=cfs_data.rename(columns={"call_type":"incident","neighborhoods_analysis_boundaries" : "neighbourhood" })
cfs_data

Unnamed: 0,call_number,unit_id,incident_number,incident,call_date,watch_date,received_dttm,entry_dttm,dispatch_dttm,response_dttm,...,number_of_alarms,unit_type,unit_sequence_in_call_dispatch,fire_prevention_district,supervisor_district,neighbourhood,rowid,case_location,transport_dttm,hospital_dttm
0,220230253,SCRT6,22010754,Medical Incident,2022-01-23T00:00:00.000,2022-01-22T00:00:00.000,2022-01-23T02:07:33.000,2022-01-23T02:14:09.000,2022-01-23T02:20:00.000,2022-01-23T02:20:00.000,...,1,SUPPORT,1,2,6,Mission,220230253-SCRT6,"{'type': 'Point', 'coordinates': [-122.4190886...",,
1,220230206,63,22010747,Medical Incident,2022-01-23T00:00:00.000,2022-01-22T00:00:00.000,2022-01-23T01:48:31.000,2022-01-23T01:49:49.000,2022-01-23T01:51:07.000,2022-01-23T01:51:10.000,...,1,MEDIC,2,2,6,Tenderloin,220230206-63,"{'type': 'Point', 'coordinates': [-122.4173655...",2022-01-23T02:16:10.000,2022-01-23T02:26:22.000
2,220230288,E03,22010764,Medical Incident,2022-01-23T00:00:00.000,2022-01-22T00:00:00.000,2022-01-23T02:35:05.000,2022-01-23T02:37:31.000,2022-01-23T02:40:50.000,2022-01-23T02:40:50.000,...,1,ENGINE,4,2,6,Tenderloin,220230288-E03,"{'type': 'Point', 'coordinates': [-122.4162597...",,
3,220230165,T17,22010739,Medical Incident,2022-01-23T00:00:00.000,2022-01-22T00:00:00.000,2022-01-23T01:27:16.000,2022-01-23T01:28:11.000,2022-01-23T01:28:27.000,2022-01-23T01:30:39.000,...,1,TRUCK,2,10.0,10,Bayview Hunters Point,220230165-T17,"{'type': 'Point', 'coordinates': [-122.3982982...",,
4,220230127,E03,22010728,Medical Incident,2022-01-23T00:00:00.000,2022-01-22T00:00:00.000,2022-01-23T01:05:18.000,2022-01-23T01:06:04.000,2022-01-23T01:06:42.000,2022-01-23T01:09:18.000,...,1,ENGINE,1,4,6,Tenderloin,220230127-E03,"{'type': 'Point', 'coordinates': [-122.4172576...",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7557,220320063,76,22014828,Medical Incident,2022-02-01T00:00:00.000,2022-01-31T00:00:00.000,2022-02-01T00:56:50.000,2022-02-01T00:59:22.000,2022-02-01T01:00:13.000,2022-02-01T01:00:19.000,...,1,MEDIC,2,9,11,Outer Mission,220320063-76,"{'type': 'Point', 'coordinates': [-122.4491688...",2022-02-01T01:21:52.000,2022-02-01T01:41:49.000
7558,220320084,58,22014833,Medical Incident,2022-02-01T00:00:00.000,2022-01-31T00:00:00.000,2022-02-01T01:18:57.000,2022-02-01T01:20:55.000,2022-02-01T01:21:09.000,2022-02-01T01:21:12.000,...,1,MEDIC,2,8,7,Lakeshore,220320084-58,"{'type': 'Point', 'coordinates': [-122.4788651...",2022-02-01T01:48:18.000,2022-02-01T02:24:51.000
7559,220320127,AM120,22014839,Medical Incident,2022-02-01T00:00:00.000,2022-01-31T00:00:00.000,2022-02-01T02:17:27.000,2022-02-01T02:19:34.000,2022-02-01T02:20:10.000,2022-02-01T02:27:52.000,...,1,PRIVATE,1,5,5,Western Addition,220320127-AM120,"{'type': 'Point', 'coordinates': [-122.4408636...",,
7560,220320152,KM104,22014846,Medical Incident,2022-02-01T00:00:00.000,2022-01-31T00:00:00.000,2022-02-01T02:51:17.000,2022-02-01T02:52:18.000,2022-02-01T02:52:46.000,2022-02-01T02:56:09.000,...,1,PRIVATE,2,8,4,Sunset/Parkside,220320152-KM104,"{'type': 'Point', 'coordinates': [-122.4976191...",,


In [37]:
cfs_data=cfs_data.dropna()
cfs_data= cfs_data.astype(str)

In [38]:
#cfs_small=cfs_data[['incident','neighbourhood','unit_type']].sample(5)
cfs_small=cfs_data[['neighbourhood','unit_type']].sample(5)
cfs_small

Unnamed: 0,neighbourhood,unit_type
675,Tenderloin,PRIVATE
4012,Golden Gate Park,MEDIC
6883,Lone Mountain/USF,MEDIC
2894,Bayview Hunters Point,MEDIC
1661,South of Market,MEDIC


In [39]:
tokenizer = TapasTokenizer.from_pretrained("google/tapas-base-finetuned-tabfact", drop_rows_to_fit=True)

In [40]:
table = pd.DataFrame.from_dict(cfs_small)
table

Unnamed: 0,neighbourhood,unit_type
675,Tenderloin,PRIVATE
4012,Golden Gate Park,MEDIC
6883,Lone Mountain/USF,MEDIC
2894,Bayview Hunters Point,MEDIC
1661,South of Market,MEDIC


In [41]:
inputs = tokenizer(table=table, queries=queries,padding=True, truncation=True, return_tensors="pt")

IndexError: iloc cannot enlarge its target object

In [None]:
queries = ["What neighbourhoods are incidents in?", "Which neighbourhood is the CHIEF unit type in?", "What neighourhoods are the medics in?"]



inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt")
logits = compute_prediction_sequence(model, inputs, device)

In [None]:
predicted_answer_coordinates, = tokenizer.convert_logits_to_predictions(inputs, logits.cpu().detach())

In [None]:
# handy helper function in case inference on Pandas dataframe
answers = []
for coordinates in predicted_answer_coordinates:
  if len(coordinates) == 1:
    # only a single cell:
    answers.append(table.iat[coordinates[0]])
  else:
    # multiple cells
    cell_values = []
    for coordinate in coordinates:
      cell_values.append(table.iat[coordinate])
    answers.append(", ".join(cell_values))

display(table)
print("")
for query, answer in zip(queries, answers):
  print(query)
  print("Predicted answer: " + answer)