<a href="https://colab.research.google.com/github/im-anukalp/ami/blob/master/Copy_of_sqa_predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/google-research/tapas/blob/master/notebooks/sqa_predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2020 The Google AI Language Team Authors

Licensed under the Apache License, Version 2.0 (the "License");

In [None]:
# Copyright 2019 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Running a Tapas fine-tuned checkpoint
---
This notebook shows how to load and make predictions with TAPAS model, which was introduced in the paper: [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349)

# Clone and install the repository


First, let's fetch the code from the github repository and install it

In [6]:
! git clone https://github.com/google-research/tapas.git

Cloning into 'tapas'...
remote: Enumerating objects: 236, done.[K
remote: Total 236 (delta 0), reused 0 (delta 0), pack-reused 236[K
Receiving objects: 100% (236/236), 212.54 KiB | 9.24 MiB/s, done.
Resolving deltas: 100% (123/123), done.


In [7]:
! pip install ./tapas

Processing ./tapas
Collecting apache-beam[gcp]==2.20.0
[?25l  Downloading https://files.pythonhosted.org/packages/4b/0d/0979ad626578a52887f7df60492ac6759089a9da261ac4c88b112b3f6a5a/apache_beam-2.20.0-cp36-cp36m-manylinux1_x86_64.whl (3.5MB)
[K     |████████████████████████████████| 3.5MB 4.7MB/s 
[?25hCollecting frozendict==1.2
  Downloading https://files.pythonhosted.org/packages/4e/55/a12ded2c426a4d2bee73f88304c9c08ebbdbadb82569ebdd6a0c007cfd08/frozendict-1.2.tar.gz
Collecting tensorflow~=2.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/3d/be/679ce5254a8c8d07470efb4a4c00345fae91f766e64f1c2aece8796d7218/tensorflow-2.2.0-cp36-cp36m-manylinux2010_x86_64.whl (516.2MB)
[K     |████████████████████████████████| 516.2MB 24kB/s 
[?25hCollecting tf-models-official~=2.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/99/8e/6db83bab2f86475fa69289848379f642746314131527d8a4ced47a6396af/tf_models_official-2.2.2-py2.py3-none-any.whl (711kB)
[K     |█████████████

# Fetch models fom Google Storage

Next we can get pretrained checkpoint from Google Storage. For the sake of speed, this is base sized model trained on [SQA](https://www.microsoft.com/en-us/download/details.aspx?id=54253). Note that best results in the paper were obtained with with a large model, with 24 layers instead of 12.

In [1]:
! gsutil cp gs://tapas_models/2020_04_21/tapas_sqa_base.zip . && unzip tapas_sqa_base.zip

Copying gs://tapas_models/2020_04_21/tapas_sqa_base.zip...
/ [1 files][  1.0 GiB/  1.0 GiB]   62.9 MiB/s                                   
Operation completed over 1 objects/1.0 GiB.                                      
Archive:  tapas_sqa_base.zip
replace tapas_sqa_base/model.ckpt.data-00000-of-00001? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace tapas_sqa_base/model.ckpt.index? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace tapas_sqa_base/README.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: tapas_sqa_base/README.txt  
replace tapas_sqa_base/vocab.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: tapas_sqa_base/vocab.txt  
  inflating: tapas_sqa_base/bert_config.json  
  inflating: tapas_sqa_base/model.ckpt.meta  


# Imports

In [2]:
import tensorflow.compat.v1 as tf
import os 
import shutil
import csv
import pandas as pd
import IPython

tf.get_logger().setLevel('ERROR')

In [3]:
from tapas.utils import tf_example_utils
from tapas.protos import interaction_pb2
from tapas.utils import number_annotation_utils
from tapas.scripts import prediction_utils

# Load checkpoint for prediction

Here's the prediction code, which will create and `interaction_pb2.Interaction` protobuf object, which is the datastructure we use to store examples, and then call the prediction script.

In [4]:
os.makedirs('results/sqa/tf_examples', exist_ok=True)
os.makedirs('results/sqa/model', exist_ok=True)
with open('results/sqa/model/checkpoint', 'w') as f:
  f.write('model_checkpoint_path: "model.ckpt-0"')
for suffix in ['.data-00000-of-00001', '.index', '.meta']:
  shutil.copyfile(f'tapas_sqa_base/model.ckpt{suffix}', f'results/sqa/model/model.ckpt-0{suffix}')

In [5]:
df=pd.read_csv("data.csv")

In [6]:
df=df.astype(str)

In [7]:
df.head(10)

Unnamed: 0,Pos,Player,Team,Span,Innings,Runs,Highest Score,Average,Strike Rate
0,1,Sachin Tendulkar,India,1989-2012,452,18426,200,44.83,86.23
1,2,Kumar Sangakkara,Sri Lanka,2000-2015,380,14234,169,41.98,78.86
2,3,Ricky Ponting,Australia,1995-2012,365,13704,164,42.03,80.39
3,4,Sanath Jayasuriya,Sri Lanka,1989-2011,433,13430,189,32.36,91.2
4,5,Mahela Jayawardene,Sri Lanka,1998-2015,418,12650,144,33.37,78.96
5,6,Virat Kohli,India,2008-2020,236,11867,183,59.85,93.39
6,7,Inzamam-ul-Haq,Pakistan,1991-2007,350,11739,137,39.52,74.24
7,8,Jacques Kallis,South Africa,1996-2014,314,11579,139,44.36,72.89
8,9,Saurav Ganguly,India,1992-2007,300,11363,183,41.02,73.7
9,10,Rahul Dravid,India,1996-2011,318,10889,153,39.16,71.24


In [10]:
list_of_list

[['Pos',
  'Player',
  'Team',
  'Span',
  'Innings',
  'Runs',
  'Highest Score',
  'Average',
  'Strike Rate'],
 ['1',
  'Sachin Tendulkar',
  'India',
  '1989-2012',
  '452',
  '18426',
  '200',
  '44.83',
  '86.23'],
 ['2',
  'Kumar Sangakkara',
  'Sri Lanka',
  '2000-2015',
  '380',
  '14234',
  '169',
  '41.98',
  '78.86'],
 ['3',
  'Ricky Ponting',
  'Australia',
  '1995-2012',
  '365',
  '13704',
  '164',
  '42.03',
  '80.39'],
 ['4',
  'Sanath Jayasuriya',
  'Sri Lanka',
  '1989-2011',
  '433',
  '13430',
  '189',
  '32.36',
  '91.2'],
 ['5',
  'Mahela Jayawardene',
  'Sri Lanka',
  '1998-2015',
  '418',
  '12650',
  '144',
  '33.37',
  '78.96'],
 ['6',
  'Virat Kohli',
  'India',
  '2008-2020',
  '236',
  '11867',
  '183',
  '59.85',
  '93.39'],
 ['7',
  'Inzamam-ul-Haq',
  'Pakistan',
  '1991-2007',
  '350',
  '11739',
  '137',
  '39.52',
  '74.24'],
 ['8',
  'Jacques Kallis',
  'South Africa',
  '1996-2014',
  '314',
  '11579',
  '139',
  '44.36',
  '72.89'],
 ['9',
  'Saur

In [9]:
list_of_list=[[]]
list_of_list[0]=list(df.columns)
list_of_list.extend(df.values.tolist())

In [12]:
max_seq_length = 512
vocab_file = "tapas_sqa_base/vocab.txt"
config = tf_example_utils.ClassifierConversionConfig(
    vocab_file=vocab_file,
    max_seq_length=max_seq_length,
    max_column_id=max_seq_length,
    max_row_id=max_seq_length,
    strip_column_names=False,
    add_aggregation_candidates=False,
)
converter = tf_example_utils.ToClassifierTensorflowExample(config)

def convert_interactions_to_examples(tables_and_queries):
  """Calls Tapas converter to convert interaction to example."""
  for idx, (table, queries) in enumerate(tables_and_queries):
    interaction = interaction_pb2.Interaction()
    for position, query in enumerate(queries):
      question = interaction.questions.add()
      question.original_text = query
      question.id = f"{idx}-0_{position}"
    for header in table[0]:
      interaction.table.columns.add().text = header
    for line in table[1:]:
      row = interaction.table.rows.add()
      for cell in line:
        row.cells.add().text = cell
    number_annotation_utils.add_numeric_values(interaction)
    for i in range(len(interaction.questions)):
      try:
        yield converter.convert(interaction, i)
      except ValueError as e:
        print(f"Can't convert interaction: {interaction.id} error: {e}")
        
def write_tf_example(filename, examples):
  with tf.io.TFRecordWriter(filename) as writer:
    for example in examples:
      writer.write(example.SerializeToString())

def predict(table_data, queries):
  table = table_data
  examples = convert_interactions_to_examples([(table, queries)])
  write_tf_example("results/sqa/tf_examples/test.tfrecord", examples)
  write_tf_example("results/sqa/tf_examples/random-split-1-dev.tfrecord", [])
  
  ! python tapas/tapas/run_task_main.py \
    --task="SQA" \
    --output_dir="results" \
    --noloop_predict \
    --test_batch_size={len(queries)} \
    --tapas_verbosity="ERROR" \
    --compression_type= \
    --init_checkpoint="tapas_sqa_base/model.ckpt" \
    --bert_config_file="tapas_sqa_base/bert_config.json" \
    --mode="predict" 2> error


  results_path = "results/sqa/model/test_sequence.tsv"
  all_coordinates = []
  df = pd.DataFrame(table[1:], columns=table[0])
  display(IPython.display.HTML(df.to_html(index=False)))
  print()
  with open(results_path) as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t')
    for row in reader:
      coordinates = prediction_utils.parse_coordinates(row["answer_coordinates"])
      all_coordinates.append(coordinates)
      answers = ', '.join([table[row + 1][col] for row, col in coordinates])
      position = int(row['position'])
      print(">", queries[position])
      print(answers)
  return all_coordinates

# Predict

In [13]:
result=predict(list_of_list,["what were the players names?",
                             "of these,which team did Sachin Tendulkar play for?",
                             "what is his highest score?",
                             "how many runs has Virat Kohli scored?"])

is_built_with_cuda: True
is_gpu_available: False
GPUs: []
Training or predicting ...
Evaluation finished after training step 0.


Pos,Player,Team,Span,Innings,Runs,Highest Score,Average,Strike Rate
1,Sachin Tendulkar,India,1989-2012,452,18426,200,44.83,86.23
2,Kumar Sangakkara,Sri Lanka,2000-2015,380,14234,169,41.98,78.86
3,Ricky Ponting,Australia,1995-2012,365,13704,164,42.03,80.39
4,Sanath Jayasuriya,Sri Lanka,1989-2011,433,13430,189,32.36,91.2
5,Mahela Jayawardene,Sri Lanka,1998-2015,418,12650,144,33.37,78.96
6,Virat Kohli,India,2008-2020,236,11867,183,59.85,93.39
7,Inzamam-ul-Haq,Pakistan,1991-2007,350,11739,137,39.52,74.24
8,Jacques Kallis,South Africa,1996-2014,314,11579,139,44.36,72.89
9,Saurav Ganguly,India,1992-2007,300,11363,183,41.02,73.7
10,Rahul Dravid,India,1996-2011,318,10889,153,39.16,71.24



> what were the players names?
Sachin Tendulkar, Rahul Dravid, Jacques Kallis, Saurav Ganguly, Inzamam-ul-Haq, Sanath Jayasuriya, Ricky Ponting, Virat Kohli, Mahela Jayawardene, Kumar Sangakkara
> of these,which team did Sachin Tendulkar play for?
India
> what is his highest score?
200
> how many runs has Virat Kohli scored?
11867
