# Sentence classification - BERT finetuning -  TPU

<table class="tfo-notebook-buttons" align="left" >
 <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>


**BERT**, or **B**idirectional **E**mbedding **R**epresentations from **T**ransformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The academic paper can be found here: https://arxiv.org/abs/1810.04805.

This Colab demonstates using a free Colab Cloud TPU to fine-tune sentence and sentence-pair classification tasks built on top of pretrained BERT models.

**Note:**  You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud 
Storage) bucket for this Colab to run.

Please follow the [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) for how to create GCP account and GCS bucket. You have [$300 free credit](https://cloud.google.com/free/) to get started with any GCP product. You can learn more about Cloud TPU at https://cloud.google.com/tpu/docs.

Once you finish the setup, let's start!

**Firstly**, we need to set up Colab TPU running environment, verify a TPU device is succesfully connected and upload credentials to TPU for GCS bucket usage.

In [1]:
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TPU address is grpc://10.104.37.58:8470


W0802 07:51:22.676072 140257710847872 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 9859838020152805601),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 11801699439757522170),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 17509246291820947546),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 2814200446972778295),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 9251293896013119029),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 5181952564651276190),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 15803278180543265295),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 16716905167322649165),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 15631505610979133869),
 _DeviceAttributes(/job:tpu_w

**Secondly**, prepare and import BERT modules.

In [2]:
import sys

!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
  sys.path += ['bert_repo']

Cloning into 'bert_repo'...
remote: Enumerating objects: 333, done.[K
remote: Total 333 (delta 0), reused 0 (delta 0), pack-reused 333[K
Receiving objects: 100% (333/333), 279.30 KiB | 3.83 MiB/s, done.
Resolving deltas: 100% (183/183), done.


**Thirdly**, prepare for training:

*  Specify task and download training data.
*  Specify BERT pretrained model
*  Specify GS bucket, create output directory for model checkpoints and eval results.



In [3]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')
!ls "/content/drive/My Drive"

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive
 19072019-Alaeddin.gdoc
 20180801_theHindu_AgreedByAll.gsheet
 20180901_theIndianExpress_AgreedByAll.xlsx
 20190319_fd_Error_Analysis_on_India_Data.gsheet
 26072019-Alaeddin.gdoc
 AlaeddinSelcukGurelReport.gdoc
 allbertpredictions.csv
 Assel_Guncel_Program.gsheet
 assignment1-comp429.gdoc
 assignment1-comp429.pdf
'batch1 (1).json'
 batch1.json
 batch1_updated.json
'batch2 (1).json'
 batch2.json
 batch2_updated.json
'batch3 (1).json'
 batch3.json
 batch3_update

In [4]:
TASK = 'sentence_classification' #@param {type:"string"}
#assert TASK in ('MRPC', 'CoLA'), 'Only (MRPC, CoLA) are demonstrated here.'
# Download glue data.
#! test -d download_glue_repo || git clone https://gist.github.com/60c2bdb54d156a41194446737ce03e2e.git download_glue_repo
#!python download_glue_repo/download_glue_data.py --data_dir='glue_data' --tasks=$TASK
TASK_DATA_DIR = "/content/drive/'My Drive'/samplesforbert" #@param {type:"string"}
print('***** Task data directory: {} *****'.format(TASK_DATA_DIR))
#!ls /content/drive/'My Drive'/thesis/data
!ls $TASK_DATA_DIR

# Available pretrained model checkpoints:
#   uncased_L-12_H-768_A-12: uncased BERT base model
#   uncased_L-24_H-1024_A-16: uncased BERT large model
#   cased_L-12_H-768_A-12: cased BERT large model
BERT_MODEL = 'uncased_L-12_H-768_A-12' #@param {type:"string"}
BERT_PRETRAINED_DIR = 'gs://cloud-tpu-checkpoints/bert/' + BERT_MODEL
print('***** BERT pretrained directory: {} *****'.format(BERT_PRETRAINED_DIR))
!gsutil ls $BERT_PRETRAINED_DIR

BUCKET = 'alaeddinselcukgurel' #@param {type:"string"}
assert BUCKET, 'Must specify an existing GCS bucket name'
OUTPUT_DIR = 'gs://{}/bert/models/{}'.format(BUCKET, TASK)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))


***** Task data directory: /content/drive/'My Drive'/samplesforbert *****
'guardian_bert (1).gsheet'   guardian_sample_with_annotations.csv
 guardian_bert.csv	     sentence_predictions.csv
 guardian_bert.gsheet
***** BERT pretrained directory: gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12 *****
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_config.json
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_model.ckpt.index
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_model.ckpt.meta
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/checkpoint
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/vocab.txt
***** Model output directory: gs://alaeddinselcukgurel/bert/models/sentence_classification *****


**Now, let's play!**

In [5]:
# Setup task specific model and TPU running config.

import modeling
import optimization
import run_classifier
import tokenization


# Model Hyper Parameters
TRAIN_BATCH_SIZE = 32
EVAL_BATCH_SIZE = 8
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 3.0
WARMUP_PROPORTION = 0.1
MAX_SEQ_LENGTH = 256
# Model configs
SAVE_CHECKPOINTS_STEPS = 1000
ITERATIONS_PER_LOOP = 1000
NUM_TPU_CORES = 8
VOCAB_FILE = os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt')
CONFIG_FILE = os.path.join(BERT_PRETRAINED_DIR, 'bert_config.json')
INIT_CHECKPOINT = os.path.join(BERT_PRETRAINED_DIR, 'bert_model.ckpt')
DO_LOWER_CASE = BERT_MODEL.startswith('uncased')

#processors = {
 # "cola": run_classifier.ColaProcessor,
 # "mnli": run_classifier.MnliProcessor,
 # "mrpc": run_classifier.MrpcProcessor,
#}
#processor = processors[TASK.lower()]()
#label_list = processor.get_labels()
tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=DO_LOWER_CASE)

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=OUTPUT_DIR,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    keep_checkpoint_max=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=ITERATIONS_PER_LOOP,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))



W0802 07:52:17.407781 140257710847872 deprecation_wrapper.py:119] From bert_repo/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0802 07:52:17.425853 140257710847872 deprecation_wrapper.py:119] From bert_repo/tokenization.py:125: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.



**Csv file reader**

In [6]:
batch_no = '5'
import pandas as pd
TASK_DATA_DIR_EDITED = '/content/drive/My Drive/thesis/emwdocumentbert'
#get training data
#data_file_name = "Newindianexpress_sentences_adjudicated_train.csv"
data_file_name = "/Batch_" + batch_no + "/train_" + batch_no + ".csv"
train = pd.read_csv(TASK_DATA_DIR_EDITED + '/' + data_file_name, header=0)

import numpy as np
list(train)
nrow,ncol=train.shape

#get validation data
#data_file_name = "Newindianexpress_sentences_adjudicated_val.csv"
data_file_name = "/Batch_" + batch_no + "/validation_" + batch_no + ".csv"
val = pd.read_csv(TASK_DATA_DIR_EDITED + '/' + data_file_name, header=0)
list(val)
vnrow,vncol=val.shape
vnrow

#get eval (test) data
#data_file_name = "Newindianexpress_sentences_adjudicated_test.csv"
data_file_name = "/Batch_" + batch_no + "/test_" + batch_no + ".csv"
eval = pd.read_csv(TASK_DATA_DIR_EDITED + '/' + data_file_name, header=0)
list(eval)
enrow,encol=eval.shape
enrow

list(train)
train_text = train["text"]
print(len(train_text))

none_protest = train[train["label"] == 0]
none_protest_text = none_protest["text"]
print(len(none_protest_text))

happened = train[train["label"] == 1]
happened_text = happened["text"]
print(len(happened_text))

planned = train[train["label"] == 2]
planned_text = planned["text"]
print(len(planned_text))

assert len(planned_text) + len(happened_text) + len(none_protest_text) == len(train_text)

3579
2723
856
0


In [7]:
len(train.loc[train.label == 2])

0

In [8]:
# used to calculate metrics (in bert)
val.shape

(457, 5)

In [9]:
# used to predict and calculate custom metrics (f1 and mcc)
eval.shape

(687, 5)

In [0]:
DATA_COLUMN = 'text'
LABEL_COLUMN = 'label'
# label_list is the list of labels, i.e. True, False or 0, 1 or 'dog', 'cat'
label_list = [0, 1]


In [0]:
#train_examples = processor.get_train_examples(TASK_DATA_DIR)
# customize get_train_examples:
# Use the InputExample class from BERT's run_classifier code to create examples from the data
train_examples = train.apply(lambda x: run_classifier.InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
                                                                   text_a = x[DATA_COLUMN], 
                                                                   text_b = None, 
                                                                   label = x[LABEL_COLUMN]), axis = 1)

val_examples = val.apply(lambda x: run_classifier.InputExample(guid=None, 
                                                                   text_a = x[DATA_COLUMN], 
                                                                   text_b = None, 
                                                                   label = x[LABEL_COLUMN]), axis = 1)

In [12]:
num_train_steps = int(
    len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

model_fn = run_classifier.model_fn_builder(
    bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),
    num_labels=len(label_list),
    init_checkpoint=INIT_CHECKPOINT,
    learning_rate=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=True,
    use_one_hot_embeddings=True)

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=True,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE,
    predict_batch_size=EVAL_BATCH_SIZE)

W0802 07:52:22.537038 140257710847872 estimator.py:1984] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f90038f1ea0>) includes params argument, but params are not passed to Estimator.


In [14]:
# Train the model.
#print('MRPC/CoLA on BERT base model normally takes about 2-3 minutes. Please wait...')
train_features = run_classifier.convert_examples_to_features(
    train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
print('***** Started training at {} *****'.format(datetime.datetime.now()))
print('  Num examples = {}'.format(len(train_examples)))
print('  Batch size = {}'.format(TRAIN_BATCH_SIZE))
tf.compat.v1.logging.info("  Num steps = %d", num_train_steps)
train_input_fn = run_classifier.input_fn_builder(
    features=train_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=True,
    drop_remainder=True)
estimator.train(input_fn=train_input_fn, max_steps=500)
print('***** Finished training at {} *****'.format(datetime.datetime.now()))

***** Started training at 2019-08-02 07:52:55.883094 *****
  Num examples = 3579
  Batch size = 32
***** Finished training at 2019-08-02 07:52:57.155583 *****


In [15]:
# Eval the model.
eval_features = run_classifier.convert_examples_to_features(val_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
#eval_examples = processor.get_dev_examples(TASK_DATA_DIR)
#eval_features = run_classifier.convert_examples_to_features(
#    val_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
print('***** Started evaluation at {} *****'.format(datetime.datetime.now()))
print('  Num examples = {}'.format(len(val_examples)))
print('  Batch size = {}'.format(EVAL_BATCH_SIZE))
# Eval will be slightly WRONG on the TPU because it will truncate
# the last batch.
eval_steps = int(len(val_examples) / EVAL_BATCH_SIZE)
eval_input_fn = run_classifier.input_fn_builder(
    features=eval_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=True)

''' 
#Buna gerek yok. Direkt trainingden restore ediyor zaten.
TRAINING_CHECKPOINT = os.path.join(OUTPUT_DIR, 'model.ckpt-500')
model_fn2 = run_classifier.model_fn_builder(
    bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),
    num_labels=len(label_list),
    init_checkpoint=TRAINING_CHECKPOINT,
    learning_rate=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=True,
    use_one_hot_embeddings=True)

estimator2 = tf.contrib.tpu.TPUEstimator(
    use_tpu=True,
    model_fn=model_fn2,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE,
    predict_batch_size=EVAL_BATCH_SIZE)
'''

result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
print('***** Finished evaluation at {} *****'.format(datetime.datetime.now()))
output_eval_file = os.path.join(OUTPUT_DIR, "eval_results.txt")
with tf.io.gfile.GFile(output_eval_file, "w") as writer:
  print("***** Eval results *****")
  for key in sorted(result.keys()):
    print('  {} = {}'.format(key, str(result[key])))
    writer.write("%s = %s\n" % (key, str(result[key])))

***** Started evaluation at 2019-08-02 07:53:00.763223 *****
  Num examples = 457
  Batch size = 8


W0802 07:53:01.491776 140257710847872 deprecation_wrapper.py:119] From bert_repo/modeling.py:171: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W0802 07:53:01.497095 140257710847872 deprecation_wrapper.py:119] From bert_repo/modeling.py:409: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

W0802 07:53:01.553837 140257710847872 deprecation_wrapper.py:119] From bert_repo/modeling.py:490: The name tf.assert_less_equal is deprecated. Please use tf.compat.v1.assert_less_equal instead.

W0802 07:53:01.624007 140257710847872 deprecation.py:323] From bert_repo/modeling.py:671: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
W0802 07:53:05.157678 140257710847872 deprecation_wrapper.py:119] From bert_repo/run_classifier.py:647: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable

***** Finished evaluation at 2019-08-02 07:53:42.388128 *****
***** Eval results *****
  eval_accuracy = 0.9232456
  eval_loss = 0.37256804
  global_step = 500
  loss = 0.10518332


In [0]:
PREDICT_BATCH_SIZE = 8
def getPrediction(in_sentences):
  labels = ["0", "1"]
  input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label
  input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=True)
  predictions = estimator.predict(predict_input_fn)
  return predictions
 
  #return [(sentence, prediction['probabilities'], labels[prediction['labels']]) for sentence, prediction in zip(in_sentences, predictions)]

In [0]:
test_sentences = eval['text']
result = getPrediction(test_sent)

In [26]:
eval.columns

Index(['Unnamed: 0', 'id', 'label', 'text', 'url'], dtype='object')

In [0]:

num_actual_predict_examples = len(test_sentences)
output_predict_file = os.path.join(OUTPUT_DIR, "test_results.tsv")
pred_labels = []
pred_probs = []
with tf.io.gfile.GFile(output_predict_file, "w") as writer:
  num_written_lines = 0
  tf.logging.info("***** Predict results *****")
  for (i, prediction) in enumerate(result):
    probabilities = prediction["probabilities"]
    probabilitieslist = list(probabilities)
    pred_probs.append(probabilitieslist)
    if i >= num_actual_predict_examples:
      break
    label = str(probabilitieslist.index(max(probabilities))) # 0 1 or 2
    pred_labels.append(label)
    output_line = label + "\t" + "\t".join(
        str(class_probability)
        for class_probability in probabilities) + "\t"+ test_sentences[i] + "\n" 
    writer.write(output_line)
    if label == "2": print(output_line)
    num_written_lines += 1
#assert num_written_lines == num_actual_predict_examples

In [0]:
test_labels = eval[LABEL_COLUMN].tolist()
pred_labels_int = [int(i) for i in pred_labels]
pred_probs_float = [[float(i) for i in a] for a in pred_probs]

assert len(test_labels) == len(pred_labels_int)

In [0]:
test_labels

In [30]:
import sklearn
f1_macro = sklearn.metrics.f1_score(test_labels, pred_labels_int, average = "macro")
mcc = sklearn.metrics.matthews_corrcoef(test_labels, pred_labels_int)
accuracy = sklearn.metrics.accuracy_score(test_labels, pred_labels_int)
#log_loss = sklearn.metrics.log_loss(test_labels, pred_probs_float)
print(f1_macro)
print(mcc)
print(accuracy)
print(sklearn.metrics.precision_score(test_labels, pred_labels_int))
print(sklearn.metrics.recall_score(test_labels, pred_labels_int))

#print(log_loss)

0.9082135131696145
0.8191498477728089
0.9388646288209607
0.9117647058823529
0.8051948051948052


In [0]:
import nltk
nltk.download('popular')

In [34]:
#INDIA EVALUATION TO SENTENCES
texts_and_urls = pd.DataFrame(columns= ["url", "sentence"])
for idx, doc in eval.iterrows():
  url = doc.url
  for elem in nltk.sent_tokenize(doc.text):
    texts_and_urls = texts_and_urls.append({"url": url, "sentence": elem}, ignore_index=True)
"""  tokenized_sentences = nltk.sent_tokenize(doc)
  tokenized_results = getPrediction(tokenized_sentences)
  result_list.append(tokenized_results)"""

'  tokenized_sentences = nltk.sent_tokenize(doc)\n  tokenized_results = getPrediction(tokenized_sentences)\n  result_list.append(tokenized_results)'

In [35]:
len(texts_and_urls)

9575

In [0]:
#INDIA EVALUATION SENTENCES PREDICTION RESULT
result = getPrediction(texts_and_urls.sentence)

In [0]:
#INDIA SENTENCE PREDICTION
num_actual_predict_examples = len(texts_and_urls)
output_predict_file = os.path.join(OUTPUT_DIR, "test_results.tsv")
pred_labels = []
pred_probs = []
with tf.io.gfile.GFile(output_predict_file, "w") as writer:
  num_written_lines = 0
  tf.logging.info("***** Predict results *****")
  for (i, prediction) in enumerate(result):
    probabilities = prediction["probabilities"]
    probabilitieslist = list(probabilities)
    pred_probs.append(probabilitieslist)
    if i >= num_actual_predict_examples:
      break
    label = str(probabilitieslist.index(max(probabilities))) # 0 1 or 2
    pred_labels.append(label)
#   output_line = label + "\t" + "\t".join(
#       str(class_probability)
#       for class_probability in probabilities) + "\t"+ test_sent[i] + "\n" 
#   writer.write(output_line)
#   if label == "2": print(output_line)
    num_written_lines += 1
#assert num_written_lines == num_actual_predict_examples

In [38]:
len(pred_labels)

9575

In [0]:
#INDIA ADD PREDICTIONS TO DF
texts_and_urls['bert_sentence_pred'] = pred_labels

In [40]:
texts_and_urls.head()

Unnamed: 0,url,sentence,bert_sentence_pred
0,https://timesofindia.indiatimes.com/city/benga...,"HDMC man's wife chained, assaulted for dowry\n...",0
1,https://timesofindia.indiatimes.com/city/benga...,the woman was admitted to the kims hospital in...,0
2,https://timesofindia.indiatimes.com/city/benga...,"the victim, identified as lakshmi anantha taga...",0
3,https://timesofindia.indiatimes.com/city/benga...,"on a tip-off, the police reached the spot and ...",0
4,https://timesofindia.indiatimes.com/city/benga...,the police source revealed that lakshmi was ma...,0


In [0]:
PREDICT_BATCH_SIZE = 8
def getPrediction(in_sentences):
  labels = ["0", "1"]
  input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label
  input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=True)
  predictions = estimator.predict(predict_input_fn)
  return predictions

In [26]:
num_train_steps = int(
    len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

model_fn = run_classifier.model_fn_builder(
    bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),
    num_labels=len(label_list),
    init_checkpoint=INIT_CHECKPOINT,
    learning_rate=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=True,
    use_one_hot_embeddings=True)

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=True,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE,
    predict_batch_size=EVAL_BATCH_SIZE)

W0731 12:23:53.909987 139773629339520 estimator.py:1984] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f1f48cb6c80>) includes params argument, but params are not passed to Estimator.


In [19]:
checkpoint_dir = "gs://alaeddinselcukgurel/bert/models/sentence_classification/model.ckpt-500"

latest = tf.train.latest_checkpoint(checkpoint_dir)



num_train_steps = int(
    len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

model_fn = run_classifier.model_fn_builder(
    bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),
    num_labels=len(label_list),
    init_checkpoint=checkpoint_dir,
    learning_rate=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=True,
    use_one_hot_embeddings=True)

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=True,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE,
    predict_batch_size=EVAL_BATCH_SIZE)


W0731 20:03:38.271130 140432068097920 estimator.py:1984] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7fb89a850510>) includes params argument, but params are not passed to Estimator.


In [0]:
import pandas as pd
import numpy as np

EVALUATION_SENTENCES_DIR = "/content/drive/My Drive/samplesforbert/guardian_sample_with_annotations.csv"

evaluation_sentences = pd.read_csv(EVALUATION_SENTENCES_DIR, header=0)

In [0]:
PREDICT_BATCH_SIZE = 8
def getPrediction(in_sentences):
  labels = ["0", "1"]
  input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label
  input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=True)
  predictions = estimator.predict(predict_input_fn)
  return predictions

In [0]:
test_sent = evaluation_sentences['text']
test_sent = test_sent.dropna()
result = getPrediction(test_sent)

In [0]:
#SENTENCE PREDICTION
import nltk
nltk.download("popular")

In [64]:
len(evaluation_sentences)

500

In [65]:
test_sent = evaluation_sentences['text']
test_sent = test_sent.dropna()
#evaluation_sentences = evaluation_sentences.dropna()
#result_list = []
texts_and_urls = pd.DataFrame(columns= ["url", "sentence"])
for idx, doc in evaluation_sentences.iterrows():
  url = doc.url
  for elem in nltk.sent_tokenize(doc.text):
    texts_and_urls = texts_and_urls.append({"url": url, "sentence": elem}, ignore_index=True)
"""  tokenized_sentences = nltk.sent_tokenize(doc)
  tokenized_results = getPrediction(tokenized_sentences)
  result_list.append(tokenized_results)"""

'  tokenized_sentences = nltk.sent_tokenize(doc)\n  tokenized_results = getPrediction(tokenized_sentences)\n  result_list.append(tokenized_results)'

In [0]:
result = getPrediction(texts_and_urls.sentence)

In [79]:
len(texts_and_urls.sentence)

15033

In [0]:
#SENTENCE PREDICTION
num_actual_predict_examples = len(texts_and_urls)
output_predict_file = os.path.join(OUTPUT_DIR, "test_results.tsv")
pred_labels = []
pred_probs = []
with tf.io.gfile.GFile(output_predict_file, "w") as writer:
  num_written_lines = 0
  tf.logging.info("***** Predict results *****")
  for (i, prediction) in enumerate(result):
    probabilities = prediction["probabilities"]
    probabilitieslist = list(probabilities)
    pred_probs.append(probabilitieslist)
    if i >= num_actual_predict_examples:
      break
    label = str(probabilitieslist.index(max(probabilities))) # 0 1 or 2
    pred_labels.append(label)
#   output_line = label + "\t" + "\t".join(
#       str(class_probability)
#       for class_probability in probabilities) + "\t"+ test_sent[i] + "\n" 
#   writer.write(output_line)
#   if label == "2": print(output_line)
    num_written_lines += 1
#assert num_written_lines == num_actual_predict_examples

In [0]:
texts_and_urls['bert_setence_pred'] = pred_labels

In [83]:
texts_and_urls

Unnamed: 0,url,sentence,bert_setence_pred
0,https://content.guardianapis.com/society/2008/...,The annual Mills Regeneration Conference has b...,0
1,https://content.guardianapis.com/society/2008/...,"In the early 1990s, the first conference was h...",0
2,https://content.guardianapis.com/society/2008/...,"As the decade progressed, more and more mills ...",0
3,https://content.guardianapis.com/society/2008/...,The conferences after Hebden Bridge visited mi...,0
4,https://content.guardianapis.com/society/2008/...,"Now in 2008, with the impact of the credit cru...",0
5,https://content.guardianapis.com/society/2008/...,The north's top 10 1 Salts Mill Salts Mill in ...,0
6,https://content.guardianapis.com/society/2008/...,A vast complex of more than one million square...,0
7,https://content.guardianapis.com/society/2008/...,"The late Jonathan Silver, a Bradford born entr...",0
8,https://content.guardianapis.com/society/2008/...,"Mills hadn't been used as galleries, but the t...",0
9,https://content.guardianapis.com/society/2008/...,Layers of dull cream and blue paint was remove...,0


In [0]:
for (i, prediction) in enumerate(result):
  probabilities = prediction["probabilities"]
  probabilitieslist = list(probabilities)
  pred_probs.append(probabilitieslist)
  if i >= num_actual_predict_examples:
    break
  label = str(probabilitieslist.index(max(probabilities))) # 0 1 or 2
  print(label)

In [51]:
len(result_list)

4765

In [140]:

num_actual_predict_examples = len(test_sent)
output_predict_file = os.path.join(OUTPUT_DIR, "test_results.tsv")
pred_labels = []
pred_probs = []
with tf.io.gfile.GFile(output_predict_file, "w") as writer:
  num_written_lines = 0
  tf.logging.info("***** Predict results *****")
  for (i, prediction) in enumerate(result):
    probabilities = prediction["probabilities"]
    probabilitieslist = list(probabilities)
    pred_probs.append(probabilitieslist)
    if i >= num_actual_predict_examples:
      break
    label = str(probabilitieslist.index(max(probabilities))) # 0 1 or 2
    pred_labels.append(label)
#   output_line = label + "\t" + "\t".join(
#       str(class_probability)
#       for class_probability in probabilities) + "\t"+ test_sent[i] + "\n" 
#   writer.write(output_line)
#   if label == "2": print(output_line)
    num_written_lines += 1
#assert num_written_lines == num_actual_predict_examples

TypeError: ignored

In [139]:
pred_labels

[]

In [0]:
for (i, prediction) in enumerate(result):
    probabilities = prediction["probabilities"]
    print(probabilities)

In [0]:
pred_labels_int = [int(i) for i in pred_labels]



In [0]:
df = pd.DataFrame()
df['text'] = test_sent
df['label'] = pred_labels_int


In [35]:
len(test_sent)

4765

In [0]:
df.to_csv("bertdocumentpredictions.csv")

In [0]:
from google.colab import drive
drive.mount('drive')

df.to_csv('allbertpredictions.csv')
!cp allbertpredictions.csv drive/My\ Drive/

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


In [0]:
print("files.download('df.csv')")

files.download('df.csv')


In [0]:
new_df = pd.DataFrame()
new_df['url'] = evaluation_sentences['url']
new_df['text'] = evaluation_sentences['text']


In [0]:
new_df = new_df.dropna()

In [0]:
new_df

In [0]:
exporting_df = pd.DataFrame()
exporting_df['url'] = new_df['url']
exporting_df['text'] = new_df['text']
exporting_df['label'] = pred_labels_int

In [0]:
from google.colab import drive
drive.mount('drive')

exporting_df.to_csv('bertdocumentpredictionsurl.csv')
!cp bertdocumentpredictionsurl.csv drive/My\ Drive/

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


In [62]:
import pickle

with open('result_lists.pkl', 'wb') as f:
  pickle.dump(result_list, f)


TypeError: ignored

In [61]:
type(result_list)

list

In [85]:
from google.colab import drive
drive.mount('drive')

texts_and_urls.to_csv('sentence_predictions.csv')
!cp sentence_predictions.csv drive/My\ Drive/

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


In [43]:
from google.colab import drive
drive.mount('drive')

texts_and_urls.to_csv('india_sentence_predictions.csv')
!cp india_sentence_predictions.csv drive/My\ Drive/

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).
