<a href="https://colab.research.google.com/github/Yaninast/NLP-projects/blob/master/BERT_News_Articles_classification_and_topic_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#BERT 
BERT, or Bidirectional Embedding Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. 

Multiclass classification has been performed using a free Colab Cloud TPU to fine-tune classification task built on top of pretrained BERT model and run predictions on tuned model. 

##Set up your TPU environment

Set up a Colab TPU running environment
Verify that you are connected to a TPU device
Upload your credentials to TPU to access your GCS bucket.

In [0]:
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

##Prepare and import BERT modules
​With your environment configured, you can now prepare and import the BERT modules. The following step clones the source code from GitHub and import the modules from the source. Alternatively, you can install BERT using pip (!pip install bert-tensorflow).

In [0]:
import sys

!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
  sys.path += ['bert_repo']

# import python modules defined by BERT
import modeling
import optimization
import run_classifier
import run_classifier_with_tfhub
import tokenization

# import tfhub 
import tensorflow_hub as hub

Cloning into 'bert_repo'...
remote: Enumerating objects: 336, done.[K
remote: Total 336 (delta 0), reused 0 (delta 0), pack-reused 336[K
Receiving objects: 100% (336/336), 290.62 KiB | 3.87 MiB/s, done.
Resolving deltas: 100% (184/184), done.





##Prepare for training
This next section of code performs the following tasks:

Specify task and download training data.
Specify BERT pretrained model
Specify GS bucket, create output directory for model checkpoints and eval results.

In [0]:
TASK = 'CoLA' #@param {type:"string"}
assert TASK in ('CoLA_2'), 'Only (CoLA) is used.'

# Download glue data.
#! test -d download_glue_repo || git clone https://gist.github.com/60c2bdb54d156a41194446737ce03e2e.git download_glue_repo
#!python train.tsv --data_dir='train_data' --tasks=$TASK

#TASK_DATA_DIR = 'glue_data/' + TASK
#print('***** Task data directory: {} *****'.format(TASK_DATA_DIR))
#!ls $TASK_DATA_DIR

BUCKET = 'bert_yanina' #@param {type:"string"}
assert BUCKET, 'Must specify an existing GCS bucket name'
OUTPUT_DIR = 'gs://{}/bert-tfhub/models/{}'.format(BUCKET, TASK)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

# Pretrained model checkpoints:uncased_L-12_H-768_A-12: uncased BERT base model
BERT_MODEL = 'uncased_L-12_H-768_A-12' #@param {type:"string"}
BERT_MODEL_HUB = 'https://tfhub.dev/google/bert_' + BERT_MODEL + '/1'

Data has been labeled and convert into tsv format to feed into BERT. 
BERT wants data to be in a tsv file with a specific format as given below (Four columns, and no header row).


*   Column 0: An ID for the row
*   Column 1: The label for the row (should be an int)
*   Column 2: A column of the same letter for all rows. BERT wants this so we’ll give it, but we don’t have a use for it.
*   Column 3: The text for the row










In [0]:
from google.colab import files # upload files 
uploaded = files.upload()

Saving corpus_test_bert.csv to corpus_test_bert.csv
Saving corpus_train_bert_2.csv to corpus_train_bert_2.csv
Saving corpus_validation_bert_2.csv to corpus_validation_bert_2.csv


In [0]:
import pandas as pd
#read source data from csv file
df_train = pd.read_csv('corpus_train_bert_2.csv', header = None)
#create a new dataframe for train, dev data
df_test = pd.read_csv('corpus_test_bert.csv')
df_eval = pd.read_csv('corpus_validation_bert_2.csv', header = None)

#######make sure your input data in tsv format 
#output tsv file, no header for train and dev
#df_train.to_csv('/content/train.tsv', sep='\t', index=False, header=False)
#df_eval.to_csv('/content/dev.tsv', sep='\t', index=False, header=False)
#df_test.to_csv('/content/test.tsv', sep='\t', index=False, header=True)

df_bert_train = pd.DataFrame({'0':df_train[0],
                  '1':df_train[1],
                  '2':df_train[2],             
                  '3':df_train[3].replace(r'\n',' ',regex=True)})

df_bert_eval = pd.DataFrame({'0':df_eval[0],
                  '1':df_eval[1],
                  '2':df_eval[2],             
                  '3':df_eval[3].replace(r'\n',' ',regex=True)})

df_bert_test = pd.DataFrame({'0':df_test['ID'],
                  '1':df_test['Text'].replace(r'\n',' ',regex=True)})
df_bert_test = df_test.replace(r'\n',' ',regex=True)

df_bert_train.to_csv('/content/train.tsv', sep='\t', index=False, header=False, encoding="UTF-8")
df_bert_eval.to_csv('/content/dev.tsv', sep='\t', index=False, header=False, encoding="UTF-8")
df_bert_test.to_csv('/content/test.tsv', sep='\t', index=False, header=True, encoding="UTF-8")


!ls

Unnamed: 0,0,1,2,3
0,0,6,a,Nato at 70: Europe fears tensions will outlast...
1,1,2,a,How Immigration became real problem – Immigra...
2,2,2,a,Trump's Oval Office address shows he's hit a w...
3,3,4,a,"In this paper we conclude that, based upon pub..."
4,4,5,a,"Trump promises ‘epic’ trade deal with China, b..."
5,5,3,a,Has the Trump administration begun to dismantl...
6,6,4,a,Special counsel Robert Mueller did not find th...
7,7,4,a,"Natalia V. Veselnitskaya, the Russian lawyer w..."
8,8,2,a,The crisis on the southern border has been dri...
9,9,5,a,President Donald Trump declared the U.S. was “...


In [0]:
from __future__ import absolute_import, division, print_function

import csv
import os
import sys
import logging

logger = logging.getLogger()
csv.field_size_limit(2147483647) # Increase CSV reader's field limit incase we have long text.


class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.
        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label


class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()
        
    def get_test_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the test set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

    @classmethod
    def _read_tsv(cls, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with open(input_file, "r", encoding="utf-8") as f:
            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
            lines = []
            for line in reader:
                if sys.version_info[0] == 2:
                    line = list(unicode(cell, 'utf-8') for cell in line)
                lines.append(line)
            return lines


class MultiClassificationProcessor(DataProcessor):

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
      
    def get_test_examples(self, data_dir):
       return self._create_examples(
           self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")  
  
    def get_labels(self):
       return [ "1", "2", "3", "4", "5","6"]
      
    def _create_examples(self, lines, set_type):
      examples = []
      for (i, line) in enumerate(lines):
      # Only the test set has a header
        if set_type == "test" and i == 0:
          continue
        guid = "%s-%s" % (set_type, i)
        if set_type == "test":
          text_a = tokenization.convert_to_unicode(line[1])
          label = "1"
        else:
          text_a = tokenization.convert_to_unicode(line[3])
          label = tokenization.convert_to_unicode(line[1])
        examples.append(
          InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
      return examples
 

In [0]:
TASK_DATA_DIR = '/content'

TRAIN_BATCH_SIZE = 32
EVAL_BATCH_SIZE = 8
PREDICT_BATCH_SIZE = 8
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 20.0
MAX_SEQ_LENGTH = 500
# Warmup is a period of time where hte learning rate 
# is small and gradually increases--usually helps training.
WARMUP_PROPORTION = 0.1
# Model configs
SAVE_CHECKPOINTS_STEPS = 1000
SAVE_SUMMARY_STEPS = 500


processor = MultiClassificationProcessor()
label_list = processor.get_labels()

# Compute number of train and warmup steps from batch size
train_examples = processor.get_train_examples(TASK_DATA_DIR)
num_train_steps = int(len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

train_examples_len = len(train_examples)
num_labels = len(label_list)

!ls

In [0]:
# Setup TPU related config
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
NUM_TPU_CORES = 8
ITERATIONS_PER_LOOP = 1000

def get_run_config(output_dir):
  return tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=output_dir,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=ITERATIONS_PER_LOOP,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))


# Force TF Hub writes to the GS bucket we provide.
os.environ['TFHUB_CACHE_DIR'] = OUTPUT_DIR

model_fn = run_classifier_with_tfhub.model_fn_builder(
  num_labels=len(label_list),
  learning_rate=LEARNING_RATE,
  num_train_steps=num_train_steps,
  num_warmup_steps=num_warmup_steps,
  use_tpu=True,
  bert_hub_module_handle=BERT_MODEL_HUB
)

estimator_from_tfhub = tf.contrib.tpu.TPUEstimator(
  use_tpu=True,
  model_fn=model_fn,
  config=get_run_config(OUTPUT_DIR),
  train_batch_size=TRAIN_BATCH_SIZE,
  eval_batch_size=EVAL_BATCH_SIZE,
  predict_batch_size=PREDICT_BATCH_SIZE,
)


In [0]:
tokenizer = run_classifier_with_tfhub.create_tokenizer_from_hub_module(BERT_MODEL_HUB)

## Train Model

In [0]:
##At this point, you can now fine-tune the model, evaluate it, and run predictions on it.

# Train the model
def model_train(estimator):
  train_features = run_classifier.convert_examples_to_features(
      train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  print('***** Started training at {} *****'.format(datetime.datetime.now()))
  print('  Num examples = {}'.format(len(train_examples)))
  print('  Batch size = {}'.format(TRAIN_BATCH_SIZE))
  tf.logging.info("  Num steps = %d", num_train_steps)
  train_input_fn = run_classifier.input_fn_builder(
      features=train_features,
      seq_length=MAX_SEQ_LENGTH,
      is_training=True,
      drop_remainder=True)
  estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
  print('***** Finished training at {} *****'.format(datetime.datetime.now()))

model_train(estimator_from_tfhub)


##Evaluate Model

In [0]:
def model_eval(estimator):
  # Eval the model.
  eval_examples = processor.get_dev_examples(TASK_DATA_DIR)
  eval_features = run_classifier.convert_examples_to_features(
      eval_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  print('***** Started evaluation at {} *****'.format(datetime.datetime.now()))
  print('  Num examples = {}'.format(len(eval_examples)))
  print('  Batch size = {}'.format(EVAL_BATCH_SIZE))

  # Eval will be slightly WRONG on the TPU because it will truncate
  # the last batch.
  eval_steps = int(len(eval_examples) / EVAL_BATCH_SIZE)
  eval_input_fn = run_classifier.input_fn_builder(
      features=eval_features,
      seq_length=MAX_SEQ_LENGTH,
      is_training=False,
      drop_remainder=True)
  result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
  print('***** Finished evaluation at {} *****'.format(datetime.datetime.now()))
  output_eval_file = os.path.join(OUTPUT_DIR, "eval_results.txt")
  with tf.gfile.GFile(output_eval_file, "w") as writer:
    print("***** Eval results *****")
    for key in sorted(result.keys()):
      print('  {} = {}'.format(key, str(result[key])))
      writer.write("%s = %s\n" % (key, str(result[key])))


model_eval(estimator_from_tfhub)



##Predict

In [0]:
def model_predict(estimator):
  # Make predictions on a subset of eval examples
  prediction_examples = processor.get_test_examples(TASK_DATA_DIR)
  input_features = run_classifier.convert_examples_to_features(prediction_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=True)
  predictions = estimator.predict(predict_input_fn)

  for example, prediction in zip(prediction_examples, predictions):
    print('text_a: %s\ntext_b: %s\nlabel:%s\nprediction:%s\n' % (example.text_a, example.text_b, str(example.label), prediction['probabilities']))
    
  output_test_file = os.path.join(OUTPUT_DIR, "test_results.txt")
  with tf.gfile.GFile(output_test_file, "w") as writer:
    print("***** Eval results *****")
    for example, prediction in zip(prediction_examples, predictions):
      writer.write('text_a: %s\ntext_b: %s\nlabel:%s\nprediction:%s\n' % (example.text_a, example.text_b, str(example.label), prediction['probabilities']))


model_predict(estimator_from_tfhub) 
