**Function1**

  - Take raw data as input and return Predictions for that point

In [1]:
!pip install transformers



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import numpy as np
import pandas as pd
import tensorflow as tf

In [4]:
def get_predictions(input):
  """This function returns the predictions for the given input"""
  from transformers import TFRobertaForQuestionAnswering, TFRobertaModel
  roberta = TFRobertaForQuestionAnswering.from_pretrained('/content/drive/My Drive/tweet-sentiment-extraction/mymodel_pretrained')
  print('*'*50)
  print('Loaded Pretrained TFRobertaForQuestionAnswering model')
  print('*'*50)
  MAX_LEN=128
  import tensorflow as tf
  from tensorflow.keras.models import Model
  from tensorflow.keras.layers import Input,Softmax,Dense,Activation,Dropout,Flatten


  input1 = Input(shape=(MAX_LEN,),name='input_id',dtype=tf.int32)
  input2 = Input(shape=(MAX_LEN,),name='attention_mask',dtype=tf.int32)
  start_scores,end_scores = roberta(input1,attention_mask = input2)
  drop1 = Dropout(0.1)(start_scores)
  drop1  = tf.expand_dims(drop1,axis=-1)
  layer1 = tf.keras.layers.Conv1D(1,1)(drop1)
  layer1= Flatten()(layer1)
  softmax1 = Activation('softmax')(layer1)
  
  drop2 = Dropout(0.1)(end_scores)
  drop2  = tf.expand_dims(drop2,axis=-1)
  layer2 = tf.keras.layers.Conv1D(1,1)(drop2)
  layer2 = Flatten()(layer2)
  softmax2 = Activation('softmax')(layer2)

  model = Model(inputs=[input1,input2],outputs=[softmax1,softmax2])
  model.load_weights('/content/drive/My Drive/tweet-sentiment-extraction/checkpt1/model_roberta.hdf5')
  print('Loaded trained model')
  print('*'*50)

  print('Preprocessing input data')
  print('*'*50)
  input['text'] = input['text'].apply(lambda x : str(x).lower())

  count = input.shape[0]
  input_ids = np.zeros((count,MAX_LEN),dtype='int32')
  attention_mask = np.zeros((count,MAX_LEN),dtype='int32')

  ip_data = input[['text','sentiment']]
  print('Loading Pretrained Tokenizer for TfRoberta model')
  print('*'*50)
  from transformers import RobertaTokenizer
  tokenizer = RobertaTokenizer.from_pretrained('/content/drive/My Drive/tweet-sentiment-extraction/roberta_tokenizer',add_prefix_space=True)

  print('Getting input_ids and attention_mask for the input')
  print('*'*50)
  from tqdm import tqdm
  for i,each in tqdm(enumerate(ip_data.values)):

    val = tokenizer.encode_plus(each[0],each[1],add_special_tokens=True,max_length=128,return_attention_mask=True,pad_to_max_length=True,return_tensors='tf',verbose=False)
    input_ids[i] = val['input_ids']
    attention_mask[i] = val['attention_mask']

  print('Shape of input id and attention mask:',input_ids.shape,attention_mask.shape)
  print('*'*50)
  print('Predicting o/p value')
  print('*'*50)
  input_data = (input_ids,attention_mask)
  start_pred , end_pred = model.predict((input_data))


  pred_text=[]
  for i in range(input_ids.shape[0]):
    a = np.argmax(start_pred[i])
    b=np.argmax(end_pred[i])
    text1 = " "+" ".join(ip_data['text'].values[i].split())
    enc = tokenizer.encode(text1)
    st = tokenizer.decode(enc[a:b+1])
    pred_text.append(st)

  print('Successfully ran!!!!!')
  print('*'*50)
  return pred_text

In [5]:
import pandas as pd
test_df = pd.read_csv('/content/drive/My Drive/tweet-sentiment-extraction/test.csv')
test_df.shape

(3534, 3)

In [6]:
import time
from prettytable import PrettyTable 
start = time.time()
ip = test_df.sample(10)
op = get_predictions(ip)
print('*'*50)
myTable = PrettyTable(["Sentiment", "Input text", "Output text"]) 
for i in range(len(op)):
  myTable.add_row([ip['sentiment'].values[i],ip['text'].values[i],op[i]])

print(myTable)  
print('Time taken:',time.time()-start)

All model checkpoint layers were used when initializing TFRobertaForQuestionAnswering.

All the layers of TFRobertaForQuestionAnswering were initialized from the model checkpoint at /content/drive/My Drive/tweet-sentiment-extraction/mymodel_pretrained.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForQuestionAnswering for predictions without further training.


**************************************************
Loaded Pretrained TFRobertaForQuestionAnswering model
**************************************************


10it [00:00, 680.77it/s]

Loaded trained model
**************************************************
Preprocessing input data
**************************************************
Loading Pretrained Tokenizer for TfRoberta model
**************************************************
Getting input_ids and attention_mask for the input
**************************************************
Shape of input id and attention mask: (10, 128) (10, 128)
**************************************************
Predicting o/p value
**************************************************





Successfully ran!!!!!
**************************************************
**************************************************
+-----------+-------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
| Sentiment |                                                              Input text                                                             |                             Output text                              |
+-----------+-------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|  negative |                        wow, its hot and miserable. people are probably killing themselves right about now...                        |                      wow, its hot and miserable.                     |


**Function 2**

  - Take raw data as input and return performance metric

  

In [7]:
def jaccard(str1, str2): 
    a = set(str(str1).lower().split()) 
    b = set(str(str2).lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

def get_performance_metric(input):
  """This function returns the performace metric(Jaccard score) for the given input"""
  from transformers import TFRobertaForQuestionAnswering, TFRobertaModel
  roberta = TFRobertaForQuestionAnswering.from_pretrained('/content/drive/My Drive/tweet-sentiment-extraction/mymodel_pretrained')
  print('*'*50)
  print('Loaded Pretrained TFRobertaForQuestionAnswering model')
  print('*'*50)
  MAX_LEN=128
  import tensorflow as tf
  from tensorflow.keras.models import Model
  from tensorflow.keras.layers import Input,Softmax,Dense,Activation,Dropout,Flatten


  input1 = Input(shape=(MAX_LEN,),name='input_id',dtype=tf.int32)
  input2 = Input(shape=(MAX_LEN,),name='attention_mask',dtype=tf.int32)
  start_scores,end_scores = roberta(input1,attention_mask = input2)
  drop1 = Dropout(0.1)(start_scores)
  drop1  = tf.expand_dims(drop1,axis=-1)
  layer1 = tf.keras.layers.Conv1D(1,1)(drop1)
  layer1= Flatten()(layer1)
  softmax1 = Activation('softmax')(layer1)
  
  drop2 = Dropout(0.1)(end_scores)
  drop2  = tf.expand_dims(drop2,axis=-1)
  layer2 = tf.keras.layers.Conv1D(1,1)(drop2)
  layer2 = Flatten()(layer2)
  softmax2 = Activation('softmax')(layer2)

  model = Model(inputs=[input1,input2],outputs=[softmax1,softmax2])
  model.load_weights('/content/drive/My Drive/tweet-sentiment-extraction/checkpt1/model_roberta.hdf5')
  print('Loaded trained model')
  print('*'*50)

  print('Preprocessing input data')
  print('*'*50)
  input['text'] = input['text'].apply(lambda x : str(x).lower())
  input['selected_text'] = input['selected_text'].apply(lambda x : str(x).lower())

  count = input.shape[0]
  input_ids = np.zeros((count,MAX_LEN),dtype='int32')
  attention_mask = np.zeros((count,MAX_LEN),dtype='int32')

  ip_data = input[['text','sentiment']]
  print('Loading Pretrained Tokenizer for TfRoberta model')
  print('*'*50)
  from transformers import RobertaTokenizer
  tokenizer = RobertaTokenizer.from_pretrained('/content/drive/My Drive/tweet-sentiment-extraction/roberta_tokenizer',add_prefix_space=True)

  print('Getting input_ids and attention_mask for the input')
  print('*'*50)
  from tqdm import tqdm
  for i,each in tqdm(enumerate(ip_data.values)):

    val = tokenizer.encode_plus(each[0],each[1],add_special_tokens=True,max_length=128,return_attention_mask=True,pad_to_max_length=True,return_tensors='tf',verbose=False)
    input_ids[i] = val['input_ids']
    attention_mask[i] = val['attention_mask']

  print('Shape of input id and attention mask:',input_ids.shape,attention_mask.shape)
  print('*'*50)
  print('Predicting o/p value')
  print('*'*50)
  input_data = (input_ids,attention_mask)
  start_pred , end_pred = model.predict((input_data))


  pred_text=[]
  for i in range(input_ids.shape[0]):
    a = np.argmax(start_pred[i])
    b=np.argmax(end_pred[i])
    text1 = " "+" ".join(ip_data['text'].values[i].split())
    enc = tokenizer.encode(text1)
    st = tokenizer.decode(enc[a:b+1])
    pred_text.append(st)



  print('Calcuating the performance metric')
  print('*'*50)
  actual_text = input['selected_text'].values
  scores = []
  for i in range(len(pred_text)):
    scores.append(jaccard(pred_text[i],actual_text[i]))

  res = np.array(scores).mean()
  print('Successfully ran!!!!!')
  print('*'*50)
  return res  

In [8]:
import time

start = time.time()
train_df = pd.read_csv('/content/drive/My Drive/tweet-sentiment-extraction/train.csv')
print(train_df.shape)
ip = train_df.sample(20)
op = get_performance_metric(ip)
print('*'*50)
print('*'*50)
print("JACCARD SCORE FOR GIVEN DATA:",op)
print('*'*50)
print('*'*50)
print('Time taken:',time.time()-start)

(27481, 4)


All model checkpoint layers were used when initializing TFRobertaForQuestionAnswering.

All the layers of TFRobertaForQuestionAnswering were initialized from the model checkpoint at /content/drive/My Drive/tweet-sentiment-extraction/mymodel_pretrained.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForQuestionAnswering for predictions without further training.


**************************************************
Loaded Pretrained TFRobertaForQuestionAnswering model
**************************************************


20it [00:00, 1050.23it/s]

Loaded trained model
**************************************************
Preprocessing input data
**************************************************
Loading Pretrained Tokenizer for TfRoberta model
**************************************************
Getting input_ids and attention_mask for the input
**************************************************
Shape of input id and attention mask: (20, 128) (20, 128)
**************************************************
Predicting o/p value
**************************************************





Calcuating the performance metric
**************************************************
Successfully ran!!!!!
**************************************************
**************************************************
**************************************************
JACCARD SCORE FOR GIVEN DATA: 0.7272556446821153
**************************************************
**************************************************
Time taken: 18.24516248703003
