In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_hub as hub
import random
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification,DataCollatorWithPadding
from datasets import Dataset
from tqdm import tqdm
tqdm.pandas()
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)




# Introduction and Intition


Novadays using large language models is in a trend. Nothing suprising, since they perform really well especially on big datasets. For some time, I was also working on this competition wanted to share my experiences. I do belive that hands-on experience is at least as important as the theoretical knowledge.


To this end, we will look at four models today. Two of which will be baseline models (Naive Bayes and Logistic Regression), one will be a feature extraction model and the last one is a fine tuned transformer model. I uploaded the predictions of each one to the system and got the accuracy results. I will share the results with you also.

A couple notes:

* There are only one transformer and feature exraction model in this notebook but actually I tried many of them in different configurations. My results ranged somewhere between 0.78 to 0.83.

* I have also tried some prompt engineering staff and its f1-score was 0.79 or so but I do belive that it could be improved since I'm a novice in that area. If I could improvey score I'll update this notebook.

* If you only have to select one baseline model, go with the logistic regression (for binary classification of course). Naive assumption does not go hand in hand with bigger datasets. (More like a side note)

Here are a couple of bullet points from experiences:

* Do not underestimate the power of baseline models. Always be aware of this tradeoff: As machine learning model gets bigger, the required computational resources and  training time also increase. This may not be feasible for many applications. (The thing that I have experienced does not have to be true for all cases. However, it is always good to start with baseline models.)

* There is a method in which you fine-tune the model using masked language modeling first then use the same pre-trained model for sequence classification. The intention is allowing the model to abstract the text in consideration better. Worth for trying but be aware of the risk of overfitting. I did it for `DistilBert` but it did not enhance the results (0.79).

* Mirror Strategy fastens up the fine-tuning process a lot. On the other hand, using TPU nodes can be tricky as defined in the [HuggingFace website](https://huggingface.co/docs/transformers/perf_train_tpu_tf). You can also try XLA with TensorFlow but do not try to run XLA and MirrorStrategy together (1) and do not try XLA with TPU [(2)](https://huggingface.co/docs/transformers/perf_train_tpu_tf).


I hope you will find something useful in this work and please share your suggestions if you have any. In the end, the goal should be to learn something from each other...

In [2]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


# Baseline Model 1: Naive Bayes

In [4]:
#tokenize and vectorize the datasets using TF-IDF
vectorizer = TfidfVectorizer()
vectorizer.fit(train.text)

train_text = vectorizer.transform(train.text)
test_text = vectorizer.transform(test.text)

#build,train and get predictions of the model
model = MultinomialNB()
model.fit(train_text,train.target)
preds = model.predict(test_text)

#submission = pd.DataFrame()
#submission['id'] = test.id
#submission['target'] = preds

#submission.set_index('id').to_csv('NaiveBayes.csv')
#0.793

# Baseline Model 2: Logistic Regression

In [5]:
#build,train and get predictions of the model
model = LogisticRegression()
model.fit(train_text,train.target)
preds = model.predict(test_text)

#submission = pd.DataFrame()
#submission['id'] = test.id
#submission['target'] = preds

#submission.set_index('id').to_csv('LogisticRegression.csv')

#0.793

# Big Model 1: Neural Network with Transfer Learning

In [6]:
#get feature extraction layer
#feature_extraction_layer = hub.KerasLayer('https://tfhub.dev/google/universal-sentence-encoder/4')

In [7]:
#build a nn model using USE

#inputs = layers.Input(shape = [], dtype = tf.string)
#x = feature_extraction_layer(inputs,training = False)
#x = layers.Dense(128,activation = 'swish')(x)
#outputs = layers.Dense(1,activation = 'sigmoid')(x)

#model = tf.keras.Model(inputs,outputs)

#model.compile(optimizer = 'adam',
#             loss = 'binary_crossentropy',
#             metrics = 'accuracy')

#history = model.fit(train.text,train.target,epochs = 50)

In [8]:
#get predictions and submit 
#preds = model.predict(test.text)
#preds_labels = np.round(preds)

#submission = pd.DataFrame()
#submission['id'] = test.id
#submission['target'] = preds_labels.ravel().astype(np.int64)
#submission.set_index('id').to_csv('NN.csv')

#ranges from 78% to 81% for different model configurations

# Big Model 2: Transformers

In [9]:
tokenizer = AutoTokenizer.from_pretrained('hkayesh/twitter-disaster-nlp')
def tokenize_data(example):
  return tokenizer(example,
                   truncation = True,
          padding = 'max_length',
          max_length = 256,
          return_tensors = 'np')

Downloading tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [11]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
token_data = train['text'].progress_apply(tokenize_data)
train['attention_mask'] = token_data.progress_apply(lambda x: x['attention_mask'][0])
train['input_ids'] = token_data.progress_apply(lambda x: x['input_ids'][0])
tokenized_train_dataset = Dataset.from_pandas(train)

test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
tokenized_test = test['text'].progress_apply(tokenize_data)
test['input_ids'] = tokenized_test.progress_apply(lambda x: x['input_ids'][0])
test['attention_mask'] = tokenized_test.progress_apply(lambda x: x['attention_mask'][0])
tokenized_test_dataset = Dataset.from_pandas(test)

100%|██████████| 7613/7613 [00:02<00:00, 3577.04it/s]
100%|██████████| 7613/7613 [00:00<00:00, 314426.18it/s]
100%|██████████| 7613/7613 [00:00<00:00, 328977.73it/s]
100%|██████████| 3263/3263 [00:00<00:00, 3563.35it/s]
100%|██████████| 3263/3263 [00:00<00:00, 293106.33it/s]
100%|██████████| 3263/3263 [00:00<00:00, 326955.11it/s]


In [12]:
tf_train = tokenized_train_dataset.to_tf_dataset(columns = ['input_ids','attention_mask'],
                                    batch_size=8,
                                    shuffle=True,
                                    collate_fn=DataCollatorWithPadding(tokenizer),
                                    label_cols = ['target'])

tf_test = tokenized_test_dataset.to_tf_dataset(columns = ['input_ids','attention_mask'],
                                              batch_size = 8,
                                              shuffle = False,
                                              collate_fn = DataCollatorWithPadding(tokenizer),
                                              )


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [13]:
model = TFAutoModelForSequenceClassification.from_pretrained('hkayesh/twitter-disaster-nlp',num_labels = 2)
model.compile(optimizer = tf.keras.optimizers.Adam(3e-5),
             jit_compile = True)
history = model.fit(tf_train,epochs = 3)


#strategy = tf.distribute.MirroredStrategy()
#with strategy.scope():
#    model = TFAutoModelForSequenceClassification.from_pretrained('hkayesh/twitter-disaster-nlp',num_labels = 2)
#    model.compile(optimizer = tf.keras.optimizers.Adam(3e-5)
#    history = model.fit(tf_train,epochs = 3)

Downloading config.json:   0%|          | 0.00/538 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/268M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at hkayesh/twitter-disaster-nlp.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Epoch 1/3
Cause: for/else statement not yet supported
Epoch 2/3
Epoch 3/3


In [14]:
preds = model.predict(tf_test)
logits = preds.logits
probas = tf.nn.softmax(logits,axis = 1)
labels = tf.argmax(probas,axis = 1)

submission = pd.DataFrame()
submission['id'] = test.id
submission['target'] = labels.numpy()
submission.set_index('id').to_csv('hugging.csv')

