# Fake Tweets Detector

This notebook is an attempt for the [Real or Not? NLP with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started/) Kaggle competition.

We will be using Huggingface and TensorFlow for text classification with BERT.

### Imports

In [None]:
import numpy as np 
import pandas as pd
from transformers import TFBertForSequenceClassification, BertTokenizer, TFDistilBertForSequenceClassification, DistilBertTokenizer, glue_convert_examples_to_features
import tensorflow as tf
from livelossplot import PlotLossesKeras
from sklearn.model_selection import train_test_split
from tensorflow import keras

## Get data and tokenize

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
#tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
#model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

In [None]:
train_data = pd.read_csv("data/train.csv")
test_data = pd.read_csv("data/test.csv")

Now that we have the data loaded lets takea  quick look at the training set. We can drop all columns besides the text and target column for now.

In [None]:
train_data

There is a significant amout of NaN values at the `keyword` and `location` columns, so for the moment let´s just drop them, along with the `id` column.

In [None]:
train_text_df = train_data[['text', 'target']]
train_text_df = train_text_df.dropna()
df_X = train_text_df['text']
df_y = train_text_df['target'].to_numpy()

Now lets split it into train and validation data.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(df_X, df_y, test_size=0.2, random_state=1234)

Now we the training set up. 

Let´s tokenize the data. 
We will use the BERT tokenizer from huggingface. The `batch_encode_plu` methos will tokenize to all the data of the df. 

We should also return tensors for TF2, to make our life easier.

In [None]:
tokenized_X_train = tokenizer.batch_encode_plus(X_train, pad_to_max_length=True, return_tensors="tf")
tokenized_X_val = tokenizer.batch_encode_plus(X_val, pad_to_max_length=True, return_tensors="tf")

Lets look at the tf object now. As you can see we get a dictionary back with "input_ids" and "attention_mask". For our purposes we do not need the attention mask so we will only be using the input_ids.

In [None]:
tokenized_X_train

In [None]:
tokenized_X_train['input_ids']

## Model + Training
Now the fun part: setting and training the Keras model.

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
bce = tf.keras.losses.BinaryCrossentropy()
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# hyper-parameters
epochs = 50
batch_size = 256

In [None]:
cb=[PlotLossesKeras()]
model.fit(
    x=tokenized_X_train['input_ids'], y=y_train, 
    validation_data=(tokenized_X_val['input_ids'], y_val),
    epochs=epochs, 
    batch_size=batch_size,
    callbacks=cb, 
    verbose=1)

### Generate submission from test data

In [None]:
test_x = test_data['text'].to_numpy()

In [None]:
test_x = tokenizer.batch_encode_plus(test_x, pad_to_max_length=True, return_tensors="tf")

In [None]:
predictions = model.predict(test_x['input_ids'])

In [None]:
predictions_label = [ np.argmax(x) for x in predictions[0]]

In [None]:
submission = pd.DataFrame({'id': test_data['id'], 'target': predictions_label})
submission['target'] = submission['target'].astype('int')
submission.to_csv('submission.csv', index=False)