# NLP Project : Disaster Tweets

This notebook will document our NLP learning process.

---

### *Data*

- train.csv : the training set
- test.csv : the test set
- sample_submission.csv : a sample submission example


## *Columns*

* *id* : a unique identifier for each tweet
* *text* : the text of the tweet
* *location* : the location the tweet was sent from (may be blank)
* *keyword* : a particular keyword from the tweet (may be blank)
* *target* : in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

### *Goal*

**Here, we are predicting whether a given tweet is about a real disaster or not. This is a binary classification problem.**


---

## References

- https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert
- https://www.kaggle.com/faressayah/natural-language-processing-nlp-for-beginners
- https://www.kaggle.com/pavansanagapati/knowledge-graph-nlp-tutorial-bert-spacy-nltk/notebook
- https://www.kaggle.com/shahules/basic-eda-cleaning-and-glove
- https://www.kaggle.com/frankmollard/nlp-a-gentle-introduction-lstm-word2vec-bert
- https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub/
---

## Libraries import

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from wordcloud import WordCloud

import missingno as msno

from tqdm import tqdm

# Nltk libraries
import nltk
from nltk import ngrams

import re
import string

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import SGDClassifier

from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, classification_report

## Load data

 ### Train dataset

In [None]:
df_train = pd.read_csv('../input/nlp-getting-started/train.csv')
df_train.head()

### Test dataset

In [None]:
df_test = pd.read_csv('../input/nlp-getting-started/test.csv')
df_test.head()

Target values are not avaible in Kaggle's test set. 

> **Gunes Evitan ([@gunesevitan](https://www.kaggle.com/gunesevitan)) wrote:**
> Test set labels can be found on [this](https://www.figure-eight.com/data-for-everyone/) website. Dataset is named **Disasters on social media**. This is how people are submitting perfect scores. Other "Getting Started" competitions also have their test labels available. The main point of "Getting Started" competitions is **learning and sharing**, and perfect score doesn't mean anything. 
> According to [@philculliton](https://www.kaggle.com/philculliton) from Kaggle Team, competitors who use test set labels in any way are not eligible to win AutoML prize. There are no other penalties for using them.


In [None]:
df_leak = pd.read_csv('../input/diastertweets/socialmedia-disaster-tweets-DFE.csv', encoding ='ISO-8859-1')[["choose_one", "text"]]

df_leak['target'] = (df_leak['choose_one'] == 'Relevant').astype(np.int8)
df_leak.drop(columns=['choose_one'], inplace=True)

df_test = df_test.merge(df_leak, on=['text'])
df_test.head()

## Dataframes Information

In [None]:
print(df_train.info())
print(df_test.info())

## Check for duplicates

In [None]:
print(df_train.duplicated().sum())
print(df_test.duplicated().sum())

There are no duplicates in the train set, but 193 duplcates in the test set.

## Drop duplicates in test set

In [None]:
df_test = df_test.drop_duplicates().reset_index(drop=True)

## Check for missing values

In [None]:
df_train.isna().mean()*100

In [None]:
df_test.isna().mean()*100

For the train set :
- $33.27\%$ of location and
- $0.80\%$ of keyword are missing.

For the test set :
- $34.36\%$ of location and
- $0.80\%$ of keyword are missing.

## Proportion of missing data Visualization

### Train set

In [None]:
msno.matrix(df_train)

### Test set

In [None]:
msno.matrix(df_test)

## Deal with missing values

- When a location is missing, it is going to be replaced by "unknown-location"
- When a keyword is missing, it is going to be replaced by "unknown-keyword"

In [None]:
df_train.location = df_train.location.fillna('unknown-location')
df_train.keyword = df_train.keyword.fillna('Unknown-keyword')
df_test.location = df_test.location.fillna('unknown-location')
df_test.keyword = df_test.keyword.fillna('Unknown-keyword')

# EDA

## Tweets exploration by target

In [None]:
df_train.iloc[:,1:].groupby(['target'])['text'].describe()

In [None]:
df_test.iloc[:,1:].groupby(['target'])['text'].describe()

## Traget distribution in dataset

### Countplot

In [None]:
fig, ax = plt.subplots(1,2, figsize=(16,8))
sns.countplot(x='target', data=df_train, palette='hls', ax=ax[0])
ax[0].set_title('Traget distribution in train set')
sns.countplot(x='target', data=df_test, palette='hls', ax=ax[1])
ax[1].set_title('Traget distribution in test set')
plt.tight_layout()

### Pie chart

In [None]:
v1 = df_train['target'].value_counts()
v2 = df_test['target'].value_counts()
labels = [0, 1]


fig = plt.figure(figsize=(16,8))

ax1 = plt.subplot2grid((1,2),(0,0))
plt.pie(v1, labels=labels, colors = ['grey','red'], autopct='%.0f%%')
plt.title('Train set')

ax2 = plt.subplot2grid((1,2),(0, 1))
plt.pie(v2, labels=labels, colors = ['grey','red'], autopct='%.0f%%')
plt.title('Test set')

plt.show()

## Location variation in dataset

In [None]:
train_1 = df_train[df_train['target']==1].reset_index(drop=True)
train_0 = df_train[df_train['target']==0].reset_index(drop=True)
test_1 = df_test[df_test['target']==1].reset_index(drop=True)
test_0 = df_test[df_test['target']==0].reset_index(drop=True)

In [None]:
fig1 = make_subplots(rows=1, cols=2)
fig2 = make_subplots(rows=1, cols=2)

trace1 = go.Histogram(x=train_1.location,
                      xbins=dict(
                      start=0,
                      end=15), name='Train set', marker_color='#a63048')
trace2 = go.Histogram(x=test_1.location,
                      xbins=dict(
                      start=0,
                      end=15), name='Test set', marker_color='#d4687e')

trace3 = go.Histogram(x=train_0.location,
                      xbins=dict(
                      start=0,
                      end=15), name='Train set', marker_color='#52d298')
trace4 = go.Histogram(x=test_0.location,
                      xbins=dict(
                      start=0,
                      end=15), name='Test set', marker_color='#c1f8d8')

fig1.add_trace(trace1, 1, 1)
fig1.add_trace(trace2, 1, 2)
fig1.update_layout(title_text='Location variation in dataset for target = 1')
fig1.show()

fig2.add_trace(trace3, 1, 1)
fig2.add_trace(trace4, 1, 2)
fig2.update_layout(title_text='Location variation in dataset for target = 0')
fig2.show()

## Keywords variation in dataset

In [None]:
fig1 = make_subplots(rows=1, cols=2)
fig2 = make_subplots(rows=1, cols=2)

trace1 = go.Histogram(x=train_1.keyword,
                      xbins=dict(
                      start=0,
                      end=15), name='Train set', marker_color='#a63048')
trace2 = go.Histogram(x=test_1.keyword,
                      xbins=dict(
                      start=0,
                      end=15), name='Test set', marker_color='#d4687e')

trace3 = go.Histogram(x=train_0.keyword,
                      xbins=dict(
                      start=0,
                      end=15), name='Train set', marker_color='#52d298')
trace4 = go.Histogram(x=test_0.keyword,
                      xbins=dict(
                      start=0,
                      end=15), name='Test set', marker_color='#c1f8d8')

fig1.add_trace(trace1, 1, 1)
fig1.add_trace(trace2, 1, 2)
fig1.update_layout(title_text='Keywords variation in dataset for target = 1')
fig1.show()

fig2.add_trace(trace3, 1, 1)
fig2.add_trace(trace4, 1, 2)
fig2.update_layout(title_text='Keywords variation in dataset for target = 0')
fig2.show()

## Meta features visualization
Visualization below were hugely inspired by [@gunesevitan](https://www.kaggle.com/gunesevitan)'s submission.

In [None]:
token = nltk.tokenize.RegexpTokenizer(r"\w+")
stopW = nltk.corpus.stopwords.words('english')

# word_count
df_train['word_count'] = df_train['text'].apply(lambda x: len(token.tokenize(str(x))))
df_test['word_count'] = df_test['text'].apply(lambda x: len(token.tokenize(str(x))))

# unique_word_count
df_train['unique_word_count'] = df_train['text'].apply(lambda x: len(set(token.tokenize(str(x)))))
df_test['unique_word_count'] = df_test['text'].apply(lambda x: len(set(token.tokenize(str(x)))))

# stop_word_count
df_train['stop_word_count'] = df_train['text'].apply(lambda x: len([w for w in str(x).lower().split() if w in stopW]))
df_test['stop_word_count'] = df_test['text'].apply(lambda x: len([w for w in str(x).lower().split() if w in stopW]))

# url_count
df_train['url_count'] = df_train['text'].apply(lambda x: len([w for w in str(x).lower().split() if 'http' in w or 'https' in w]))
df_test['url_count'] = df_test['text'].apply(lambda x: len([w for w in str(x).lower().split() if 'http' in w or 'https' in w]))

# mean_word_length
df_train['mean_word_length'] = df_train['text'].apply(lambda x: np.mean([len(w) for w in token.tokenize(str(x))]))
df_test['mean_word_length'] = df_test['text'].apply(lambda x: np.mean([len(w) for w in token.tokenize(str(x))]))

# char_count
df_train['char_count'] = df_train['text'].apply(lambda x: len(str(x)))
df_test['char_count'] = df_test['text'].apply(lambda x: len(str(x)))

# punctuation_count
df_train['punctuation_count'] = df_train['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))
df_test['punctuation_count'] = df_test['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))

# hashtag_count
df_train['hashtag_count'] = df_train['text'].apply(lambda x: len([c for c in str(x) if c == '#']))
df_test['hashtag_count'] = df_test['text'].apply(lambda x: len([c for c in str(x) if c == '#']))

# mention_count
df_train['mention_count'] = df_train['text'].apply(lambda x: len([c for c in str(x) if c == '@']))
df_test['mention_count'] = df_test['text'].apply(lambda x: len([c for c in str(x) if c == '@']))

In [None]:
METAFEATURES = ['word_count', 'unique_word_count', 'stop_word_count', 'url_count', 'mean_word_length',
                'char_count', 'punctuation_count', 'hashtag_count', 'mention_count']
DISASTER_TWEETS = df_train['target'] == 1

fig, axes = plt.subplots(ncols=2, nrows=len(METAFEATURES), figsize=(20, 50), dpi=100)

for i, feature in enumerate(METAFEATURES):
    sns.histplot(df_train.loc[~DISASTER_TWEETS][feature], label='Not Disaster', ax=axes[i][0], color='grey')
    sns.histplot(df_train.loc[DISASTER_TWEETS][feature], label='Disaster', ax=axes[i][0], color='#972139')

    sns.histplot(df_train[feature], label='Training', ax=axes[i][1], color="#217497")
    sns.histplot(df_test[feature], label='Test', ax=axes[i][1], color="#dcf6d5")
    
    for j in range(2):
        axes[i][j].set_xlabel('')
        axes[i][j].tick_params(axis='x', labelsize=12)
        axes[i][j].tick_params(axis='y', labelsize=12)
        axes[i][j].legend()
    
    axes[i][0].set_title(f'{feature} Target Distribution in Training Set', fontsize=13)
    axes[i][1].set_title(f'{feature} Training & Test Set Distribution', fontsize=13)

plt.show()

In [None]:
# Deleting columns after visualization
df_train = df_train.drop(['word_count', 'unique_word_count', 'stop_word_count', 'url_count', 
               'mean_word_length','char_count', 'punctuation_count', 'hashtag_count', 
               'mention_count'], axis = 1)

df_test = df_test.drop(['word_count', 'unique_word_count', 'stop_word_count', 'url_count', 
               'mean_word_length','char_count', 'punctuation_count', 'hashtag_count', 
               'mention_count'], axis = 1)

---

# Data Cleaning

These steps of cleaning were applied :
1. **Normalization**
2. **Tokenization**
3. **Remove stopwords**
4. **Remove punctuation**
5. **Lemmatization**
6. **Remove digits**
7. **Remove single letters**
8. **Remove symbols**

This cleaning was done by using the function *preprocess* defined below.

## Proprocess functions

In [None]:
def remove_stopwords(sent):
    stopW = nltk.corpus.stopwords.words('english')
    stopW.extend(list(string.punctuation))
    return [word for word in sent if word not in stopW]

def lemmatize(sent, join=False):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(lemmatizer.lemmatize(lemmatizer.lemmatize(w,'v'),'n'),'a') for w in sent]
    if join:
        return ' '.join(tokens)
    else:
        return tokens
    
def remove_digits(sent):
    return [word for word in sent if not re.match(r"\S*\d+\S*", word)]

def remove_single_letters(sent):
    return [word for word in sent if len(word) > 1]

def remove_noise(sent):
    typos = ["û_", "amp", "ûª", "http", "https", "co", "rt", 
             "ûªs", "@", "...", "ûªs", "ûò", "åè", "ìñ1"]
    return [word for word in sent if word not in typos]

def join_tweets(sent):
    return "  ".join(sent)

def preprocess(df):
    # Lower casing
    df['text_cleaned'] = df['text'].apply(lambda sent: sent.lower())
    
    # Tokenize
    token = nltk.tokenize.RegexpTokenizer(r"\w+")
    df['text_cleaned'] = df['text_cleaned'].apply(token.tokenize)
    
    # New column : tweet lengths before cleaning
    df['text_length_before'] = df.text_cleaned.apply(len)
    
    # Remove stopwords and punctuation
    df['text_cleaned'] = df['text_cleaned'].apply(lambda sent: remove_stopwords(sent)).reset_index(drop=True)
    
    # Lemmatize
    df['text_cleaned'] = df['text_cleaned'].apply(lambda sent: lemmatize(sent))
    
    # Remove digits
    df['text_cleaned'] = df['text_cleaned'].apply(remove_digits).reset_index(drop=True)
    
    # Remove single letter
    df['text_cleaned'] = df['text_cleaned'].apply(remove_single_letters).reset_index(drop=True)
    
    # Remove weird symbols
    df['text_cleaned'] = df['text_cleaned'].apply(remove_noise).reset_index(drop=True)
    
    # New column : tweet lengths after cleaning
    df['text_length_after'] = df.text_cleaned.apply(len)
    
    # Join words into one str
    df['text_cleaned'] = df['text_cleaned'].apply(join_tweets)
    
    return df

## Applying preprocess on dataset

In [None]:
df_train = preprocess(df_train)
df_test = preprocess(df_test)

## Tweets length distribution by target before and after cleaning

In [None]:
fig, ax = plt.subplots(1,2, figsize=(16,8))
sns.histplot(data = df_train, x = "text_length_before", element = "step", color='grey', ax=ax[0], label='before cleaning')
sns.histplot(data = df_train, x = "text_length_after", element = "step", color='yellow', ax=ax[0], label='after cleaning')
ax[0].set_title('Tweets length distribution in train set')
ax[0].legend()
sns.histplot(data = df_test, x = "text_length_before", element = "step", color='grey', ax=ax[1], label='before cleaning')
sns.histplot(data = df_test, x = "text_length_after", element = "step", color='yellow', ax=ax[1], label='after cleaning')
ax[1].set_title('Tweets length distribution in test set')
ax[1].legend()

# Wordclouds

## Wordclouds preparation

In [None]:
train_1 = df_train[df_train['target']==1].reset_index(drop=True)
train_0 = df_train[df_train['target']==0].reset_index(drop=True)
test_1 = df_test[df_test['target']==1].reset_index(drop=True)
test_0 = df_test[df_test['target']==0].reset_index(drop=True)

# Train for target 1
texts_train_1 = []
for i in range(0, train_1.shape[0]):
    texts_train_1.append(train_1['text_cleaned'][i])
wordcloud1 = WordCloud(background_color='white', colormap="hot").generate(" ".join(texts_train_1))

# Train for target 0
texts_train_0 = []
for i in range(0, train_0.shape[0]):
    texts_train_0.append(train_0['text_cleaned'][i])
wordcloud2 = WordCloud(background_color='white', colormap="cividis").generate(" ".join(texts_train_0))

# Test for target 1
texts_test_1 = []
for i in range(0, test_1.shape[0]):
    texts_test_1.append(test_1['text_cleaned'][i])
wordcloud3 = WordCloud(background_color='white', colormap="hot").generate(" ".join(texts_test_1))

# Test for target 0
texts_test_0 = []
for i in range(0, test_0.shape[0]):
    texts_test_0.append(test_0['text_cleaned'][i])
wordcloud4 = WordCloud(background_color='white', colormap="cividis").generate(" ".join(texts_test_0))

In [None]:
plt.subplots(figsize = (24, 14))

plt.subplot(2,2,1)
plt.imshow(wordcloud1)
plt.axis('off')
plt.title("Wordcloud train set target = 1")

plt.subplot(2,2,2)
plt.imshow(wordcloud2)
plt.axis('off')
plt.title("Wordcloud train set target = 0")

plt.subplot(2,2,3)
plt.imshow(wordcloud3)
plt.axis('off')
plt.title("Wordcloud test set target = 1")

plt.subplot(2,2,4)
plt.imshow(wordcloud4)
plt.axis('off')
plt.title("Wordcloud test set target = 0")

plt.show()

# Classification preparation

We will be using both machine learning and deep learning models. Let us first split the data.

## Splitting data

In [None]:
X = np.array(df_train.text_cleaned)
y = np.array(df_train.target)

X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, shuffle=True, test_size=0.20)

In [None]:
X_test = np.array(df_test.text_cleaned)
y_test = np.array(df_test.target)

## Vectorization

# Classification with Machine Learning

A few models were tested using gridsearhCV with different vectorizations.

**Models :**
* MultinomialNB
* Logistic Regression
* SGD
* Xgboost
* SVM classifier


**Vectorizers :**
- CountVectorizer()
- HashingVectorizer(),
- TfidfVectorizer(stop_words='english',analyzer='word',ngram_range=(1,2)),
- TfidfVectorizer(stop_words='english',analyzer='word',ngram_range=(2,3)),
- TfidfVectorizer(stop_words='english',analyzer='word',ngram_range=(1,3))

The performances (considering the recall score) were pretty close but the best performing model was the SGD Classifier. We will be using the best performing combination of model and vectorizer.

In [None]:
vect = TfidfVectorizer(analyzer='word',ngram_range=(1,3))

vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_val_dtm = vect.transform(X_val)
X_test_dtm = vect.transform(X_test)

In [None]:
clf = SGDClassifier()

In [None]:
%time clf.fit(X_train_dtm, y_train)

## Predict validation set

In [None]:
%time y_pred_val = clf.predict(X_val_dtm)

### Accuracy score for validation set

In [None]:
accuracy_score(y_val, y_pred_val)

## Recall Score

In [None]:
recall_score(y_val, y_pred_val)

### F1 Score for validation set

In [None]:
f1_score(y_val, y_pred_val)

### Confusion matrix for validation set

In [None]:
sns.heatmap(confusion_matrix(y_val, y_pred_val), annot=True, fmt='.4g', cmap='Reds')

### Classification report for validation set

In [None]:
print(classification_report(y_val, y_pred_val))

## Predict test set

In [None]:
%time y_pred_test = clf.predict(X_test_dtm)

### Accuracy score for test set

In [None]:
accuracy_score(y_test, y_pred_test)

## Recall Score

In [None]:
recall_score(y_test, y_pred_test)

### F1 Score for validation set

In [None]:
f1_score(y_test, y_pred_test)

### Confusion matrix for test set

In [None]:
sns.heatmap(confusion_matrix(y_test, y_pred_test), annot=True, fmt='.4g', cmap='Reds')

### Classification report for test set

In [None]:
print(classification_report(y_test, y_pred_test))

# Classification with Deep Learning : Glove + LSTM

In [None]:
# Define x and y values for the deep
x = df_train['text_cleaned'].values
y = df_train['target'].values

In [None]:
# Calculate the length of our vocabulary in train tweets
from keras.preprocessing.text import Tokenizer

word_tokenizer= Tokenizer()
word_tokenizer.fit_on_texts(x)

vocab_length = len(word_tokenizer.word_index) + 1
vocab_length

In [None]:
### This cell retrieves the Glove dictionary that we use as a reference to convert the tokens referred to in;

# Load GloVe 100D embeddings
embeddings_dictionary = dict()
embedding_dim = 100

with open('/kaggle/input/glove6b100dtxt/glove.6B.100d.txt') as fp:
    for line in fp.readlines():
        records = line.split()
        word = records[0]
        vector_dimensions = np.asarray(records[1:], dtype='float32')
        embeddings_dictionary [word] = vector_dimensions

In [None]:
# Use the GloVe Dictionnary to load embedding vectors of our tweets only for tokens that are mentionned there. If they don't, they are set to 0. 
embedding_matrix = np.zeros((vocab_length, embedding_dim))
for word, index in word_tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

In [None]:
## Here we convert the tokens into numeric sequences that the DL machine can receive as input. Then, every sequence is padded with zeros to put all vectors at the same "lentgh"
from nltk.tokenize import word_tokenize
from keras.preprocessing.sequence import pad_sequences

def embed(corpus): 
    return word_tokenizer.texts_to_sequences(corpus)

longest_sent = max(x, key=lambda sentence: len(word_tokenize(sentence)))
sent_max_len = len(word_tokenize(longest_sent))

train_pad_sent = pad_sequences(embed(x), sent_max_len, padding='post')

In [None]:
# Split into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_pad_sent, y, test_size=0.20)

In [None]:
# Let's define the model 
from keras.models import Sequential
from keras.initializers import Constant
from keras.layers import (LSTM, 
                          Embedding, 
                          BatchNormalization,
                          Dense, 
                          TimeDistributed, 
                          Dropout, 
                          Bidirectional,
                          Flatten, 
                          GlobalMaxPool1D)
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from sklearn.metrics import (
    precision_score, 
    recall_score, 
    f1_score, 
    classification_report,
    accuracy_score
)

In [None]:
# Let's define the model 
def glove_lstm():
    model = Sequential()
    
    model.add(Embedding(
        input_dim=embedding_matrix.shape[0], 
        output_dim=embedding_matrix.shape[1], 
        weights = [embedding_matrix], 
        input_length= sent_max_len
    ))
    
    model.add(Bidirectional(LSTM(
        sent_max_len, 
        return_sequences = True, 
        recurrent_dropout=0.2
    )))
    
    model.add(GlobalMaxPool1D())
    model.add(BatchNormalization())
    model.add(Dropout(0.5))
    model.add(Dense(sent_max_len, activation = "relu"))
    model.add(Dropout(0.5))
    model.add(Dense(sent_max_len, activation = "relu"))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation = 'sigmoid'))
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

model = glove_lstm()
model.summary()

In [None]:
checkpoint = ModelCheckpoint(
    'model.h5', 
    monitor = 'val_loss', 
    verbose = 1, 
    save_best_only = True
)
reduce_lr = ReduceLROnPlateau(
    monitor = 'val_loss', 
    factor = 0.2, 
    verbose = 1, 
    patience = 5,                        
    min_lr = 0.001
)
history = model.fit(
    X_train, 
    y_train, 
    epochs = 7,
    batch_size = 32,
    validation_data = (X_val, y_val),
    verbose = 1,
    callbacks = [reduce_lr, checkpoint]
)

### Let's evaluate the results!

In [None]:
history.history.keys()

In [None]:
glstm_scores = [['loss', 'val_loss'],['accuracy', 'val_accuracy']]
# Visualize the loss
fig, ax = plt.subplots(1, 2, figsize=(20, 5))
for idx in range(2):
    ax[idx].plot(history.history[glstm_scores[idx][0]])
    ax[idx].plot(history.history[glstm_scores[idx][1]])
    ax[idx].legend([glstm_scores[idx][0], glstm_scores[idx][1]],fontsize=18)
    ax[idx].set_xlabel('Epochs ',fontsize=16)
    ax[idx].set_ylabel('Score',fontsize=16)
    ax[idx].set_title(glstm_scores[idx][0])

In [None]:
y_preds = (model.predict(X_val) > 0.5).astype("int32")
sns.heatmap(confusion_matrix(y_val, y_preds), annot=True, fmt='.4g', cmap='Reds')

# Classification : Bert

The steps bellow were hugely inspired by [@xhlulu](https://www.kaggle.com/xhlulu)'s submission.

In [None]:
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub

import tokenization

In [None]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [None]:
def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    optimizer = SGD(learning_rate=1e-5, momentum=0.8)
    model.compile(optimizer, loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [None]:
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

In [None]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [None]:
train_input = bert_encode(df_train.text.values, tokenizer, max_len=160)
test_input = bert_encode(df_test.text.values, tokenizer, max_len=160)
train_labels = df_train.target.values
test_labels = df_test.target.values

In [None]:
model = build_model(bert_layer, max_len=160)
model.summary()

In [None]:
checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True)

In [None]:
%time train_history = model.fit(train_input, train_labels, validation_split=0.2, epochs=12, verbose = 1, callbacks=[checkpoint], batch_size=12)

In [None]:
train_history.history

fig = go.Figure()

fig.add_trace(go.Scatter(x = train_history.epoch, y = train_history.history['loss'], name="Train loss"))

fig.add_trace(go.Scatter(x = train_history.epoch, y = train_history.history['val_loss'], name="Validation Loss"))

fig.update_layout(
    title="Bert performance",
    xaxis_title="Epochs",
    yaxis_title="Loss")

fig.show()

In [None]:
model.load_weights('model.h5')
%time test_pred = model.predict(test_input)

## Confusion Matrix

In [None]:
test_pred = (test_pred > 0.5).astype("int32")
sns.heatmap(confusion_matrix(y_test, test_pred), annot=True, fmt='.4g', cmap='Reds')
plt.title('Bert for test set')

## Accuracy Score

In [None]:
accuracy_score(y_test, test_pred)

## Recall Score

In [None]:
recall_score(y_test, test_pred)

## F1 Score

In [None]:
f1_score(y_test, test_pred)

## Classification report

In [None]:
print(classification_report(y_test, test_pred))