We will be utilizing a dataset from Kaggle's competition, which can be found at this URL: https://www.kaggle.com/competitions/nlp-getting-started

The dataset is structured as follows:
Each entry in the training and testing sets contains the following elements:

- **Text**: This is the content of the tweet.
- **Keyword**: This is a specific keyword from the tweet, although it may not always be present.
- **Location**: This is the geographical location from where the tweet was sent, but it might also be absent.

The goal of this competition is to predict whether a given tweet is about a real disaster. If it is, you should predict a 1. If it isn't, you should predict a 0.

Here are the details of each field in the dataset:

- **id**: A unique identifier assigned to each tweet.
- **text**: The actual text content of the tweet.
- **location**: The geographical location from where the tweet was sent (this field may be blank).
- **keyword**: A specific keyword from the tweet (this field may also be blank).
- **target**: This field is only present in the train.csv file. It indicates whether a tweet is about a real disaster (1) or not (0).

In [None]:
!unzip nlp-getting-started.zip

In [None]:
import pandas as pd

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

train_df.head()

In [None]:
train_df["text"][0]

In [None]:
train_df_shuffled = train_df.sample(frac=1, random_state=101)
train_df_shuffled.head()

In [None]:
train_df.target.value_counts()

In [None]:
test_df.head()

In [None]:
len(train_df), len(test_df)

In [None]:
from sklearn.model_selection import train_test_split

train_tweets, val_tweets, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                      train_df_shuffled["target"].to_numpy(),
                                                                      test_size=0.1,
                                                                      random_state=101)

In [None]:
len(train_tweets), len(val_tweets)

In [None]:
train_tweets[:10], train_labels[:10]

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import  TextVectorization

In [None]:
text_vectorizer = TextVectorization(max_tokens=20000,
                                    standardize="lower_and_strip_punctuation",
                                    split="whitespace",
                                    output_mode="int",
                                    output_sequence_length=15)

In [None]:
text_vectorizer.adapt(train_tweets)

In [None]:
sample_tweet = "Just happened a terrible car crash"
text_vectorizer([sample_tweet])

In [None]:
words_in_vocab = text_vectorizer.get_vocabulary()

top_5_words, bottom_5_words = words_in_vocab[:5], words_in_vocab[-5:]

print(len(words_in_vocab))
print(top_5_words)
print(bottom_5_words)

In [None]:
from tensorflow.keras import layers

In [None]:
embedding = layers.Embedding(input_dim=20000,
                             output_dim=128,
                             input_length=15)

In [None]:
sample_embed = embedding(text_vectorizer([sample_tweet]))
sample_embed

In [None]:
sample_embed[0][0]

In [None]:
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
layers.Dropout(0.5),

# Add multiple Conv1D layers with different kernel sizes
x = layers.Conv1D(filters=128, kernel_size=5, strides=1, activation="relu", padding="same")(x)
x = layers.MaxPooling1D(pool_size=2)(x)

x = layers.Conv1D(filters=64, kernel_size=3, strides=1, activation="relu", padding="same")(x)
x = layers.MaxPooling1D(pool_size=2)(x)

x = layers.Conv1D(filters=32, kernel_size=3, strides=1, activation="relu", padding="same")(x)
x = layers.MaxPooling1D(pool_size=2)(x)

# Flatten the output from the Conv layers before feeding it into Dense layer
x = layers.Flatten()(x)

# Add Dense layers before the output
x = layers.Dense(32, activation='relu', kernel_initializer="he_normal")(x)
x = layers.Dropout(0.5)(x)

x = layers.Dense(32, activation='relu', kernel_initializer="he_normal")(x)
x = layers.Dropout(0.5)(x)

outputs = layers.Dense(1, activation="sigmoid")(x)

model = tf.keras.Model(inputs, outputs, name="model_multi_conv1d")

model.compile(loss="binary_crossentropy",
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])

model.summary()

In [None]:
model_history = model.fit(train_tweets,
                          train_labels,
                          epochs=10,
                          validation_data=(val_tweets, val_labels))