# Natural Language Processing with Disaster Tweets
Predict which Tweets are about real disasters and which ones are not.
Dataset comes from [Kaggle](kaggle.com)

This is my submission to the running [competition](https://www.kaggle.com/competitions/nlp-getting-started/data) Organised by the Kaggle team.

In [None]:
import pandas as pd
import tensorflow as tf

In [None]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [None]:
train_df[train_df['target'] == 0]['text'].values[1]

In [None]:
train_df[train_df['target'] == 1]['text'].values[1]

In [None]:
train_df.head()

## Preprocessing the data
The theory behind the model I'll build in this notebook is pretty simple: the words contained in each tweet are a good indicator of whether they're about a real disaster or not (this is not entirely correct, but it's a great place to start).


In [None]:
train_df = train_df[['text', 'target']]
train_df['text'] = train_df['text'].apply(lambda val: val.lower())
texts = train_df['text'].values
texts = list(texts)

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

max_words = 10000 # max number of words to use in the vocabulary
max_len = 100 # max length of each text (in terms of number of words)
embedding_dim = 100 # dimension of word embeddings
lstm_units = 64 # number of units in the LSTM layer

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

In [None]:
texts[1]

In [None]:
sequences[0]

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

X = pad_sequences(sequences, maxlen=max_len)

y = train_df['target'].values

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(max_words, embedding_dim, input_length=max_len))
model.add(tf.keras.layers.LSTM(lstm_units))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
histroy = model.fit(X, y, batch_size=32, epochs=10, verbose=1)

In [None]:
test_df = test_df[['id','text']]
test_df.head()

In [None]:
test_df['text'] = test_df['text'].apply(lambda val: val.lower())
test_texts = test_df['text'].values
test_texts = list(test_texts)
print(test_texts[20])


We're using the same tokenizer object, as during the training

In [None]:
test_sequences = tokenizer.texts_to_sequences(test_texts)
test_sequences = pad_sequences(test_sequences, maxlen=max_len)

In [None]:
predictions = model.predict(test_sequences)

In [None]:
predictions = [1 if val >= 0.5 else 0 for val in predictions]

In [None]:
submit_df = pd.DataFrame({'id': test_df['id'], 'target': predictions})
submit_df.head()

In [None]:
submit_df.to_csv('submission.csv', index=False)