<a href="https://colab.research.google.com/github/farheenfathimaa/NLP-with-Disaster-Tweets/blob/main/Natural_Language_Processing_with_Disaster_Tweets_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# mounting drive
from google.colab import drive
drive.mount("/content/drive/")

Mounted at /content/drive/


In [2]:
# Unzip the uploaded data into Google Drive
#!unzip "/content/drive/MyDrive/nlp-getting-started.zip" -d "/content/drive/MyDrive/nlp-tweets"

# Natural Language Processing with Disaster Tweets

This notebook looks into various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

We're going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experimentation

It is available on Kaggle. [Link](https://www.kaggle.com/competitions/nlp-getting-started/overview)



In [3]:
#!pip install tensorflow

The code below is the copy of [Link](https://www.kaggle.com/code/nabeelparuk/nlp-disaster-tweet-sentiment-analysis)

This is attempt of understanding the real time working on a NLP problem

## Importing modules

In [4]:
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization, Embedding
import tensorflow_hub as hub
#import tensorflow_text as text

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import random
import datetime
import matplotlib.pyplot as plt
import io
from IPython.display import FileLink

import warnings
warnings.filterwarnings('ignore')

## Exploratory Data Analysis
### Import data

In [10]:
train_df = pd.read_csv("/content/drive/MyDrive/nlp-tweets/train.csv")
test_df = pd.read_csv("/content/drive/MyDrive/nlp-tweets/test.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [11]:
train_df_shuffled = train_df.sample(frac=1, random_state=1)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
3228,4632,emergency%20services,"Sydney, New South Wales",Goulburn man Henry Van Bilsen missing: Emergen...,1
3706,5271,fear,,The things we fear most in organizations--fluc...,0
6957,9982,tsunami,Land Of The Kings,@tsunami_esh ?? hey Esh,0
2887,4149,drown,,@POTUS you until you drown by water entering t...,0
7464,10680,wounds,"cody, austin follows ?*?",Crawling in my skin\nThese wounds they will no...,1


In [12]:
train_df.isna().sum()

Unnamed: 0,0
id,0
keyword,61
location,2533
text,0
target,0


In [14]:
train_df["target"].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


In [16]:
# Visualising few samples
random_index = random.randint(0, len(train_df)-5)
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(real disaster)" if target>0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 0 (not real disaster)
Text:
My ears are bleeding  https://t.co/k5KnNwugwT

---

Target: 0 (not real disaster)
Text:
Ted Cruz fires back at Jeb &amp; Bush: ÛÏWe lose because of Republicans like Jeb &amp; Mitt.Û [Video] -  http://t.co/bFtiaPF35F

---

Target: 0 (not real disaster)
Text:
First impressions: glad hat man is leaving in lieu of more interesting ladies. Hope mudslide lady triumphs next week.

---

Target: 0 (not real disaster)
Text:
The Flash And The Thunder by WC Quick on Amazon Kindle and soon in PRINT at Amazon Books via Create... http://t.co/oS1WjRvx5c via @weebly

---

Target: 0 (not real disaster)
Text:
@hellotybeeren cue the flood of people 'ironically' calling you that

---



## Preprocessing
### Split into training and validation data

In [17]:
# Set target and predictors
X = train_df_shuffled["text"].to_numpy()
y = train_df_shuffled["target"].to_numpy()

# Split
X_train, X_val, y_train, y_val = train_test_split(X,
                                                  y,
                                                  random_state=42,
                                                  test_size=0.2)

###Text vectorization

In [19]:
# Find average number of tokens in tweets
round(sum([len(i.split()) for i in X_train])/len(X_train))

15

In [21]:
# Set custom parameters
max_vocab_length = 10000
max_length = 15

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_sequence_length=max_length)
text_vectorizer.adapt(X_train)

In [24]:
# Test on random sentence
random_sentence = random.choice(X_train)
print(f"Text:\n{random_sentence}\n\nAfter vectorization:\n{text_vectorizer([random_sentence])}")

Text:
Seriously do we have to do a tactical riot against the headquarters of Disney and Marvel...

After vectorization:
[[1639   67   46   24    5   67    3 7204  467  351    2 5037    6 2791
     8]]


Let's look at some common words

In [27]:
# Get common values
vocab_words = text_vectorizer.get_vocabulary()

# Get the least and the most common words
most_common = vocab_words[:5]
least_common = vocab_words[-5:]
print(f"Most common: {most_common}\nLeast_common: {least_common}")

Most common: ['', '[UNK]', 'the', 'a', 'in']
Least_common: ['mildmannered', 'milc5040h', 'mil', 'mikecroninwmur', 'mihirssharma']


## Embedding
We use embedding to make the vectorized text learnable

In [29]:
# Create embedding layer
embedding = layers.Embedding(input_dim=max_vocab_length,
                             output_dim=128,
                             input_length=max_length,
                             name="embedding_1")

In [31]:
# Choose random sentence
random_sentence = random.choice(X_train)
print("Original sentence:", random_sentence)

# Embed sentence -> can't use straight text
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original sentence: USFS an acronym for United States Fire Service. http://t.co/8NAdrGr4xC


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.0034367 , -0.04932019, -0.0442805 , ...,  0.0141158 ,
         -0.00573764, -0.02702069],
        [-0.00029341, -0.02373687,  0.02341169, ..., -0.03347812,
         -0.01269273, -0.00145557],
        [-0.03247808, -0.02974534, -0.04184685, ..., -0.03741069,
         -0.01111143,  0.04754398],
        ...,
        [-0.04945989, -0.02760971, -0.04315114, ..., -0.00434431,
         -0.00817271,  0.02983383],
        [-0.04945989, -0.02760971, -0.04315114, ..., -0.00434431,
         -0.00817271,  0.02983383],
        [-0.04945989, -0.02760971, -0.04315114, ..., -0.00434431,
         -0.00817271,  0.02983383]]], dtype=float32)>