# **Predict which Tweets are about real disasters**


Anik Chakraborty (waytoanik@outlook.com)

You can also find this notebook in Kaggle: https://www.kaggle.com/anik424/nlp-predict-tweets-about-real-disasters-82

# Installing Required Packages

In [1]:
!pip install ktrain

Collecting ktrain
  Downloading ktrain-0.25.4.tar.gz (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 348 kB/s 
Collecting langdetect
  Downloading langdetect-1.0.8.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 44.3 MB/s 
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |████████████████████████████████| 263 kB 55.6 MB/s 
[?25hCollecting syntok
  Downloading syntok-1.3.1.tar.gz (23 kB)
Collecting seqeval==0.0.19
  Downloading seqeval-0.0.19.tar.gz (30 kB)
Collecting transformers<4.0,>=3.1.0
  Downloading transformers-3.5.1-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 51.5 MB/s 
Collecting keras_bert>=0.86.0
  Downloading keras-bert-0.86.0.tar.gz (26 kB)
Collecting whoosh
  Downloading Whoosh-2.7.4-py2.py3-none-any.whl (468 kB)
[K     |████████████████████████████████| 468 kB 42.8 MB/s 
Collecting keras-transformer>=0.38.0
  Downloading keras-transf

In [2]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.0.48-py2.py3-none-any.whl (6.4 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting anyascii
  Downloading anyascii-0.1.7-py3-none-any.whl (260 kB)
[K     |████████████████████████████████| 260 kB 818 kB/s 
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.1.tar.gz (321 kB)
[K     |████████████████████████████████| 321 kB 7.2 MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l- \ | / done
[?25h  Created wheel for pyahocorasick: filename=pyahocorasick-1.4.1-cp37-cp37m-linux_x86_64.whl size=102851 sha256=18d86012f8c2e54066f2ffab90b29f10b8722f97554b8921bbfb1582eefd8f15
  Stored in directory: /root/.cache/pip/wheels/fe/ea/e6/38b0d734be6936b783e916a0d8d670313fb1b2f74c5889d4fe
Successfully built pyahocorasick
Installing collected packages: pyahocorasick, anyascii, textsearch, co

#  Importing Required Packages

In [3]:
import pandas as pd
import numpy as np
import sys  
import re
import string
import contractions
from sklearn.model_selection import train_test_split
import ktrain
import tensorflow as tf
from ktrain import text

# Data Preparation

In [4]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/nlp-getting-started/sample_submission.csv
/kaggle/input/nlp-getting-started/train.csv
/kaggle/input/nlp-getting-started/test.csv


In [5]:
df_train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
df_train

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [6]:
df_train.dtypes

id           int64
keyword     object
location    object
text        object
target       int64
dtype: object

In [7]:
df_val = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
df_val

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago http://t.co/U...
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...


In [8]:
df_train['target'].value_counts(normalize=True)

0    0.57034
1    0.42966
Name: target, dtype: float64

In [9]:
sum(df_train.keyword.isna())

61

In [10]:
sum(df_train.location.isna())

2533

**Droping keyword and location columns**

In [11]:
df_train.drop(columns=['keyword', 'location' ,'id'], inplace=True)
df_train

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...
7608,Two giant cranes holding a bridge collapse int...,1
7609,@aria_ahrary @TheTawniest The out of control w...,1
7610,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,Police investigating after an e-bike collided ...,1


# **Initial Text Pre-Processing**
**We'll remove hashtags(#example), @username and links(starting with http:// or https://) only. As we are going to use BERT, we are not removing emoticons as it will help BERT in prediction. We will again do text pre-processing later using BERT.**

In [12]:
def pre_process(tweet):
    tweet = ' '.join(re.sub("(@[A-Za-z0-9_]+)|(#[A-Za-z0-9]+)", " ", tweet).split())  # remove #tags and @usernames
    tweet = ' '.join(re.sub("(\w+:\/\/\S+)", " ", tweet).split()) # remove urls
    return(tweet)

In [13]:
def pre_process1(tweet):
    tweet = ' '.join(re.sub("(\w+:\/\/\S+)", " ", tweet).split()) # remove urls
    return(tweet)

**Handling constractions**:  Below funnction will replace constactions (e.g. wouldn't to would not).

In [14]:
def fn_contractions(tweet):
    expanded_words = []
    for word in tweet.split():
        expanded_words.append(contractions.fix(word))
    return(' '.join(expanded_words))

In [15]:
df_train['text'] = df_train['text'].apply(lambda x:pre_process(x))
df_train

Unnamed: 0,text,target
0,Our Deeds are the Reason of this May ALLAH For...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive evacuation orders in Cal...",1
4,Just got sent this photo from Ruby as smoke fr...,1
...,...,...
7608,Two giant cranes holding a bridge collapse int...,1
7609,The out of control wild fires in California ev...,1
7610,M1.94 [01:04 UTC]?5km S of Volcano Hawaii.,1
7611,Police investigating after an e-bike collided ...,1


In [16]:
df_train['text'] = df_train['text'].apply(lambda x:fn_contractions(x))
df_train

Unnamed: 0,text,target
0,Our Deeds are the Reason of this May ALLAH For...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive evacuation orders in Cal...",1
4,Just got sent this photo from Ruby as smoke fr...,1
...,...,...
7608,Two giant cranes holding a bridge collapse int...,1
7609,The out of control wild fires in California ev...,1
7610,M1.94 [01:04 UTC]?5km S of Volcano Hawaii.,1
7611,Police investigating after an e-bike collided ...,1


In [17]:
df_val['text'] = df_val['text'].apply(lambda x:pre_process(x))
df_val['text'] = df_val['text'].apply(lambda x:fn_contractions(x))
df_val

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about is different cities, stay safe eve..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting.
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO)


# Spliting Data for Test and Train

In [18]:
train, test = train_test_split(df_train, test_size=0.2)
X_train = train.text.tolist()
X_test = test.text.tolist()
y_train = train.target.tolist()
y_test = test.target.tolist()

In [19]:
X_train

['let us try to do our best to prevent another outbreak of violence by talking to each other both the people and the politics',
 'Training grains of wheat to bare gold in the August heat of their anger I am the no trespass lest you seek danger.',
 'Woman\x89Ûªs GPS app guides rescuers to injured biker in Marin County',
 '12News: UPDATE: A family of 3 has been displaced after fired damaged housed near 90th and Osborn. Fire extinguished no i\x89Û_',
 'One Direction Is my pick for Fan Army x1411',
 '2 great new recipes; mudslide cake and so sorry stew!',
 'Correction: Tent Collapse Story',
 'Brunette teen Giselle Locke teases at home View and download video',
 "The twins pitcher's ego is now WRECKED",
 "Wars doomed to destruction loss money must invest in Iran's inside that should not go outside",
 'Listening to Blowers and Tuffers on the Aussie batting collapse at Trent Bridge reminds me why I love ! Wonderful stuff!',
 'I am blazing rn and there is nothing you can do to stop me',
 'No k

In [20]:
y_train

[1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,


In [21]:
print(len(X_train),len(X_test),len(y_train),len(y_test))

6090 1523 6090 1523


# Model building using BERT

We are using bert-base-uncased model. You can choose any other model. I am selecting maxlen of tokenization as 512 (it's max for BERT).

In [22]:
model_arch ='bert-base-uncased'
factors = [0,1] # We have two factors to predict.
MAXLEN = 512
trans = text.Transformer(model_arch, maxlen=MAXLEN, class_names= factors)

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

In [23]:
train_data = trans.preprocess_train(X_train,y_train)
test_data = trans.preprocess_test(X_test,y_test)

preprocessing train...
language: en
train sequence lengths:
	mean : 14
	95percentile : 24
	99percentile : 28


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 14
	95percentile : 24
	99percentile : 28


In [24]:
model = trans.get_classifier()

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

In [25]:
learner = ktrain.get_learner(model, train_data=train_data, val_data=test_data, batch_size=10)

In [26]:
#learner.lr_find(show_plot=True, max_epochs=10) #finding optimal learning rate

In [27]:
learner.fit_onecycle(3e-5, 4)



begin training using onecycle policy with max lr of 3e-05...
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7f1315941450>

In [28]:
learner.validate(val_data=test_data, class_names=factors)

              precision    recall  f1-score   support

           0       0.83      0.88      0.85       863
           1       0.83      0.76      0.80       660

    accuracy                           0.83      1523
   macro avg       0.83      0.82      0.82      1523
weighted avg       0.83      0.83      0.83      1523



array([[760, 103],
       [156, 504]])

In [29]:
predictor = ktrain.get_predictor(learner.model, preproc=trans)

# Prediction

In [30]:
df_val['target'] = predictor.predict(df_val.text.tolist())
df_val

Unnamed: 0,id,keyword,location,text,target
0,0,,,Just happened a terrible car crash,1
1,2,,,"Heard about is different cities, stay safe eve...",1
2,3,,,"there is a forest fire at spot pond, geese are...",1
3,9,,,Apocalypse lighting.,0
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan,1
...,...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...,1
3259,10865,,,Storm in RI worse than last hurricane. My city...,1
3260,10868,,,Green Line derailment in Chicago,1
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO),1


In [31]:
df_val.to_csv('/kaggle/working/test_result_final.csv', index=False)

In [32]:
df_submission = df_val[['id','target']]

In [33]:
df_submission.to_csv('/kaggle/working/submission5.csv', index=False)