## Using AutoKeras 1.0.16.post1 to solve [Natural Language Processing with Disaster Tweets Kaggle Competition](https://www.kaggle.com/c/nlp-getting-started)

In [None]:
!pip3 install autokeras nltk

In [None]:
import autokeras as ak
import tensorflow as tf
import pandas as pd

### Unzip and upload datasets

In [None]:
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')

train

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


### Text preprocessing

* Ignore "keyword" and "location" column
* Convert to lower case
* Use regular expression to filter out Unicode words and hyperlinks
* Filter out English stop words (common words)
* Randomize the order of data for proper train/validation split

In [None]:
train.text = train.text.str.lower()
test.text = test.text.str.lower()

In [None]:
import re

rule = r'(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?'
f = lambda t: re.sub(rule, '', t)

train.text = train.text.apply(f)
test.text = test.text.apply(f)

In [None]:
import nltk.corpus
from nltk.corpus import stopwords

nltk.download('stopwords')
stop = stopwords.words('english')

f = lambda t: ' '.join([word for word in t.split() if word not in stop])

train.text = train.text.apply(f)
test.text = test.text.apply(f)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Alan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
from sklearn.utils import shuffle

train = shuffle(train, random_state=0)
train

Unnamed: 0,id,keyword,location,text,target
311,454,armageddon,Wrigley Field,katiekatcubs already know shit goes world seri...,0
4970,7086,meltdown,Two Up Two Down,lemairelee danharmon people near meltdown comi...,0
527,762,avalanche,Score Team Goals Buying @,16 tix calgary flames vs col avalanche preseas...,0
6362,9094,suicide%20bomb,Roadside,ever think running choices life rembr theres k...,0
800,1160,blight,Laventillemoorings,dotish blight car go right ahead mine,0
...,...,...,...,...,...
4931,7025,mayhem,"Manavadar, Gujarat",real heroes rip brave hearts,0
3264,4689,engulfed,USA,car engulfed flames backs traffic parleys summit,1
1653,2388,collapsed,"Alexandria, Egypt.",great british bake offs back dorrets chocolate...,1
2607,3742,destroyed,USA,black eye 9 space battle occurred star o784 in...,0


### Train a BERT model

Using GTX 1660 Ti. The batch size has to be 4 or lower to avoid memory issue.

In [None]:
input_node = ak.TextInput()
output_node = ak.BertBlock()(input_node)
output_node = ak.ClassificationHead()(output_node)

clf = ak.AutoModel(
    inputs=input_node, outputs=output_node, 
    max_trials=20, overwrite=True)

clf.fit(
    train.text.to_numpy(), 
    train.target.to_numpy(),
    batch_size=4,
    callbacks=[tf.keras.callbacks.EarlyStopping(patience=5)])

Trial 12 Complete [00h 14m 06s]
val_loss: 0.6860448718070984

Best val_loss So Far: 0.42517581582069397
Total elapsed time: 04h 12m 10s
INFO:tensorflow:Oracle triggered exit
Epoch 1/2
Epoch 2/2




INFO:tensorflow:Assets written to: .\auto_model\best_model\assets


INFO:tensorflow:Assets written to: .\auto_model\best_model\assets


<tensorflow.python.keras.callbacks.History at 0x1e990ac4d00>

### Predict test labels

In [None]:
predicted = clf.predict(test.text.to_numpy()).flatten().astype('uint8')
predicted



















array([1, 1, 1, ..., 1, 1, 1], dtype=uint8)

### Print first 50 test tweets and their predicted labels

In [None]:
labels = ('NOT disaster', 'REAL disaster')

for i in range(50):
    print('Test:', test.text.to_numpy()[i])
    print('Predict:', labels[predicted[i]])
    print('')

Test: happened terrible car crash
Predict: REAL disaster

Test: heard earthquake different cities stay safe everyone
Predict: REAL disaster

Test: forest fire spot pond geese fleeing across street cannot save
Predict: REAL disaster

Test: apocalypse lighting spokane wildfires
Predict: REAL disaster

Test: typhoon soudelor kills 28 china taiwan
Predict: REAL disaster

Test: shakingits earthquake
Predict: REAL disaster

Test: theyd probably still show life arsenal yesterday eh eh
Predict: NOT disaster

Test: hey
Predict: NOT disaster

Test: nice hat
Predict: NOT disaster

Test: fuck
Predict: NOT disaster

Test: dont like cold
Predict: NOT disaster

Test: nooooooooo dont
Predict: NOT disaster

Test: dont tell
Predict: NOT disaster

Test: 
Predict: NOT disaster

Test: awesome
Predict: NOT disaster

Test: birmingham wholesale market ablaze bbc news fire breaks birminghams wholesale market
Predict: REAL disaster

Test: sunkxssedharry wear shorts race ablaze
Predict: NOT disaster

Test: previ

### Generate Kaggle submission

In [None]:
test['target'] = pd.Series(predicted)

submission = test[['id', 'target']]
submission.to_csv('./submission.csv', index=False)
submission

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
...,...,...
3258,10861,1
3259,10865,1
3260,10868,1
3261,10874,1


* Kaggle F1 score: **0.82163**