# Natural Language Processing with Disaster Tweets

## Overview

### Description

Welcome to one of our "Getting Started" competitions 👋
This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.

If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle

### Competition Description

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:

![alt-text](/Users/amith/Downloads/tweet_screenshot.png "optional-title")

The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

### Acknowledgments
This dataset was created by the company figure-eight and originally shared on their ‘Data For Everyone’ website here.

Tweet source: https://twitter.com/AnyOtherAnnaK/status/629195955506708480

## Dataset Description

What files do I need?
You'll need train.csv, test.csv and sample_submission.csv.

What should I expect the data format to be?
Each sample in the train and test set has the following information:

The text of a tweet
A keyword from that tweet (although this may be blank!)
The location the tweet was sent from (may also be blank)
What am I predicting?
You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

### Files

**train.csv** - the training set
**test.csv** - the test set
**sample_submission.csv** - a sample submission file in the correct format

### Columns
**id** - a unique identifier for each tweet
**text** - the text of the tweet
**location** - the location the tweet was sent from (may be blank)
**keyword** - a particular keyword from the tweet (may be blank)
**target** - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

In [240]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import nltk
import warnings
warnings.filterwarnings("ignore")

In [241]:
df_train = pd.read_csv("nlp-getting-started/train.csv")

In [242]:
df_train

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [243]:
df_train.isna().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [244]:
df_train_1 = df_train[~(df_train["keyword"].isna() & df_train["keyword"].isna())].reset_index(drop=True)
df_train_1

Unnamed: 0,id,keyword,location,text,target
0,48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
1,49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT h...,0
2,50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1
3,52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0
4,53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0
...,...,...,...,...,...
7547,10830,wrecked,,@jt_ruff23 @cameronhacker and I wrecked you both,0
7548,10831,wrecked,"Vancouver, Canada",Three days off from work and they've pretty mu...,0
7549,10832,wrecked,London,#FX #forex #trading Cramer: Iger's 3 words tha...,0
7550,10833,wrecked,Lincoln,@engineshed Great atmosphere at the British Li...,0


In [245]:
df_train_1["target"].value_counts()

target
0    4323
1    3229
Name: count, dtype: int64

In [246]:
df_train_1["target"].value_counts()*100/len(df_train_1["target"])

target
0    57.243114
1    42.756886
Name: count, dtype: float64

In [247]:
np.abs(np.diff(df_train_1["target"].value_counts()))

array([1094])

In [248]:
df_train_1.isna().sum()

id             0
keyword        0
location    2472
text           0
target         0
dtype: int64

In [250]:
df_target_no_location = df_train_1[df_train_1["location"].isna()][df_train_1["target"]==0]
df_target_no_location

Unnamed: 0,id,keyword,location,text,target
10,61,ablaze,,on the outside you're ablaze and alive\nbut yo...,0
12,63,ablaze,,SOOOO PUMPED FOR ABLAZE ???? @southridgelife,0
13,64,ablaze,,I wanted to set Chicago ablaze with my preachi...,0
14,65,ablaze,,I gained 3 followers in the last week. You? Kn...,0
16,67,ablaze,,Building the perfect tracklist to life leave t...,0
...,...,...,...,...,...
7534,10814,wrecked,,Wrecked tired but not gonna be asleep before 3??,0
7537,10818,wrecked,,The Riddler would be the best early-exit prima...,0
7545,10827,wrecked,,He just wrecked all of you http://t.co/y46isyZkC8,0
7547,10830,wrecked,,@jt_ruff23 @cameronhacker and I wrecked you both,0


In [340]:
per_to_remove = np.abs(np.diff(df_train_1["target"].value_counts()))*100/len(df_target_no_location)
per_to_remove

array([76.02501737])

In [341]:
df_target_no_location["keyword"].value_counts()

keyword
body%20bags              18
detonation               18
blizzard                 17
twister                  17
army                     17
                         ..
buildings%20on%20fire     1
nuclear%20disaster        1
razed                     1
radiation%20emergency     1
suicide%20bombing         1
Name: count, Length: 211, dtype: int64

In [342]:
key_group = df_target_no_location.groupby("keyword")
key_group

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x29309dd90>

In [343]:
keyword_groups = key_group.groups
keyword_groups

{'ablaze': [10, 12, 13, 14, 16, 29], 'accident': [44, 49, 51, 59], 'aftershock': [79, 81, 85, 91, 92, 99, 100, 101, 104], 'ambulance': [154, 167, 174, 175], 'annihilated': [182, 184, 188, 191, 200, 205, 209], 'annihilation': [213, 214, 215, 219, 230, 231, 236, 237], 'apocalypse': [243, 244, 248, 252, 253, 255, 257, 258, 264, 265, 266, 272], 'armageddon': [276, 277, 279, 281, 287, 296, 301, 306, 310], 'army': [315, 316, 317, 318, 319, 320, 324, 329, 333, 334, 336, 339, 340, 341, 342, 346, 347], 'arson': [349], 'arsonist': [398, 402, 407, 411, 413], 'attack': [433, 446, 448], 'attacked': [466, 476], 'avalanche': [489, 493, 497, 500, 503], 'battle': [518, 522, 523, 527, 533], 'bioterror': [544, 560, 563, 575], 'bioterrorism': [582, 587, 593, 596, 598, 603, 605], 'blaze': [610, 611, 612, 615, 619, 620, 621, 625, 627, 631, 640, 642, 645], 'blazing': [648, 651, 653, 654, 655, 657, 660, 663, 664, 666, 667, 669, 670, 679], 'bleeding': [681, 683, 685, 686, 689, 695, 698, 699, 700, 703, 705, 709

In [344]:
key_group_1 = dict()
ele_group_1 = np.array([])
for i in keyword_groups:
    ele = keyword_groups[i].values
    key_group_1[i] = list(ele[0:int(len(ele)*per_to_remove/100)])
    ele_group_1 = np.append(ele_group_1,key_group_1[i])

In [345]:
ele_group_1

array([  10.,   12.,   13.,   14.,   44.,   49.,   51.,   79.,   81.,
         85.,   91.,   92.,   99.,  154.,  167.,  174.,  182.,  184.,
        188.,  191.,  200.,  213.,  214.,  215.,  219.,  230.,  231.,
        243.,  244.,  248.,  252.,  253.,  255.,  257.,  258.,  264.,
        276.,  277.,  279.,  281.,  287.,  296.,  315.,  316.,  317.,
        318.,  319.,  320.,  324.,  329.,  333.,  334.,  336.,  339.,
        398.,  402.,  407.,  433.,  446.,  466.,  489.,  493.,  497.,
        518.,  522.,  523.,  544.,  560.,  563.,  582.,  587.,  593.,
        596.,  598.,  610.,  611.,  612.,  615.,  619.,  620.,  621.,
        625.,  627.,  648.,  651.,  653.,  654.,  655.,  657.,  660.,
        663.,  664.,  666.,  681.,  683.,  685.,  686.,  689.,  695.,
        698.,  699.,  700.,  716.,  718.,  723.,  724.,  731.,  732.,
        733.,  752.,  760.,  764.,  768.,  781.,  782.,  789.,  791.,
        793.,  795.,  796.,  797.,  799.,  800.,  807.,  808.,  818.,
        819.,  821.,

In [346]:
df_target_location_index = np.setdiff1d(df_train_1.index.values,ele_group_1)

In [356]:
df_train_2 = df_train_1.loc[df_target_location_index,]
df_train_2 = df_train_2.reset_index(drop=True)
df_train_2

Unnamed: 0,id,keyword,location,text,target
0,48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
1,49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT h...,0
2,50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1
3,52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0
4,53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0
...,...,...,...,...,...
6555,10830,wrecked,,@jt_ruff23 @cameronhacker and I wrecked you both,0
6556,10831,wrecked,"Vancouver, Canada",Three days off from work and they've pretty mu...,0
6557,10832,wrecked,London,#FX #forex #trading Cramer: Iger's 3 words tha...,0
6558,10833,wrecked,Lincoln,@engineshed Great atmosphere at the British Li...,0


In [357]:
df_train_2["target"].value_counts()*100/len(df_train_2)

target
0    50.777439
1    49.222561
Name: count, dtype: float64

In [358]:
df_train_2

Unnamed: 0,id,keyword,location,text,target
0,48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
1,49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT h...,0
2,50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1
3,52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0
4,53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0
...,...,...,...,...,...
6555,10830,wrecked,,@jt_ruff23 @cameronhacker and I wrecked you both,0
6556,10831,wrecked,"Vancouver, Canada",Three days off from work and they've pretty mu...,0
6557,10832,wrecked,London,#FX #forex #trading Cramer: Iger's 3 words tha...,0
6558,10833,wrecked,Lincoln,@engineshed Great atmosphere at the British Li...,0


In [360]:
df_train_3 = df_train_2.drop("id",axis=1,inplace=False)
df_train_3

Unnamed: 0,keyword,location,text,target
0,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
1,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT h...,0
2,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1
3,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0
4,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0
...,...,...,...,...
6555,wrecked,,@jt_ruff23 @cameronhacker and I wrecked you both,0
6556,wrecked,"Vancouver, Canada",Three days off from work and they've pretty mu...,0
6557,wrecked,London,#FX #forex #trading Cramer: Iger's 3 words tha...,0
6558,wrecked,Lincoln,@engineshed Great atmosphere at the British Li...,0


In [361]:
df_train_4 = df_train_3.fillna("Not Specified")
df_train_4

Unnamed: 0,keyword,location,text,target
0,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
1,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT h...,0
2,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1
3,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0
4,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0
...,...,...,...,...
6555,wrecked,Not Specified,@jt_ruff23 @cameronhacker and I wrecked you both,0
6556,wrecked,"Vancouver, Canada",Three days off from work and they've pretty mu...,0
6557,wrecked,London,#FX #forex #trading Cramer: Iger's 3 words tha...,0
6558,wrecked,Lincoln,@engineshed Great atmosphere at the British Li...,0


In [362]:
df_train_4.isna().sum()

keyword     0
location    0
text        0
target      0
dtype: int64

In [363]:
df_train_4["text"]

0       @bbcmtd Wholesale Markets ablaze http://t.co/l...
1       We always try to bring the heavy. #metal #RT h...
2       #AFRICANBAZE: Breaking news:Nigeria flag set a...
3                      Crying out for more! Set me ablaze
4       On plus side LOOK AT THE SKY LAST NIGHT IT WAS...
                              ...                        
6555     @jt_ruff23 @cameronhacker and I wrecked you both
6556    Three days off from work and they've pretty mu...
6557    #FX #forex #trading Cramer: Iger's 3 words tha...
6558    @engineshed Great atmosphere at the British Li...
6559    Cramer: Iger's 3 words that wrecked Disney's s...
Name: text, Length: 6560, dtype: object

In [380]:
df_train_4.loc[0,"text"]

'@bbcmtd Wholesale Markets ablaze http://t.co/lHYXEOHY6C'

In [381]:
nltk.regexp_tokenize(df_train_4.loc[0,"text"],r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))")

[('http://t.co/lHYXEOHY6C', '', '', '', '')]

In [375]:
wl = nltk.stem.WordNetLemmatizer()
wl.lemmatize("co")

'co'

In [378]:
nltk.corpus.wordnet.lemmas("t")

[Lemma('thymine.n.01.T'),
 Lemma('deoxythymidine_monophosphate.n.01.T'),
 Lemma('metric_ton.n.01.t'),
 Lemma('t.n.04.T'),
 Lemma('t.n.04.t'),
 Lemma('triiodothyronine.n.01.T'),
 Lemma('thyroxine.n.01.T')]