# Natural Language Processing with Disaster Tweets
Predict which Tweets are about real disasters and which ones are not.

### Dataset:
**train** - the training set\
**test** - the test set

### Columns:
**id** - a unique identifier for each tweet\
**text** - the text of the tweet\
**location** - the location the tweet was sent from\
**keyword** - a particular keyword from the tweet\
**target** - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

### Objective:
To predict whether a given tweet is about a real disaster(featured as 1) or not(featured as 0)

### List of Contents:
1. Dataset and EDA
2. Text Pre-processing
3. Model Training\
    Vectorization\
    TF-IDF(term frequency-inverse document frequency)\
    Classifier Algorithm ( Naive Bayes )\
    Training a model (Using Pipeline)
4. Results

#### 1. Dataset and EDA:

In [1]:
import pandas as pd

In [2]:
train=pd.read_csv('nlp-getting-started/train.csv')
test=pd.read_csv('nlp-getting-started/test.csv')

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [4]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB


In [5]:
print(f"real disaster tweets(as 1) and tweets for not an actual disaster(as 0)\n{train['target'].value_counts()}")

real disaster tweets(as 1) and tweets for not an actual disaster(as 0)
0    4342
1    3271
Name: target, dtype: int64


In [6]:
train[train.target==1][['id','text']].head() # Sample showing tweet related to disasters

Unnamed: 0,id,text
0,1,Our Deeds are the Reason of this #earthquake M...
1,4,Forest fire near La Ronge Sask. Canada
2,5,All residents asked to 'shelter in place' are ...
3,6,"13,000 people receive #wildfires evacuation or..."
4,7,Just got sent this photo from Ruby #Alaska as ...


In [7]:
train[train.target==0][['id','text']].head() # Sample showing tweet not related to disasters

Unnamed: 0,id,text
15,23,What's up man?
16,24,I love fruits
17,25,Summer is lovely
18,26,My car is so fast
19,28,What a goooooooaaaaaal!!!!!!


#### 2. Text Pre-processing:
Converting the messages (sequence of characters) into vectors (sequences of numbers) by using **bag-of-words** approach.

In [8]:
import string
from nltk.corpus import stopwords

In [9]:
def text_process(text):
    nopunc = [char for char in text if char not in string.punctuation] # Check characters to see if they are in punctuation
    nopunc = ''.join(nopunc) # Join the characters again to form the string.
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')] # to remove any stopwords

In [10]:
train['text'].head(5).apply(text_process) # Tokenization

0    [Deeds, Reason, earthquake, May, ALLAH, Forgiv...
1        [Forest, fire, near, La, Ronge, Sask, Canada]
2    [residents, asked, shelter, place, notified, o...
3    [13000, people, receive, wildfires, evacuation...
4    [got, sent, photo, Ruby, Alaska, smoke, wildfi...
Name: text, dtype: object

#### 3. Model Training:

#### Vectorization
Now we'll convert each message, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

#### TF-IDF(term frequency-inverse document frequency)

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer

#### Classifier Algorithm ( Naive Bayes )

In [13]:
from sklearn.naive_bayes import MultinomialNB

#### Training a model (Using Pipeline)

In [14]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors with Naive Bayes classifier
])

In [15]:
pipeline.fit(train['text'],train['target']) # Training the model using train dataset

Pipeline(steps=[('bow',
                 CountVectorizer(analyzer=<function text_process at 0x000001E15348E700>)),
                ('tfidf', TfidfTransformer()),
                ('classifier', MultinomialNB())])

In [16]:
predictions = pipeline.predict(test['text']) # predicting the output using test dataset

In [17]:
predictions

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

#### 4. Result:
**New Dataframe to check prediction result**

In [18]:
target=pd.DataFrame(predictions,columns=['target']) # new df 'target' created u model prediction output

In [19]:
test_prediction=test.join(target) # 'target' & 'test' dataset are joined

In [22]:
test_prediction[test_prediction.target==1][['id','text','target']] # showing text having predtion output of target value 1

Unnamed: 0,id,text,target
0,0,Just happened a terrible car crash,1
1,2,"Heard about #earthquake is different cities, s...",1
2,3,"there is a forest fire at spot pond, geese are...",1
3,9,Apocalypse lighting. #Spokane #wildfires,1
4,11,Typhoon Soudelor kills 28 in China and Taiwan,1
...,...,...,...
3257,10858,The death toll in a #IS-suicide car bombing on...,1
3259,10865,Storm in RI worse than last hurricane. My city...,1
3260,10868,Green Line derailment in Chicago http://t.co/U...,1
3261,10874,MEG issues Hazardous Weather Outlook (HWO) htt...,1


In [23]:
test_prediction[test_prediction.target==0][['id','text','target']] # showing text having predtion output of target value 0

Unnamed: 0,id,text,target
6,21,They'd probably still show more life than Arse...,0
7,22,Hey! How are you?,0
8,27,What a nice hat?,0
9,29,Fuck off!,0
10,30,No I don't like cold!,0
...,...,...,...
3249,10816,@thrillhho jsyk I haven't stopped thinking abt...,0
3250,10820,@stighefootball Begovic has been garbage. He g...,0
3251,10828,Wrecked today got my hattrick ????,0
3256,10857,To conference attendees! The blue line from th...,0
