# <a id="top_section"></a>

<div align='center'><font size="5" color="#000000"><b>NLP with disaster tweets!-Starter modelling , data cleaning and explanation <br>(~80% accuracy)</b></font></div>
<hr>
<div align='center'><font size="5" color="#000000">About the problem</font></div>
<hr>

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.<br>
I have two notebooks on this competition , the first one is using basic naive-base model whereas the second is by using BERT pre-trained model. If you're a beginner I highly recommend you to to start with this notebook! After that if you want to enhance your accuracy and read about how we can implement this model using BERT then do check out the second notebook here : <br><br>
<a class="nav-link active"  style="background-color:; color:Blue"  href="https://lh3.googleusercontent.com/proxy/fbKFMqzpD5rqh-R4wh4bsiACiX4b6PUs2kzMSs61V36aWWxZd8y0I_ZHur3NEOXcLJ83BJKy7tZF4-Wflp9mtGWnaXkc3Cs1MmKWYmAAPOgt4Qudk1qi_hqLoePakMfmTN-A8146oiXMgKg07aQrYWrxM70" role="tab">NLP with disaster tweets!-Data-cleaning and Bert (Explained)</a>

<br>
<a href="https://ibb.co/nm4kTk1"><img src="https://i.ibb.co/54Ccdcj/Aquamarine-and-Orange-Pixel-Games-Collection-You-Tube-Icon.png" alt="Aquamarine-and-Orange-Pixel-Games-Collection-You-Tube-Icon" border="0" height=300 width=300></a>


### Here are the things I will try to cover in this Notebook:

- Basic EDA of the text data.
- Data cleaning (basic)
- Data Cleaning (advanced)
- Transforming text into vectors
- Building our model 

### If you like this kernel feel free to upvote and leave feedback, thanks!

<a id="toc_section"></a>
<div class="list-group" id="list-tab" role="tablist">

<h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home"> Table of Content</h3>

* [Introduction](#top_section)
* [Importing the Required Libraries and Data](#sec1)
* [Exploring the Data](#sec2)
    - [Visualizing given dataset](#sec3)
* [Text Pre-processing](#sec4)
    - [Basic data cleaning](#sec5)
    - [Advanced data cleaning](#sec6)
    - [Using NLP processing](#sec7)
    - [Stemming](#sec8)
    - [Frequent words using WordCloud](#sec9)
* [Transform token in vectors](#sec10)
    - [Bag of words](#sec11)
* [Modelling](#sec13)
* [Submission & Some Last Words](#sectionlst)
* [References](#sec14)


<a id="sec1"></a>
## Importing the required libraries and data


Let us start with importing all the required libraries ! We will use the basic libraries to play with data(numpy,pandas,etc),some text related libraries (re,string,nltk,etc) and various model libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import re
import string
import nltk
from nltk.corpus import stopwords
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV,StratifiedKFold,RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from wordcloud import WordCloud,STOPWORDS
from sklearn.model_selection import train_test_split
from nltk.stem.snowball import SnowballStemmer

Now let's import our datasets , both train and test.

In [None]:
train=pd.read_csv('../input/nlp-getting-started/train.csv')
test=pd.read_csv('../input/nlp-getting-started/test.csv')
dataset=pd.concat([train,test])
print(f'train:{train.shape}\ntest:{test.shape}\ndataset:{dataset.shape}')

<a id="sec2"></a>
## Exploring the data


Let's take a sneak peak at our data set ! ;)

<img src='https://media1.tenor.com/images/41597f32f2989333d14515fb1b7a9b4f/tenor.gif?itemid=13480143'>

In [None]:
train.head()

In [None]:
test.head()

Let's see how much of our data is missing !

In [None]:
(train.isnull().sum()[train.isnull().sum()>0]/len(train))*100

In [None]:
pd.DataFrame({'Test Data Missing':(test.isnull().mean()*100).sort_values(ascending=False)})

We will deal with the missing data a bit later. But first let's look at some examples of disaster and non-disaster tweets !

In [None]:
non_dis = train[train.target==0]['text']
non_dis.values[7]

In [None]:
dis=train[train.target==1]['text']
dis.values[7]

Let's see how many disaster and non-disaster tweets are actually there in our data !

In [None]:
train.target.value_counts()

<a id="sec3"></a>
## Visualizing the data !


Now that we have seen how our data is , how much it is missing and some counts, let's visualize our data so that we can to more explore and make better of it!

First let's see the count of disaster and non-disaster tweets !

In [None]:
plt.figure(figsize=(6,6))
sns.barplot(train.target.value_counts().index,train.target.value_counts())

Now let's see how much of the keywords were actualy unique ! We will use the nunique function of pandas for this !

In [None]:
train.keyword.nunique()

Now let's see the top 15 most used keywords ! Maybe we can get some insights from this !

In [None]:
plt.figure(figsize=(12,12))
sns.barplot(y=train.keyword.value_counts().index[:15],x=train.keyword.value_counts()[:15])

So some highly used keywords are fatalities , sinking , harm , damage , etc which can actually be very helpful in finding either the given tweet is disaster related or not !

Now let's see the unique locations that the tweets in our dataset were tweeted from !

In [None]:
print(train.location.nunique())

Let's see the top 15 locations where the most tweets come from !

In [None]:
plt.figure(figsize=(12,12))
sns.barplot(y=train.location.value_counts().index[:15],x=train.location.value_counts()[:15])

Well , what are the places where the least tweets were tweeted from ? Let's find out !

In [None]:
plt.figure(figsize=(12,12))
sns.barplot(y=train.location.value_counts().index[-10:],x=train.location.value_counts()[-10:])

So we have seen how some locations have very high tweeting activity whereas some have very low , and how alot of keywords were highly used and how many of them were alot hinting towards the nature of the tweet(i.e disastarious or non-disastarious).

<a id="sec4"></a>
## Text Pre-processing

Now comes one of the mosst important parts of any Natural Language Processing Problem ! Let's clean our data !

<img src='https://media.tenor.com/images/0bf00f08e5e5cce9bb1ec5899cbc046b/tenor.gif'>

<a id="sec5"></a>
### Basic Data cleaning

We will start with cleaning basic text noises such as URLS , Email IDS , punctautions etc.

All the functions are below and quiet basic !

In [None]:
def lowercase_text(text):
    return text.lower()

train.text=train.text.apply(lambda x: lowercase_text(x))
test.text=test.text.apply(lambda x: lowercase_text(x))

In [None]:
train.text.head(5)

In [None]:
def remove_noise(text):
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

In [None]:
train.text=train.text.apply(lambda x: remove_noise(x))
test.text=test.text.apply(lambda x: remove_noise(x))

In [None]:
train.text.head(5)

<a id="sec7"></a>
### Using NLP processing

Now we will use NLP preprocessing to process our data ! This actually gave me better results so , let's use it !

In [None]:
!pip install nlppreprocess
from nlppreprocess import NLP

nlp = NLP()

train['text'] = train['text'].apply(nlp.process)
test['text'] = test['text'].apply(nlp.process)  

In [None]:
train.text.sample(10)

<a id="sec8"></a>
### Stemming

Now we have to stem our text , will be using SnowballStemmer as it is quite good for the job ! So let's just get to the code !

In [None]:
stemmer = SnowballStemmer("english")

def stemming(text):
    text = [stemmer.stem(word) for word in text.split()]
    return ' '.join(text)

train['text'] = train['text'].apply(stemming)
test['text'] = test['text'].apply(stemming)

<a id="sec9"></a>
### Frequent words using wordcloud

This is just a fun part , I loved this thing i found in one of the notebooks so i added it in mine ! <br>
This is a wordcloud of the frequent words in our text and it's actually quite cool to look at !

In [None]:
from wordcloud import WordCloud
fig , ax1 = plt.subplots(1,figsize=(12,12))
wordcloud=WordCloud(background_color='white',width=600,height=600).generate(" ".join(train.text))
ax1.imshow(wordcloud)
ax1.axis('off')
ax1.set_title('Frequent Words',fontsize=24)

<img src='https://i.gifer.com/EP97.gif'>

<a id="sec10"></a>
##  Transform token in vectors

Up until now , we have done all the processing to the texts , but you and I both know that our system cannot really read any language(English in this case) so how do we train it on this data ?

Simple we will convert the text data into numerical vectors ! ;) <br>
For this we can use two approaches , the first one being Bag-of-Words and the second one being TFIDF.<br>
For this model I will be using bag of words !

<a id="sec11"></a>
### Using Bag of words

So let's create our bag of words then ! If you do not know about bag of words , you can read about it here >>
[BAG OF WORDS](https://machinelearningmastery.com/gentle-introduction-bag-words-model/#:~:text=A%20bag%2Dof%2Dwords%20is,the%20presence%20of%20known%20words.)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer=CountVectorizer(analyzer='word',binary=True)
count_vectorizer.fit(train.text)

train_vec = count_vectorizer.fit_transform(train.text)
test_vec = count_vectorizer.transform(test.text)

print(train_vec[7].todense())
print(test_vec[7].todense())

<a id="sec13"></a>
## Modelling

Now we have pre-processed our data , converted it so that our machine can actually process and use it ! So comes the final step , let's get our model ready !

First we will store the target data into y variable !

In [None]:
y=train.target

We will use a multinomial Naive Bayes model for this notebook ! You can go ahead and choose your own model as per you like , can also play with this model's parameters so as to increase it's accuracy! But for me this gave a accuracy of around 79.6% 

In [None]:
from sklearn import model_selection
model =MultinomialNB(alpha=1)
scores= model_selection.cross_val_score(model,train_vec,y,cv=6,scoring='f1')
scores

Now let's train our model !

In [None]:
model.fit(train_vec,y)

<a id="sectionlst"></a>
#  Submission

<a href="#toc_section" class="btn btn-primary" style="color:white;" >Back to Table of Content</a>

Now we will use the sample_submission csv file as reference and fill the target column with our predictions !

In [None]:
sample_submission=pd.read_csv('../input/nlp-getting-started/sample_submission.csv')

Let's fill the target column !

In [None]:
sample_submission.target= model.predict(test_vec)

Mind taking a sneak-peak? :P

In [None]:
sample_submission.head()

Finally ,let's convert our predictions into .csv file and submit it !

In [None]:
sample_submission.to_csv('submission.csv',index=False)

Now, do you want to increase your accuracy ? Do you want to know how to get to 84-85 % accuracy ? Do you want to know how BERT can help attain that accuract? Do you want to know if it is possible to get to 100% accuracy ?If yes , then Check out my other notebook on the same problem here :
<a class="nav-link active"  style="background-color:; color:Blue"  href="https://lh3.googleusercontent.com/proxy/fbKFMqzpD5rqh-R4wh4bsiACiX4b6PUs2kzMSs61V36aWWxZd8y0I_ZHur3NEOXcLJ83BJKy7tZF4-Wflp9mtGWnaXkc3Cs1MmKWYmAAPOgt4Qudk1qi_hqLoePakMfmTN-A8146oiXMgKg07aQrYWrxM70" role="tab">NLP with disaster tweets!-Data-cleaning and Bert (Explained)</a>

<a id="sec14"></a>
#  References

- [Basic EDA,Cleaning and GloVe](https://www.kaggle.com/shahules/basic-eda-cleaning-and-glove)
- [NLP with Disaster Tweets - EDA, Cleaning and BERT](https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert)
- [Disaster NLP: Keras BERT using TFHub](https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub)


<img src='https://i.pinimg.com/originals/2f/08/84/2f088410e696203853ecf91a3fbcd0f4.gif'>

# Some last words:

Thank you for reading! I'm still a beginner and want to improve myself in every way I can. So if you have any ideas to feedback please let me know in the comments section!


<div align='center'><font size="3" color="#000000"><b>And again please star if you liked this notebook so it can reach more people, Thanks!</b></font></div>

<img src="https://media1.giphy.com/media/j2ersR5s9rDnUpMDBI/giphy.gif" alt="Thank you!" width="500" height="600">