**Importing the Datasets**

In [1]:
import pandas as pd
import numpy as np

In [2]:
sample_submission = pd.read_csv("../input/nlp-getting-started/sample_submission.csv")
test = pd.read_csv("../input/nlp-getting-started/test.csv")
train = pd.read_csv("../input/nlp-getting-started/train.csv")

**A Glimpse of The Data**

In [3]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


Let's see how much the true and the false disaster tweets in the train dataset.

In [5]:
train['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

**Splitting the Train Dataset**

In order to produce a good model, I decided to split the train dataset randomly using train_test_split.

In [6]:
# Importing train_test_split
from sklearn.model_selection import train_test_split

In [7]:
# Splitting the train dataset into two disparate datasets (one to train the train dataset, the other for test the dataset)
X_train, X_test, y_train, y_test = train_test_split(train['text'], train['target'])

**Vectorizing the Train Datasets**

I used Tfidf vectorizer to create a corpus and vectorize the train's train dataset and test dataset.

In [8]:
# Initializing the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()

In [9]:
# Learning the vocabulary inside the datasets and transform the train's train dataset into matrix
X_train_vect = vect.fit_transform(X_train)
X_test_vect = vect.transform(X_test)

**Classifying Each Tweet through Machine Learning**

In this model, I chose to use Multinomial Naive Bayes classifier, which is suitable for classification with discrete features. 

In [10]:
# Importing Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()

In [11]:
# Training the train dataset
model.fit(X_train_vect, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [12]:
# Predicting the target (0 for non-disaster tweet, 1 for disaster tweet)
y_predict = model.predict(X_test_vect)

In [13]:
# Estimating the accuracy of the model
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

In [14]:
# Classification report
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.77      0.93      0.84      1091
           1       0.87      0.63      0.73       813

    accuracy                           0.80      1904
   macro avg       0.82      0.78      0.79      1904
weighted avg       0.81      0.80      0.79      1904



In [15]:
# Confusion matrix
print(confusion_matrix(y_test, y_predict))

[[1014   77]
 [ 303  510]]


In [16]:
# Accuracy score
print(accuracy_score(y_test, y_predict))

0.8004201680672269


**Inference: Predicting the Target for the Test Dataset**

First, let's see the first five rows of the test dataset.

In [17]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [18]:
# Extracting the tweets from the test dataset
text_test = test['text']

In [19]:
# Transforming the tweets into matrix
text_test_trans = vect.transform(text_test)

In [20]:
# Predicting the tweets
result = model.predict(text_test_trans)

In [21]:
# Putting the result into the submission's dataframe
sample_submission['target'] = result

Now, here is the dataframe after the target was included into it.

In [22]:
sample_submission.head()

Unnamed: 0,id,target
0,0,1
1,2,0
2,3,1
3,9,1
4,11,1


Change the dataframe into a CSV file, then submit it.

In [23]:
sample_submission.to_csv('submission.csv', index = False)