# **Project Task 1**

*The goal of project is to Predict which Tweets are about real disasters and which ones are not. You have been provided with the data sets (test, train and sample submission). You have to use test and train data sets accordingly.
Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).
But, it’s not always clear whether a person’s words are actually announcing a disaster.
Take this example- "on Plus Side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE"
The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.
In this task, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.*

***Importing Modules***

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import re
import string

In [3]:
#Text blob for language processing

import textblob
from textblob import TextBlob
from textblob import Word

In [4]:
#nltk file import for nlp

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [5]:
import warnings
warnings.filterwarnings('ignore')

***Importing Dataset***

In [6]:
#importing train csv file

train = pd.read_csv('../input/project-given/train.csv')


#importing test csv file

test = pd.read_csv('../input/project-given/test.csv')

***Loading Dataset***

In [7]:
#checking train csv

train 

In [8]:
#checking test csv

test

In [9]:
train.head()

In [10]:
test.head()

In [11]:
train.tail()

In [12]:
test.tail()

In [13]:
train.describe

In [14]:
train.info()

In [15]:
train.shape

***Plotting Histogram***

In [16]:
lenght_train = train['text'].str.len()
lenght_test = test['text'].str.len()

plt.hist(lenght_train,bins=20, label="train")
plt.hist(lenght_test,bins=20, label="test")
plt.legend()

#plotting graph
plt.show()

***Checking Word count, Character count, etc***

In [17]:
#word count

train['word_count'] = train['text'].apply(lambda x: len(str(x).split(" ")))
train[['text','word_count']].head()

In [18]:
#chracter count

train['char_count'] = train['text'].str.len()
train[['text','char_count']].head()

In [19]:
#average word

def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

train['avg_word'] = train['text'].apply(lambda x: avg_word(x))
train[['text','avg_word']].head()

In [20]:
#stopword count

stop = stopwords.words('english')

train['stopwords'] = train['text'].apply(lambda x: len([x for x in x.split() if x in stop]))
train[['text','stopwords']].head()

In [21]:
#hastags count

train['hastags'] = train['text'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
train[['text','hastags']].head()

In [22]:
#checking count of disaster and non-disaster tweets 

train['target'].value_counts()

***Removing Extra labels***

In [23]:
#dropping extra columns from file

train = train.drop('word_count', axis=1)
train = train.drop('char_count',axis=1)
train = train.drop('avg_word',axis=1)
train = train.drop('hastags', axis=1)
train = train.drop('stopwords',axis=1)

***Preprocessing Dataset***

In [24]:
train.head()

In [25]:
#lowercasing all words in text

train['text'] = train['text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
train['text'].head()

In [26]:
#remove special characters from text

train['text'] = train['text'].str.replace("[^a-zA-Z#]"," ")
train['text'].head()

In [27]:
#remove punctuations

train['text'] = train['text'].str.replace('[^\w\s]','')
train['text'].head()

In [28]:
#remove stopwords

stop = stopwords.words('english')
train['text'] = train['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
train['text'].head()

In [29]:
train.head()

In [30]:
#common/frequent words

freq = pd.Series(' '.join(train['text']).split()).value_counts()[:10]
freq

In [31]:
#removing common words

freq = list(freq.index)
train['text'] = train['text'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['text'].head()

In [32]:
#rare words

freq = pd.Series(' '.join(train['text']).split()).value_counts()[-10:]
freq

In [33]:
#removing rare words

freq = list(freq.index)
train['text'] = train['text'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['text'].head()

In [34]:
#text/spelling correction

train['text'][:5].apply(lambda x: str(TextBlob(x).correct()))

In [35]:
# removing short words ie.word of size 2 or less

train['text'] = train['text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))
train.head()

In [36]:
#Removing spaces at the beginning and at the end of the string

train['text'] = train['text'].str.strip()
train.head()

***Stemming***

In [37]:
st = PorterStemmer()
train['text'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

***Lemmatization***

In [38]:
train['text'] = train['text'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
train['text'].head()

***TFIDF***

*Syntax*


*from sklearn.feature_extraction.text import TfidfVectorizer*


*tfidf = TfidfVectorizer(max_features=20000, lowercase=True, analyzer='word',
stop_words= 'english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(train['text'])*


*train_vect*

***Bag of Word***

*Syntax*


*from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=20000, lowercase=True, ngram_range=(1,1),analyzer = "word")*


*train_bow = bow.fit_transform(train['text'])*


*train_bow*

***Sentiment Analysis***

In [39]:
train['text'][:5].apply(lambda x: TextBlob(x).sentiment)
train['sentiment'] = train['text'].apply(lambda x: TextBlob(x).sentiment[0] )
train[['text','sentiment']].head()

***Wordcloud***

In [40]:
#Complete Wordcloud

all_words = ' '.join([text for text in train['text']]) 

from wordcloud import WordCloud

wordcloud = WordCloud(width=800,height=500,random_state=21,max_font_size=110).generate(all_words)

plt.figure(figsize=(10,7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')

plt.show()

In [41]:
#Disaster Wordcloud

normal_words = ' '.join([text for text in train['text'][train['target'] == 1]])

wordcloud = WordCloud(width=800,height=500,random_state=21,max_font_size = 110).generate(normal_words)

plt.figure(figsize=(10,7))
plt.imshow(wordcloud,interpolation ='bilinear')
plt.axis('off')

plt.show()

In [42]:
#Non-Disaster Wordcloud

normal_words = ' '.join([text for text in train['text'][train['target'] == 0]])

wordcloud = WordCloud(width=800,height=500,random_state=21,max_font_size = 110).generate(normal_words)

plt.figure(figsize=(10,7))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')

plt.show()

***Checking Column with Null value***

In [43]:
train.isnull().sum()

In [44]:
#Dropping Columns with null value

train = train.drop([ 'keyword', 'location'], axis = 1)

In [45]:
train.head()

In [46]:
#Checking null

train.isnull().sum()

In [47]:
train

***Spliting for Naive Bayes Algorithm***

In [48]:
#Importing Library

from sklearn.model_selection import train_test_split

In [49]:
#Defining X and y

X = train['text']
y = train['target']

X_train, X_value, y_train, y_value = train_test_split(X, y, test_size = 0.25, random_state=42)

X_train.head()

***NLTK's Word tokenization import***

In [50]:
from nltk.tokenize import word_tokenize 

In [51]:
X_token = X_train.apply(word_tokenize)
X_value_token = X_value.apply(word_tokenize)

In [52]:
X_token 

In [53]:
X_value_token

In [54]:
def listToString(list): 
    string = "" 
    for element in list: 
        string += element 
    return string

In [55]:
X_token_str = X_train.apply(listToString)
X_value_token_str = X_value.apply(listToString)

In [56]:
from sklearn.feature_extraction.text import CountVectorizer 

vec = CountVectorizer(max_df=0.90 ,min_df=2 , max_features=1000,stop_words='english')
X_train = vec.fit_transform(X_token_str).toarray()
X_value_test = vec.transform(X_value_token_str).toarray()

In [57]:
#Printing value of Variables

print("X_train_shape : ",X_train.shape)
print("X_value_test_shape : ",X_value_test.shape)
print("y_train_shape : ",y_train.shape)
print("y_value_shape : ",y_value.shape)

In [58]:
# Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB  

model_naive = MultinomialNB().fit(X_train, y_train) 
predicted_naive = model_naive.predict(X_value_test)

***Printing Confusion Matrix***

***True Positive:***
*Interpretation: You predicted positive and it’s true.*


***True Negative:***
*Interpretation: You predicted negative and it’s true.*


***False Positive: (Type 1 Error)***
*Interpretation: You predicted positive and it’s false.*


***False Negative: (Type 2 Error)***
*Interpretation: You predicted negative and it’s false.*

In [59]:
from sklearn.metrics import confusion_matrix

plt.figure(dpi=600)
mat = confusion_matrix(y_value, predicted_naive)
sns.heatmap(mat.T, annot=True, fmt='d', cbar=False)

plt.title('Confusion Matrix')
plt.xlabel('True')
plt.ylabel('Predicted')
plt.show()

***Calculating Accuracy***

In [60]:
#Calculating Accuracy with model

from sklearn.metrics import accuracy_score

score_naive = accuracy_score(predicted_naive, y_value)
print("Accuracy with Naive-bayes model for disaster prediction: ",score_naive)

***Accuracy with Naive-bayes model:*** 

# 0.7773109243697479

***Prediction on given test set***

In [61]:
ID = test['id']
test = test['text']

test = test.apply(word_tokenize)
test = test.apply(listToString)

test_vector = vec.transform(test).toarray()

prediction = model_naive.predict(test_vector)

In [62]:
prediction

***Printing Prediction in Submission csv***

In [63]:
prediction = pd.Series(prediction)
ids = pd.Series(ID)

predict_csv = pd.concat([ids, prediction], keys = ['id', 'target'], axis = 1)

predict_csv.to_csv('submission_deepak.csv',index = False)

***Checking Submission csv***

In [64]:
#importing submission_deepak csv file

submission = pd.read_csv('submission_deepak.csv')

In [65]:
submission

In [66]:
submission.head()