<a href="https://colab.research.google.com/github/albinjohn366/Disaster_Message_Classification/blob/Disaster_News_Classification_Vectorization/Disaster_News_SVM_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Disaster Related News Classification
Using Machine Learning tool to categorize Twitter news into disaster related and not disaster related. Using labelled dataset to train the model. Using Support Vector Machine for classification.

In [None]:
import nltk
import pandas as pd
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm, naive_bayes
from sklearn.metrics import accuracy_score

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
stopwords = stopwords.words('english')
data = pd.read_csv('/disaster_response_messages_training.csv')

In [277]:
# Checking for inappropriate values
index = data[data['related'].isin([1, 0]) == False].index
data.drop(index=index, inplace=True)

In [278]:
# Function to find the type of word
lemattizer_dict = {'V': wordnet.wordnet.VERB,
                   'J': wordnet.wordnet.ADJ,
                   'R': wordnet.wordnet.ADV}
def find_type(string):
  return lemattizer_dict.get(string, wordnet.wordnet.NOUN)


In [280]:
# Function to lemitize
def lemattize(message):
  lemattizer = nltk.WordNetLemmatizer()
  return [lemmatizer.lemmatize(token, find_type(typ[0])) for (token, typ) in nltk.pos_tag(nltk.word_tokenize(message)) 
    if (token.isalpha() and token not in stopwords)]

In [281]:
# Function to convert strings to useful format
words = []
def convert_string(messages):
  temp = []
  for message in messages:
    result = lemattize(message)
    words.extend(result)
    temp.append(str(result))
  return temp

In [282]:
# Creating train and test data
x_train, x_test, y_train, y_test = train_test_split(data['message'], data['related'], test_size=0.3)
x_train = convert_string(x_train)
x_test = convert_string(x_test)

In [283]:
# Using term frquency inverse document (TFIDF)
vectorizer = TfidfVectorizer()
vectorizer.fit(words)
x_train_vc = vectorizer.transform(x_train)
x_test_vc = vectorizer.transform(x_test)

In [269]:
# Printing the vectorozed data
print(x_train[0])
print(x_train_vc[0])

['Based', 'on', 'the', 'current', 'predication', 'local', 'should', 'forbid', 'boat', 'to', 'sail', 'out', 'and', 'should', 'assist', 'boat', 'owner', 'on', 'find', 'safe', 'anchorage']
  (0, 21476)	0.09576920996174329
  (0, 21201)	0.08268331897972223
  (0, 19319)	0.35574019762712417
  (0, 18420)	0.25409920261744434
  (0, 18394)	0.19061792273210237
  (0, 16544)	0.28347557283537433
  (0, 15377)	0.24735098595798832
  (0, 15260)	0.1612415525141724
  (0, 15019)	0.2529563067824314
  (0, 12335)	0.16971801979200532
  (0, 7826)	0.28347557283537433
  (0, 7571)	0.15329951477392242
  (0, 4905)	0.2018887435777399
  (0, 2395)	0.4007334303820894
  (0, 1840)	0.2516033127416185
  (0, 1340)	0.19584902894210332
  (0, 834)	0.09477993091118236
  (0, 832)	0.28347557283537433


In [285]:
# Creating the classifier
clf = svm.SVC(kernel='linear', gamma='auto', C=1).fit(x_train_vc, y_train)

In [286]:
# Predicting for the test elements and printing the accuracy
predictions = clf.predict(x_test_vc)
print(accuracy_score(predictions, y_test))

0.8245530012771393


### Result from SVM model
It is observed that linear kernal has the best results. The accuracy was 10% more for linear kernal compared to poly and rbf.

*Kernel*

kernel parameters selects the type of hyperplane used to separate the data. Using ‘linear’ will use a linear hyperplane (a line in the case of 2D data). ‘rbf’ and ‘poly’ uses a non linear hyper-plane

*gamma*

gamma is a parameter for non linear hyperplanes. The higher the gamma value it tries to exactly fit the training data set

*C*

C is the penalty parameter of the error term. It controls the trade off between smooth decision boundary and classifying the training points correctly.Increasing C values may lead to overfitting the training data. In this case, Increasing C resulted in less accuracy due to overfitting. Reduced C also caused less accuracy due to increased error. C = 1 is the apt one.

In [284]:
# Trying the same with naive Bayes Model
naive = naive_bayes.MultinomialNB()
naive.fit(x_train_vc, y_train)
predictions_for_naive = naive.predict(x_test_vc)
print(accuracy_score(predictions_for_naive, y_test))

0.7755427841634738


### Result from Naive Bayes Model
Naive bayes model took less time for fitting in the data. But SVM produced more accuracy compared to this model.