**Xing Yi Chan**

**R00183768**

## **Part 2**

Select the most corresponding reason why this statement is against common sense.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

### **Preparing Training Data**

In [2]:
train_data = pd.read_csv('/content/drive/My Drive/NLP/dataset2/traindata/subtaskB_data_all.csv')
train_label = pd.read_csv('/content/drive/My Drive/NLP/dataset2/traindata/subtaskB_answers_all.csv', names=['id', 'ans'])

# extract positive sentences
train_pos_sent = train_data[train_label['ans'] == 'A']['OptionA']
train_pos_sent = train_pos_sent.append(train_data[train_label['ans'] == 'B']['OptionB'])
train_pos_sent = train_pos_sent.append(train_data[train_label['ans'] == 'C']['OptionC'])

# extract negative sentences
train_neg_sent = train_data[train_label['ans'] == 'A']['OptionB']
train_neg_sent = train_neg_sent.append(train_data[train_label['ans'] == 'A']['OptionC'])
train_neg_sent = train_neg_sent.append(train_data[train_label['ans'] == 'B']['OptionA'])
train_neg_sent = train_neg_sent.append(train_data[train_label['ans'] == 'B']['OptionC'])
train_neg_sent = train_neg_sent.append(train_data[train_label['ans'] == 'C']['OptionA'])
train_neg_sent = train_neg_sent.append(train_data[train_label['ans'] == 'C']['OptionB'])

#combining both positive and negative sentences
   # sensible sentences --> 0
   # nonsense sentences --> 1
trainX = train_pos_sent.append(train_neg_sent)
# create labels for the sentences
trainY = pd.Series([0]*10000).append(pd.Series([1]*20000))

# reset index and avoid old index being added mistakenly as a column
trainX = trainX.reset_index(drop=True)
trainY = trainY.reset_index(drop=True)

### **Preparing Testing Data**

In [3]:
test_data = pd.read_csv('/content/drive/My Drive/NLP/dataset2/testdata/subtaskB_trial_data.csv')
test_label = pd.read_csv('/content/drive/My Drive/NLP/dataset2/testdata/subtaskB_answers.csv', names=['id', 'ans'])

# extract positive sentences
test_pos_sent = test_data[test_label['ans'] == 'A']['OptionA']
test_pos_sent = test_pos_sent.append(test_data[test_label['ans'] == 'B']['OptionB'])
test_pos_sent = test_pos_sent.append(test_data[test_label['ans'] == 'C']['OptionC'])

# extract negative sentences
test_neg_sent = test_data[test_label['ans'] == 'A']['OptionB']
test_neg_sent = test_neg_sent.append(test_data[test_label['ans'] == 'A']['OptionC'])
test_neg_sent = test_neg_sent.append(test_data[test_label['ans'] == 'B']['OptionA'])
test_neg_sent = test_neg_sent.append(test_data[test_label['ans'] == 'B']['OptionC'])
test_neg_sent = test_neg_sent.append(test_data[test_label['ans'] == 'C']['OptionA'])
test_neg_sent = test_neg_sent.append(test_data[test_label['ans'] == 'C']['OptionB'])

#combining both positive and negative sentences
   # sensible sentences --> 0
   # nonsense sentences --> 1
testX = test_pos_sent.append(test_neg_sent)
# create labels for the sentences
testY = pd.Series([0]*2021).append(pd.Series([1]*4042))

# reset index and avoid old index being added mistakenly as a column
testX = testX.reset_index(drop=True)
testY = testY.reset_index(drop=True)

### **Remove non-values in both training and testing data**

In [4]:
#remove nan values in train data
for x in np.where(trainX.isnull())[0]:
    trainX = trainX.drop(index=x)
    trainY = trainY.drop(index=x)

# remove nan values in test data
for y in np.where(testX.isnull())[0]:
    testX = testX.drop(index=y)
    testY = testY.drop(index=y)

### **Create tf-idf vectors for both training and testing data**

In [5]:
# transfrom words into integers
vectorizer = TfidfVectorizer(max_features=2000, analyzer='word', ngram_range=(1, 4))
trainX_tfidf = vectorizer.fit_transform(trainX)
testX_tfidf = vectorizer.fit_transform(testX)

### **Performs classification to get accuracy score**

In [6]:
# naive bayes classifier
nb = MultinomialNB()
nb.fit(trainX_tfidf, trainY)
print('Accuracy for naive bayes:', nb.score(testX_tfidf, testY)*100)

# k-nearest neighbour classifier     
knn = KNeighborsClassifier()
knn.fit(trainX_tfidf, trainY)
print('Accuracy for k-nn:', knn.score(testX_tfidf, testY)*100)

# random forest classifier
rf = RandomForestClassifier()
rf.fit(trainX_tfidf, trainY)
print('Accuracy for random forest:', rf.score(testX_tfidf, testY)*100)

# svm classifier
svm = SVC()
svm.fit(trainX_tfidf, trainY)
print('Accuracy svm:', svm.score(testX_tfidf, testY)*100)

Accuracy for naive bayes: 62.70207852193995
Accuracy for k-nn: 64.84658528538436
Accuracy for random forest: 60.722533817222036
Accuracy svm: 65.0940283734741


### **Conclusion**

For classification, 4 different type of classification methods had been used. They include Multinomial Naive Bayes Classifier, k-Nearest Neighbour Classifier, Random Forest Classifier and Support Vector Machine Classifier. After comparing the accuracy of these classifiers, it can be concluded that all these classifiers will give an accuracy score between 50%. 

Among all the 4 classifiers, Support Vector Machine classifier performs best with the accuracy of 65.09%.