## Import Packages

In [1]:
import pandas as pd
import gensim
pathToBinVectors = 'GoogleNews-vectors-negative300.bin'

print ("Loading the data file... Please wait...")
model1 = gensim.models.KeyedVectors.load_word2vec_format(pathToBinVectors, binary=True)
print ("Successfully loaded 3.6 G bin file!")

import numpy as np
import math
from scipy.spatial import distance

from random import sample
import sys
from nltk.corpus import stopwords

Loading the data file... Please wait...
Successfully loaded 3.6 G bin file!


## Load Training and Test Data

In [2]:
test_data = pd.read_csv('test_data.csv')
train_data = pd.read_csv('singtel_qna.csv',header=None)

In [3]:
# Load Singtel's FAQ data as Training data, FAQ question as input and FAQ answers as labels
train_data.columns = ['Question','Answer']
train_data.head()

Unnamed: 0,Question,Answer
0,What is Singtel Fibre Broadband?,Singtel Fibre Broadband is Singtel's high spee...
1,What is Fixed Line Number Retention Service?,Fixed Line Number Retention is a number portab...
2,What are the benefits I can enjoy with Singtel...,A. Consistently fast speeds IDA findings show ...
3,I'm interested in Singtel Fibre Broadband serv...,You can visit any Singtel shop check out our w...
4,What service plans are available if I want to ...,Please refer to the available Broadband plans ...


In [4]:
train_data.shape

(194, 2)

In [5]:
# Load similar questions as Testing Data with the tagged FAQ answer as the labels
test_data.head()

Unnamed: 0,Questions,FAQ Question,Answers
0,Can I subscribe for DataMore?,Am I eligible for DataMore?,Yes if you are on Singtel Combo Mobile plans.
1,Can I keep my current ID if I change my servic...,Can I maintain my existing user ID if I am to ...,For Dial-up customers you will be able to reta...
2,Can I use my own equipment instead of the one ...,Can I purchase my own equipment instead of usi...,AllÂ necessary equipment to enable your Singte...
3,Can I get a microSIM card with MobileShare Sup...,Can I request for the microSIM card when I sig...,Yes you can.
4,Can I postpone the installation of Fibre?,Can I reschedule my Singtel Fibre installation?,Yes. You can reschedule your Singtel Fibre ins...


In [6]:
test_data.shape

(99, 3)

In [7]:
X_train = train_data['Question']
y_train = train_data['Answer']
X_test = test_data['Questions']
y_test = test_data['Answers']

## Method for Evaluation

In [8]:
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect, X_train, X_test, y_train, y_test, model):
    
    # create document-term matrices using the vectorizer
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # print the number of features that were generated
    print('Features: ', X_train_dtm.shape[1])
    # use Multinomial Naive Bayes to predict
    nb = model
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    
    # Get the training accuracy
    print('Training Accuracy: ', metrics.accuracy_score(y_train, nb.predict(X_train_dtm)))
    # print the accuracy of its predictions
    print('Test Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

# 2 Main Approachs: ML and Distance Similarity

# ML Method
## Model #1 CountVectorizer (Stopwords, Lower)

In [15]:
vectorizer = CountVectorizer(stop_words='english', lowercase=True)
tokenize_test(vectorizer, X_train, X_test, y_train, y_test, MultinomialNB())

Features:  349
Training Accuracy:  0.8041237113402062
Test Accuracy:  0.21212121212121213


## Model #2 TF-IDF (Stopwords, Lower)

In [16]:
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True)
tokenize_test(vectorizer, X_train, X_test, y_train, y_test, MultinomialNB())

Features:  349
Training Accuracy:  0.13917525773195877
Test Accuracy:  0.010101010101010102


We can observe that both training and testing accuracy dropped drastically after using IDF. We will attempt tuning.

## Model #3 TF-IDF (Stopwords, Lower, Norm=None)

In [17]:
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True, norm=None)
tokenize_test(vectorizer, X_train, X_test, y_train, y_test, MultinomialNB())

Features:  349
Training Accuracy:  0.979381443298969
Test Accuracy:  0.37373737373737376


Accuracy for training improved by 17% and for test improved by 16% as compared to using CountVectorizer. We will next explore using binary count of features and BernoulliNB. As counting 1 or 0 for presence of words makes sense here as questions are usually rather short, about 1 sentence long. Hence, avoiding repetition could help to remove noise.

## Model #4 CountVectorizer (Stopwords, Lower, Binary) using Bernoulli

In [18]:
vectorizer = CountVectorizer(stop_words='english', lowercase=True, binary=True)
tokenize_test(vectorizer, X_train, X_test, y_train, y_test, BernoulliNB())

Features:  349
Training Accuracy:  0.0979381443298969
Test Accuracy:  0.010101010101010102


Based on the test accuracy, we can infer that this method might not be very robust. Hence, we will focus on using TfIdfVectorizer.

## Model #5 TfidfVectorizer (Stopwords, Lower, Norm=None, Uni & Bi-gram)

In [20]:
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True,norm=None,ngram_range = (1,2))
tokenize_test(vectorizer, X_train, X_test, y_train, y_test, MultinomialNB())

Features:  978
Training Accuracy:  0.9845360824742269
Test Accuracy:  0.35353535353535354


Using both uni-gram and bi-gram helped to improve training accuracy but decreases test accuracy slightly. However, as our main focus is creating a robust model, thus we will not include bi-gram.

## Model #6 TfidfVectorizer (Stopwords, Lower, Norm=None, Lemmatisation)

In [21]:
from nltk.stem import PorterStemmer

In [22]:
porter = PorterStemmer()

In [23]:
X_train_lemma = [' '.join(list(map(lambda x: porter.stem(x), sent.split()))) for sent in X_train]
X_test_lemma = [' '.join(list(map(lambda x: porter.stem(x), sent.split()))) for sent in X_test]

In [24]:
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True, norm=None)
tokenize_test(vectorizer, X_train_lemma, X_test_lemma, y_train, y_test, MultinomialNB())

Features:  355
Training Accuracy:  0.9845360824742269
Test Accuracy:  0.3838383838383838


Based on the results, we can infer that lemmatisation does help with robustness. Next, we can experiment with POS Tagging to take only Nouns + Adj.

## Model #7 TfidfVectorizer (Stopwords, Lower, Norm=None, Lemmatisation, POS Tagging)

In [25]:
import nltk

In [26]:
def posRemoval(sentence):
    tokens = nltk.word_tokenize(sentence)
    pos_tokens = nltk.pos_tag(tokens)
    # Return selected words if is Noun or Adj
    filt_words = list(filter(lambda x: x[1] in ['NN','NNS','NNPS','NNP','JJ','JJR','JJS'], pos_tokens))
    rem_sent = " ".join([word[0] for word in filt_words])
    return rem_sent

In [27]:
X_train_pos = [posRemoval(sent) for sent in X_train_lemma]
X_test_pos = [posRemoval(sent) for sent in X_test_lemma]

In [28]:
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True, norm=None)
tokenize_test(vectorizer, X_train_pos, X_test_pos, y_train, y_test, MultinomialNB())

Features:  277
Training Accuracy:  0.9329896907216495
Test Accuracy:  0.2727272727272727


Using POS Tagging affects our scoring negatively for both training and testing, hence, we will not use POS Tagging. Next, we can attempt a different model for classification.

## Model #8 SVC

In [29]:
from sklearn.svm import LinearSVC

In [30]:
# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test_svm(vect, X_train, X_test, y_train, y_test):
    
    # create document-term matrices using the vectorizer
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # print the number of features that were generated
    print('Features: ', X_train_dtm.shape[1])
    
    # use SVM to predict
    svm = LinearSVC()
    svm.fit(X_train_dtm, y_train)
    y_pred_class = svm.predict(X_test_dtm)
    
    # Get the training accuracy
    print('Training Accuracy: ', metrics.accuracy_score(y_train, svm.predict(X_train_dtm)))
    # print the accuracy of its predictions
    print('Test Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [31]:
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True, norm=None)
tokenize_test_svm(vectorizer, X_train_lemma, X_test_lemma, y_train, y_test)

Features:  355
Training Accuracy:  0.9896907216494846
Test Accuracy:  0.30303030303030304


No improvement in model despite different model being used. Therefore, we can conclude using Multinomial Naive Bayes as the model and preprocessing steps of removing stopwords, lowercase, tuning of model and lemmatisation provides the best results of 38.3%.

# Distance Method
As we are using a different method, our input will be questions similiar to FAQ questions and our label will be the FAQ questions. Evaluation is done by comparing if the question can be answered by a specific FAQ question.

In [118]:
X = test_data['Questions']
faq_qns = pd.DataFrame({'FAQ Question':train_data['Question']})
y =test_data['FAQ Question']

In [119]:
# Convert sentence into 300-d vector
class PhraseVector:
    def __init__(self, phrase):
        self.vector = self.PhraseToVec(phrase)
        
    def ConvertVectorSetToVecAverageBased(self, vectorSet, ignore = []):
        if len(ignore) == 0: 
            return np.mean(vectorSet, axis = 0)
        else: 
            return np.dot(np.transpose(vectorSet),ignore)/sum(ignore)

    def PhraseToVec(self, phrase):
        cachedStopWords = stopwords.words("english")
        phrase = phrase.lower()
        wordsInPhrase = [word for word in phrase.split() if word not in cachedStopWords]
        vectorSet = []
        for aWord in wordsInPhrase:
            try:
                wordVector=model1[aWord]
                vectorSet.append(wordVector)
            except:
                pass
        return self.ConvertVectorSetToVecAverageBased(vectorSet)
    
    # Cosine Similarity to determine how close two questions are
    def CosineSimilarity(self, otherPhraseVec):
        cosine_similarity = np.dot(self.vector, otherPhraseVec) / (np.linalg.norm(self.vector) * np.linalg.norm(otherPhraseVec))
        try:
            if math.isnan(cosine_similarity):
                cosine_similarity=0
        except:
            cosine_similarity=0
        return cosine_similarity

In [120]:
# Get closest FAQ question based on input
def simQuestion(text, faq_qna):
    qn_df = faq_qna.copy()
    text_qn = PhraseVector(text)
    qn_df['scoring'] = [text_qn.CosineSimilarity(PhraseVector(qn).vector) for qn in qn_df['FAQ Question']]
    max_row = qn_df.sort_values(by=['scoring'], ascending=False).iloc[0,:]
    return max_row['FAQ Question']

In [121]:
def evalData(input_qns, actual_match_qns, faq_qns):
    y_pred_class = [simQuestion(qn, faq_qns) for qn in input_qns]
    print('Accuracy: ', metrics.accuracy_score(actual_match_qns, y_pred_class))
    return pd.DataFrame({'Input':input_qns, 'Output':y_pred_class,'Actual':actual_match_qns})

## Model #8 Sentence Similarity (Google 300-dimension Vector)

In [122]:
output_df = evalData(X, y, faq_qns)

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


Accuracy:  0.41414141414141414


In [123]:
#Look at wrongly classified groups
output_df[output_df['Output']!=output_df['Actual']].head(30)

Unnamed: 0,Input,Output,Actual
0,Can I subscribe for DataMore?,Who can subscribe for MobileShare Supplementar...,Am I eligible for DataMore?
1,Can I keep my current ID if I change my servic...,What should I do if I change my phone?,Can I maintain my existing user ID if I am to ...
4,Can I postpone the installation of Fibre?,When can I request for installation to take pl...,Can I reschedule my Singtel Fibre installation?
5,Can I install the Fibre myself?,How can I change my installation appointment?,Can I self-install my Singtel Fibre Broadband ...
7,Can I call with my Mobileshare line while bein...,Why am I unable to call the WatchPhone?,Can I use my MobileShare line when I travel ov...
8,Can I subscripbe for DataMore in addition to m...,What do I do with my Trackimo Device when I fi...,Can my MobileShare supplementary line sign up ...
9,Can I keep my current ID if I change my servic...,I am currently on the 300M Fibre Entertainment...,Do I need to change out my current SNBB UserID...
10,How to activate my Singtel service?,How can I activate my hi!DataRoam plan?,How can I activate my Singtel IDD services?
11,Can I call when I travel abroad with my prepai...,How do I make an overseas call from Singapore ...,How can I make calls from my prepaid card whil...
12,How can I see my Singtel balance?,How do I know if I have any Singtel Vouchers?,How do I check my outstanding balance for my S...


This model gives an accuracy of 41.4% which is decent. However, we observe some of errors that are quite close to the actual labels. We can also try to adopt POS Tagging approach to eliminate non-Nouns and non-Adj words.

## Model #9 Sentence Similarity (Google 300-dimension Vector + POS Tagging)

In [124]:
def evalDataPos(input_qns, actual_match_qns, faq_qns):
    input_qns_pos = [posRemoval(sent) for sent in input_qns]
    faq_qns_pos = faq_qns.copy()
    faq_qns_pos['FAQ Question'] = [posRemoval(sent) for sent in faq_qns['FAQ Question']]
    y_pred_class = [simQuestion(qn, faq_qns_pos) for qn in input_qns_pos]
    actual_match_qns_pos = [posRemoval(sent) for sent in actual_match_qns]
    print('Accuracy: ', metrics.accuracy_score(actual_match_qns_pos, y_pred_class))


In [125]:
evalDataPos(X, y, faq_qns)

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


Accuracy:  0.36363636363636365


Based on the accuracy, we can understand why it lowers the accuracy because we have some questions that will have similar nouns and adjectives but different meaning behind. This results in misclassification in some cases.

## Model Selection
All in all, we decided to choose Model #8 as it provides the best accuracy on test data (41.4%). This is critical for our application as we strive to be able handle paraphrasing of FAQ questions.