# A Forum-based Chatbot for Parents of Autistic Children

**[Khalil Mrini](https://goo.gl/7MCYvq)**

# Table of Contents

### [1. Scraping the Online Forum](#1)
[1.1. Functions](#11) <br/>
[1.2. Collected Data](#12) <br/>
[1.2.1. Parents' Discussion](#121) <br/>
[1.2.2. General Autism Discussion](#122) <br/>
[1.3. Filtering Data](#13) <br/>

### [2. Amazon Mechanical Turk](#2)
[2.1. Sampling Threads](#21) <br/>
[2.2. Using sent2vec to Cluster Samples](#22) <br/>
[2.3. Checking Amazon Mechanical Turk Results](#23) <br/>
[2.4. Using NLTK to Separate Text Data into Sentences](#24) <br/>
[2.4.1. For the Samples](#241) <br/>
[2.4.2. For All Collected Data](#242) <br/>

### [3. Machine Learning Features](#3)
[3.1. MTurk DataFrame](#31) <br/>
[3.2. Adding Reply Vectors](#32) <br/>
[3.3. Adding First Post Vectors](#33) <br/>
[3.4. Adding Title Vectors](#34) <br/>
[3.5. Computing sent2vec Similarity](#35) <br/>
[3.6. Adding Features from Forum Threads](#36) <br/>
[3.7. Standardization of Features](#37) <br/>
[3.8. Train and Test Data Sets](#38) <br/>
[3.9. PCA Visualization](#39) <br/>

### [4. Ensemble Classifiers for Multi-Class Labels](#4)
[4.1. Random Forest Classifier](#41) <br/>
[4.2. Bagging Classifier](#42) <br/>
[4.3. Ada Boost Classifier](#43) <br/>
[4.4. Extra Trees Classifier](#44) <br/>
[4.5. Gradient Boosting Classifier](#45) <br/>
[4.6. SVM](#46) <br/>
[4.7. Decision Tree Classifier](#47) <br/>
[4.8. K-Neighbours Classifier](#48) <br/>
[4.9. Radius Neighbours Classifier](#49) <br/>

### [5. Building the Chatbot](#5)
[5.1. Applying the Model to All Unlabelled Replies](#51) <br/>
[5.2. Precomputing word2vec](#52) <br/>
[5.3. Dot Product Similarity Functions](#53) <br/>
[5.4. Chatbot Functions](#54) <br/>

# 1. Scraping the Online Forum <a class="anchor" id="1"></a>

In this section, we scrape the online forum [Wrong Planet](https://wrongplanet.net/forums/) for posts in the *General Autism Discussion* and the *Parents' Discussion*.

In [1]:
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

In [2]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import nltk
import gensim

## 1.1. Functions <a class="anchor" id="11"></a>

In [3]:
def find_next(soup, index):
    try:
        return soup.find('a', class_="guiArw sprite-pageNext")['href'][1:]
    except:
        return ""  
            
def transform_datetime(datetime):
    try:
        return datetime.replace('Yesterday', '24 Oct 2017').replace('Today', '25 Oct 2017')
    except:
        return datetime 
        

def find_all_rows(soup, type_name, class_name):
    try:
        #print(soup.prettify())
        return soup.find_all(type_name, {'class': class_name})
    except:
        return []

def convert_to_int(text):
    return int(''.join([char for char in text if char.isdigit()]))

def transform_datetime(datetime):
    """
    Removes mentions of Yesterday and Today.
    """
    try:
        return datetime.replace('Yesterday', '24 Oct 2017').replace('Today', '25 Oct 2017')
    except:
        return datetime

def collect_forum_data(soup_row, link=""): 
    data = {}
    try:
        title_soup = soup_row.find('b').find('a')
        data['Title'] = title_soup.text.strip()
        data['Link'] = title_soup['href'][1:]
        info_soup = soup_row.find('td', {'class': 'reply rowentry '})
        data['Replies'] = convert_to_int(info_soup.text.strip())     
    except:
        pass
    return data

def collect_recursively(data, url, type_name, class_name, data_function, index=""):
    try:
        if index:
            print(index, end='\r', flush=True)      
        soup = BeautifulSoup(requests.get(url).text, 'html.parser')
        soup_rows = find_all_rows(soup, type_name, class_name)
        data.extend([data_function(soup_row, link=url) for soup_row in soup_rows])
        next_url = find_next(soup, index)
        if next_url:
            if index:
                return collect_recursively(data, PREFIX_URL + next_url, type_name, class_name, data_function, index+1)
            else:
                return collect_recursively(data, PREFIX_URL + next_url, type_name, class_name, data_function)
        else:
            return data
    except:
        return data

def get_forum_dataframe(url, type_name, class_name, data_function):
    data = collect_recursively([], url, type_name, class_name, data_function, 1)
    #return pd.DataFrame(data[1:])
    return pd.DataFrame(data).dropna()

def collect_post_data(soup_row, link=""):
    data = {}
    try:
        data['Link'] = link.replace(PREFIX_URL, '')
        #print(soup_row)
        #print("------------------------------------------------------------------------------")
        
        #Get the username
        data['Username'] = soup_row.find('div', class_='username').find('a').text.strip()
        #print(soup_row.find('div', class_='username').find('a').text.strip())
        
        #Get the timestamp
        data['Timestamp'] = transform_datetime(soup_row.find('div', class_='postDate').text.strip())
        #print("Timestamp ", transform_datetime(soup_row.find('div', class_='postDate').text.strip()))
        
        #Get the message
        post_msg = soup_row.find_all('p')
        message = ''
        for msg in post_msg:
            message += msg.text.strip()  
        data['Message'] = message
        #print(data['Message'])  
        #print('-----------')      
    except:
        pass
    
    try:   
        #Get level
        data['User Level'] = convert_to_int(soup_row.find('div', class_='levelBadge').find('img')['src']) -22000
        #print(convert_to_int(soup_row.find('div', class_='levelBadge').find('img')['src']) -22000)
    except:
        data['User Level'] = 0
   
    return data

def get_thread_dataframe(forum_df, type_name, class_name, data_function):
    data = []
    total = len(forum_df['Link'])
    index = 0
    for url in forum_df['Link']:     
        index += 1
        print('{} out of {}'.format(index, total), end='\r', flush=True)
        data.extend(collect_recursively([], PREFIX_URL + url, type_name, class_name, data_function))
    return pd.DataFrame(data)

In [4]:
PREFIX_URL = 'https://www.tripadvisor.co.uk/'
START_URL = PREFIX_URL + 'ShowForum-g1-i12334-Holiday_Travel.html'
forum_df = get_forum_dataframe(START_URL, 'tr', '', collect_forum_data)
forum_df.to_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/trip_advisor.json')

257

In [5]:
thread_df = get_thread_dataframe(forum_df, 'div', 'post', collect_post_data)

4874 out of 4874

In [6]:
merged_df = pd.merge(thread_df, forum_df, on='Link')
merged_df.to_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/trip_advisor_threads.json')

## 1.3. Filtering Data <a class="anchor" id="13"></a>

In this subsection, we filter the threads to only keep those having questions as titles.

In [42]:
forum_threads = pd.read_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/trip_advisor_threads.json')
forum_subjects = pd.read_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/trip_advisor.json')

forum_threads.Link = forum_threads.Link.apply(lambda row: row.split('ShowTopic-g1-i12334-')[1])
#forum_threads.Message = forum_threads.Message.apply(lambda msg: msg.split('_________________')[0])
forum_subjects.Link = forum_subjects.Link.apply(lambda row: row.split('ShowTopic-g1-i12334-')[1])
forum_subjects = forum_subjects.drop_duplicates(subset=['Link'], keep='first')
forum_threads = forum_threads.drop_duplicates(subset=['Link', 'Message'], keep='first')
#forum_threads.to_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/forum_threads.json')
#forum_subjects.to_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/forum_subjects.json')

The total number of posts is:

In [8]:
len(forum_threads)

22689

The total number of threads is:

In [9]:
len(forum_subjects)

4874

The following function defines what is a question:

In [10]:
def is_question(sentence):
    """
    Returns true if a sentence is a question.
    :param sentence: list of strings, the sentence is tokenized
    :return: boolean
    """
    new_sentence = (''.join([c for c in sentence if c.isalnum() or c in '?!./ "\''])).replace('/', ' ')
    #print(new_sentence)
    tokens = nltk.word_tokenize(new_sentence)
    tokens = [word[:1].lower() + word[1:] for word in tokens]
    try:
        question_mark_index = tokens.index('?')
        #before was in next if but failed
        tags = nltk.pos_tag(tokens[:question_mark_index + 1])
        if question_mark_index == len(tokens) - 1:
            #tags = nltk.pos_tag(tokens[:question_mark_index + 1])
            if tags[0][1] in 'MD VB VBD VBN VBP VBZ WRB WP WDT'.split(' ') and tags[0][0] not in 'am'.split(' '):
                for tag in tags:
                    if tag[1].startswith('VB'):
                        return True
    except:
        pass
    return False

In [11]:
filtered_forum_subjects = forum_subjects[forum_subjects['Title'].map(lambda x: is_question(x))]
len(filtered_forum_subjects)

34

In [12]:
filtered_forum_subjects.head()

Unnamed: 0,Link,Replies,Title
117,k12010239-Can_kiwi_com_asked_10x_the_price_if_...,3,Can kiwi.com asked 10x the price if name was i...
164,k11986484-Can_a_hotel_charge_cancellation_fee_...,7,Can a hotel charge cancellation fee I was not ...
169,k9329714-Has_anyone_used_Deluxe_Breaks-Holiday...,42,Has anyone used Deluxe Breaks?
186,k11459095-Is_Cox_and_Kings_worth_it-Holiday_Tr...,16,Is Cox and Kings worth it?
192,k11155487-Has_anyone_heard_of_cushybnb_com-Hol...,95,Has anyone heard of cushybnb.com?


In [13]:
merged_forum_threads = pd.merge(forum_threads.drop(['Replies', 'Title'], axis=1), forum_subjects, 
                                on='Link', how='inner')
merged_forum_threads.to_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/merged_forum_threads_ta.json')
merged_forum_threads.head()

Unnamed: 0,Link,Message,Timestamp,User Level,Username,Replies,Title
0,k7867029-See_TOP_QUESTIONS_before_posting-Holi...,HOW TO USE THE HOLIDAY TRAVEL FORUM!It is wort...,2014-10-11 01:00:00,6,BradJill,2,See TOP QUESTIONS before posting!
1,k7867029-See_TOP_QUESTIONS_before_posting-Holi...,Great advice BradJill.....any idea how to acce...,2014-10-11 12:37:00,6,Eden7,2,See TOP QUESTIONS before posting!
2,k7867029-See_TOP_QUESTIONS_before_posting-Holi...,Now that is a good question. I know this is mu...,2014-10-11 13:16:00,6,BradJill,2,See TOP QUESTIONS before posting!
3,k11858152-Never_you_cox_and_kings_tour-Holiday...,I booked American bonanza tour from cox and ki...,2018-08-15 04:36:00,0,Hyder A,6,Never you cox and kings tour
4,k11858152-Never_you_cox_and_kings_tour-Holiday...,Whilst it was understandably annoying that you...,2018-08-15 07:11:00,6,Travel_Undercover,6,Never you cox and kings tour


In [14]:
len(merged_forum_threads)

22689

### 2.4.2. For All Collected Data <a class="anchor" id="242"></a>

In [1]:
import nltk.data
import subprocess

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [2]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import nltk
import gensim



In [3]:
df = pd.read_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/merged_forum_threads_ta.json')
#print(len(df))
df = df[df['Message'].map(lambda x: x is not None)]
#print(len(df))
df.head()

Unnamed: 0,Link,Message,Replies,Timestamp,Title,User Level,Username
0,k7867029-See_TOP_QUESTIONS_before_posting-Holi...,HOW TO USE THE HOLIDAY TRAVEL FORUM!It is wort...,2,2014-10-11 01:00:00,See TOP QUESTIONS before posting!,6,BradJill
1,k7867029-See_TOP_QUESTIONS_before_posting-Holi...,Great advice BradJill.....any idea how to acce...,2,2014-10-11 12:37:00,See TOP QUESTIONS before posting!,6,Eden7
10,k11355776-Bliss_PV_Real_Estate_Is_this_a_scam_...,"It's actually an old thread, brought back to l...",12,2018-11-13 02:51:00,Bliss PV Real Estate - Is this a scam company?,6,MarlySF
100,k8909331-Booking_in_advance_vs_last_minute-Hol...,>>>Is it something that's quite safe (even if ...,9,2015-10-06 10:05:00,Booking in advance vs last-minute (Closed topic),6,RojBlake
1000,k7411109-Golden_ticket_travel_company-Holiday_...,If you type 'golden ticket travel' into the ma...,6,2014-04-29 16:58:00,Golden ticket travel company (Closed topic),6,BradJill


In [4]:
titles_with_first_post = df.groupby('Link').first().reset_index()[['Link', 'Title', 'Username', 'Message', 'Replies']]
titles_with_first_post.columns = ['Link', 'Title', 'Seeker', 'First_Post', 'Replies']
titles_with_first_post.head()

Unnamed: 0,Link,Title,Seeker,First_Post,Replies
0,k10001274-World_wide_travel_and_money_currency...,World wide travel and money/currency (Closed t...,joychild,Am hoping to plan a world trip to several coun...,6
1,k10003103-January_warm_hot_quiet_relaxing_fish...,"january warm/hot, quiet, relaxing, fishing vil...",DRMF1066,I want to go somewhere that is warm/hot in the...,16
2,k10007687-Travel_insurance_is_1cover_good-Holi...,travel insurance: is 1cover good? (Closed topic),21traveller,Hi theremy husband and i are travelling to can...,4
3,k10008965-Hotel_Cancellation-Holiday_Travel.html,Hotel Cancellation (Closed topic),Bonjours,Unfortunately the research should have been do...,3
4,k10010521-Airport_hotels-Holiday_Travel.html,Airport hotels (Closed topic),tracybideford,HiWe are flying to Canada in maw from Heathrow...,6


In [5]:
def tokenize_properly(text):
    if text is not None:
        r = [sent for sent in tokenizer.tokenize(text.replace('\n', '. ')) 
            if len(sent.replace('.', '').replace(' ', '')) >= 2]
        return r

titles_with_first_post.Title = titles_with_first_post.Title.apply(tokenize_properly)
#titles_with_first_post.First_Post = titles_with_first_post.First_Post.apply(tokenize_properly)
titles_with_first_post.head()

Unnamed: 0,Link,Title,Seeker,First_Post,Replies
0,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,6
1,k10003103-January_warm_hot_quiet_relaxing_fish...,"[january warm/hot, quiet, relaxing, fishing vi...",DRMF1066,I want to go somewhere that is warm/hot in the...,16
2,k10007687-Travel_insurance_is_1cover_good-Holi...,"[travel insurance: is 1cover good?, (Closed to...",21traveller,Hi theremy husband and i are travelling to can...,4
3,k10008965-Hotel_Cancellation-Holiday_Travel.html,[Hotel Cancellation (Closed topic)],Bonjours,Unfortunately the research should have been do...,3
4,k10010521-Airport_hotels-Holiday_Travel.html,[Airport hotels (Closed topic)],tracybideford,HiWe are flying to Canada in maw from Heathrow...,6


In [6]:
titles_with_first_post['Title_sent_count'] = titles_with_first_post.Title.apply(len)
#titles_with_first_post['FP_sent_count'] = titles_with_first_post.First_Post.apply(len)
titles_with_first_post.head()

Unnamed: 0,Link,Title,Seeker,First_Post,Replies,Title_sent_count
0,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,6,1
1,k10003103-January_warm_hot_quiet_relaxing_fish...,"[january warm/hot, quiet, relaxing, fishing vi...",DRMF1066,I want to go somewhere that is warm/hot in the...,16,2
2,k10007687-Travel_insurance_is_1cover_good-Holi...,"[travel insurance: is 1cover good?, (Closed to...",21traveller,Hi theremy husband and i are travelling to can...,4,2
3,k10008965-Hotel_Cancellation-Holiday_Travel.html,[Hotel Cancellation (Closed topic)],Bonjours,Unfortunately the research should have been do...,3,1
4,k10010521-Airport_hotels-Holiday_Travel.html,[Airport hotels (Closed topic)],tracybideford,HiWe are flying to Canada in maw from Heathrow...,6,1


In [10]:
titles_with_messages = pd.merge(titles_with_first_post.drop(['Replies'], axis=1), 
                                df[['Username', 'Message', 'Link', 'User Level']], on='Link')
titles_with_messages.rename(columns={'Username':'Replier'}, inplace=True)
titles_with_messages = titles_with_messages[titles_with_messages.apply(
    lambda row: not row['First_Post'] == row['Message'] and not row['Seeker'] == row['Replier'], axis=1)]
titles_with_messages.Message = titles_with_messages.Message.apply(tokenize_properly)
titles_with_messages.rename(columns={'Message': 'Reply'}, inplace=True)
titles_with_messages['Reply_sent_count'] = titles_with_messages.Reply.apply(len)
titles_with_messages.head()

Unnamed: 0,Link,Title,Seeker,First_Post,Title_sent_count,Replier,Reply,User Level,Reply_sent_count
1,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,1,bestcornishcatdc,[Use your card in local ATMs preferably using ...,5,1
3,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,1,Eden7,[Any ATM you use in any country will dispense ...,6,1
4,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,1,Bonjours,[For such a trip and so long I would bring 2 c...,6,2
5,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,1,RojBlake,[You are wise not to want to take a pile of ca...,6,5
6,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,1,Tracey F,[CaroleRecently went to europe and loaded euro...,2,4


In [11]:
titles_with_messages.head()

Unnamed: 0,Link,Title,Seeker,First_Post,Title_sent_count,Replier,Reply,User Level,Reply_sent_count
1,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,1,bestcornishcatdc,[Use your card in local ATMs preferably using ...,5,1
3,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,1,Eden7,[Any ATM you use in any country will dispense ...,6,1
4,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,1,Bonjours,[For such a trip and so long I would bring 2 c...,6,2
5,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,1,RojBlake,[You are wise not to want to take a pile of ca...,6,5
6,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,1,Tracey F,[CaroleRecently went to europe and loaded euro...,2,4


In [12]:
titles_with_messages.columns

Index(['Link', 'Title', 'Seeker', 'First_Post', 'Title_sent_count', 'Replier',
       'Reply', 'User Level', 'Reply_sent_count'],
      dtype='object')

In [13]:
len(titles_with_messages)

11105

In [14]:
CHOSEN_COLUMNS = ['Link', 'Reply']
titles_with_messages[CHOSEN_COLUMNS].head()

Unnamed: 0,Link,Reply
1,k10001274-World_wide_travel_and_money_currency...,[Use your card in local ATMs preferably using ...
3,k10001274-World_wide_travel_and_money_currency...,[Any ATM you use in any country will dispense ...
4,k10001274-World_wide_travel_and_money_currency...,[For such a trip and so long I would bring 2 c...
5,k10001274-World_wide_travel_and_money_currency...,[You are wise not to want to take a pile of ca...
6,k10001274-World_wide_travel_and_money_currency...,[CaroleRecently went to europe and loaded euro...


In [15]:
msg_df = titles_with_messages[CHOSEN_COLUMNS]
msg_df.head()

Unnamed: 0,Link,Reply
1,k10001274-World_wide_travel_and_money_currency...,[Use your card in local ATMs preferably using ...
3,k10001274-World_wide_travel_and_money_currency...,[Any ATM you use in any country will dispense ...
4,k10001274-World_wide_travel_and_money_currency...,[For such a trip and so long I would bring 2 c...
5,k10001274-World_wide_travel_and_money_currency...,[You are wise not to want to take a pile of ca...
6,k10001274-World_wide_travel_and_money_currency...,[CaroleRecently went to europe and loaded euro...


In [16]:
titles_with_first_post.head()

Unnamed: 0,Link,Title,Seeker,First_Post,Replies,Title_sent_count
0,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,joychild,Am hoping to plan a world trip to several coun...,6,1
1,k10003103-January_warm_hot_quiet_relaxing_fish...,"[january warm/hot, quiet, relaxing, fishing vi...",DRMF1066,I want to go somewhere that is warm/hot in the...,16,2
2,k10007687-Travel_insurance_is_1cover_good-Holi...,"[travel insurance: is 1cover good?, (Closed to...",21traveller,Hi theremy husband and i are travelling to can...,4,2
3,k10008965-Hotel_Cancellation-Holiday_Travel.html,[Hotel Cancellation (Closed topic)],Bonjours,Unfortunately the research should have been do...,3,1
4,k10010521-Airport_hotels-Holiday_Travel.html,[Airport hotels (Closed topic)],tracybideford,HiWe are flying to Canada in maw from Heathrow...,6,1


In [17]:
titles_with_first_post[['Link', 'Title', 'First_Post']].to_csv('all_titles_fp.csv')

In [18]:
tfp_df = titles_with_first_post[['Link', 'Title', 'First_Post']]
tfp_df.head()

Unnamed: 0,Link,Title,First_Post
0,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,Am hoping to plan a world trip to several coun...
1,k10003103-January_warm_hot_quiet_relaxing_fish...,"[january warm/hot, quiet, relaxing, fishing vi...",I want to go somewhere that is warm/hot in the...
2,k10007687-Travel_insurance_is_1cover_good-Holi...,"[travel insurance: is 1cover good?, (Closed to...",Hi theremy husband and i are travelling to can...
3,k10008965-Hotel_Cancellation-Holiday_Travel.html,[Hotel Cancellation (Closed topic)],Unfortunately the research should have been do...
4,k10010521-Airport_hotels-Holiday_Travel.html,[Airport hotels (Closed topic)],HiWe are flying to Canada in maw from Heathrow...


## 5.2. Precomputing word2vec <a class="anchor" id="52"></a>

In [19]:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/GoogleNews-vectors-negative300.bin', binary=True)
from nltk.corpus import stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))

def get_sentence_vector(sentence):
    tokens = [token for token in nltk.word_tokenize(sentence) if token not in stopwords]
    vectors = []
    for token in tokens:
        try:
            word_vec = model.wv[token]
            vectors.append(word_vec)
        except:
            pass
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return []

In [20]:
tfp_df['Title_word2vec'] = tfp_df.Title.apply(lambda sents: [get_sentence_vector(sent) for sent in sents])
tfp_df['First_Post_word2vec'] = tfp_df.First_Post.apply(lambda sents: [get_sentence_vector(sent) for sent in sents])
tfp_df.head()

  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Link,Title,First_Post,Title_word2vec,First_Post_word2vec
0,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,Am hoping to plan a world trip to several coun...,"[[0.020874023, 0.04030762, -0.00045166016, 0.1...","[[-0.10595703, 0.21386719, 0.118652344, -0.031..."
1,k10003103-January_warm_hot_quiet_relaxing_fish...,"[january warm/hot, quiet, relaxing, fishing vi...",I want to go somewhere that is warm/hot in the...,"[[0.0685791, 0.07075195, -0.008129883, 0.05454...","[[0.07910156, -0.0050354004, 0.111816406, 0.21..."
2,k10007687-Travel_insurance_is_1cover_good-Holi...,"[travel insurance: is 1cover good?, (Closed to...",Hi theremy husband and i are travelling to can...,"[[0.090413414, 0.1077474, -0.018676758, 0.1013...","[[-0.28125, 0.027954102, 0.023071289, -0.03112..."
3,k10008965-Hotel_Cancellation-Holiday_Travel.html,[Hotel Cancellation (Closed topic)],Unfortunately the research should have been do...,"[[0.09375, 0.029388428, 0.008453369, 0.1171264...","[[-0.29882812, 0.13964844, 0.29492188, 0.07617..."
4,k10010521-Airport_hotels-Holiday_Travel.html,[Airport hotels (Closed topic)],HiWe are flying to Canada in maw from Heathrow...,"[[0.09472656, 0.009857178, 0.008491516, 0.2236...","[[-0.28125, 0.027954102, 0.023071289, -0.03112..."


In [21]:
msg_df['Reply_word2vec'] = msg_df.Reply.apply(lambda sents: [get_sentence_vector(sent) for sent in sents])
msg_df.head()

  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Link,Reply,Reply_word2vec
1,k10001274-World_wide_travel_and_money_currency...,[Use your card in local ATMs preferably using ...,"[[0.009467231, 0.049945407, 0.027411567, 0.145..."
3,k10001274-World_wide_travel_and_money_currency...,[Any ATM you use in any country will dispense ...,"[[-0.025487265, 0.04998053, -0.00014822824, 0...."
4,k10001274-World_wide_travel_and_money_currency...,[For such a trip and so long I would bring 2 c...,"[[0.0423473, 0.05445168, 0.07248757, 0.0971568..."
5,k10001274-World_wide_travel_and_money_currency...,[You are wise not to want to take a pile of ca...,"[[0.09565226, 0.12756348, 0.069244385, 0.15706..."
6,k10001274-World_wide_travel_and_money_currency...,[CaroleRecently went to europe and loaded euro...,"[[0.005925959, -0.024559714, -0.07071755, 0.10..."


In [22]:
import pickle

output1 = open('tfp_df_COPY_ta.pkl', 'wb')
pickle.dump(tfp_df, output1)

output2 = open('msg_df_COPY_ta.pkl', 'wb')
pickle.dump(msg_df, output2)

## 5.3. Dot Product Similarity Functions <a class="anchor" id="53"></a>

In [23]:
import pandas as pd
from os import listdir
import gensim
import numpy as np
import nltk
from nltk.corpus import stopwords
import ast
stopwords = set(nltk.corpus.stopwords.words('english'))

In [24]:
model = gensim.models.KeyedVectors.load_word2vec_format('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/GoogleNews-vectors-negative300.bin', binary=True)

In [43]:
tfp_df = pd.read_pickle('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/tfp_df_COPY_ta.pkl')
tfp_df.head()

Unnamed: 0,Link,Title,First_Post,Title_word2vec,First_Post_word2vec
0,k10001274-World_wide_travel_and_money_currency...,[World wide travel and money/currency (Closed ...,Am hoping to plan a world trip to several coun...,"[[0.020874023, 0.04030762, -0.00045166016, 0.1...","[[-0.10595703, 0.21386719, 0.118652344, -0.031..."
1,k10003103-January_warm_hot_quiet_relaxing_fish...,"[january warm/hot, quiet, relaxing, fishing vi...",I want to go somewhere that is warm/hot in the...,"[[0.0685791, 0.07075195, -0.008129883, 0.05454...","[[0.07910156, -0.0050354004, 0.111816406, 0.21..."
2,k10007687-Travel_insurance_is_1cover_good-Holi...,"[travel insurance: is 1cover good?, (Closed to...",Hi theremy husband and i are travelling to can...,"[[0.090413414, 0.1077474, -0.018676758, 0.1013...","[[-0.28125, 0.027954102, 0.023071289, -0.03112..."
3,k10008965-Hotel_Cancellation-Holiday_Travel.html,[Hotel Cancellation (Closed topic)],Unfortunately the research should have been do...,"[[0.09375, 0.029388428, 0.008453369, 0.1171264...","[[-0.29882812, 0.13964844, 0.29492188, 0.07617..."
4,k10010521-Airport_hotels-Holiday_Travel.html,[Airport hotels (Closed topic)],HiWe are flying to Canada in maw from Heathrow...,"[[0.09472656, 0.009857178, 0.008491516, 0.2236...","[[-0.28125, 0.027954102, 0.023071289, -0.03112..."


In [45]:
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2)/(np.linalg.norm(vec1) * np.linalg.norm(vec2))

def get_sentence_vector(sentence):
    tokens = [token for token in nltk.word_tokenize(sentence) if token not in stopwords]
    vectors = []
    for token in tokens:
        try:
            word_vec = model.wv[token]
            vectors.append(word_vec)
        except:
            pass
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return []

def is_not_null(sent_vec):
    for element in sent_vec:
        if not element == 0.0:
            return True
    return False

def sent_to_text_similarity(sent_vec, text_vec):
    similarities = []
    for vec in text_vec:
        if is_not_null(vec):
            similarities.append(np.dot(sent_vec, vec)/(np.linalg.norm(sent_vec) * np.linalg.norm(vec)))
    if similarities:
        return np.mean(similarities)
    else:
        return np.nan

def text_to_text_similarity(sent_vecs1, sent_vecs2):
    similarities = []
    for v1 in sent_vecs1:
        if is_not_null(v1):
            similarity = sent_to_text_similarity(v1, sent_vecs2)
            if not np.isnan(similarity):
                similarities.append(similarity)
    if similarities:
        return np.mean(similarities)
    else:
        return np.nan

def text_to_corpus_similarity(text, corpus):
    sent_vecs = text_to_sent_vec(text)
    corpus_vecs = [text_to_sent_vec(other_text) for other_text in corpus]
    max_sim = 0
    index = -1
    for text_index in range(len(corpus_vecs)):
        similarity = text_to_text_similarity(sent_vecs, corpus_vecs[text_index])
        if not np.isnan(similarity) and max_sim < similarity:
            max_sim = similarity
            index = text_index
    if index >= 0:
        return corpus[index]
    else:
        return None

## 5.4. Chatbot Functions <a class="anchor" id="54"></a>

In [46]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [51]:
def compute_similarity(row, sent_vec):
    title_sim = 0
    title_word2vec = row['Title_word2vec']
    if len(title_word2vec) > 0:
        if len(title_word2vec[0]) > 0:
            title_sim = cosine_similarity(sent_vec, title_word2vec[0])
    return title_sim

def compute_separate_similarity(row, sent_vecs):
    title_sim = 0
    title_word2vec = row['Title_word2vec']
    if len(title_word2vec) > 0:
        if len(title_word2vec[0]) > 0:
            title_sim = np.dot(sent_vecs[0], title_word2vec[0])/(np.linalg.norm(sent_vecs[0])*np.linalg.norm(title_word2vec[0]))
    fp_sim = text_to_text_similarity(sent_vecs[1:], row['First_Post_word2vec'])
    return title_sim + fp_sim

def compute_separate_similarity_no_question(row, sent_vecs):
    fp_sim = text_to_text_similarity(sent_vecs, row['First_Post_word2vec'])
    return fp_sim

def get_most_similar_title(sentences, sent_vecs):
    if sentences == 0:
        raise ValueError('Write something!')
    elif len(sentences) == 1:
        title_fp_sim = tfp_df.apply(lambda row: compute_similarity(row, sent_vecs[0]), axis=1)
    elif sentences[0].endswith('?'):
        title_fp_sim = tfp_df.apply(lambda row: compute_separate_similarity(row, sent_vecs), axis=1)
    else:
        title_fp_sim = tfp_df.apply(lambda row: compute_separate_similarity_no_question(row, sent_vecs), axis=1)
    return tfp_df.loc[title_fp_sim.idxmax()]

def get_response_sentences(sentences, sent_vecs, link, max_sentences):
    answer_df = pd.read_pickle('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/msg_df_COPY_ta.pkl')
    answer_df = answer_df[answer_df['Link'].map(lambda x: x == link)]
    if answer_df.empty:
        s = 'I did not find a matching sentence'
        return s
    
    best_answer = answer_df.loc[answer_df['Reply_word2vec'].apply(lambda other_vecs: 
                                                     text_to_text_similarity(sent_vecs, other_vecs)).idxmax()]
    
    best_sentence_idx = np.argmax([sent_to_text_similarity(sent_vec, sent_vecs) for sent_vec in best_answer.Reply_word2vec if len(sent_vec)])
    reply_sentences = best_answer.Reply
    if max_sentences <= 1:
        return reply_sentences[best_sentence_idx]
    else:
        context_sent_count = int((max_sentences - 1)/2)
        sent_count = len(reply_sentences)
        lower_bound = best_sentence_idx - context_sent_count
        upper_bound = best_sentence_idx + context_sent_count + 1
        return ' '.join(reply_sentences[max(0, lower_bound - max(0, upper_bound - sent_count)): 
                                        min(upper_bound + max(0, 0 - lower_bound) + ((max_sentences - 1) % 2), sent_count)])

def chatbot_answer(question, max_sentences=1):
    sentences = tokenizer.tokenize(question)
    sent_vecs = [get_sentence_vector(sent) for sent in sentences]
    most_similar_title = get_most_similar_title(sentences, sent_vecs)
    return get_response_sentences(sentences, sent_vecs, most_similar_title.Link, max_sentences)
                        

An example...

In [54]:
chatbot_answer('should i use cash or card', max_sentences=2)

  if __name__ == '__main__':


"I'd just take cash up to your insurance limits, and then use a card like a Clarity. Failing that travellers cheques are still widely accepted in the States."