# A Forum-based Chatbot for Parents of Autistic Children

**[Khalil Mrini](https://goo.gl/7MCYvq)**

# Table of Contents

### [1. Scraping the Online Forum](#1)
[1.1. Functions](#11) <br/>
[1.2. Collected Data](#12) <br/>
[1.2.1. Parents' Discussion](#121) <br/>
[1.2.2. General Autism Discussion](#122) <br/>
[1.3. Filtering Data](#13) <br/>

### [2. Amazon Mechanical Turk](#2)
[2.1. Sampling Threads](#21) <br/>
[2.2. Using sent2vec to Cluster Samples](#22) <br/>
[2.3. Checking Amazon Mechanical Turk Results](#23) <br/>
[2.4. Using NLTK to Separate Text Data into Sentences](#24) <br/>
[2.4.1. For the Samples](#241) <br/>
[2.4.2. For All Collected Data](#242) <br/>

### [3. Machine Learning Features](#3)
[3.1. MTurk DataFrame](#31) <br/>
[3.2. Adding Reply Vectors](#32) <br/>
[3.3. Adding First Post Vectors](#33) <br/>
[3.4. Adding Title Vectors](#34) <br/>
[3.5. Computing sent2vec Similarity](#35) <br/>
[3.6. Adding Features from Forum Threads](#36) <br/>
[3.7. Standardization of Features](#37) <br/>
[3.8. Train and Test Data Sets](#38) <br/>
[3.9. PCA Visualization](#39) <br/>

### [4. Ensemble Classifiers for Multi-Class Labels](#4)
[4.1. Random Forest Classifier](#41) <br/>
[4.2. Bagging Classifier](#42) <br/>
[4.3. Ada Boost Classifier](#43) <br/>
[4.4. Extra Trees Classifier](#44) <br/>
[4.5. Gradient Boosting Classifier](#45) <br/>
[4.6. SVM](#46) <br/>
[4.7. Decision Tree Classifier](#47) <br/>
[4.8. K-Neighbours Classifier](#48) <br/>
[4.9. Radius Neighbours Classifier](#49) <br/>

### [5. Building the Chatbot](#5)
[5.1. Applying the Model to All Unlabelled Replies](#51) <br/>
[5.2. Precomputing word2vec](#52) <br/>
[5.3. Dot Product Similarity Functions](#53) <br/>
[5.4. Chatbot Functions](#54) <br/>

# 1. Scraping the Online Forum <a class="anchor" id="1"></a>

In this section, we scrape the online forum [Wrong Planet](https://wrongplanet.net/forums/) for posts in the *General Autism Discussion* and the *Parents' Discussion*.

In [1]:
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import nltk
import gensim



## 1.1. Functions <a class="anchor" id="11"></a>

In [3]:
def find_next(soup, index):
    try:
        next_soup = soup.find_all('li')
        for s in next_soup:
            if s.text.strip() == 'Next':
                return s.find('a')['href'][1:]
    except:
        return ""  
            
def transform_datetime(datetime):
    try:
        return datetime.replace('Yesterday', '24 Oct 2017').replace('Today', '25 Oct 2017')
    except:
        return datetime 
        

def find_all_rows(soup, type_name, class_name, fct_number):
    try:
        if fct_number == 1:
            return soup.find_all(type_name, {'class': class_name})
        else:
            return soup.find_all(type_name, {'class': 'op-post'}) + soup.find_all(type_name, {'class': class_name})
        
    except:
        return []

def convert_to_int(text):
    return int(''.join([char for char in text if char.isdigit()]))

def transform_datetime(datetime):
    """
    Removes mentions of Yesterday and Today.
    """
    try:
        return datetime.replace('Yesterday', '24 Oct 2017').replace('Today', '25 Oct 2017')
    except:
        return datetime

def collect_forum_data(soup_row, link=""): 
    data = {}
    try:
        title_soup = soup_row.find('span', {'class': 'title'})
        data['Title'] = title_soup.text.strip()
        data['Link'] = title_soup.find('a')['href'][1:]
        info_soup = soup_row.find('div', {'class': 'labels'})
        data['Replies'] = convert_to_int(info_soup.text.strip())     
    except:
        pass
    return data

def collect_recursively(data, url, type_name, class_name, fct_number, data_function, index=""):
    try:
        if index:
            print(index, end='\r', flush=True)      
        soup = BeautifulSoup(requests.get(url).text, 'html.parser')
        soup_rows = find_all_rows(soup, type_name, class_name, fct_number)
        data.extend([data_function(soup_row, link=url) for soup_row in soup_rows])
        next_url = find_next(soup, index)
        if next_url:
            if index:
                return collect_recursively(data, PREFIX_URL + next_url, type_name, class_name, fct_number, data_function, index+1)
            else:
                return collect_recursively(data, PREFIX_URL + next_url, type_name, class_name, fct_number, data_function)
        else:
            return data
    except:
        return data

def get_forum_dataframe(url, type_name, class_name, fct_number, data_function):
    data = collect_recursively([], url, type_name, class_name, fct_number, data_function, 1)
    return pd.DataFrame(data)

def collect_post_data(soup_row, link=""):
    data = {}
    try:
        data['Link'] = link.replace(PREFIX_URL, '')
        #print(soup_row)
        #print('-----------------------------------------')
        
        #Get the username
        data['Username'] = soup_row.find('div', class_='user-name').find('a').text.strip()
        #print(soup_row.find('div', class_='user-name').find('a').text.strip())
        
        #Get the timestamp
        data['Timestamp'] = transform_datetime(soup_row.find('div', class_='op-date').text.strip())
        #print("Timestamp ", soup_row.find('div', class_='op-date').text.strip())
    
        #Get the message
        data['Message'] = soup_row.find('div', class_='col-md-10').text.strip()
        #print(soup_row.find('div', class_='col-md-10').text.strip())  
        #print('------------------------') 
             
    except:   
            pass
   
    return data

def get_thread_dataframe(forum_df, type_name, class_name, fct_number, data_function):
    data = []
    total = len(forum_df['Link'])
    index = 0
    for url in forum_df['Link']:     
        index += 1
        print('{} out of {}'.format(index, total), end='\r', flush=True)
        data.extend(collect_recursively([], PREFIX_URL + url, type_name, class_name, fct_number, data_function))
    return pd.DataFrame(data)

In [4]:
PREFIX_URL = 'https://www.holidaytruths.co.uk/'
START_URL = PREFIX_URL + 'forum/america-canada-discussion-forum-f2-0.html'
forum_df = get_forum_dataframe(START_URL, 'div', 'row review', 1, collect_forum_data)
forum_df.to_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/holiday_truths.json')

257

In [5]:
thread_df = get_thread_dataframe(forum_df, 'div', 'topic-item', 2, collect_post_data)

4874 out of 4874

In [6]:
merged_df = pd.merge(thread_df, forum_df, on='Link')
merged_df.to_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/holiday_truths_threads.json')

## 1.3. Filtering Data <a class="anchor" id="13"></a>

In this subsection, we filter the threads to only keep those having questions as titles.

In [3]:
forum_threads = pd.read_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/holiday_truths_threads.json')
forum_subjects = pd.read_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/holiday_truths.json')

forum_threads.Link = forum_threads.Link.apply(lambda row: row.split('forum/')[1].split('.html')[0])
forum_threads.Message = forum_threads.Message.apply(lambda msg: msg.split('_________________')[0])
forum_subjects.Link = forum_subjects.Link.apply(lambda row: row.split('forum/')[1].split('.html')[0])
forum_subjects = forum_subjects.drop_duplicates(subset=['Link'], keep='first')
forum_threads = forum_threads.drop_duplicates(subset=['Link', 'Message'], keep='first')
forum_threads.to_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/forum_threads_holiday_truths.json')
forum_subjects.to_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/forum_subjects_holiday_truths.json')

The total number of posts is:

In [6]:
len(forum_threads)

11054

The total number of threads is:

In [7]:
len(forum_subjects)

1434

In [14]:
len(merged_forum_threads)

22689

In [8]:
def is_question(sentence):
    """
    Returns true if a sentence is a question.
    :param sentence: list of strings, the sentence is tokenized
    :return: boolean
    """
    new_sentence = (''.join([c for c in sentence if c.isalnum() or c in '?!./ "\''])).replace('/', ' ')
    #print(new_sentence)
    tokens = nltk.word_tokenize(new_sentence)
    tokens = [word[:1].lower() + word[1:] for word in tokens]
    try:
        question_mark_index = tokens.index('?')
        #before was in next if but failed
        tags = nltk.pos_tag(tokens[:question_mark_index + 1])
        if question_mark_index == len(tokens) - 1:
            #tags = nltk.pos_tag(tokens[:question_mark_index + 1])
            if tags[0][1] in 'MD VB VBD VBN VBP VBZ WRB WP WDT'.split(' ') and tags[0][0] not in 'am'.split(' '):
                for tag in tags:
                    if tag[1].startswith('VB'):
                        return True
    except:
        pass
    return False

In [9]:
filtered_forum_subjects = forum_subjects[forum_subjects['Title'].map(lambda x: is_question(x))]
len(filtered_forum_subjects)

36

In [10]:
filtered_forum_subjects.head()

Unnamed: 0,Link,Replies,Title
1059,miami-what-do-t107527,6.0,Miami - what to do?
1071,is-this-good-price-for-flight-new-york-t106955,6.0,Is this a good price for a flight to new york?
1079,what-is-there-do-in-las-vegas-t105579,11.0,what is there to do in Las Vegas?
1180,what-pack-for-florida-t98837,14.0,What to pack for Florida?
1232,when-do-flights-for-april-2009-come-on-sale-fl...,7.0,when do flights for april 2009 come on sale to...


In [11]:
merged_forum_threads = pd.merge(forum_threads.drop(['Replies', 'Title'], axis=1), forum_subjects, 
                                on='Link', how='inner')
merged_forum_threads.to_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/merged_forum_threads_ht.json')
merged_forum_threads.head()

Unnamed: 0,Link,Message,Timestamp,Username,Replies,Title
0,esta-question-on-employment-t172445,"Hi, I am brand new and hopefully I have put th...",2018-01-15 09:54:14,AnnaM,12.0,ESTA question on employment
1,esta-question-on-employment-t172445,Hi Anita & \nIf you are retired Anna then I'...,2018-01-15 10:38:14,Glynis HT Admin,12.0,ESTA question on employment
2,esta-question-on-employment-t172445,You will love it!\n\n \n\n ...,2018-01-16 19:19:00,Fiona,12.0,ESTA question on employment
3,esta-question-on-employment-t172445,Keep your ESTA reference number\n\n\n ...,2018-10-13 12:44:47,James Fletcher,12.0,ESTA question on employment
4,esta-question-on-employment-t172445,"Thanks for your input, My worry is if I put No...",2018-01-15 11:18:45,AnnaM,12.0,ESTA question on employment


In [12]:
len(merged_forum_threads)

11054

### 2.4.2. For All Collected Data <a class="anchor" id="242"></a>

In [2]:
import nltk.data
import subprocess

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [3]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import nltk
import gensim

In [4]:
df = pd.read_json('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/merged_forum_threads_ht.json')
df.head()
df = df[df['Message'].map(lambda x: x is not None)  #check if useful for other sites than TripAdvisor

Unnamed: 0,Link,Message,Replies,Timestamp,Title,Username
0,esta-question-on-employment-t172445,"Hi, I am brand new and hopefully I have put th...",12.0,2018-01-15 09:54:14,ESTA question on employment,AnnaM
1,esta-question-on-employment-t172445,Hi Anita & \nIf you are retired Anna then I'...,12.0,2018-01-15 10:38:14,ESTA question on employment,Glynis HT Admin
10,esta-question-on-employment-t172445,Is that the Cosmos one? Haven't been to LA but...,12.0,2018-01-15 13:56:02,ESTA question on employment,Fiona
100,when-do-oct-08-brrochures-go-on-sale-t93309,Reviews HERE.\n\nPippy \n\n Reply,6.0,2007-02-24 00:20:29,When do Oct 08 Brrochures go on sale?,Pippy
1000,trip-report-vegas-ti-bellagio-7-15-sept-long-t...,"Not too long at all, Lesley. You seemed to pac...",18.0,2005-09-22 14:56:51,Trip Report - Vegas TI/Bellagio 7-15 sept - long!,pebbles


In [5]:
titles_with_first_post = df.groupby('Link').first().reset_index()[['Link', 'Title', 'Username', 'Message', 'Replies']]
titles_with_first_post.columns = ['Link', 'Title', 'Seeker', 'First_Post', 'Replies']
titles_with_first_post.head()

Unnamed: 0,Link,Title,Seeker,First_Post,Replies
0,-20-dollar-trick-t110905,$20 Dollar trick,jac47,just returned from our first trip to vegas.\nI...,13.0
1,-toronto-halal-t102178,--Toronto- halal--,Just_a_tourist,Hi! Does anyone know good places where they se...,1.0
2,1st-time-florida-help-t141308,1st Time to Florida HELP!!!,Lelly,"Hello, we are a family of 5 going to Florida f...",5.0
3,1st-time-florida-whats-nearby-t153058,1st time florida - whats nearby,cart583,"hi, never been to florida before but have book...",2.0
4,1st-time-new-york-booking-advice-t126856,1st time New York / booking advice,seagull,Starting to look into booking a break in New Y...,3.0


In [6]:
def tokenize_properly(text):
    if text is not None:
        r = [sent for sent in tokenizer.tokenize(text.replace('\n', '. ')) 
            if len(sent.replace('.', '').replace(' ', '')) >= 2]
        return r

titles_with_first_post.Title = titles_with_first_post.Title.apply(tokenize_properly)
titles_with_first_post.First_Post = titles_with_first_post.First_Post.apply(tokenize_properly)
titles_with_first_post.head()

Unnamed: 0,Link,Title,Seeker,First_Post,Replies
0,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",13.0
1,-toronto-halal-t102178,[--Toronto- halal--],Just_a_tourist,"[Hi!, Does anyone know good places where they ...",1.0
2,1st-time-florida-help-t141308,[1st Time to Florida HELP!!],Lelly,"[Hello, we are a family of 5 going to Florida ...",5.0
3,1st-time-florida-whats-nearby-t153058,[1st time florida - whats nearby],cart583,"[hi, never been to florida before but have boo...",2.0
4,1st-time-new-york-booking-advice-t126856,[1st time New York / booking advice],seagull,[Starting to look into booking a break in New ...,3.0


In [7]:
titles_with_first_post['Title_sent_count'] = titles_with_first_post.Title.apply(len)
titles_with_first_post['FP_sent_count'] = titles_with_first_post.First_Post.apply(len)
titles_with_first_post.head()

Unnamed: 0,Link,Title,Seeker,First_Post,Replies,Title_sent_count,FP_sent_count
0,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",13.0,1,10
1,-toronto-halal-t102178,[--Toronto- halal--],Just_a_tourist,"[Hi!, Does anyone know good places where they ...",1.0,1,3
2,1st-time-florida-help-t141308,[1st Time to Florida HELP!!],Lelly,"[Hello, we are a family of 5 going to Florida ...",5.0,1,8
3,1st-time-florida-whats-nearby-t153058,[1st time florida - whats nearby],cart583,"[hi, never been to florida before but have boo...",2.0,1,7
4,1st-time-new-york-booking-advice-t126856,[1st time New York / booking advice],seagull,[Starting to look into booking a break in New ...,3.0,1,1


In [8]:
titles_with_messages = pd.merge(titles_with_first_post.drop(['Replies'], axis=1), 
                                df[['Username', 'Message', 'Link']], on='Link')
titles_with_messages.rename(columns={'Username':'Replier'}, inplace=True)
titles_with_messages = titles_with_messages[titles_with_messages.apply(
    lambda row: not row['First_Post'] == row['Message'] and not row['Seeker'] == row['Replier'], axis=1)]
titles_with_messages.Message = titles_with_messages.Message.apply(tokenize_properly)
titles_with_messages.rename(columns={'Message': 'Reply'}, inplace=True)
titles_with_messages['Reply_sent_count'] = titles_with_messages.Reply.apply(len)
titles_with_messages.head()

Unnamed: 0,Link,Title,Seeker,First_Post,Title_sent_count,FP_sent_count,Replier,Reply,Reply_sent_count
1,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",1,10,luci HT Mod,"[Re: $20 Dollar trick., Excellent news jac!, I...",5
2,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",1,10,lesley74,"[Re: $20 Dollar trick., I love the $20 trick b...",3
3,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",1,10,luci HT Mod,"[Re: $20 Dollar trick., Tried it at the Bellag...",8
4,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",1,10,kiershay,"[Re: $20 Dollar trick., does this only work in...",3
5,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",1,10,luci HT Mod,"[Re: $20 Dollar trick., The suggested method i...",8


In [9]:
titles_with_messages.head()

Unnamed: 0,Link,Title,Seeker,First_Post,Title_sent_count,FP_sent_count,Replier,Reply,Reply_sent_count
1,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",1,10,luci HT Mod,"[Re: $20 Dollar trick., Excellent news jac!, I...",5
2,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",1,10,lesley74,"[Re: $20 Dollar trick., I love the $20 trick b...",3
3,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",1,10,luci HT Mod,"[Re: $20 Dollar trick., Tried it at the Bellag...",8
4,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",1,10,kiershay,"[Re: $20 Dollar trick., does this only work in...",3
5,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",1,10,luci HT Mod,"[Re: $20 Dollar trick., The suggested method i...",8


In [10]:
titles_with_messages.columns

Index(['Link', 'Title', 'Seeker', 'First_Post', 'Title_sent_count',
       'FP_sent_count', 'Replier', 'Reply', 'Reply_sent_count'],
      dtype='object')

In [11]:
len(titles_with_messages)

7643

In [12]:
CHOSEN_COLUMNS = ['Link', 'Reply']
titles_with_messages[CHOSEN_COLUMNS].head()

Unnamed: 0,Link,Reply
1,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., Excellent news jac!, I..."
2,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., I love the $20 trick b..."
3,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., Tried it at the Bellag..."
4,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., does this only work in..."
5,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., The suggested method i..."


In [14]:
msg_df = titles_with_messages[CHOSEN_COLUMNS]
msg_df.head()

Unnamed: 0,Link,Reply
1,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., Excellent news jac!, I..."
2,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., I love the $20 trick b..."
3,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., Tried it at the Bellag..."
4,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., does this only work in..."
5,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., The suggested method i..."


In [15]:
titles_with_first_post.head()

Unnamed: 0,Link,Title,Seeker,First_Post,Replies,Title_sent_count,FP_sent_count
0,-20-dollar-trick-t110905,[$20 Dollar trick],jac47,"[just returned from our first trip to vegas..,...",13.0,1,10
1,-toronto-halal-t102178,[--Toronto- halal--],Just_a_tourist,"[Hi!, Does anyone know good places where they ...",1.0,1,3
2,1st-time-florida-help-t141308,[1st Time to Florida HELP!!],Lelly,"[Hello, we are a family of 5 going to Florida ...",5.0,1,8
3,1st-time-florida-whats-nearby-t153058,[1st time florida - whats nearby],cart583,"[hi, never been to florida before but have boo...",2.0,1,7
4,1st-time-new-york-booking-advice-t126856,[1st time New York / booking advice],seagull,[Starting to look into booking a break in New ...,3.0,1,1


In [16]:
titles_with_first_post[['Link', 'Title', 'First_Post']].to_csv('all_titles_fp.csv')

In [17]:
tfp_df = titles_with_first_post[['Link', 'Title', 'First_Post']]
tfp_df.head()

Unnamed: 0,Link,Title,First_Post
0,-20-dollar-trick-t110905,[$20 Dollar trick],"[just returned from our first trip to vegas..,..."
1,-toronto-halal-t102178,[--Toronto- halal--],"[Hi!, Does anyone know good places where they ..."
2,1st-time-florida-help-t141308,[1st Time to Florida HELP!!],"[Hello, we are a family of 5 going to Florida ..."
3,1st-time-florida-whats-nearby-t153058,[1st time florida - whats nearby],"[hi, never been to florida before but have boo..."
4,1st-time-new-york-booking-advice-t126856,[1st time New York / booking advice],[Starting to look into booking a break in New ...


## 5.2. Precomputing word2vec <a class="anchor" id="52"></a>

In [18]:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/GoogleNews-vectors-negative300.bin', binary=True)
from nltk.corpus import stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))

def get_sentence_vector(sentence):
    tokens = [token for token in nltk.word_tokenize(sentence) if token not in stopwords]
    vectors = []
    for token in tokens:
        try:
            word_vec = model.wv[token]
            vectors.append(word_vec)
        except:
            pass
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return []

In [23]:
tfp_df['Title_word2vec'] = tfp_df.Title.apply(lambda sents: [get_sentence_vector(sent) for sent in sents])
tfp_df['First_Post_word2vec'] = tfp_df.First_Post.apply(lambda sents: [get_sentence_vector(sent) for sent in sents])
tfp_df.head()

  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Link,Title,First_Post,Title_word2vec,First_Post_word2vec
0,-20-dollar-trick-t110905,[$20 Dollar trick],"[just returned from our first trip to vegas..,...","[[0.053548176, -0.06933594, -0.022298178, 0.05...","[[0.09472656, 0.1303711, -0.07922363, 0.002766..."
1,-toronto-halal-t102178,[--Toronto- halal--],"[Hi!, Does anyone know good places where they ...","[[-0.4375, -0.36914062, 0.21484375, 0.14941406...","[[-0.087402344, 0.095703125, 0.27539062, -0.01..."
2,1st-time-florida-help-t141308,[1st Time to Florida HELP!!],"[Hello, we are a family of 5 going to Florida ...","[[0.09753418, 0.0126953125, 0.038024902, 0.224...","[[0.039695047, 0.022238992, 0.029807352, 0.102..."
3,1st-time-florida-whats-nearby-t153058,[1st time florida - whats nearby],"[hi, never been to florida before but have boo...","[[0.030883789, -0.016217042, 0.0061157225, 0.2...","[[-0.010480608, 0.021902902, -0.032854352, 0.0..."
4,1st-time-new-york-booking-advice-t126856,[1st time New York / booking advice],[Starting to look into booking a break in New ...,"[[-0.012568156, 0.072255455, -0.041554768, 0.0...","[[0.042194713, 0.096147016, -0.065665506, 0.10..."


In [24]:
msg_df['Reply_word2vec'] = msg_df.Reply.apply(lambda sents: [get_sentence_vector(sent) for sent in sents])
msg_df.head()

  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Link,Reply,Reply_word2vec
1,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., Excellent news jac!, I...","[[0.026672363, 0.012939453, 0.06286621, 0.0266..."
2,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., I love the $20 trick b...","[[0.026672363, 0.012939453, 0.06286621, 0.0266..."
3,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., Tried it at the Bellag...","[[0.026672363, 0.012939453, 0.06286621, 0.0266..."
4,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., does this only work in...","[[0.026672363, 0.012939453, 0.06286621, 0.0266..."
5,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., The suggested method i...","[[0.026672363, 0.012939453, 0.06286621, 0.0266..."


In [22]:
import pickle

# write python dict to a file
output1 = open('tfp_df_COPY_ht.pkl', 'wb')
pickle.dump(tfp_df, output1)

output2 = open('msg_df_COPY_ht.pkl', 'wb')
pickle.dump(msg_df, output2)

## 5.3. Dot Product Similarity Functions <a class="anchor" id="53"></a>

In [19]:
import pandas as pd
from os import listdir
import gensim
import numpy as np
import nltk
from nltk.corpus import stopwords
import ast
stopwords = set(nltk.corpus.stopwords.words('english'))

In [22]:
model = gensim.models.KeyedVectors.load_word2vec_format('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/Forum Data/GoogleNews-vectors-negative300.bin', binary=True)

In [38]:
tfp_df = pd.read_pickle('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/tfp_df_COPY_ht.pkl')
tfp_df.head()

Unnamed: 0,Link,Title,First_Post,Title_word2vec,First_Post_word2vec
0,-20-dollar-trick-t110905,[$20 Dollar trick],"[just returned from our first trip to vegas..,...","[[0.053548176, -0.06933594, -0.022298178, 0.05...","[[0.09472656, 0.1303711, -0.07922363, 0.002766..."
1,-toronto-halal-t102178,[--Toronto- halal--],"[Hi!, Does anyone know good places where they ...","[[-0.4375, -0.36914062, 0.21484375, 0.14941406...","[[-0.087402344, 0.095703125, 0.27539062, -0.01..."
2,1st-time-florida-help-t141308,[1st Time to Florida HELP!!],"[Hello, we are a family of 5 going to Florida ...","[[0.09753418, 0.0126953125, 0.038024902, 0.224...","[[0.039695047, 0.022238992, 0.029807352, 0.102..."
3,1st-time-florida-whats-nearby-t153058,[1st time florida - whats nearby],"[hi, never been to florida before but have boo...","[[0.030883789, -0.016217042, 0.0061157225, 0.2...","[[-0.010480608, 0.021902902, -0.032854352, 0.0..."
4,1st-time-new-york-booking-advice-t126856,[1st time New York / booking advice],[Starting to look into booking a break in New ...,"[[-0.012568156, 0.072255455, -0.041554768, 0.0...","[[0.042194713, 0.096147016, -0.065665506, 0.10..."


In [39]:
msg_df = pd.read_pickle('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/msg_df_COPY_ht.pkl')
msg_df.head()

Unnamed: 0,Link,Reply,Reply_word2vec
1,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., Excellent news jac!, I...","[[0.026672363, 0.012939453, 0.06286621, 0.0266..."
2,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., I love the $20 trick b...","[[0.026672363, 0.012939453, 0.06286621, 0.0266..."
3,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., Tried it at the Bellag...","[[0.026672363, 0.012939453, 0.06286621, 0.0266..."
4,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., does this only work in...","[[0.026672363, 0.012939453, 0.06286621, 0.0266..."
5,-20-dollar-trick-t110905,"[Re: $20 Dollar trick., The suggested method i...","[[0.026672363, 0.012939453, 0.06286621, 0.0266..."


In [71]:
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2)/(np.linalg.norm(vec1) * np.linalg.norm(vec2))

def get_sentence_vector(sentence):
    tokens = [token for token in nltk.word_tokenize(sentence) if token not in stopwords]
    vectors = []
    for token in tokens:
        try:
            word_vec = model.wv[token]
            vectors.append(word_vec)
        except:
            pass
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return []

def is_not_null(sent_vec):
    for element in sent_vec:
        if not element == 0.0:
            return True
    return False

def sent_to_text_similarity(sent_vec, text_vec):
    similarities = []
    for vec in text_vec:
        if is_not_null(vec):
            similarities.append(np.dot(sent_vec, vec)/(np.linalg.norm(sent_vec) * np.linalg.norm(vec)))
    if similarities:
        return np.mean(similarities)
    else:
        return np.nan

def text_to_text_similarity(sent_vecs1, sent_vecs2):
    similarities = []
    for v1 in sent_vecs1:
        if is_not_null(v1):
            similarity = sent_to_text_similarity(v1, sent_vecs2)
            if not np.isnan(similarity):
                similarities.append(similarity)
    if similarities:
        return np.mean(similarities)
    else:
        return np.nan

def text_to_corpus_similarity(text, corpus):
    sent_vecs = text_to_sent_vec(text)
    corpus_vecs = [text_to_sent_vec(other_text) for other_text in corpus]
    max_sim = 0
    index = -1
    for text_index in range(len(corpus_vecs)):
        similarity = text_to_text_similarity(sent_vecs, corpus_vecs[text_index])
        if not np.isnan(similarity) and max_sim < similarity:
            max_sim = similarity
            index = text_index
    if index >= 0:
        return corpus[index]
    else:
        return None

## 5.4. Chatbot Functions <a class="anchor" id="54"></a>

In [72]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [73]:
def compute_similarity(row, sent_vec):
    title_sim = 0
    title_word2vec = row['Title_word2vec']
    if len(title_word2vec) > 0:
        if len(title_word2vec[0]) > 0:
            title_sim = cosine_similarity(sent_vec, title_word2vec[0])
    return title_sim

def compute_separate_similarity(row, sent_vecs):
    title_sim = 0
    title_word2vec = row['Title_word2vec']
    if len(title_word2vec) > 0:
        if len(title_word2vec[0]) > 0:
            title_sim = np.dot(sent_vecs[0], title_word2vec[0])/(np.linalg.norm(sent_vecs[0])*np.linalg.norm(title_word2vec[0]))
    fp_sim = text_to_text_similarity(sent_vecs[1:], row['First_Post_word2vec'])
    return title_sim + fp_sim

def compute_separate_similarity_no_question(row, sent_vecs):
    fp_sim = text_to_text_similarity(sent_vecs, row['First_Post_word2vec'])
    return fp_sim

def get_most_similar_title(sentences, sent_vecs):
    if sentences == 0:
        raise ValueError('Write something!')
    elif len(sentences) == 1:
        title_fp_sim = tfp_df.apply(lambda row: compute_similarity(row, sent_vecs[0]), axis=1)
    elif sentences[0].endswith('?'):
        title_fp_sim = tfp_df.apply(lambda row: compute_separate_similarity(row, sent_vecs), axis=1)
    else:
        title_fp_sim = tfp_df.apply(lambda row: compute_separate_similarity_no_question(row, sent_vecs), axis=1)
    return tfp_df.loc[title_fp_sim.idxmax()]

def get_response_sentences(sentences, sent_vecs, link, max_sentences):
    answer_df = pd.read_pickle('C:/Users/Meret/Documents/EPFL/3Annee/Semestre_5/Projet/Forum_Chatbot/msg_df_COPY_ht.pkl')
    answer_df = answer_df[answer_df['Link'].map(lambda x: x == link)]
    if answer_df.empty:
        s = 'I did not find a matching sentence'
        return s
    
    best_answer = answer_df.loc[answer_df['Reply_word2vec'].apply(lambda other_vecs: 
                                                     text_to_text_similarity(sent_vecs, other_vecs)).idxmax()]
    
    best_sentence_idx = np.argmax([sent_to_text_similarity(sent_vec, sent_vecs) for sent_vec in best_answer.Reply_word2vec if len(sent_vec)])
    reply_sentences = best_answer.Reply
    if max_sentences <= 1:
        return reply_sentences[best_sentence_idx]
    else:
        context_sent_count = int((max_sentences - 1)/2)
        sent_count = len(reply_sentences)
        lower_bound = best_sentence_idx - context_sent_count
        upper_bound = best_sentence_idx + context_sent_count + 1
        return ' '.join(reply_sentences[max(0, lower_bound - max(0, upper_bound - sent_count)): 
                                        min(upper_bound + max(0, 0 - lower_bound) + ((max_sentences - 1) % 2), sent_count)])

def chatbot_answer(question, max_sentences=1):
    sentences = tokenizer.tokenize(question)
    sent_vecs = [get_sentence_vector(sent) for sent in sentences]
    most_similar_title = get_most_similar_title(sentences, sent_vecs)
    return get_response_sentences(sentences, sent_vecs, most_similar_title.Link, max_sentences)
                        

An example...

In [81]:
chatbot_answer('Should I use cash or card?', max_sentences=2)

  if __name__ == '__main__':


'Therefore, we recommend using a regular credit or debit card to "guarantee" transactions and only use the FairFX card when you actually settle an account.. . ATM withdrawal Limits.'