# Comparative question answering demo

External resourses required:

In [1]:
!python3 -m nltk.downloader stopwords
!python3 -m nltk.downloader universal_tagset
!python3 -m spacy download en

Downloading [trained models](https://drive.google.com/open?id=1JTtQ6HuSddqAxgXueFQunoaTuknSN7p6) from google drive:

In [9]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1JTtQ6HuSddqAxgXueFQunoaTuknSN7p6' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1JTtQ6HuSddqAxgXueFQunoaTuknSN7p6" -O model_files.zip && rm -rf /tmp/cookies.txt

--2020-05-08 17:33:31--  https://docs.google.com/uc?export=download&confirm=geV6&id=1JTtQ6HuSddqAxgXueFQunoaTuknSN7p6
Resolving docs.google.com (docs.google.com)... 173.194.73.194, 2a00:1450:4010:c03::c2
Connecting to docs.google.com (docs.google.com)|173.194.73.194|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0s-5s-docs.googleusercontent.com/docs/securesc/7ia27re84neq8nk3h6q4aciftspfnri0/65nk291gg3bpupt14iqk2q5s9o5somp7/1588948350000/13476312262238289650/10998875876866526155Z/1JTtQ6HuSddqAxgXueFQunoaTuknSN7p6?e=download [following]
--2020-05-08 17:33:31--  https://doc-0s-5s-docs.googleusercontent.com/docs/securesc/7ia27re84neq8nk3h6q4aciftspfnri0/65nk291gg3bpupt14iqk2q5s9o5somp7/1588948350000/13476312262238289650/10998875876866526155Z/1JTtQ6HuSddqAxgXueFQunoaTuknSN7p6?e=download
Resolving doc-0s-5s-docs.googleusercontent.com (doc-0s-5s-docs.googleusercontent.com)... 64.233.163.132, 2a00:1450:4010:c0b::84
Connecting to doc-0s-5s

Unzipping the archive and moving the files to proper places:

In [10]:
!unzip model_files.zip
!mv model_files/bert/pytorch_model.bin bert/pytorch_model.bin
!mv model_files/elmo/elmo+linreg.bin elmo/elmo+linreg.bin
!mv model_files/infersent/infersent+xgboost.bin infersent/infersent+xgboost.bin
!mv model_files/seq_lab/BERT_seq_lab/pytorch_model.bin seq_lab/BERT_seq_lab/pytorch_model.bin
!mv model_files/seq_lab/LSTMCRF_seq_lab/pytorch_model.bin seq_lab/LSTMCRF_seq_lab/pytorch_model.bin
!mv model_files/seq_lab/LSTMCRF_seq_lab/vocab.pkl seq_lab/LSTMCRF_seq_lab/vocab.pkl
!mv model_files/w2v/asp_clf.pkl w2v/asp_clf.pkl

Archive:  model_files.zip
   creating: model_files/
   creating: model_files/bert/
  inflating: model_files/bert/pytorch_model.bin  
   creating: model_files/elmo/
  inflating: model_files/elmo/elmo+linreg.bin  
   creating: model_files/infersent/
  inflating: model_files/infersent/infersent+xgboost.bin  
   creating: model_files/seq_lab/
   creating: model_files/seq_lab/BERT_seq_lab/
  inflating: model_files/seq_lab/BERT_seq_lab/pytorch_model.bin  
   creating: model_files/seq_lab/LSTMCRF_seq_lab/
  inflating: model_files/seq_lab/LSTMCRF_seq_lab/pytorch_model.bin  
  inflating: model_files/seq_lab/LSTMCRF_seq_lab/vocab.pkl  
   creating: model_files/w2v/
  inflating: model_files/w2v/asp_clf.pkl  


## Whole pipeline in three cells to provide an example:

In [1]:
from Pipeline import Pipeline
import warnings
warnings.simplefilter("ignore")

In [2]:
# possible args for comp_sent_clf_name: 'bow+xgboost', 'infersent+xgboost', 'elmo+linreg', 'bert'
# possible args for seq_labeller_name: None, 'lstmcrftagger', 'berttagger'
# if seq_labeller_name is None, keywords approach will be used
pl = Pipeline(comp_sent_clf_name='infersent+xgboost', seq_labeller_name='berttagger')

Initializing comparative sentences classifier
Initializing aspect classifier


In [4]:
obj_a, obj_b, obj_a_aspects, obj_b_aspects = pl.get_structured_answer("python", "java")
print("Result:")
print(f"{obj_a}: {obj_a_aspects}")
print(f"{obj_b}: {obj_b_aspects}")

Requesting Elasticsearch
Preparing sentences
Classifying comparative sentences
Sequence labelling
Result:
python: ['older', 'easier to read', 'easier to learn', 'easier', 'simpler', 'complex', 'easier to program in', 'quicker to write code']
java: ['faster', 'stronger']


In [4]:
obj_a, obj_b, obj_a_aspects, obj_b_aspects = pl.get_structured_answer("xbox", "play station")
print("Result:")
print(f"{obj_a}: {obj_a_aspects}")
print(f"{obj_b}: {obj_b_aspects}")

Requesting Elasticsearch
Preparing sentences
Classifying comparative sentences
Looking for keyphrases
Predicting good aspects
Result:
xbox: ['play', 'control', 'graphics', 'comparison', 'better situation', 'greater sales', 'form', 'powerful', 'console sales', 'offense', 'much healty']
play station: ['much', 'smart design', 'better target', 'free', 'reliable', 'much video games', 'superior machine', 'good old play', 'speed', 'free games', 'better graphics', 'cheaper', 'better deal', 'touch screen', 'liberation', 'resistance bs', 'bad system', 'fun', 'overall']


## Demo step by step

In [1]:
import pke
from pke.unsupervised import MultipartiteRank
import requests
from requests.auth import HTTPBasicAuth
import gensim
import gensim.downloader as api
import numpy as np
import pandas as pd
import pickle
import re
import joblib
import torch
from CompSentClf import CompSentClf
from seq_lab.SeqLabeller import SeqLabeller

### Input two objects to compare

In [2]:
obj_a = "python"
obj_b = "java"

In [3]:
obj_a = obj_a.lower().strip()
obj_b = obj_b.lower().strip()

## Look for sentences, containing the requested objects in Elasticsearch

Fill in user and password

In [4]:
def request_elasticsearch(obj_a, obj_b, user, password):
    url = 'http://ltdemos.informatik.uni-hamburg.de/depcc-index/_search?q='
    url += 'text:\"{}\"%20AND%20\"{}\"'.format(obj_a, obj_b)

    size = 10000
    
    url += '&from=0&size={}'.format(size)
    response = requests.get(url, auth=HTTPBasicAuth(user, password))
    return response

In [5]:
#Write down the name and password for elasticSearch
name = "reader"
password = "reader"

In [6]:
json_compl = request_elasticsearch(obj_a, obj_b, name, password)

## Preparing sentences for classificator

In [7]:
def extract_sentences(es_json, aggregate_duplicates=False):
    try:
        hits = es_json.json()['hits']['hits']
    except KeyError:
        return []
    sentences = []
    seen_sentences = set()
    for hit in hits:
        source = hit['_source']
        text = source['text']

        if not aggregate_duplicates:
            if (text.lower()) not in seen_sentences:
                seen_sentences.add(text.lower())
                sentences.append(text)
        else:
            sentences.append(text)

    return sentences

def remove_questions(sentences):
    sentences_to_delete = []
    for sentence in sentences:
        if '?' in sentence:
            sentences_to_delete.append(sentence)
    for sentence in sentences_to_delete:
        del sentences[sentences.index(sentence)]
    return sentences

def get_regEx(sequence):
    return re.compile('\\b{}\\b|\\b{}\\b'.format(re.escape(sequence), re.sub('[^a-zA-Z0-9 ]', ' ', sequence)), re.IGNORECASE)

def find_pos_in_sentence(sequence, sentence):
    regEx = get_regEx(sequence)
    match = regEx.search(sentence)    
    if match == None:
        match = regEx.search(re.sub(' +',' ', re.sub('[^a-zA-Z0-9 ]', ' ', sentence)))
        return match.start() if match != None else -1
    else:
        return match.start()

def prepare_sentence_DF(sentences, obj_a, obj_b):
    index = 0
    temp_list = []
    for sentence in sentences:
        pos_a = find_pos_in_sentence(obj_a, sentence)
        pos_b = find_pos_in_sentence(obj_b, sentence)
        if pos_a < pos_b:
            temp_list.append([obj_a, obj_b, sentence])
        else:
            temp_list.append([obj_b, obj_a, sentence])
        index += 1
    sentence_df = pd.DataFrame.from_records(temp_list, columns=['object_a', 'object_b', 'sentence'])

    return sentence_df

In [8]:
all_sentences = extract_sentences(json_compl)
remove_questions(all_sentences)
prepared_sentences = prepare_sentence_DF(all_sentences, obj_a, obj_b)

In [9]:
prepared_sentences.head()

Unnamed: 0,object_a,object_b,sentence
0,java,python,"Software Terms: Java, Programming, Source Cod..."
1,java,python,"Java , Ruby, Python, Java,"
2,java,python,"As Java devours Python, Python also devours Java."
3,java,python,"compare Java to Python, not Java + JVM to Python."
4,python,java,"Python is not Java, and Java is not Python."


## Using classificator of comparative sentences
The classificator is used in CAM system:  
Paper: https://arxiv.org/abs/1901.05041  
Github: https://github.com/uhh-lt/cam/  

The classifier takes 2 compared objects and a sentence as an input.
The output is one of 3 classes:
- NONE - the sentence does not have comparison in it
- BETTER - the first object in a sentence is better than the second
- WORSE - the first object in a sentence is worse than the second  

Paper: https://arxiv.org/abs/1809.06152

Models from the paper:
- BOW+XGBoost
- InferSent+XGBoost

The other options available:
- ELMo+LinReg
- BERT

In [10]:
model_name = "bow+xgboost" # 'bow+xgboost', 'infersent+xgboost', 'elmo+linreg', 'bert'

In [12]:
clf = CompSentClf(model_name)

In [13]:
classification_results = clf.classify_sentences(prepared_sentences)

We don't need the sentences without comparison

In [18]:
prepared_sentences[classification_results['max'] != 'NONE']

Unnamed: 0,object_a,object_b,sentence
81,python,java,Python isn't Java.
104,python,java,But Python isn't Java.
144,python,java,python than for java.
221,python,java,Python isn't Java and you shouldn't try to wri...
296,python,java,"Python, instead of Java)."
...,...,...,...
5232,java,python,"(At this point, I know more Java than Python.)"
5274,python,java,throw a Python RuntimeError instead of a Java ...
5330,java,python,Java is a whole lot more predictible than Python.
5385,java,python,net.degreedays.api in Java is equivalent to de...


Uniting the comparative sentences and results of classification into one dataframe

In [19]:
comparative_sentences = prepared_sentences[classification_results['max'] != 'NONE']

In [20]:
comparative_sentences['max'] = classification_results[classification_results['max'] != 'NONE']['max']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [21]:
comparative_sentences

Unnamed: 0,object_a,object_b,sentence,max
81,python,java,Python isn't Java.,BETTER
104,python,java,But Python isn't Java.,BETTER
144,python,java,python than for java.,BETTER
221,python,java,Python isn't Java and you shouldn't try to wri...,BETTER
296,python,java,"Python, instead of Java).",BETTER
...,...,...,...,...
5232,java,python,"(At this point, I know more Java than Python.)",BETTER
5274,python,java,throw a Python RuntimeError instead of a Java ...,BETTER
5330,java,python,Java is a whole lot more predictible than Python.,BETTER
5385,java,python,net.degreedays.api in Java is equivalent to de...,BETTER


## Getting aspects from gathered sentences

### Keywords approach
We unite the sentences into a single document and look for keywords in that document using PKE

In [23]:
text = prepared_sentences[classification_results['max'] != 'NONE']['sentence'].str.cat(sep=' ')

In [24]:
extractor = MultipartiteRank()
extractor.load_document(input=text, language="en", normalization='stemming')

extractor.candidate_selection(pos={'NOUN', 'PROPN', 'ADJ'})

extractor.candidate_weighting()

keyphrases = extractor.get_n_best(n=-1, stemming=False)

Here are our keyphrases

In [25]:
keyphrases

[('python', 0.21973674959244296),
 ('java', 0.2042764442818003),
 ('faster', 0.014932241840394536),
 ('easier', 0.01261451070763693),
 ('slower', 0.009933512569054453),
 ('better', 0.009375573341228096),
 ('equivalent', 0.008914757023569454),
 ('java programs', 0.00867398313325556),
 ('many ways', 0.008535753833617968),
 ('popular', 0.008259124756198923),
 ('python code', 0.007928614854966968),
 ('language', 0.00783928454007348),
 ('closer', 0.007038364945303477),
 ('scala performance', 0.006716068876988277),
 ('older', 0.006564064343323661),
 ('syntax', 0.006443130202768127),
 ('ruby', 0.006339968885611265),
 ('times', 0.005936204106738509),
 ('hells', 0.00588300428189334),
 ('shorter', 0.005729272006915879),
 ('interpreted', 0.0056786424202054445),
 ('cool', 0.005491690978292079),
 ('much', 0.005415352088287819),
 ('longer time', 0.004801770093156908),
 ('harder', 0.0046633036637127165),
 ('java implementation', 0.004645191497805674),
 ('simpler syntax', 0.004635334947390272),
 ('fin

Most of the keyphrases don't look like aspects we need. To extract the needed aspects we use a classifier which is trained to find good aspects.

### Aspect classifier

Loading and preprocessing sentences for training the classifier

In [26]:
names = ["object_a", "object_b", "aspect", "most_frequent_rating", "sentence"]
df_train = pd.read_csv("classification_fine_grained/train_clf_fine_grained.csv", header=None, names=names)
df_test = pd.read_csv("classification_fine_grained/test_clf_fine_grained.csv", header=None, names=names)

In [27]:
def get_output_for_binary(data):
    return (data['most_frequent_rating'] != 'BAD').astype('float32').to_numpy()

y_train = get_output_for_binary(df_train)
y_test = get_output_for_binary(df_test)

To vectorise the sentences we use Word2Vec:  
Input of the classifier is a concatenation of 4 embeddings:
- object a embedding
- object b embedding
- aspect embedding
- sentence embedding  

For sentence embedding we use mean of embeddings of its words.  
So, considering w2v dimensionality, we have vecctors of size 1200 as an input.

In [28]:
from w2v.w2v_feature import initialize_w2v, W2VFeature
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC

In [29]:
# w2v_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
w2v_model = initialize_w2v()

In [31]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

def report_scores(model, X, y):
    y_pred = model.predict(X)
    acc = accuracy_score(y, y_pred)
    pr = precision_score(y, y_pred, average='weighted')
    re = recall_score(y, y_pred, average='weighted')
    f1 = f1_score(y, y_pred, average='weighted')
    f1_bad, f1_good = f1_score(y, y_pred, average=None)
    print("Accuracy: {:.2f}".format(acc * 100))
    print("Precision: {:.2f}".format(pr * 100))
    print("Recall: {:.2f}".format(re * 100))
    print("F1: {:.2f}".format(f1 * 100))
    print("F1 GOOD: {:.2f}".format(f1_good * 100))
    print("F1 BAD: {:.2f}".format(f1_bad * 100))

In [32]:
pl = make_pipeline(W2VFeature(w2v_model), SVC(kernel='linear', gamma='auto'))
fitted = pl.fit(df_train, y_train)
report_scores(fitted, df_test, y_test)

Accuracy: 81.91
Precision: 81.99
Recall: 81.91
F1: 81.89
F1 GOOD: 82.43
F1 BAD: 81.36


Training the classifier: Support Vector Classifier with a linear kernel.

In [33]:
filename = 'w2v/asp_clf.pkl'

In [48]:
joblib.dump(fitted[1], filename)

['w2v/asp_clf.pkl']

In [49]:
loaded_model = make_pipeline(W2VFeature(w2v_model), joblib.load(filename))

In [50]:
# Evaluation
print("Test")
report_scores(loaded_model, df_test, y_test)

Test
Accuracy: 81.91
Precision: 81.99
Recall: 81.91
F1: 81.89
F1 GOOD: 82.43
F1 BAD: 81.36


Now we have trained a classifier and are going to process our keyphrases.

To process the keyphrases we need a separate dataframe with sentences for each aspect (keyphrase).

In [51]:
asp_df = pd.DataFrame(columns=['object_a', 'object_b', 'aspect', 'sentence', 'max'])
forbidden_phrases = [obj_a, obj_b, 'better', 'worse']

for index, row in comparative_sentences.iterrows():
    sentence = row['sentence']
    for (keyphrase, score) in keyphrases:
        skip_keyphrase = False
        for phrase in forbidden_phrases:
            if keyphrase == phrase:
                skip_keyphrase = True
                break
        if not skip_keyphrase:
            if keyphrase in sentence:
                asp_df = asp_df.append(
                    {'object_a': row['object_a'],
                     'object_b': row['object_b'],
                     'aspect': keyphrase,
                     'sentence': row['sentence'],
                     'max': row['max'],
                    }, ignore_index=True)

Applying classifier

In [52]:
y_pred = loaded_model.predict(asp_df)

The aspects left after classifier

In [54]:
aspects = asp_df.iloc[np.nonzero(y_pred)[0].tolist()]['aspect'].unique()

In [54]:
aspects

array(['faster', 'easier', 'apps', 'syntax', 'simpler syntax', 'simpler',
       'performance', 'slower', 'easier ways'], dtype=object)

### Sequence labelling approach

In [22]:
sentences = prepared_sentences[classification_results['max'] != 'NONE']['sentence'].tolist()

In [29]:
seq_lab_name = 'berttagger'
seq_labeller = SeqLabeller(seq_lab_name)

In [30]:
words, preds = seq_labeller.get_labels(sentences)

In [33]:
asp_df = pd.DataFrame(columns=['object_a', 'object_b', 'aspect', 'sentence', 'max'])

aspects = set()

for i, sent in enumerate(words):
    for j, word in enumerate(sent):
        if preds[i][j] == 'B-PREDFULL':
            cur_asp = word
            for k in range(j + 1, len(sent)):
                if preds[i][k] == 'I-PREDFULL':
                    cur_asp = cur_asp + ' ' + sent[k]
                else:
                    break
            aspects.add(cur_asp.lower())
            row = comparative_sentences.iloc[i]
            asp_df = asp_df.append(
                    {'object_a': row['object_a'],
                     'object_b': row['object_b'],
                     'aspect': cur_asp,
                     'sentence': row['sentence'],
                     'max': row['max'],
                    }, ignore_index=True)
            
aspects = list(aspects)

In [34]:
aspects

['simpler',
 'easier',
 'easier to program in',
 'complex',
 'easier to learn',
 'faster',
 'older',
 'stronger',
 'easier to read',
 'quicker to write code']

## Determining winner

First, we need to specify which aspects belong to which object.

In [35]:
obj_a_aspects = []
obj_b_aspects = []
for aspect in aspects:
    rows = asp_df[asp_df['aspect']==aspect]
    if obj_a == rows.iloc[0]['object_a']:
        obj_a_aspects.append(aspect)
    else:
        obj_b_aspects.append(aspect)

In [36]:
obj_a_aspects

['simpler',
 'easier',
 'easier to program in',
 'complex',
 'easier to learn',
 'older',
 'easier to read',
 'quicker to write code']

In [37]:
obj_b_aspects

['faster', 'stronger']

The winner of comparison is the object which has more aspects.

In [98]:
comparing_pair = {}

In [99]:
if len(obj_a_aspects) > len(obj_b_aspects):
    comparing_pair['winner_aspects'] = obj_a_aspects
    comparing_pair['loser_aspects'] = obj_b_aspects
    comparing_pair['winner'] = obj_a
    comparing_pair['loser'] = obj_b
else:
    comparing_pair['winner_aspects'] = obj_b_aspects
    comparing_pair['loser_aspects'] = obj_a_aspects
    comparing_pair['winner'] = obj_b
    comparing_pair['loser'] = obj_a

## Generating response

Using templates

In [62]:
from template_generation.template_generation import generate_template

In [63]:
generate_template(comparing_pair, mode="extended")

'after much thought, I realized that  python is better, because: compile, better language, easier, syntax, simpler syntax, apps, simpler, quicker, performance, slower, easier ways. But you should know that java is: faster, comparable, productivity, bad'

Getting a brief summary using text rank.

In [64]:
from gensim.summarization.textcleaner import split_sentences
from gensim.summarization.summarizer import summarize

In [65]:
rows = asp_df[asp_df.aspect.isin(aspects)]

In [66]:
sentences = ""
for row in range (rows.shape[0]):
    sentence = asp_df.iloc[row]['sentence'] + " "
    if sentence not in sentences:
        sentences += sentence

In [67]:
if len(split_sentences(sentences)) > 10:
    summary = str(summarize(sentences, split=False, word_count=30))

In [68]:
print(summary)

Python runs slower than Java .
Python isn't just Java without the compile .
much slower in python than in java.
much slower in python than in java.


In [55]:
from Demo import one_liner
import gensim.downloader as api



You can test what the demo by using the oneliner below (it only requires the w2v model)  
response - sentence containing aspects of products generated using templates  
summary - brief summary of sentences gathered from Elasticsearch

In [56]:
# w2v_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
# w2v_model = api.load('word2vec-google-news-300')

In [None]:
obj_a = "play station"
obj_b = "xbox"
user = "reader" # username in Elasticsearch
password = "reader" # password in Elasticsearch

response, summary = one_liner(obj_a, obj_b, user, password, w2v_model)

In [49]:
response

'i came to the conclusion that play station is better, because: much, ill, fun abilities, useful, fun, smart design, better target, free, reliable, much video games, bumpers, rubbish, price tag, price, investors, current market dominance, apprehension, order, powerful consoles, candy, stocking, solder toys, bit bigger, x2 inches, numbers, works, cheaper, free games, better deal, better graphics, touch screen, coarse flipping, resistance bs, liberation, bad system, overall. But it will be useful for you to know that xbox is: play, control, graphics, comparison, better situation, greater sales, form, sale, last, bluray drive, disk, trading games, powerful, console sales, size, cheap, best, hard, secure'

In [50]:
summary

"One great feature on this then, is the free play station network, which is also much more reliable than Xbox LIVE.\nPersonally I prefer the Xbox controller and I've heard that live is a more complete online experience than what play station offers."