# Comparative question answering demo

To run the demo you need to have w2v embeddings in the same folder

External resourses required:

In [None]:
!python3 -m nltk.downloader stopwords
!python3 -m nltk.downloader universal_tagset
!python3 -m spacy download en

In [1]:
from utils.sentence_clearer import clear_sentences, remove_questions
from ml_approach.sentence_preparation_ML import prepare_sentence_DF
from ml_approach.classify import classify_sentences
from utils.es_requester import extract_sentences
from utils.objects import Argument

import pke
from pke.unsupervised import MultipartiteRank
import requests
from requests.auth import HTTPBasicAuth
import gensim
import numpy as np



### Input two objects to compare

In [2]:
obj_a = "python"
obj_b = "java"

In [3]:
obj_a = Argument(obj_a.lower().strip())
obj_b = Argument(obj_b.lower().strip())

## Look for sentences, containing the requested objects in Elasticsearch

Fill in user and password

In [4]:
def request_elasticsearch(obj_a, obj_b):
    user = ''
    password = ''
    url = 'http://ltdemos.informatik.uni-hamburg.de/depcc-index/_search?q='
    url += 'text:\"{}\"%20AND%20\"{}\"'.format(obj_a.name, obj_b.name)

    size = 10000
    
    url += '&from=0&size={}'.format(size)
    response = requests.get(url, auth=HTTPBasicAuth(user, password))
    return response

In [5]:
json_compl = request_elasticsearch(obj_a, obj_b)

## Preparing sentences for classificator

In [6]:
all_sentences = extract_sentences(json_compl)
remove_questions(all_sentences)
prepared_sentences = prepare_sentence_DF(all_sentences, obj_a, obj_b)

In [7]:
prepared_sentences.head()

Unnamed: 0,object_a,object_b,sentence
0,java,python,"Software Terms: Java, Programming, Source Cod..."
1,java,python,"Java , Ruby, Python, Java,"
2,java,python,"As Java devours Python, Python also devours Java."
3,java,python,"compare Java to Python, not Java + JVM to Python."
4,python,java,"Python is not Java, and Java is not Python."


## Using classificator of comparative sentences
The classificator is used in CAM system:  
Paper: https://arxiv.org/abs/1901.05041  
Github: https://github.com/uhh-lt/cam/  

The classifier takes 2 compared objects and a sentence as an input.
The output is one of 3 classes:
- NONE - the sentence does not have comparison in it
- BETTER - the first object in a sentence is better than the second
- WORSE - the first object in a sentence is worse than the second  

Paper: https://arxiv.org/abs/1809.06152


In [8]:
classification_results = classify_sentences(prepared_sentences, 'bow')



We don't need the sentences without comparison

In [9]:
classification_results[classification_results['max'] != 'NONE']

Unnamed: 0,BETTER,NONE,WORSE,max
144,0.677115,0.246707,0.076178,BETTER
296,0.626316,0.319415,0.054270,BETTER
459,0.562257,0.225554,0.212190,BETTER
464,0.850893,0.119348,0.029760,BETTER
481,0.850893,0.119348,0.029760,BETTER
...,...,...,...,...
5232,0.635913,0.297547,0.066540,BETTER
5274,0.626316,0.319415,0.054270,BETTER
5309,0.626316,0.319415,0.054270,BETTER
5328,0.902701,0.057767,0.039532,BETTER


In [10]:
prepared_sentences[classification_results['max'] != 'NONE']

Unnamed: 0,object_a,object_b,sentence
144,python,java,python than for java.
296,python,java,"Python, instead of Java)."
459,java,python,Java was compared to Python.
464,java,python,Java 8X Faster than Python
481,java,python,Java 5.3X Faster than Python
...,...,...,...
5232,java,python,"(At this point, I know more Java than Python.)"
5274,python,java,throw a Python RuntimeError instead of a Java ...
5309,java,python,I'm writing this project in Java instead of Py...
5328,java,python,Java is a whole lot more predictible than Python.


Uniting the comparative sentences and results of classification into one dataframe

In [11]:
comparative_sentences = prepared_sentences[classification_results['max'] != 'NONE']

In [12]:
comparative_sentences['max'] = classification_results[classification_results['max'] != 'NONE']['max']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [13]:
comparative_sentences

Unnamed: 0,object_a,object_b,sentence,max
144,python,java,python than for java.,BETTER
296,python,java,"Python, instead of Java).",BETTER
459,java,python,Java was compared to Python.,BETTER
464,java,python,Java 8X Faster than Python,BETTER
481,java,python,Java 5.3X Faster than Python,BETTER
...,...,...,...,...
5232,java,python,"(At this point, I know more Java than Python.)",BETTER
5274,python,java,throw a Python RuntimeError instead of a Java ...,BETTER
5309,java,python,I'm writing this project in Java instead of Py...,BETTER
5328,java,python,Java is a whole lot more predictible than Python.,BETTER


## Getting aspects from gathered sentences

### Keywords approach
We unite the sentences into a single document and look for keywords in that document using PKE

In [14]:
text = prepared_sentences[classification_results['max'] != 'NONE']['sentence'].str.cat(sep=' ')

In [15]:
extractor = MultipartiteRank()
extractor.load_document(input=text, language="en", normalization='stemming')

extractor.candidate_selection(pos={'NOUN', 'PROPN', 'ADJ'})

extractor.candidate_weighting()

keyphrases = extractor.get_n_best(n=-1, stemming=False)

Here are our keyphrases

In [16]:
keyphrases

[('python', 0.21740087945740835),
 ('java', 0.20172378798161344),
 ('faster', 0.02199726877605206),
 ('easier', 0.015883211094384012),
 ('way', 0.013088053043287619),
 ('python language evolution', 0.010611863513849734),
 ('slower', 0.010320595937476522),
 ('ruby', 0.010157464142227231),
 ('better', 0.010148856677091601),
 ('java programs', 0.009231769250017077),
 ('closer', 0.00888508329537028),
 ('older', 0.008628973281021765),
 ('scala performance', 0.008472943343755849),
 ('apps', 0.007675049092989409),
 ('shorter', 0.007367731815362034),
 ('simpler syntax', 0.0061869071317006245),
 ('longer time', 0.006179509018187798),
 ('concurrency', 0.005954445648699867),
 ('market share', 0.005928308780750451),
 ('popular', 0.005912019937246259),
 ('syntax', 0.005815970756635444),
 ('old python book', 0.005602751898790517),
 ('times', 0.005590342170612456),
 ('hell', 0.0055837438352847255),
 ('rapid application development', 0.0055411411414709975),
 ('worse', 0.005518487496112234),
 ('code sa

Most of the keyphrases don't look like aspects we need. To extract the needed aspects we use a classifier which is trained to find good aspects.

## Aspect classifier

Loading and preprocessing sentences for training the classifier

In [17]:
import pandas as pd

names = ["OBJECT A", "OBJECT B", "ASPECT", "MOST FREQUENT RATING", "SENTENCE"]
df_train = pd.read_csv("classification_fine_grained/train_clf_fine_grained.csv", header=None, names=names)
df_test = pd.read_csv("classification_fine_grained/test_clf_fine_grained.csv", header=None, names=names)
df_dev = pd.read_csv("classification_fine_grained/dev_clf_fine_grained.csv", header=None, names=names)

import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


def get_list_of_tokens(df_texts):
    stop_words=set(stopwords.words('english'))
    wordnet_lemmatizer = WordNetLemmatizer()
    tokens = []
    texts = df_texts["SENTENCE"].values
    for i in range(len(texts)):
        row = texts[i]
        # remove punctuation
        for ch in string.punctuation:
            row = row.replace(ch, " ")
        row = row.replace("   ", " ")
        row = row.replace("  ", " ")
        temp_line = []
        # remove stop words
        for word in row.split():
            if word not in stop_words:
                temp_line.append(word)
        row = ' '.join(temp_line)
        # lemmatization
        temp_line = []
        for word in row.split():
            temp_line.append(wordnet_lemmatizer.lemmatize(word))
        tokens.append(temp_line)
    return tokens

tokens_test = get_list_of_tokens(df_test)
tokens_train = get_list_of_tokens(df_train)
tokens_dev = get_list_of_tokens(df_dev)

df_train['TOKENS'] = pd.Series(tokens_train)
df_dev['TOKENS'] = pd.Series(tokens_dev)
df_test['TOKENS'] = pd.Series(tokens_test)

To vectorise the sentences we use Word2Vec:  
Input of the classifier is a concatenation of 4 embeddings:
- object a embedding
- object b embedding
- aspect embedding
- sentence embedding  

For sentence embedding we use mean of embeddings of its words.  
So, considering w2v dimensionality, we have vecctors of size 1200 as an input.

In [18]:
w2v_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [19]:
def create_sentence_embeddings(model, words_list):
    sentence_embedding = []
    for word in words_list:
        try:
            sentence_embedding.append(model[word])
        except KeyError:
            continue
#             print(word + " is not in the vocabulary, skipping...")
    if len(sentence_embedding) == 0:
        sentence_embedding.append(np.zeros(300))
    return np.array(sentence_embedding)

def to_w2v_matrix(df_data, model):
    sent_embs = np.zeros([df_data.shape[0], 300 * 4], dtype='float32')
    for i in range(df_data.shape[0]):
        object_a_embedding = create_sentence_embeddings(model, df_data["OBJECT A"][i].split()).mean(axis=0)
        object_b_embedding = create_sentence_embeddings(model, df_data["OBJECT B"][i].split()).mean(axis=0)
        aspect_embedding = create_sentence_embeddings(model, df_data["ASPECT"][i].split()).mean(axis=0)
        sentence_embedding = create_sentence_embeddings(model, df_data["TOKENS"][i]).mean(axis=0)
        sent_embs[i, :] = np.concatenate((object_a_embedding, object_b_embedding, aspect_embedding, sentence_embedding), axis=0)
    return sent_embs

X_train = to_w2v_matrix(df_train, w2v_model)
X_dev = to_w2v_matrix(df_dev, w2v_model)
X_test = to_w2v_matrix(df_test, w2v_model)

In [20]:
def get_output_for_binary(data):
    return (data['MOST FREQUENT RATING'] != 'BAD').astype('float32').to_numpy()

y_train = get_output_for_binary(df_train)
y_dev = get_output_for_binary(df_dev)
y_test = get_output_for_binary(df_test)

In [21]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

def report_scores(model, X, y):
    y_pred = model.predict(X)
    acc = accuracy_score(y, y_pred)
    pr = precision_score(y, y_pred, average='weighted')
    re = recall_score(y, y_pred, average='weighted')
    f1 = f1_score(y, y_pred, average='weighted')
    f1_bad, f1_good = f1_score(y, y_pred, average=None)
    print("Accuracy: {:.2f}".format(acc * 100))
    print("Precision: {:.2f}".format(pr * 100))
    print("Recall: {:.2f}".format(re * 100))
    print("F1: {:.2f}".format(f1 * 100))
    print("F1 GOOD: {:.2f}".format(f1_good * 100))
    print("F1 BAD: {:.2f}".format(f1_bad * 100))

Training the classifier: Support Vector Classifier with a linear kernel.

In [22]:
from sklearn.svm import SVC

model = SVC(kernel='linear', gamma='auto')

print("start of fit")
model.fit(X_train, y_train)

# Evaluation
print("Train")
report_scores(model, X_train, y_train)

print("Dev")
report_scores(model, X_dev, y_dev)

start of fit
Train
Accuracy: 96.41
Precision: 96.44
Recall: 96.41
F1: 96.41
F1 GOOD: 96.21
F1 BAD: 96.59
Dev
Accuracy: 78.74
Precision: 78.72
Recall: 78.74
F1: 78.72
F1 GOOD: 80.48
F1 BAD: 76.67


In [23]:
# Evaluation
print("Test")
report_scores(model, X_test, y_test)

Test
Accuracy: 81.91
Precision: 81.99
Recall: 81.91
F1: 81.89
F1 GOOD: 82.43
F1 BAD: 81.36


Now we have trained a classifier and are going to process our keyphrases.

To process the keyphrases we need a separate dataframe with sentences for each aspect (keyphrase).

In [24]:
asp_df = pd.DataFrame(columns=['OBJECT A', 'OBJECT B', 'ASPECT', 'SENTENCE', 'max'])
forbidden_phrases = [obj_a.name, obj_b.name, 'better', 'worse']

for index, row in comparative_sentences.iterrows():
    sentence = row['sentence']
    for (keyphrase, score) in keyphrases:
        skip_keyphrase = False
        for phrase in forbidden_phrases:
            if keyphrase == phrase:
                skip_keyphrase = True
                break
        if not skip_keyphrase:
            if keyphrase in sentence:
                asp_df = asp_df.append(
                    {'OBJECT A': row['object_a'],
                     'OBJECT B': row['object_b'],
                     'ASPECT': keyphrase,
                     'SENTENCE': row['sentence'],
                     'max': row['max'],
                    }, ignore_index=True)

In [25]:
asp_df['TOKENS'] = pd.Series(get_list_of_tokens(asp_df))

In [26]:
X_asp = to_w2v_matrix(asp_df, w2v_model)

Applying classifier

In [27]:
y_pred = model.predict(X_asp)

The aspects left after classifier

In [28]:
aspects = asp_df.iloc[np.nonzero(y_pred)[0].tolist()]['ASPECT'].unique()

In [29]:
aspects

array(['faster', 'easier', 'apps', 'simpler syntax', 'syntax', 'simpler',
       'performance', 'slower', 'easier ways'], dtype=object)

Top 10 keyphrases for comparison

In [30]:
keyphrases[:10]

[('python', 0.21740087945740835),
 ('java', 0.20172378798161344),
 ('faster', 0.02199726877605206),
 ('easier', 0.015883211094384012),
 ('way', 0.013088053043287619),
 ('python language evolution', 0.010611863513849734),
 ('slower', 0.010320595937476522),
 ('ruby', 0.010157464142227231),
 ('better', 0.010148856677091601),
 ('java programs', 0.009231769250017077)]

## Determining winner

First, we need to specify which aspects belong to which object.

In [31]:
obj_a_aspects = []
obj_b_aspects = []
for aspect in aspects:
    rows = asp_df[asp_df['ASPECT']==aspect]
    if obj_a.name == rows.iloc[0]['OBJECT A']:
        obj_a_aspects.append(aspect)
    else:
        obj_b_aspects.append(aspect)

In [32]:
obj_a_aspects

['easier',
 'apps',
 'simpler syntax',
 'syntax',
 'simpler',
 'performance',
 'slower',
 'easier ways']

In [33]:
obj_b_aspects

['faster']

The winner of comparison is the object which has more aspects.

In [34]:
comparing_pair = {}

In [35]:
if len(obj_a_aspects) > len(obj_b_aspects):
    comparing_pair['winner_aspects'] = obj_a_aspects
    comparing_pair['loser_aspects'] = obj_b_aspects
    comparing_pair['winner'] = obj_a.name
    comparing_pair['loser'] = obj_b.name
else:
    comparing_pair['winner_aspects'] = obj_b_aspects
    comparing_pair['loser_aspects'] = obj_a_aspects
    comparing_pair['winner'] = obj_b.name
    comparing_pair['loser'] = obj_a.name

## Generating response

Using templates

In [36]:
from template_generation.template_generation import generate_template

In [37]:
generate_template(comparing_pair, mode="extended")

'after much thought, I realized that  python is better, because: first, easier, second, apps, third, simpler syntax, fourth, syntax, fifth, simpler, sixth, performance, seventh, slower, eighth, easier ways. But it will be useful for you to know that java is: faster'

Getting a brief summary using text rank.

In [38]:
from gensim.summarization.textcleaner import split_sentences
from gensim.summarization.summarizer import summarize

In [39]:
rows = asp_df[asp_df.ASPECT.isin(aspects)]

In [40]:
sentences = ""
for row in range (rows.shape[0]):
    sentence = asp_df.iloc[row]['SENTENCE'] + " "
    if sentence not in sentences:
        sentences += sentence

In [41]:
if len(split_sentences(sentences)) > 10:
    summary = str(summarize(sentences, split=False, word_count=30))

In [42]:
print(summary)

Python runs slower than Java .
Java has way more market penetration than Python.
Python is actually older than Java.
Java is faster, while python is .
