# Search

__Question-Answering system:__
* extract words
* look up words in keyword database
* extract similar keywords
* for all these keywords, extract sentenceIDs. These are the candidates.


Ranking:
* If a sentence occurs more than once in that list, it contains more keywords, so start with that sentence. If not or if tie, start with sentence with highest score. OBS: score has to be 0.5 at least, otherwise the sentence is too likely to not be informative.
* Eliminate all sentences from candidates that are too similar to start sentence. 
* Find sentence with highest score among leftovers.
* Repeat until the desired length is reached or if there are no more candidates left.


OOV in question:
* lookup 10 most similar words in fasttext model, and check whether these words are in the keyword database.

Special types of questions:
* Who wrote X/about X?
* What does X say about Y?

The first kind has to return authors + title + 'ling.auf.net/'+url. Lookup X in the keyword list in the paper database. Possibly: my keyword list is raw, exact matches will be needed. An idea to get better results is to run the keywords list through is_english_sentence.  
The second kind has to look up author in papers db, identify the papers with the right keywords, and do the search on the sentences from those papers only.

In [1]:
import re
import string
from gensim.models.wrappers import FastText
from sklearn.externals import joblib
from pymongo import MongoClient
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from nltk.corpus import stopwords
from pyspark.mllib.linalg.distributed import RowMatrix, IndexedRow, IndexedRowMatrix
from pyspark import SparkContext
from collections import defaultdict
from multiprocessing import Pool

In [2]:
standard_stopwords = set(list(filter(lambda x: x not in ['who', 
                                                         'what', 'which', 'when', 'where', 'how', 'why'], 
                                     list(ENGLISH_STOP_WORDS)+list(stopwords.words('english')))))

In [3]:
for w in ['who', 'what', 'which', 'when', 'where', 'how', 'why']:
    if w in standard_stopwords:
        print(w)

In [3]:
authors = joblib.load('authors')
#fasttext = FastText.load_fasttext_format('../fastText/wiki.en.bin')
bigrams = joblib.load('bigrams_model')

In [4]:
featureVec = featureVec = joblib.load('featurevec')
id_s = joblib.load('sentence_ids')

In [5]:
#df = pd.DataFrame({k:v for k, v in zip(id_s, featureVec)})
df = pd.DataFrame(featureVec, index = id_s)

In [7]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
59a85acdb18b146ddb84ff2b,0.016841,-0.064066,-0.003312,0.042494,-0.070648,0.006753,0.012448,-0.035684,0.024778,0.13314,...,-0.017461,0.024403,0.08463,-0.00647,0.001895,-0.08075,-0.051471,-0.020571,-0.035888,0.032818
59a85aceb18b146ddb84ff2c,-0.005425,-0.018609,-0.018961,0.05141,-0.066882,0.033543,0.042023,-0.053016,0.002348,0.04285,...,-0.031156,-0.029165,0.022447,0.00497,-0.008506,-0.059322,-0.005609,0.03346,-0.006075,-0.038163
59a85aceb18b146ddc84ff2b,-0.001685,0.008268,-0.025039,0.0372,-0.02401,-0.006751,-0.016207,-0.033398,-0.00848,-0.013157,...,-0.01806,-0.011136,0.006381,-0.002854,-0.01202,-0.060105,-0.014063,0.047626,0.01063,-0.005372
59a85acfb18b146dda84ff2b,-0.024034,-0.005111,-0.034685,0.05453,-0.000676,0.011178,0.030094,-0.043227,0.000945,0.027879,...,-0.030182,-0.015533,0.004545,0.008071,0.01158,-0.02527,-0.019287,0.046426,0.034357,-0.00551
59a85acfb18b146ddb84ff2d,-0.011807,-0.010629,-0.053721,0.076901,-0.04725,0.013618,0.006365,-0.056048,-0.002637,0.019908,...,-0.037469,-0.004816,0.004341,-0.033402,0.007919,-0.059306,-0.014925,0.034908,0.027748,0.023625


In [6]:
client = MongoClient()
db = client.lingbuzz
papers = db.get_collection('papers')

In [7]:
keywords = db.get_collection('keywords')
sentences = db.get_collection('sentences')

In [8]:
# debugged
def parse_question(sent):
    """determines whether a word is English/author"""
    sentence = []
    author = []
    for w in str(sent).split():
        w = str(w)
        if w.lower() in authors:
            author.append(str(w[1:]))
        else: 
            try:
                w.encode(encoding='utf-8').decode('ascii')
                word = re.sub('[%s]' % re.escape(string.punctuation), '', w)
                if word not in standard_stopwords:
                    sentence.append(word.lower())
            except UnicodeDecodeError:
                pass
    return author, bigrams[sentence]

In [9]:
def evaluate_question(question):
    a, q = parse_question(question)
    if 'who' in q or ('which' and 'author') in q:
        return request_reference(q[1:])
    elif len(a) != 0:
        return create_summary(q[1:], a, restrict = True)
    else:
        return create_summary(q[1:])

In [10]:
# debugged
# def request_reference(q):
#     """question is a list of strings"""
#     # for this to work, the keyword entry has to be englishified. 
#     # candidates = []
#     print('These are some papers you might want to read:')
#     for candidate in papers.find({'updated_keywords': [q]}):
#         print(candidate['title'] + ', by '+ ', '.join(c for c in candidate['authors']))
#         print('You can download the paper here: ling.auf.net/' + candidate['url'] + '\n')
        
def request_reference(q):
    """question is a list of strings"""
    # for this to work, the keyword entry has to be englishified. 
    # candidates = []
    answer = 'These are some papers you might want to read: \n\n'
    for w in q:
        for candidate in papers.find({'updated_keywords': w}):
            answer += candidate['title'] + ', by '+ ', '.join(c for c in candidate['authors']) + '\n'
            answer += 'You can download the paper here: ling.auf.net/' + candidate['url'] + '\n\n'
    return(answer)

# debugged
def restrict_search(a, q):
    paperIDs = []
    #(?=.*word1)(?=.*word2)(?=.*word3)
    regex = '|'.join(a)
    for candidate in papers.find({'authors': {'$regex': regex}, 'updated_keywords': {'$in': q}}):
        paperIDs.append(candidate['_id'])
    return paperIDs

# add the paperIDs restriction

In [11]:
a, q = parse_question('what do greco and haegeman say about verb second')

In [12]:
restrict_search(a, q)

[ObjectId('598b44c407d7df0771938487')]

In [13]:
from collections import Counter

def most_Common(lst):
    data = Counter(lst)
    return data.most_common()

In [14]:
# only do this if there are candidates. Otherwise: print I do not have enough information on that. 

# def create_summary(q, a = None, restrict = False):
#     candidates = find_candidates(q)
#     out = str()
#     if restrict:
#         ids = restrict_search(a, q)
#         candidates = restrict_candidates(candidates, ids)
#         out = '\nThese are some papers you might want to read: \n\n'
#         for candidate in papers.find({'_id': {'$in': ids}}):
#             out += candidate['title'] + ', by '+ ', '.join(c for c in candidate['authors']) + '\n'
#             out += 'You can download the paper here: ling.auf.net/' + candidate['url'] + '\n\n'
#     else:
#         candidates = restrict_candidates(candidates)
#     stack = []
#     if len(candidates) == 0 and len(out) == 0:
#         print('I do not have enough information to answer that question.')
#     if len(candidates) > 0:
#         sentence_similarities = calculate_similarities(candidates)
#         to_eliminate = []
#         while (len(stack) < 5 and len(candidates)>1):
#             freqs = most_Common(candidates)
#             if freqs[0][1] != freqs[1][1]:
#                 stack.append(sentences.find_one({'_id': freqs[0][0]})['sentence'])
#                 to_eliminate = [freqs[0][0]] + sentence_similarities[freqs[0][0]]
#                 candidates = list(filter(lambda a: a not in to_eliminate, candidates))
#             else: 
#                 sub_candidates = [_id[0] for _id in freqs if _id[1] == freqs[0][1]]
#                 sents = sentences.find({'_id': {'$in': sub_candidates}}).sort([('score',-1)]).limit(1)
#                 for sent in sents:
#                     stack.append(sent['sentence'])
#                     to_eliminate+= [sent['_id']] + sentence_similarities[sent['_id']]
#                     candidates = list(filter(lambda a: a not in to_eliminate, candidates))
#         print(' '.join(stack))
#     if len(candidates) ==1:
#         #stack.append(sentences.find_one({'_id': candidates[0]})['sentence'])
#         print(sentences.find_one({'_id': candidates[0]})['sentence'])
#     print(out)
#       
# def create_summary(q, a = None, restrict = False):
#     candidates = find_candidates(q)
#     out = str()
#     if restrict:
#         ids = restrict_search(a, q)
#         candidates = restrict_candidates(candidates, ids)
#         out = '\nThese are some papers you might want to read: \n\n'
#         for candidate in papers.find({'_id': {'$in': ids}}):
#             out += candidate['title'] + ', by '+ ', '.join(c for c in candidate['authors']) + '\n'
#             out += 'You can download the paper here: ling.auf.net/' + candidate['url'] + '\n\n'
#     else:
#         candidates = restrict_candidates(candidates)
#     stack = []
#     if len(candidates) == 0 and len(out) == 0:
#         return 'I do not have enough information to answer that question.'
#     if len(candidates) > 0:
#         sentence_similarities = calculate_similarities(candidates)
#         to_eliminate = []
#         while (len(stack) < 5 and len(candidates)>1):
#             freqs = most_Common(candidates)
#             if freqs[0][1] != freqs[1][1]:
#                 stack.append(sentences.find_one({'_id': freqs[0][0]})['sentence'])
#                 to_eliminate = [freqs[0][0]] + sentence_similarities[freqs[0][0]]
#                 candidates = list(filter(lambda a: a not in to_eliminate, candidates))
#             else: 
#                 sub_candidates = [_id[0] for _id in freqs if _id[1] == freqs[0][1]]
#                 sents = sentences.find({'_id': {'$in': sub_candidates}}).sort([('score',-1)]).limit(1)
#                 for sent in sents:
#                     stack.append(sent['sentence'])
#                     to_eliminate+= [sent['_id']] + sentence_similarities[sent['_id']]
#                     candidates = list(filter(lambda a: a not in to_eliminate, candidates))
#         out = ' '.join(stack)+'\n\n'+out
#     if len(candidates) ==1:
#         #stack.append(sentences.find_one({'_id': candidates[0]})['sentence'])
#         out += sentences.find_one({'_id': candidates[0]})['sentence']
#     return out 

def create_summary(q, a = None, restrict = False):
    candidates = find_candidates(q)
    out = str()
    refs = str()
    if restrict:
        ids = restrict_search(a, q)
        candidates = restrict_candidates(candidates, ids)
        refs = '\nThese are some papers you might want to read: \n\n'
        for candidate in papers.find({'_id': {'$in': ids}}):
            refs += candidate['title'] + ', by '+ ', '.join(c for c in candidate['authors']) + '\n'
            refs += 'You can download the paper here: ling.auf.net/' + candidate['url'] + '\n\n'
    else:
        candidates = restrict_candidates(candidates)
    stack = []
    # if len(candidates) == 0 and len(out) == 0:
    #     return 'I do not have enough information to answer that question.'
    if len(candidates) > 0:
        sentence_similarities = calculate_similarities(candidates)
        to_eliminate = []
        while (len(stack) < 5 and len(candidates)>1):
            freqs = most_Common(candidates)
            if freqs[0][1] != freqs[1][1]:
                to_append = clean_sentence(sentences.find_one({'_id': freqs[0][0]})['sentence'])
                if len(to_append.split()) > 3:
                    stack.append(to_append)
                    to_eliminate = [freqs[0][0]] + sentence_similarities[freqs[0][0]]
                else: 
                    to_eliminate = [freqs[0][0]]
                candidates = list(filter(lambda a: a not in to_eliminate, candidates))
            else: 
                sub_candidates = [_id[0] for _id in freqs if _id[1] == freqs[0][1]]
                sents = sentences.find({'_id': {'$in': sub_candidates}}).sort([('score',-1)]).limit(1)
                for sent in sents:
                    to_append = clean_sentence(sent['sentence'])
                    if len(to_append.split()) > 3:
                        stack.append(to_append)
                        to_eliminate+= [sent['_id']] + sentence_similarities[sent['_id']]
                    else:
                        to_eliminate+= [sent['_id']]
                    candidates = list(filter(lambda a: a not in to_eliminate, candidates))
        out = ' '.join(stack)+'\n'
    if len(candidates) ==1:
        out += clean_sentence(sentences.find_one({'_id': candidates[0]})['sentence']) 
    if len(candidates) == 0 and len(out) == 0:
         out = 'I do not have enough information to answer that question.'
    out += '\n\n'+ refs
    return out 

In [15]:
def calculate_similarities(candidates):
    sub_df = df.filter(items=list(set(candidates)), axis = 0)
    cos_sim = cosine_similarity(sub_df.values)
    df_sim = pd.DataFrame(cos_sim, index = sub_df.index, columns = sub_df.index)
    similar_sent = {}
    for k in candidates:
        indexes = list(df_sim[k][df_sim[k]>0.92].index)
        if len(indexes) > 0:
            similar_sent[k] = indexes
    return similar_sent

In [16]:
# debugged

# def find_candidates(q):
#     candidates = []
#     #similar_words = []
#     for w in q:
#         try: 
#             candidates+=keywords.find_one({'word': w})['sentenceIDs']      
#             #similar_words+=keywords.find_one({'word': w})['similar_words']
#         except: 
#             pass
#     #for w in similar_words:
#     #    candidates+=keywords.find_one({'_id': w})['sentenceIDs']
#     return candidates


def find_candidates(q):
    candidates = []
    #similar_words = []
    for candidate in keywords.find({'word': {'$in': q}}):
            #similar_words+=keywords.find_one({'word': w})['similar_words']
        candidates+=(candidate['sentenceIDs'] )
    #for w in similar_words:
    #    candidates+=keywords.find_one({'_id': w})['sentenceIDs']
    return candidates

# def find_candidates(q):
#     candidates = []
#     similar_words = []
#     frequencies = {}
#     for w in q:
#         frequencies[w]= keywords.find_one({'word': w})['frequency']
#     least_frequent = min(frequencies, key=frequencies.get)
#     q.remove(least_frequent)
#     try: 
#         candidates+=keywords.find_one({'word': least_frequent})['sentenceIDs']      
#         similar_words+=keywords.find_one({'word': w})['similar_words']
#     except: 
#         pass
#     for w in q:
#         try: 
#             for _id in keywords.find_one({'word': least_frequent})['sentenceIDs']:
#                 if _id in candidates: 
#                     candidates.append(_id)   
#                 similar_words+=keywords.find_one({'word': w})['similar_words']
#         except: 
#             pass
#     for w in similar_words:
#         try: 
#             for _id in keywords.find_one({'word': least_frequent})['sentenceIDs']:
#                 if _id in candidates: 
#                     candidates.append(_id)   
#                 similar_words+=keywords.find_one({'word': w})['similar_words']
#         except: 
#             pass
#     return candidates
# 
# gets rid of the sentences with a low score and takes into account restricted search
# def restrict_candidates(candidates, ids = None):
#     print('restricting candidates')
#     print(len(candidates))
#     for c in candidates:
#         if ids:
#             try: 
#                 if sentences.find_one({'_id': c})['paperID'] not in ids:
#                     candidates.remove(c)
#                 if sentences.find_one({'_id': c})['score'] <= 0.7:
#                     candidates.remove(c)
#             except:
#                 pass
#         else:
#             try: 
#                 if sentences.find_one({'_id': c})['score'] <= 0.7:
#                     candidates.remove(c)
#             except:
#                 pass
#     print(len(candidates))
#     return candidates

In [17]:
# this still takes too long... As far as I could tell, it is not possible to return mongo queries as a list in python
# there is a javascript function that does this, but no equivalent in python. And in any case, it involves looping 
# as well

def restrict_candidates(candidates, ids = None):
    print('found %s candidates' %len(candidates))
    if ids:
        for doc in sentences.find({'_id': {'$in': candidates}, 'paperID': {'$nin': ids}}):
            candidates = list(filter(lambda x: x != doc['_id'], candidates))
        print('restricted to %s candidates based on ids' %len(candidates))
    for doc in sentences.find({'_id': {'$in': candidates}, 'score': {'$lt': 0.5}}):
        candidates = list(filter(lambda x: x != doc['_id'], candidates))
    print('restricted to %s candidates' %len(candidates))
    return candidates

In [18]:
print(evaluate_question('What is balooba?'))

found 0 candidates
restricted to 0 candidates
I do not have enough information to answer that question.




In [19]:
def clean_sentence(sent):
    sent = sent.split('  ')
    out = str()
    for s in sent:
        s = s.lstrip('0123456789.- ')
        try: 
            if s[0].isupper():
                if len(s)> len(out):
                    out = s.strip()
            else:
                if out[-1] != '.':
                    out+= ' ' +s.strip()
        except:
            pass
    try: 
        if out[-1] not in ['.', '?', '!']:
            out = ''
    except:
        pass
    return out


def clean_sentence1(sent):
    """helper function to clean sentences up a bit"""
    sent = sent.split('   ')
    out = str()
    for s in sent:
        s = s.lstrip('0123456789.- ').replace('- ', '')
        try: 
            if s[0].isupper():
                if len(s)> len(out):
                    out = s
            else:
                if out[-1] != '.':
                    out+= ' ' +s
        except:
            pass
    return out

# def clean_sentence(sent):
#     """helper function to clean sentences up a bit"""
#     return sent

In [22]:
s = 'zan  ’t   I’ll   it      zeggen.  tell   1  In this paper, we discuss a specific set of Verb Third phenomena in West Flemish, the  dialect spoken in the Belgian province of West Flanders, exploring the microvariation  with Standard Dutch and the ramifications that these patterns have for the interaction  between discourse and the syntax of V2 and for the interaction between discourse and  the narrow syntax in general.   '

In [23]:
s3 = 'Tense sufﬁxes come in many forms, but an important generalization exists that helps to elucidate the pat- terns found'

In [24]:
s2 = 'A can occur either in the first or in the second position  A can only occur in the second position  A linearly precedes B  phi-features (person/number/gender)  focus feature  operator feature  question feature  negation  existential quantification  first person  second person  third person  accusative  affirmative particle  complementizer  interrogative complementizer (Japanese and Irish)  complementizer agreement  clitic pronoun  conditional mood (Finnish)  dative  demonstrative pronoun  emphatic form  imperative  infinitive  negative clitic (Dutch dialects)  negative auxiliary (Finnish)  nominative  object  past tense  dummy Case preposition (Romanian)  plural  present tense  particle  preverb (Hungarian)  relative pronoun  singular  strong pronoun  subject  topic marker (Japanese)  weak pronoun x  SG  STRONG  SUBJ  TOP  WEAK a.'

In [20]:
s3 = 'Tense sufﬁxes come in many forms, but an important generalization exists that helps to elucidate the pat- terns found.'

In [21]:
clean_sentence1(s3)

'Tense sufﬁxes come in many forms, but an important generalization exists that helps to elucidate the patterns found.'

In [26]:
evaluate_question('what does Haegeman write about verb second?')

found 18874 candidates
restricted to 36 candidates based on ids
restricted to 8 candidates


'Verb second, the split CP and null subjects in early Dutch finite from fragments and ellipsis. The nature of Old Spanish Verb Second reconsidered. The second goal is to investigate the role of syntax in these patterns. Verbs and Verb phrases. In this paper, we discuss a specific set of Verb Third phenomena in West Flemish, the dialect spoken in the Belgian province of West Flanders, exploring the microvariation with Standard Dutch and the ramifications that these patterns have for the interaction between discourse and the syntax of V2 and for the interaction between discourse and the narrow syntax in general.\nAssuming the context given in (14), let us consider examples in which the attitude verb willen (‘to want’) takes an infinitival complement.\n\n\nThese are some papers you might want to read: \n\nFramesetters and the micro-variation of subject-initial V2, by Ciro Greco, Liliane Haegeman\nYou can download the paper here: ling.auf.net//lingbuzz/003226\n\n'

In [36]:
too_low = []
for candidate in sentences.find({'score':{'$lt': 0.5}}):
        too_low.append(candidate['_id'])

In [37]:
len(too_low)

407306

In [None]:
for word in keywords.find():
    keywords.update_one({'_id': word['_id']}, {'$set': {'informative_sents': list(filter(lambda x: x not in too_low, word['sentenceIDs']))}})

In [42]:
list(filter(lambda x: x not in [1,2,3], [1,2,3,4]))

[4]

In [44]:
out[0]

ObjectId('59a85f7cb18b14085d6c5eac')

Keywords have to be split into a list of words in order for the 'request reference' to work properly.

In [None]:
def clean_keywords(keywords):
    sentence = []
    for w in keywords[0].split():
        w = str(w).lower()
        word = re.sub('[%s]' % re.escape(string.punctuation), '', w)
        if word.isalpha():
            sentence.append(word)
    return bigrams[sentence]

In [None]:
for doc in papers.find({'updated_keywords':{'$exists': True}}):
    try: 
        papers.update_one({'_id': doc['_id']},{'$set':{'updated_keywords': clean_keywords(doc['keywords'])}})
    except:
        pass

In [None]:
papers.find({'keywords':{'$exists': True}}).count()

In [None]:
papers.find({'updated_keywords':{'$exists': True}}).count()

Ideas to make the search faster:
* Start with looking for sentences that contain the least frequent keyword. Then, within that set, find the sentences that have the other keywords as well. The problem with this is that some keywords are more important than others. For instance, in the question 'What is the difference between topic and focus?', we want to look at sentences with topic and focus, while 'difference' is the least frequent word. We should have some keyword relevance ranking system as well.
* everytime that sentence similarities are computed, store them in the db. After a while, all similarities should be in there.
* Store Q-A pairs in a db, in order to avoid having to recompute an answer to a question that has already been asked.


Ideas to make the general answers better:
* In English, the scope of the question is usually where default intonation falls, i.e., the last constituent of the clause. Is there a way to incorporate this?
* text simplification: now there are a lot of semantic connectors in the answers that do not connect to anything. It would be better to get rid of them.
* make the author search better

In [None]:
for doc in keywords.find():
    keywords.update_one({'_id': doc['_id']}, {'$set': {'frequency': len(doc['sentenceIDs'])} })

In [None]:
keywords.find_one({'word': 'the'})['frequency']

In [None]:
questions = ['What is the difference between topic and focus?', 'Who wrote about focus?', 
             'What did Schlenker write about sing language?']