<h3>0. Exploratory analysis</h3>

In [1]:
import json
import pandas as pd

#df_training=pd.read_json('project_files/training.json', encoding = 'utf8')
df_devel=pd.read_json('project_files/devel.json')
df_docs=pd.read_json('project_files/documents.json')
df_testing=pd.read_json('project_files/testing.json')

df_training=pd.read_pickle('project_files/df_training.pkl')
question_learning_dataset = df_training[df_training.answer_type.notnull()]

1. Find Keywords
2. Answer types - Using answer type taxonomy
3. Query formulation -> Keywords
4. Go to each document and check the frequency distribution of words and pick the document if one of the query words are present in document. Create a rank with that score
5. Find the paragraphs -> Discard irrelevant paragraphs. Use NE,Keywords, longest exact keywords. Put same weight for now and calculate the score of paragraphs. Rank each of the paragraphs in the document. We have to use the original answer and match the answer type
6. Find candidate answers -> Use supervised ML method
7. Merge candidate answers -> Use NER
8. Pick the best answer -> Logistic regression

<h3>1. Question processing</h3>

Configuring Stanford CoreNLP . Link -> https://blog.manash.me/configuring-stanford-parser-and-stanford-ner-tagger-with-nltk-in-python-on-windows-f685483c374a

In [2]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.tag.stanford import CoreNLPNERTagger
from itertools import groupby

stopwords = set(nltk.corpus.stopwords.words('english')) 


def get_Name_Entity_NLTK(data):
    results=[]
    for sentence in data:
        ne_chunked_sents = ne_chunk(pos_tag(word_tokenize(sentence)))
        result = []

        for tagged_tree in ne_chunked_sents:

            if hasattr(tagged_tree, 'label'):
                entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #
                entity_type = tagged_tree.label() # get NE category
                result.append((entity_name, entity_type))
        results.append(result)

    return results

def get_Name_Entity_Sentence(sentence):
    st = CoreNLPNERTagger(url='http://localhost:9000')
    tokenized_text = nltk.word_tokenize(sentence)
    classified_text = st.tag(tokenized_text)
    result = []
    
    for tag, chunk in groupby(classified_text, lambda x:x[1]):
       if tag != "O":
            word = " ".join(w for w, t in chunk)
            result.append((word.lower(), tag))
    
    return result

def get_Name_Entity_paragraph(paragraph):
    st = CoreNLPNERTagger(url='http://localhost:9000')
    entity_para=[]
    
    tokenized_sentence=nltk.sent_tokenize(paragraph)
    #print(tokenized_sentence)
    for sentence in tokenized_sentence:
        #print(sentence)
        tokenized_text = nltk.word_tokenize(sentence)
        classified_text = st.tag(tokenized_text)
        #result = {}
        entity_sent=[]
        for tag, chunk in groupby(classified_text, lambda x:x[1]):
           if tag != "O":
                word = " ".join(w for w, t in chunk)
                #result[word.lower()] = tag
                entity_sent.append((word.lower(),tag))
        #print(entity_sent)     
        entity_para.append(entity_sent)
        
    return entity_para

def get_Name_Entity_StanfordCoreNLP(data):
    st = CoreNLPNERTagger(url='http://localhost:9000')
    results=[]
    for sentence in data:
        tokenized_text = nltk.word_tokenize(sentence)
        classified_text = st.tag(tokenized_text)
        result = []
        for tag, chunk in groupby(classified_text, lambda x:x[1]):
            if tag != "O":
                word = " ".join(w for w, t in chunk)
                result.append((word.lower(),tag))
       
        results.append(result)
        
    return results

def addNameEntity(df,feature,func):
    if 'NE'+"_"+feature in df:
        df = df.drop('NE'+"_"+feature, axis=1)
    df["NE"+"_"+feature] = func(df[feature])
    
    return df

Get Keywords

In [3]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
tokenizer = RegexpTokenizer(r'\w+')
POS = set(["NN","NNS","NNP","NNPS","CD","JJ","VB","VBD","VBG","VBN","VBP","VBZ"]) 

stopwords = set(nltk.corpus.stopwords.words('english')) 


def lemmatize(word):
    lemma = lemmatizer.lemmatize(word,wn.NOUN)
    if (lemma == word):
        lemma = lemmatizer.lemmatize(word,wn.VERB)
        
    return lemma

def get_keyword(data):
    result = []
    sentence=data
    tokenized_text = tokenizer.tokenize(sentence)
    tagged = nltk.pos_tag(tokenized_text)
    for text,pos in tagged:
        text = lemmatize(text.lower())
        if text not in stopwords:
            if pos in POS:
                result.append(text)
                
    return result

def get_keyword_paragraph(data):
    results=[]
    tokenized_sentence = nltk.sent_tokenize(data)
    for sentence in tokenized_sentence:
        result = get_keyword(sentence)
        results.append(result)
        
    return results

def get_keyword_all(data):
    results=[]
    for sentence in data:
        result = get_keyword(sentence)
        results.append(result)
        
    return results

def add_keywords(df,feature):
    if 'keywords'+"_"+feature in df:
        df = df.drop('keywords'+"_"+feature, axis=1)
    df['keywords'+"_"+feature]=get_keyword_all(df[feature])
    return df

def get_number_of_common_kewyords(question_keywords,answer_sentence_keywords):
    sum_keywords=0
    for qkey in question_keywords:
        if qkey in answer_sentence_keywords:
            sum_keywords+=1
    
    return sum_keywords

<h4>Train a classifier</h4>

In [4]:
import nltk
import difflib

def isEqual(answer,sentence):
    answer_tokens=nltk.word_tokenize(answer)
    sentence_tokens=set(nltk.word_tokenize(sentence))
    for a_token in answer_tokens:
        if a_token not in sentence_tokens:
            return False
    return True
                
def get_answer_features(paragraph,answer,ner_answer,ner_paragraph,answer_found):
    dict_answer_ner={}
    
    for ner in ner_answer:
        if len(ner)<2:
            print ('NER list',ner)
        dict_answer_ner[ner[0]]=ner[1]
    
    dict_answer_sentence_ner={}
    ner_paragraph_list=[]
    for ner_list in ner_paragraph:
        for ner in ner_list:
            #if len(ner)<2:
             #   print ('NER list',ner_list,ner)
            dict_answer_ner[ner[0]]=ner[1]
        ner_paragraph_list.append(dict_answer_ner)

    
    
    sents_passage = nltk.sent_tokenize(paragraph)
    answer_sentence_ner={'UNKNOWN':'UNKNOWN'}

    answer_sentence_keywords=[]
    common_entities=tuple()
    
    
    common_entities = set(dict_answer_sentence_ner.items()) & set(dict_answer_ner.items())
    
    
    for sentence_index in range(len(sents_passage)):
        #if answer.lower() in sents_passage[sentence_index].lower():
        if (isEqual(answer.lower(),sents_passage[sentence_index].lower())):
            answer_found=sents_passage[sentence_index]
            dict_answer_sentence_ner=ner_paragraph_list[sentence_index]
            common_entities = set(dict_answer_sentence_ner.items()) & set(dict_answer_ner.items())
            
            break
    
    return answer_found,dict_answer_sentence_ner,common_entities



In [5]:
# BOW extraction for passages and questions
def get_passages_bow(passages):
    passage_bow={}
    for passage in passages:
        for token in nltk.word_tokenize(passage):
            if token not in stopwords: 
                word=lemmatize(token.lower())
                passage_bow[word] = passage_bow.get(word, 0) +  1
    
    return passage_bow

def get_sentences_bow(sentences):
    sentence_bow={}
    
    for sentence in sentences:
        for token in nltk.word_tokenize(sentence):
            if token not in stopwords:
                word=lemmatize(token.lower())
                sentence_bow[word] = sentence_bow.get(word, 0) +  1
    
    return sentence_bow

def get_question_bow(question):
    question_bow={}
    for token in nltk.word_tokenize(question):
        if token not in stopwords: 
            word=lemmatize(token.lower())
            question_bow[word] = question_bow.get(word, 0) +  1
                
    return question_bow

def get_training_question_bow(question,keywords,qt):
    question_bow={}
    question_bow[qt]=1
    for token in nltk.word_tokenize(question):
        if token not in stopwords: 
            word=lemmatize(token.lower())
            if word in keywords:
                question_bow[word] = question_bow.get(word, 0) +  1
                
    return question_bow

In [6]:
def get_feature_questions(questions, keywords,qt):
    qs = []
    for i,question in enumerate(questions):
        q_bow = get_training_question_bow(question,keywords,qt[i])
        qs.append(q_bow)
        
    return qs

In [7]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

def check_results(predictions, classifications):
    print("Accuracy:")
    print(accuracy_score(classifications,predictions))
    print(classification_report(classifications,predictions))

In [8]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction import DictVectorizer


    
# get the most common words from answer sentences (we can twek this for paragraph)
answer_sentences_bow=get_sentences_bow(question_learning_dataset[question_learning_dataset['answer_found'].notnull()]['answer_found'])
answer_keywords = set([word for word, count in answer_sentences_bow.items()])

#qs_training=get_feature_questions(questions,answer_keywords)
qs_training=get_feature_questions(list(question_learning_dataset.question),answer_keywords,list(question_learning_dataset.question_type))




In [15]:
from sklearn.ensemble import RandomForestClassifier

if (len(qs_training)>0 and len(list(question_learning_dataset.question_type))>0):
    # fit vectorizer
    vectorizer = DictVectorizer()
    
    X_train_dtm = vectorizer.fit_transform(qs_training)
    
    

    model=RandomForestClassifier(n_estimators = 300, max_depth = 60, criterion = 'entropy')
    
    # tag the answers
    # fit a logistic regression model to the data 
    # build classifier
    #model = MultinomialNB(2, False, None)

    # train the model using X_train_dtm 
    model.fit(X_train_dtm, list(question_learning_dataset.answer_type))
    
    y_predicted_class = model.predict(X_train_dtm)
    
    check_results(y_predicted_class,list(question_learning_dataset.answer_type))

Accuracy:
0.8439676567128332
                   precision    recall  f1-score   support

   CAUSE_OF_DEATH       1.00      0.50      0.66       327
             CITY       1.00      0.08      0.15        12
          COUNTRY       0.94      0.55      0.69      1058
  CRIMINAL_CHARGE       1.00      0.42      0.59        64
             DATE       0.78      0.99      0.87      5801
         DURATION       0.97      0.67      0.79       464
         IDEOLOGY       0.99      0.58      0.73       232
         LOCATION       0.79      0.91      0.85      1738
             MISC       1.00      0.54      0.70       133
            MONEY       1.00      0.82      0.90       462
      NATIONALITY       0.99      0.44      0.61       858
           NUMBER       0.93      0.91      0.92      4644
          ORDINAL       1.00      0.70      0.82       406
     ORGANIZATION       1.00      0.53      0.69       496
          PERCENT       0.97      0.87      0.92       751
           PERSON       0.

In [16]:
X_train_dtm.shape

(24982, 13286)

<h3>2. Candidate answering generation</h3>

<h4> Get a score for the passage to filter the most relevant passages</h4>


In [17]:
## features relevant to this part
# number of named entities of the right type in the passage
# number of question keywords in the passage
# the longest exact sequence of question keywords
# rank of the document where the passage was extracted
# proximity of the keywords from the original query
# ngram overlap between the passage and the question

First, we will set up useful functions to extract term frequencies to build the vector space model

In [309]:
import nltk
from collections import defaultdict
from collections import Counter
from math import log

stopwords = set(nltk.corpus.stopwords.words('english')) # wrap in a set() (see below)


# get the terms for a passage
def get_terms(passage):
    terms = set()
    for token in nltk.word_tokenize(passage):
        if token not in stopwords: 
            terms.add(lemmatize(token.lower()))
    return terms
    
# get document_term 
def get_document_term_passsages(ds_documents):
    document_term={}
    passageID=0
    for index, row in ds_documents.iterrows():
        passageID=0
        terms={}
        # every row is a document
        list_of_passages=row['text']
        for passage in list_of_passages:
            terms[passageID]=get_terms(passage)
            passageID+=1
            
        document_term[row['docid']]=terms
    return document_term

# get the term frequency
def extract_term_freqs(doc):
    tfs = Counter()
    for token in doc:
        if token not in stopwords: 
            tfs[lemmatize(token.lower())] += 1
    return tfs
        
# compute idf
def compute_doc_freqs(doc_term_freqs):
    doc_dic = {}
    for key, value in doc_term_freqs.items():
        dfs = Counter()
        for passage_id,tfs in value.items():
            for term in tfs.keys():
                dfs[term] += 1
        doc_dic[key] = dfs
        
    return doc_dic
    

In [310]:
# create a document-term matrix
docs=get_document_term_passsages(df_docs)
#docs

In [311]:
# create a vector space model we need to define a score function
# first I will use tf-idf
doc_term_freqs = {}
for docid,dic_passages in docs.items():
    passage_dic = {}
    for passage_id, terms in dic_passages.items():
        term_freqs = extract_term_freqs(terms)
        passage_dic[passage_id] = term_freqs
    doc_term_freqs[docid] = passage_dic

doc_freqs = compute_doc_freqs(doc_term_freqs)


In [312]:
#doc_term_freqs

<b>Improvement:</b> Use BM25

Create an inverted index for query processing. Inverted index will not change from query to query. Here we can improve how the weight is defined for the posting list tuple for each term (docid,weight)

In [313]:
def count_words(freqs):
    p_count=0
    for counter in freqs.values():
        p_count+=sum(counter.values())
    
    #print(p_count)
    return p_count

In [314]:
## Code from WSTA_N16_information_retrieval
vsm_inverted_index_all = defaultdict()
for docid, passage_freqs in doc_term_freqs.items():
    vsm_inverted_index = defaultdict(list)
    
    #N = sum(passage_freqs.values())
    N = count_words(passage_freqs)
    #print(N,passage_freqs)
    for passage_id, term_freqs in passage_freqs.items():
        length = 0
        # find tf*idf values and accumulate sum of squares 
        tfidf_values = []
        M = len(passage_freqs)
        for term, count in term_freqs.items():
            tfidf = float(count) / N * log(M / float(doc_freqs[docid][term])) # should be number of documents (paragraphs) with term
            tfidf_values.append((term, tfidf))
            length += tfidf ** 2

        # normalise documents by length and insert into index
        length = length ** 0.5
        for term, tfidf in tfidf_values:
            # note the inversion of the indexing, to be term -> (doc_id, score)
            vsm_inverted_index[term].append([passage_id, tfidf / length])
    vsm_inverted_index_all[docid] = vsm_inverted_index

# ensure posting lists are in sorted order (less important here cf above)
for key, value in vsm_inverted_index_all.items():
    for term, docids in value.items():
        docids.sort()


In [18]:
import pickle

def save_obj(obj, name ):
    with open('obj/'+ name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(name ):
    with open('obj/' + name + '.pkl', 'rb') as f:
        return pickle.load(f)
    


In [21]:
vsm_inverted_index_all=load_obj('vsm_inverted_index_corpus')

Query the VSM creating a score for each document (passage) and returning the top k

In [23]:
vsm_inverted_index_all[0]['addition']

[[19, 0.15652088017686194]]

In [27]:
from collections import Counter
# get a list of paragraphs ordered by relevance on the question
def query_vsm(query, index):
    accumulator = Counter()
    for term in query:
        postings = index[term]
        for docid, weight in postings:
            accumulator[docid] += weight
    return accumulator

## end copied code

<h3>3. Candidate answering scoring</h3>

In [25]:
df_result_devel=pd.DataFrame(columns=['id','answer'])

In [442]:

for index, row in df_devel.iterrows():
    question=row['question']
    docid=row['docid']
    expected_answer=row['text']
    
    question_keywords=get_keyword(question)
    
    # get the most relevant documents for the question
    results = query_vsm(question_keywords, vsm_inverted_index_all[docid])
    documents_ranked=results.most_common(10) 
    
    # extract a set of potential answers
    
    q_bow=get_question_bow(question)
    x = vectorizer.transform(q_bow)
    answer_type=model.predict(x)
    #print('Predicted answer type: ',answer_type)
    
    candidate_passages={}
    list_of_passages=[]
    answer=''
    if len(documents_ranked)>0:
        for document in documents_ranked:
            # perform a paragraph segmentation
            paragraph=df_docs.iloc[docid]['text'][document[0]]
            passages = nltk.sent_tokenize(paragraph)
            for passage in passages:
                list_of_passages.append(passage)



        ## PARAMETERS TO GET FROM TESTING DATASET AND USE A MODEL TO GET THE ANSWER PASSAGE CANDIDATES. 
        #question= df_training.loc[(df_training["docid"] == docid_query) & (df_training["answer_paragraph"] ==document[0] ),"question"][0]
        #answer_type=df_training.loc[(df_training["docid"] == docid_query) & (df_training["answer_paragraph"] ==document[0] ),"answer_type"][0]
        #print(question)
        #print(answer_type) 
        #print(sorted(get_keyword(question)))
        ###

        ## FOR NOW USING KEYWORDS AND GET JUST ONE DEFINITE ANSWER PASSAGE CANDIDATE
        indexPassage=0
        for indexPassage in range(len(list_of_passages)):
            NER_passage=get_Name_Entity_Sentence(list_of_passages[indexPassage])
            for entity in NER_passage.items():
                if (entity[1]==answer_type):
                    candidate_passages[indexPassage]=get_number_of_common_kewyords(get_keyword(question),get_keyword(list_of_passages[indexPassage]))
                    break


        if len(candidate_passages)>0:
            best_candidate_passage=list_of_passages[max(candidate_passages, key=candidate_passages.get)]
        else:
            if len(list_of_passages)>0:
                best_candidate_passage=list_of_passages[0]
        #print("Candidate Passage Answer:")
        #print(best_candidate_passage)

        NER_answer_passage=get_Name_Entity_Sentence(best_candidate_passage)
        for entity in NER_answer_passage.items():
                if (entity[1]==answer_type):
                    answer=entity[0]

        #print('Predicted answer:',answer)
        df_result_devel.loc[len(df_result)]=[index,answer]
    
    

    
    

NameError: name 'df_result_devel' is not defined

<h3>Testing Dataset</h3>

In [None]:

df_result=pd.DataFrame(columns=['id','answer'])
df_testing=pd.read_json('project_files/testing.json')

for index, row in df_testing.iterrows():
    question=row['question']
    docid=row['docid']
    ida=row['id']
    
    #print('Question: ',question)
    #print('Expected Answer:',expected_answer)
    #print('Docid:',docid)
    question_keywords=get_keyword(question)
    
    # get the most relevant documents for the question
    results = query_vsm(question_keywords, vsm_inverted_index_all[docid])
    documents_ranked=results.most_common(10) 
    #print('Top 10 paragraphs: ',documents_ranked)
    q_bow=get_question_bow(question)
    x = vectorizer.transform(q_bow)
    answer_type=model.predict(x)
    #print('Predicted answer type: ',answer_type)
    
    candidate_passages={}
    list_of_passages=[]
    answer=''
    if len(documents_ranked)>0:
        for document in documents_ranked:
            # perform a paragraph segmentation
            paragraph=df_docs.iloc[docid]['text'][document[0]]
            passages = nltk.sent_tokenize(paragraph)
            
            for passage in passages:
                list_of_passages.append(passage)



        ## PARAMETERS TO GET FROM TESTING DATASET AND USE A MODEL TO GET THE ANSWER PASSAGE CANDIDATES. 
        #question= df_training.loc[(df_training["docid"] == docid_query) & (df_training["answer_paragraph"] ==document[0] ),"question"][0]
        #answer_type=df_training.loc[(df_training["docid"] == docid_query) & (df_training["answer_paragraph"] ==document[0] ),"answer_type"][0]
        #print(question)
        #print(answer_type) 
        #print(sorted(get_keyword(question)))
        ###

        ## FOR NOW USING KEYWORDS AND GET JUST ONE DEFINITE ANSWER PASSAGE CANDIDATE
        indexPassage=0
        for indexPassage in range(len(list_of_passages)):
            NER_passage=get_Name_Entity_Sentence(list_of_passages[indexPassage])
            for entity in NER_passage:
                if (entity[1]==answer_type):
                    candidate_passages[indexPassage]=get_number_of_common_kewyords(get_keyword(question),get_keyword(list_of_passages[indexPassage]))
                    break


        if len(candidate_passages)>0:
            best_candidate_passage=list_of_passages[max(candidate_passages, key=candidate_passages.get)]
        else:
            if len(list_of_passages)>0:
                best_candidate_passage=list_of_passages[0]
        #print("Candidate Passage Answer:")
        #print(best_candidate_passage)

       
        NER_answer_passage=get_Name_Entity_Sentence(best_candidate_passage)
        for entity in NER_answer_passage:
                if (entity[1]==answer_type):
                    answer=entity[0]
    
    #print('Predicted answer:',answer)
    print(ida)
    
    df_result.loc[len(df_result)]=[ida,answer]
    
    

    

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [31]:
df_result.to_csv('prediction/output.csv',index=False)

In [495]:
def balanced_subsample(x,y,subsample_size=1.0):
    
    class_xs = []
    min_elems = None

    for yi in np.unique(y):
        elems = x[(y == yi)]
        class_xs.append((yi, elems))
        if min_elems == None or elems.shape[0] < min_elems:
            min_elems = elems.shape[0]

    use_elems = min_elems
    if subsample_size < 1:
        use_elems = int(min_elems*subsample_size)

    xs = []
    ys = []

    for ci,this_xs in class_xs:
        if len(this_xs) > use_elems:
            np.random.shuffle(this_xs)

        x_ = this_xs[:use_elems]
        y_ = np.empty(use_elems)
        y_.fill(ci)

        xs.append(x_)
        ys.append(y_)

    xs = np.concatenate(xs)
    ys = np.concatenate(ys)

    return xs,ys

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 

df_devel=pd.read_json('project_files/devel.json')

    
def get_answer_rank_features(dataset):
    X=[]
    Y=[]
    for index, row in dataset.iterrows():
        question=row['question']
        raw_answer=row['text']

        paragraph=df_docs.iloc[row['docid']]['text'][row['answer_paragraph']]
        
        #answer_found,dict_answer_sentence_ner,common_entities=get_answer_features(paragraph,raw_answer,row['NE_text'],row['NE_paragraph'])

        # number of named entities in the passage
        num_entities=len(common_entities)

        # number of question keywords in the passage
        question_keywords=get_keyword(question)
        answer_passage_keywords=get_keyword(answer_found)
        qk_passage=[]
        for qk in question_keywords:
            if qk in answer_passage_keywords:
                qk_passage.append(qk)
        num_qkp=len(qk_passage)   

        # longest exact sequence of keywords
        longest_exact_sequence=0

        for i in range(len(question_keywords)):
            if i < len(answer_passage_keywords):
                if question_keywords[i] in answer_passage_keywords[i]:
                    longest_exact_sequence+=1

        # rank of the paragraph where the answer sentence was extracted
        results = query_vsm(question_keywords, vsm_inverted_index_all[row['docid']])
        documents_ranked=results.most_common(10) 
        rank_of_paragraph=0
        for document in documents_ranked:
            if (document[0]==row['answer_paragraph']):
                break
            else:
                rank_of_paragraph+=1

        #print('Question:',question)
        #print('answer:',answer_found)

        #print (num_entities,num_qkp,longest_exact_sequence,rank_of_paragraph)

        tokenized_sentence = nltk.sent_tokenize(paragraph)
        for sentence in tokenized_sentence:
            X.append([num_entities,num_qkp,longest_exact_sequence,rank_of_paragraph])
            #print(sentence)
            if (sentence==answer_found):
                Y.append(1)
            else:
                Y.append(0)

        #print(Y_train)
    
    return X,Y


    
def get_answer_rank_features_devel(dataset):
    X=[]
    Y=[]
    for index, row in dataset.iterrows():
        question=row['question']
        raw_answer=row['text']
        print(index)
        paragraph=df_docs.iloc[row['docid']]['text'][row['answer_paragraph']]
        
        NE_answer=get_Name_Entity_Sentence(raw_answer)
        NE_paragraph=get_Name_Entity_paragraph(paragraph)
        #print(NE_answer)
        answer_found,dict_answer_sentence_ner,common_entities=get_answer_features(paragraph,raw_answer,NE_answer,NE_paragraph)

        # number of named entities in the passage
        num_entities=len(common_entities)

        # number of question keywords in the passage
        question_keywords=get_keyword(question)
        answer_passage_keywords=get_keyword(answer_found)
        qk_passage=[]
        for qk in question_keywords:
            if qk in answer_passage_keywords:
                qk_passage.append(qk)
        num_qkp=len(qk_passage)   

        # longest exact sequence of keywords
        longest_exact_sequence=0

        for i in range(len(question_keywords)):
            if i < len(answer_passage_keywords):
                if question_keywords[i] in answer_passage_keywords[i]:
                    longest_exact_sequence+=1

        # rank of the paragraph where the answer sentence was extracted
        results = query_vsm(question_keywords, vsm_inverted_index_all[row['docid']])
        documents_ranked=results.most_common(10) 
        rank_of_paragraph=0
        for document in documents_ranked:
            if (document[0]==row['answer_paragraph']):
                break
            else:
                rank_of_paragraph+=1

        #print('Question:',question)
        #print('answer:',answer_found)

        #print (num_entities,num_qkp,longest_exact_sequence,rank_of_paragraph)

        tokenized_sentence = nltk.sent_tokenize(paragraph)
        for sentence in tokenized_sentence:
            
            X.append([num_entities,num_qkp,longest_exact_sequence,rank_of_paragraph])
            if (answer_found in sentence):
                Y.append(1)
            else:
                Y.append(0)

        #print(Y_train)
        
    return X,Y    

LogReg = LogisticRegression()

X_train,Y_train=get_answer_rank_features(df_training)
#X_train, X_test, Y_train, Y_test = train_test_split(X_train, Y_train, stratify=Y_train, test_size=0.2)


#print(X_train)
#print(Y_train)
LogReg.fit(X_train, Y_train)
print('done training')
X,Y=get_answer_rank_features_devel(df_devel)


y_predicted_class = LogReg.predict(X)


classifications=Y
predictions=y_predicted_class

print("Accuracy:")
print(accuracy_score(classifications,predictions))
print(classification_report(classifications,predictions))

In [504]:
def get_passage_features(passage,question,answer_type):
    # number of named entities in the passage
        num_entities=len(common_entities)

        # number of question keywords in the passage
        question_keywords=get_keyword(question)
        answer_passage_keywords=get_keyword(answer_found)
        qk_passage=[]
        for qk in question_keywords:
            if qk in answer_passage_keywords:
                qk_passage.append(qk)
        num_qkp=len(qk_passage)   

        # longest exact sequence of keywords
        longest_exact_sequence=0

        for i in range(len(question_keywords)):
            if i < len(answer_passage_keywords):
                if question_keywords[i] in answer_passage_keywords[i]:
                    longest_exact_sequence+=1

        # rank of the paragraph where the answer sentence was extracted
        results = query_vsm(question_keywords, vsm_inverted_index_all[row['docid']])
        documents_ranked=results.most_common(10) 
        rank_of_paragraph=0
        for document in documents_ranked:
            if (document[0]==row['answer_paragraph']):
                break
            else:
                rank_of_paragraph+=1

        return num_entities,num_qkp,longest_exact_sequence,rank_of_paragraph]
    

5396

In [None]:
print(sum(Y))

In [394]:
df_devel.head()

Unnamed: 0,answer_paragraph,docid,question,text
0,5,380,On what date did the companies that became the Computing-Tabulating-Recording Company get consolidated?,"june 16 , 1911"
1,22,380,What percentage of its desktop PCs does IBM plan to install Open Client on to?,5 %
2,16,380,What year did IBM hire its first black salesman?,1946
3,4,380,"IBM made an acquisition in 2009, name it.",spss
4,2,380,"This IBM invention is known by the acronym UPC, what is the full name?",universal product code


In [426]:
a={'a':1,'b':2,'c':3}

In [427]:
a[0]

KeyError: 0

In [428]:
df_training.head()

Unnamed: 0,answer_paragraph,docid,question,text,NE_question,NE_text,NE_paragraph,answer_type,keywords_question,question_type,POS_questions,answer_found
0,23,0,A kilogram could be definined as having a Planck constant of what value?,6966662606895999999♠6.62606896×10−34 j⋅s,[],"[(6966662606895999999 ♠ 6.62606896, NUMBER), (10 − 34, NUMBER)]","[[(general, TITLE), (2011, DATE)], [], [(one, NUMBER)], [(7050135639273999999 ♠ 135639274 ×, NUMBER), (1042, DATE), (6966662606895999999 ♠ 6.62606896, NUMBER), (10 − 34, NUMBER), (⋅, NUMBER)]]",NUMBER,"[kilogram, definined, planck, constant, value]",what,"[NN, VBN, NNP, NN, NN]","Possible new definitions include ""the mass of a body at rest whose equivalent energy equals the energy of photons whose frequencies sum to 7050135639273999999♠135639274×1042 Hz"", or simply ""the kilogram is defined so that the Planck constant equals 6966662606895999999♠6.62606896×10−34 J⋅s""."
1,22,0,What is the shape of the object that establishes the base unit of the kilogram?,cylinder,[],[],"[[], [], [(1889, DATE), (paris, CITY)], [(1889, DATE), (1, NUMBER), (one, NUMBER), (million, NUMBER)], [(one, NUMBER), (current, DATE), (planck, LOCATION)]]",,"[shape, object, establish, base, unit, kilogram]",what,"[NN, NN, VBZ, JJ, NN, NN]","The most urgent unit on the list for redefinition is the kilogram, whose value has been fixed for all science (since 1889) by the mass of a small cylinder of platinum–iridium alloy kept in a vault just outside Paris."
2,12,0,What example is given as another paired relationship of uncertainly related to standard deviation?,time vs. energy,[],[],"[[], [], [(one, NUMBER)], [(fourier, LOCATION)]]",,"[example, give, pair, relationship, relate, standard, deviation]",what,"[NN, VBN, JJ, NN, VBN, JJ, NN]",One example is time vs. energy.
3,1,0,What does the Planck Constant refer to?,quantum of action,[],[],"[[], [(planck, PERSON)], [(now, DATE)], [], []]",,"[doe, planck, constant, refer]",what,"[VBZ, NNP, NNP, NN]","Instead, it must be some multiple of a very small quantity, the ""quantum of action"", now called the Planck constant."
4,10,0,When was the first quantized model of the atom introduced?,1913,"[(first, ORDINAL), (model, TITLE)]","[(1913, DATE)]","[[(niels bohr, PERSON), (first, ORDINAL), (model, TITLE), (1913, DATE), (rutherford, PERSON), (model, TITLE)], [], [], [(bohr, PERSON), (planck, PERSON), (bohr, PERSON)]]",DATE,"[wa, first, quantize, model, atom, introduce]",when,"[VBD, JJ, JJ, NN, NN, VBD]","Niels Bohr introduced the first quantized model of the atom in 1913, in an attempt to overcome a major shortcoming of Rutherford's classical model."


In [534]:
len(df_training[df_training.answer_found=='UNKNOWN'])

2743

In [542]:
df_training.iloc[3105]

answer_paragraph     19                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
docid                28                                                                                                                                                                                         

In [538]:
a='The winner of the 2014 Nobel Prize in Literature, Patrick Modiano–who lives in Paris–, based most of his literary work on the depiction of the city during World War II and the 1960s-1970s.'

if 'patrick modiano' in a.lower():
    print(True)

True
