# Project PRI

The goal of this project is to implement an Information Search & Extraction System for
the analysis of political discourse. Your system will have access to a large set of documents containing the electoral manifestos
of several political parties from different countries in the world. Using this data,
the system should be able to provide the following functionalities.

a) Ad hoc search on the collection of documents

Given a query, represented by a set of keywords, the system should return all manifestos containing such keywords, ordered according to their relevance to the query.

1- create an inverted index/dictionary for all documents in the document collection

2- from query given in command line, transform it, to compare with document dictionary and retrieve all that are relevant

3- ranking of documents (based on relevance to the query)


we can surpress common words, or consider the different conjugations of the same verb the same term, and so on!

In [1]:
#read csv file 
import pandas as pd
import numpy as np

data = pd.read_csv("en_docs_clean.csv")


#creates a frame with columns text, id, party, date and title
print(data.shape[0])
print(data.shape[1])
#isinstance(s, str)
#print(isinstance(data.loc[0,"text"], str))
print(data.iloc[16725])
#print(data.iloc[21]) #to access first row

#devo comprimir num só texto os que possuem o mesmo manifesto_id?
#https://whoosh.readthedocs.io/en/latest/indexing.html#indexing-documents

16726
5
text            Table 1 (continued) All figures in £bn TAX AND...
manifesto_id                                         51951_201505
party                           United Kingdom Independence Party
date                                                       201505
title                     Believe in Britain. UKIP Manifesto 2015
Name: 16725, dtype: object


In [4]:
dic = {}

dic = {}

for i in range(data.shape[0]):
    key = data.loc[i, "manifesto_id"]
    if key not in dic:
        dic[key] = []
        #append text
        dic[key].append(data.loc[i,"text"])
        #manifesto_id
        dic[key].append(key)
        #party
        #dic[key][2] = data.loc[i,"party"]
        dic[key].append(data.loc[i,"party"])
        #data
        dic[key].append(data.loc[i,"date"])
        #title
        dic[key].append(data.loc[i,"title"])
        #print("this should be 5 but is:", len(dic[key]))
    else:
        dic[key][0] += data.loc[i,"text"]
        #for each manifesto, same party, date and title

#create DataFrame from dict
 #columns=['text', 'manifesto_id', 'party', 'date', 'title']
data_proc = pd.DataFrame.from_dict(dic, orient='index')
#data_proc = pd.DataFrame.from_dict(list(dic.items()))
print(data_proc.shape[1])
print(data_proc)
#print(pd.DataFrame.from_dict(list(dic.items())))

5
                                                              0             1  \
51320_196410  "THE NEW BRITAIN"  The world wants it and woul...  51320_196410   
51620_196410  "PROSPERITY WITH A PURPOSE"  Foreward  by Sir ...  51620_196410   
51320_196603  Time for Decision  PREFACE: TIME FOR DECISION ...  51320_196603   
51620_196603  Action Not Words: The New Conservative Program...  51620_196603   
51320_197006  NOW BRITAIN'S STRONG - LET'S MAKE IT GREAT TO ...  51320_197006   
51620_197006  A Better Tomorrow  FOREWORD  This Manifesto se...  51620_197006   
51320_197402  Let us work together - Labour's way out of the...  51320_197402   
51620_197402  'Firm action for a fair Britain'  Today we fac...  51620_197402   
51320_197410  BRITAIN WILL WIN WITH LABOR  FOREWORD BY  THE ...  51320_197410   
51620_197410  Putting Britain First  A national policy  The ...  51620_197410   
51320_197905  'The Labour Way is the Better Way'  FOREWARD: ...  51320_197905   
51620_197905  FOREWORD  FO

In [5]:
#now we want to create an index in order to ease the access and search
import nltk
import os.path
import shutil
from whoosh.fields import *
from whoosh.index import create_in, open_dir
from whoosh.query import Every
from whoosh.qparser import MultifieldParser, OrGroup
from whoosh.analysis import StemmingAnalyzer
from whoosh.formats import Frequency
from whoosh import scoring

#define the index's schema, that lists the fields in the index

#a field is a piece of information for each document in the index,
#such as its title or text content. It can be searched and/or stored
#(meaning the value that fets indexed is returned with the results)

#ndexing of a field means it can be searched and it is also returned 
#with results if defined as argument (stored=True) in schema.

# in our data, we have the text,manifesto_id,party,date,title

def createIndexComplete(data):
    
    #composes a RegexTokenizer (class implements a customizable, regular-expression-based tokenizer that extracts words
    #and ignores whitespace and punctuation) + LowerCaseFilter + StopWordsFilter + stemming filter(verbs converted to infinitive)
    analyzer = StemmingAnalyzer() 
    
    vector_format = Frequency() #Stores the number of times each term appears in each document.
    
    schema = Schema(text=TEXT(analyzer=analyzer, vector=vector_format), manifesto_id=ID(stored=True, sortable=True), party=TEXT(stored=True, vector=vector_format, sortable=True), date=NUMERIC, title=TEXT(stored=True, analyzer=analyzer, vector=vector_format))
    print(schema)
    
    if os.path.isdir("index"):
        shutil.rmtree("index")

    if not os.path.exists("index"):
        os.mkdir("index")
    
    index = create_in("index", schema)
    #The main index is an inverted index. It maps terms to the documents they appear in.
    
    #create an index writer to add documents
    writer = index.writer()
    for ind,row in data.iterrows():
        print(row[1])
       # print(data.loc[i, 'manifesto_id'])
        #writer.add_document(text=data.at[i,'text'], manifesto_id=data.at[i, "manifesto_id"], party=data.at[i, "party"], date=data.iat[i, "date"], title=data.iat[i, "title"])
        writer.add_document(text=row[0], manifesto_id=row[1], party=row[2], date=row[3], title=row[4])

    print("Going to commit")
    writer.commit()
    return index
    
    
index = createIndexComplete(data_proc)


<Schema: ['date', 'manifesto_id', 'party', 'text', 'title']>
51320_196410
51620_196410
51320_196603
51620_196603
51320_197006
51620_197006
51320_197402
51620_197402
51320_197410
51620_197410
51320_197905
51620_197905
51320_198306
51620_198306
51320_198706
51620_198706
51320_199204
51421_199204
51620_199204
51320_199705
51421_199705
51620_199705
51902_199705
51320_200106
51421_200106
51620_200106
51902_200106
51951_200106
51320_200505
51421_200505
51620_200505
51110_201505
51210_201505
51320_201505
51340_201505
51421_201505
51620_201505
51621_201505
51901_201505
51902_201505
51903_201505
51951_201505
Going to commit


In [53]:
#By default, Whoosh returns the results ordered using the BM25 similarity

def showIndex(index):
    with index.searcher() as searcher:
        # Match any documents with something in the "text" field
        q = Every("text")
        results = searcher.search(q, limit=None)
        print("Number of total documents:", searcher.doc_count())
        for result in results:
            print ("Id: %s Party: %s" % (result['manifesto_id'], result['party']))
            print ("Text:")
            print (result['title'])
            #print("Score:", result.score)
        
       # freq = searcher.frequency("content", "wobble")
        
showIndex(index)

Number of total documents: 42
Id: 51320_196410 Party: Labour Party
Text:
Let’s Go With Labour for the New Britain
Id: 51620_196410 Party: Conservative Party
Text:
‘Prosperity with a Purpose’, Conservative and Unionist Party’s Policy
Id: 51320_196603 Party: Labour Party
Text:
Time for Decision
Id: 51620_196603 Party: Conservative Party
Text:
Action not Words: New Conservative Programme
Id: 51320_197006 Party: Labour Party
Text:
Now Britain’s Strong - Let’s Make it Great to Live In
Id: 51620_197006 Party: Conservative Party
Text:
A Better Tomorrow
Id: 51320_197402 Party: Labour Party
Text:
Let us Work Together - Labour’s Way Out of the Crisis
Id: 51620_197402 Party: Conservative Party
Text:
Firm Action for a Fair Britain
Id: 51320_197410 Party: Labour Party
Text:
Britain Will Win With Labour
Id: 51620_197410 Party: Conservative Party
Text:
Putting Britain First
Id: 51320_197905 Party: Labour Party
Text:
The Labour Way is the Better Way
Id: 51620_197905 Party: Conservative Party
Text:
The

In [57]:
#experimentar vários critérios de score!!! ranking

def searchQuery(arg, index, w):
    #arg = unicode(arg, "utf-8") #convert to unicode to be processed by Whoosh
    #arg = arg.decode(encoding = 'UTF-8',errors = 'strict')
    ix = open_dir("index")
    with ix.searcher(weighting = w) as searcher:
        #print("Using the scoring criteria:", w)
        #search() takes query object and returns result object
        query_parser = MultifieldParser(["title","text", "party"], schema=ix.schema, group=OrGroup) #search 
        # OrGroup -> so that any of the terms may be present for a document to match
        query = query_parser.parse(arg)
        results = searcher.search(query, limit=None) 
        #By default the results contains at most the first 10 matching documents; limit=None all results
        print ("Number of results:", results.scored_length())
        #print("Documents that match with query:", list(query.docs(searcher)).sort())
        docs = []
        for result in results:
           # print(result)
            print("Score:", result.score)
            print("Document:", result.docnum)
            docs.append(result.docnum)
            #By default, Whoosh returns the results ordered using the BM25 similarity.
            #Consider not only the term frequency and inverse document
            #frequency heuristics, but also the document length as a
            #normalization factor for the term frequency
        print("Documents that match with query ordered by score:", docs)
        
print("Using BM25F as scoring criteria with k1=1,5")        
w = scoring.BM25F(B=0.75, content_B=1.0, K1=1.5) #default parameters for BM25 ??
searchQuery("world school primary", index, w)
print("Using BM25F as scoring criteria with k1=1,2")
w = scoring.BM25F(B=0.75, content_B=1.0, K1=1.2)
searchQuery("world school primary", index, w)
print("Using TF-IDF as scoring criteria")
w = scoring.TF_IDF()
searchQuery("world school primary", index, w)
print("Using Frequency as scoring criteria")
w = scoring.Frequency()
#check how it does frequency -> sum??
searchQuery("world school primary", index, w)
print("Using BM25 as scoring criteria")
w= scoring.BM25()
searchQuery("world school primary", index, w)

#they will have different orders! of displaying the docs

Using BM25F as scoring criteria with k1=1,5
Number of results: 42
Score: 7.276704613083039
Document: 23
Score: 7.155324483584744
Document: 28
Score: 7.070756285871019
Document: 36
Score: 7.070252426742156
Document: 19
Score: 6.979598183425653
Document: 21
Score: 6.97839564111772
Document: 15
Score: 6.950441405145526
Document: 24
Score: 6.940584406946761
Document: 26
Score: 6.895800920457827
Document: 41
Score: 6.873633094548244
Document: 3
Score: 6.834574940352524
Document: 29
Score: 6.831417797474144
Document: 31
Score: 6.785409238519012
Document: 33
Score: 6.7465644333903905
Document: 37
Score: 6.6428825105633456
Document: 30
Score: 6.636902242294807
Document: 25
Score: 6.595154614069755
Document: 27
Score: 6.519963831247696
Document: 20
Score: 6.503868394594681
Document: 7
Score: 6.4985395037086455
Document: 12
Score: 6.490179746390241
Document: 0
Score: 6.484890935449877
Document: 35
Score: 6.375696580068961
Document: 38
Score: 6.372924812131904
Document: 16
Score: 6.29754213593674

AttributeError: module 'whoosh.scoring' has no attribute 'BM25'

In [9]:
#For each party, how many manifestos are in the results returned
def manifestos_per_party(index):
    ix = open_dir("index")
    with ix.searcher() as searcher:
        q = Every("manifesto_id")
        #results = searcher.search(q, groupedby="party")
        results = searcher.search(q, sortedby="party", limit=None)
        print ("Number of results:", results.scored_length())
       # print("Size of groups in results:", results.groups("party"))
        for result in results:
            print(result)
            
        #too complicated

def manif_per_party(data):
    #parties = list(data.groupby(['party']).groups.keys())
    #print("List of the different parties:", parties)
    number_manif = data[2].value_counts()
    
    print(number_manif)
    
def number_docs_per_party(data):
    n = data['party'].value_counts()
    print("Number of parties:", len(n))
    print("Number of documents for each party:")
    print(n)

def number_manifestos(data):
    n = data['manifesto_id'].nunique()
    print("Number of manifestos:", n)

def number_parties(data):
    return len(data['party'].value_counts())
    
    
manif_per_party(data_proc)
#number_docs_per_party(data)
#number_manifestos(data)
#TODO

Conservative Party                    13
Labour Party                          13
Liberal Democrats                      5
Scottish National Party                3
United Kingdom Independence Party      2
We Ourselves                           1
Democratic Unionist Party              1
Social Democratic and Labour Party     1
Green Party of England and Wales       1
The Party of Wales                     1
Ulster Unionist Party                  1
Name: 2, dtype: int64


In [18]:
#How many times each party mentions each keyword
def keyword_times_party(index, arg):
    ix = open_dir("index")
    with index.searcher(weighting = scoring.Frequency()) as searcher:
        query_parser = MultifieldParser(["title","text"], schema=ix.schema, group=OrGroup)
        number_times = {}
        for word in nltk.word_tokenize(arg):
            number_times[word] = []
            #dictionary to store number of times word appears for a certain party
            #key = party value=number_times
            query = query_parser.parse(word)
            #print(query)
            results = searcher.search(query, limit=None)
            dic_aux = {}
            for result in results:
                #print(result)
                #print("Score:", result.score)
                if result['party'] not in dic_aux:
                    dic_aux[result['party']] = result.score
                else:
                    dic_aux[result['party']] += result.score
            #we have dic_aux with frequency for word for each party as a total for all results
            for key,value in dic_aux.items():
                print("Party:", key)
                print("number of times:", value)
                number_times[word].append([key,value])
        print_dict(number_times)

def print_dict(dictionary):
    for key, value in dictionary.items():
        print("Number of times per party for keyword:", key)
        #print(value)
        for i in range(len(value)):
            print("%s - %s" % (value[i][0], value[i][1]))
        #print("%s: %s" % (keyword, int(value)))
        
keyword_times_party(index, "world school primary")        

Party: Conservative Party
number of times: 322.0
Party: Green Party of England and Wales
number of times: 55.0
Party: Labour Party
number of times: 302.0
Party: Liberal Democrats
number of times: 126.0
Party: United Kingdom Independence Party
number of times: 55.0
Party: Scottish National Party
number of times: 39.0
Party: The Party of Wales
number of times: 15.0
Party: Ulster Unionist Party
number of times: 13.0
Party: Democratic Unionist Party
number of times: 6.0
Party: Social Democratic and Labour Party
number of times: 3.0
Party: Labour Party
number of times: 457.0
Party: Conservative Party
number of times: 444.0
Party: Liberal Democrats
number of times: 256.0
Party: Green Party of England and Wales
number of times: 45.0
Party: United Kingdom Independence Party
number of times: 63.0
Party: The Party of Wales
number of times: 35.0
Party: Scottish National Party
number of times: 42.0
Party: Social Democratic and Labour Party
number of times: 11.0
Party: Ulster Unionist Party
number 