# Project PRI

The goal of this project is to implement an Information Search & Extraction System for
the analysis of political discourse. Your system will have access to a large set of documents containing the electoral manifestos
of several political parties from different countries in the world. Using this data,
the system should be able to provide the following functionalities.

a) Ad hoc search on the collection of documents

Given a query, represented by a set of keywords, the system should return all manifestos containing such keywords, ordered according to their relevance to the query.

1- create an inverted index/dictionary for all documents in the document collection

2- from query given in command line, transform it, to compare with document dictionary and retrieve all that are relevant

3- ranking of documents (based on relevance to the query)


we can surpress common words, or consider the different conjugations of the same verb the same term, and so on!

In [9]:
#read csv file 
import pandas as pd

data = pd.read_csv("en_docs_clean.csv")

#creates a frame with columns text, id, party, date and title
print(data.shape[0])
print(data.iloc[16725])
print(data.iloc[21]) #to access first row

#devo comprimir num só texto os que possuem o mesmo manifesto_id?
#https://whoosh.readthedocs.io/en/latest/indexing.html#indexing-documents

16726
text            Table 1 (continued) All figures in £bn TAX AND...
manifesto_id                                         51951_201505
party                           United Kingdom Independence Party
date                                                       201505
title                     Believe in Britain. UKIP Manifesto 2015
Name: 16725, dtype: object
text            Our aim: To make Britain the world's foremost ...
manifesto_id                                         51421_199705
party                                           Liberal Democrats
date                                                       199705
title                                         Make the Difference
Name: 21, dtype: object


In [14]:
#now we want to create an index in order to ease the access and search
import nltk
import os.path
import shutil
from whoosh.fields import Schema, TEXT, ID, NUMERIC
from whoosh.index import create_in
from whoosh.query import Every
from whoosh.analysis import StemmingAnalyzer
from whoosh.formats import Frequency

#define the index's schema, that lists the fields in the index

#a field is a piece of information for each document in the index,
#such as its title or text content. It can be searched and/or stored
#(meaning the value that fets indexed is returned with the results)

#ndexing of a field means it can be searched and it is also returned 
#with results if defined as argument (stored=True) in schema.

# in our data, we have the text,manifesto_id,party,date,title

def createIndexComplete(data):
    
    #composes a RegexTokenizer (class implements a customizable, regular-expression-based tokenizer that extracts words
    #and ignores whitespace and punctuation) + LowerCaseFilter + StopWordsFilter + stemming filter(verbs converted to infinitive)
    analyzer = StemmingAnalyzer() 
    
    vector_format = Frequency() #Stores the number of times each term appears in each document.
    
    schema = Schema(text=TEXT(analyzer=analyzer, vector=vector_format), manifesto_id=ID(stored=True), party=TEXT(stored=True, vector=vector_format), date=NUMERIC, title=TEXT(stored=True, analyzer=analyzer, vector=vector_format))
    print(schema)
    
    if os.path.isdir("index"):
        shutil.rmtree("index")

    if not os.path.exists("index"):
        os.mkdir("index")
    
    index = create_in("index", schema)
    #The main index is an inverted index. It maps terms to the documents they appear in.
    
    #create an index writer to add documents
    writer = index.writer()
    
    for i in range(data.shape[0]):
        #print(i)
        #print(data.loc[i, "manifesto_id"])
        writer.add_document(text=data.loc[i, "text"], manifesto_id=data.loc[i, "manifesto_id"], party=data.loc[i, "party"], date=data.loc[i, "date"], title=data.loc[i, "title"])
        #print("One added")
    print("Going to commit")
    writer.commit()
    return index
    
    
index = createIndexComplete(data)


<Schema: ['date', 'manifesto_id', 'party', 'text', 'title']>
Going to commit


In [20]:
#By default, Whoosh returns the results ordered using the BM25 similarity

def showIndex(index):
    with index.searcher() as searcher:
        # Match any documents with something in the "text" field
        results = searcher.search(Every('text'))
        for result in results:
            print ("Id: %s Party: %s" % (result['manifesto_id'], result['party']))
            print ("Text:")
            print (result)
            print("Score:", result.score)
        
showIndex(index)

Id: 51320_196410 Party: Labour Party
Text:
<Hit {'manifesto_id': '51320_196410', 'party': 'Labour Party', 'title': 'Let’s Go With Labour for the New Britain'}>
Id: 51620_196410 Party: Conservative Party
Text:
<Hit {'manifesto_id': '51620_196410', 'party': 'Conservative Party', 'title': '‘Prosperity with a Purpose’, Conservative and Unionist Party’s Policy'}>
Id: 51320_196603 Party: Labour Party
Text:
<Hit {'manifesto_id': '51320_196603', 'party': 'Labour Party', 'title': 'Time for Decision'}>
Id: 51620_196603 Party: Conservative Party
Text:
<Hit {'manifesto_id': '51620_196603', 'party': 'Conservative Party', 'title': 'Action not Words: New Conservative Programme'}>
Id: 51320_197006 Party: Labour Party
Text:
<Hit {'manifesto_id': '51320_197006', 'party': 'Labour Party', 'title': 'Now Britain’s Strong - Let’s Make it Great to Live In'}>
Id: 51620_197006 Party: Conservative Party
Text:
<Hit {'manifesto_id': '51620_197006', 'party': 'Conservative Party', 'title': 'A Better Tomorrow'}>
Id: 