#Indexação do arquivo de telegramas do Wikileaks

Nesta prática vamos indexar todo oarquivo de telegramas do Wikileaks com o elasticsearch e testar algumas de suas características como sistema de recuperação de informação.

Para realizar esta prática, é necessário fazer o Download do seguinte arquivo:
https://archive.org/details/wikileaks-cables-csv

Também é necessário instalar a biblioteca cablemap: https://github.com/heuer/cablemap


In [38]:
from cablemap.core import cables_from_source
from elasticsearch import Elasticsearch as ES
from elasticsearch import helpers
import datetime

Primeiro temos que localizar o arquivo com os telegramas.

In [39]:
fname = "../../../../Downloads/wikileaks_cables/cables.csv"

In [40]:
i = 0
for cable in cables_from_source(fname):
    print(cable.subject, cable.created)
    if i>5:
        break
    i += 1

(u'EXTENDED NATIONAL JURISDICTIONS OVER HIGH SEAS', u'1966-12-28 18:48')
(u'ACCELERATION OF F-4ES FOR IRAN', u'1972-02-25 09:30')
(u'TRIALS/EXECUTIONS OF ANTI-GOVERNMENT ELEMENTS: STUDENTS DEMONSTRATE AND SHAH LASHES OUT AT FOREIGN CRITICS', u'1972-03-09 05:40')
(u'CONTINUING TERRORIST ACTIVITIES IN IRAN', u'1972-08-10 04:00')
(u'CONTINUING TERRORIST VIOLENCE', u'1972-08-22 09:27')
(u'AUDIENCE WITH SHAH APRIL 5', u'1973-04-02 08:34')
(u'ASSASSINATION/KIDNAP PLOT AGAINST SHAH REVEALED', u'1973-10-02 14:00')


Cada telegrama tem vários atributos, por isso precisamos construir um documento JSON com todas estes atributos

Vamos ignorar dois atributos por causa de bugs na biblioteca cablemap

In [41]:
atributos = [i for i in dir(cable) if not i.startswith('_')]
print atributos
atributos.pop(3) #remove "classification_categories"
atributos.pop(4) #remove "comment"
print atributos

['cabledrum_uri', 'canonical_id', 'classification', 'classification_categories', 'classified_by', 'comment', 'content', 'created', 'header', 'info_recipients', 'is_partial', 'nondisclosure_deadline', 'origin', 'plusd_canonical_id', 'plusd_uri', 'recipients', 'reference_id', 'references', 'released', 'signed_by', 'subject', 'summary', 'tags', 'transmission_id', 'wl_uris']
['cabledrum_uri', 'canonical_id', 'classification', 'classified_by', 'content', 'created', 'header', 'info_recipients', 'is_partial', 'nondisclosure_deadline', 'origin', 'plusd_canonical_id', 'plusd_uri', 'recipients', 'reference_id', 'references', 'released', 'signed_by', 'subject', 'summary', 'tags', 'transmission_id', 'wl_uris']


In [42]:
getattr(cable,'cabledrum_uri')

u'http://www.cabledrum.net/cables/73TEHRAN7005'

Agora temos que configurar o Elasticsearch para receber a nossa coleção:

In [43]:
es = ES()
es.indices.create(index='wikileaks', ignore=400)

{u'error': u'IndexAlreadyExistsException[[wikileaks] already exists]',
 u'status': 400}

In [None]:
def build_doc(cable):
    doc = {}
    for a in atributos:
        try:
            if a == 'created':
                doc[a] = datetime.datetime.strptime(getattr(cable, a).strip(), "%Y-%m-%d %H:%M")
            else:
                doc[a] = getattr(cable, a)
        except AttributeError as e:
            print e
            doc[a] = ""
    return doc

def cable_doc_gen():
    """
    Função geradora que itera sobre cables.csv
    retornando um telegrama por vez, incluindo-o em um dicionário compatível com o elasticsearch.
    """
    for j,cable in enumerate(cables_from_source(fname)):
        doc = build_doc(cable)
        
        action = {
            "_index": "wikileaks",
            "_type": "telegramas",
            "_id": j,
            "doc": doc
            }
        if j%1000 == 0:
            print "Indexando telegrama número {}".format(j)
        yield action
        
helpers.bulk(es, cable_doc_gen(), chunk_size=1000)

FM SECSTATE WASHDC
INFO AMEMBASSY ATHENS
"
FM SECSTATE WASHDC
INFO USDEL SECRETARY IMMEDIATE
"
FM SECSTATE WASHDC
INFO USMISSION GENEVA IMMEDIATE
AMEMBASSY HELSINKI IMMEDIATE
"
FM SECSTATE WASHDC
INFO ALL AMERICAN REPUBLIC DIPLOMATIC POSTS IMMEDIATE
"
FM SECSTATE WASHDC
INFO AMEMBASSY KUWAIT IMMEDIATE
"
FM SECSTATE WASHDC
INFO USDEL SECRETARY IMMEDIATE
"
FM SECSTATE WASHDC
INFO USDEL SECRETARY IMMEDIATE
"
FM SECSTATE WASHDC
INFO USDEL SECRETARY IMMEDIATE 
"


Indexando telegrama número 0
'Reference' object has no attribute 'name'
Indexando telegrama número 1000

FM SECSTATE WASHDC
INFO ALL DIPLOMATIC AND CONSULAR POSTS
SPECIAL EMBASSY PROGRAM
USOFFICE PRISTINA 
AMEMBASSY DUSHANBE 
AMEMBASSY BELGRADE 
AMEMBASSY FREETOWN "



'Reference' object has no attribute 'name'
'Reference' object has no attribute 'name'
'Reference' object has no attribute 'name'

FM SECSTATE WASHDC
INFO ALL DIPLOMATIC AND CONSULAR POSTS
SPECIAL EMBASSY PROGRAM
AMEMBASSY FREETOWN 
USOFFICE PRISTINA 
AMEMBASSY DUSHANBE 
AMEMBASSY BELGRADE "



'Reference' object has no attribute 'name'
'Reference' object has no attribute 'name'
Indexando telegrama número 2000
'Reference' object has no attribute 'name'
'Reference' object has no attribute 'name'
Indexando telegrama número 3000

u"UNCLAS TEHRAN 7005 \n \nDeclassified/Released US Department of State EO Systematic Review 30 JUN 2005 \n \nE.O. 11652: N/A \nTAGS: PINS IR \nSUBJECT: ASSASSINATION/KIDNAP PLOT AGAINST SHAH REVEALED \n \nSUMMARY: GOI ANNOUNCED TODAY ARREST OF TWELVE PERSONS \nINCLUDING TWO WOMEN FOR PLOTTING TO KIDNAP OR KILL SHAH, \nEMPRESS AND OTHER MEMBERS OF IMPERIAL FAMILY. PLOTTERS \nSAID TO BELONG TO WING OF OUTLAWED TUDEH (COMMUNIST) PARTY \nAND ARE SAID TO HAVE MADE CONFESSIONS. END SUMMARY. \n \n1. MINISTRY OF INFORMATION ANNOUNCED OCTOBER 2 THE ARREST OF \nTWELVE PERSONS INCLUDING TWO WOMEN ON CHARGES OF PLOTTING TO \nKIDNAP OR KILL MEMBERS OF THE IMPERIAL FAMILY. ACCORDING TO \nOFFICIAL ANNOUNCEMENT, GROUP, WHICH INCLUDED FILMMAKERS, CAMERAMEN \nAND NEWSPAPERMEN, HAD RECONNOITERED SHAH'S CASPIAN PALACE \nAT NOWSHAHR AS WELL AS RESIDENCE OF HIM'S YOUNGER SISTER \nPRINCESS FATEMEH. PLAN WAS TO KILL SHAH, EMPRESS, CROWN \nPRINCE AND POSSIBLY OTHERS, PERHAPS INCLUDING UNNAMED FOREIGN \nAMBASSA