Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Mapping Wikileaks' Cablegate thematics using Python, MongoDB and Gephi
branch: networkrevised

This branch is 93 commits ahead of anarchivist:master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
neo4jrestclient
shared
.gitignore
INSTALL.md
LICENSE
README.md
cableextractor.py
cableimporter.py
cablenetwork.py
cabletokenizer.py
config.yaml
conll2000_aubt.pickle
conll2000_brill_aubt.pickle
datamodel.py
execute.py
filtering.py
mongodbhandler.py
setup.py
stopwords.py

README.md

Mapping Wikileaks' Cablegate thematics using Python, MongoDB and Gephi

Talk proposal for FOSDEM's data dev room, Brussels, Feb 5 2011,

Speakers

We are two software engineers at Centre National de la Recherche Scientifique (France) working on the TINA project.

  • Julian Bilcke : contributor for the Gephi project. Follow me at @flngr.
  • Elias Showk : Its key areas of work are text-mining with python, building data applications engines with non-relational databases and customized HTTP servers. Also codes Javascript/JQuery/HTML5 web interfaces and, less recently, Perl/Moose/Catalyst modules. Follow me @elishowk

Audience

  • intermediate or beginner

Abstract

We propose to present a complete work-flow of textual data analysis, from acquisition to visual exploration of a complex network. Through the presentation of a simple software specifically developed for this talk, we will cover a set of productive and widely used softwares and libraries in text analysis, then introduce some features of Gephi, an open-source network visualization & analysis software, using the data collected and transformed with cablegate-semnet.

Data and methodology

The presentation will focus on Wikileaks' cablegate data, and specially on the full text of all published diplomatic cables yet. The goal is to produce a weighted network. This network will contain two categories of nodes :

  • thematics nodes linked by co-occurrences, automatically extracted from full-text
  • leaked cables nodes linked by a custom similarity index (adaptation of Jaccard similarity index).

Both categories will be linked by occurrences.

1st Part : Information extraction, internals of a simple python software

  • speaker : Elias

This software illustrates common methods of text-mining taking advantage of Python built-in functions as well as some external and famous libraries (NLTK, BeautifulSoup). It also demonstrate the simplicity and power of Mongo DB in tasks like document indexing and information extraction.

The talk will focus on the following topics :

2nd part : Network visualization : Gephi demonstration

  • speaker : Julian

Cablegate-semnet has a quite naive automatic selection of text thematics and produces a network of thousands of nodes but containing some noise. On the other hand, the presence of two types of nodes implies three types of edges so that we can expect a dense graph. As a conclusion, we produce a weighted network quite rich in information, so the aim of this second part is to demonstrate Gephi's features in network post-processing, with a focus on :

  • How to import a network data file
  • Overview of basic visualization features
  • How to remove meaningless content using the data table, sorting and filtering
  • How to highlight meaningful elements using cluster detection, ranking, coloration
  • How to customize the graph appearance, and export the map to PDF and the web
Something went wrong with that request. Please try again.