Skip to content

Commit

Permalink
finishing readme
Browse files Browse the repository at this point in the history
  • Loading branch information
elishowk committed Dec 24, 2010
1 parent 991abcc commit 0c73cc7
Showing 1 changed file with 12 additions and 10 deletions.
22 changes: 12 additions & 10 deletions README.md
Expand Up @@ -5,11 +5,11 @@ Talk proposal for [FOSDEM's data dev room](http://datadevroom.couch.it/), Brusse
##Speakers

- Julian Bilcke : software developer for the [TINA project](http://tinasoft.eu), contributor for the [Gephi project](http://gephi.org). Follow me at [@flngr](http://twitter.com/flngr).
- (Elias Showk)[http://github.com/elishowk/] : software engineer at Centre National de la Recherche Scientifique (France). Developer for the [TINA project](http://tinasoft.eu). Its key areas of work are text-mining, building data applications engines with non-relational databases and customized HTTP servers. Its current main programming language is Python, but also occasionally codes Javascript/JQuery or Perl. Follow me [@elishowk](http://identi.ca/elishowk)
- [Elias Showk](http://github.com/elishowk) : software engineer at Centre National de la Recherche Scientifique (France). Developer for the [TINA project](http://tinasoft.eu). Its key areas of work are text-mining, building data applications engines with non-relational databases and customized HTTP servers. Its current main programming language is Python, but also occasionally codes Javascript/JQuery or Perl. Follow me [@elishowk](http://identi.ca/elishowk)

##Audience

- intermediate (to confirm)
- intermediate or beginner

##Abstract

Expand All @@ -19,30 +19,32 @@ We propose to present a complete work-flow of textual data analysis, from acquis

The presentation will focus on Wikileaks' cablegate data, and specially on the full text of all published diplomatic cables yet. The goal is to produce a weighted network.

This networks will contents to parties :
This networks will contain to categories of nodes :

- text thematics nodes linked by co-occurrences,
- leaked cables nodes linked by a custom similarity index (adaptation of [Jaccard similarity index](http://en.wikipedia.org/wiki/Jaccard_index)).


Both categories will be linked by occurrences.

###1st Part : Cablegate-semnet python software internals demonstration


This software illustrates common methods of text-mining taking advantage of Python built-in functions as well as some external and famous libraries (NLTK, BeautifulSoup).
It also demonstrate the simplicity and power of [Mongo DB](http://mongodb.org) in tasks like document indexing and information extraction.

- Reads the local copy of the cablegate site, using built-in OS file handling and some Regular Expressions
The talk will focus on the following topics :

- Parses cables with [NLTK](http://nltk.org)'s HTML cleaning feature, [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)'s HTML parser and [Python's regular expressions](http://docs.python.org/library/re.html)
- Inserts cables into Mongo DB, using [Python Mongo DB driver](http://api.mongodb.org/python/1.9%2B/index.html)
- Automatically extracts relevant words NGram with NLTK and updating Mongo DB collections using
- Processes the network with [Mongo DB's map/reduce](http://www.mongodb.org/display/DOCS/MapReduce) integration
- Automatically extracts relevant words NGram with NLTK and updating Mongo DB collections
- Processes the network with [Mongo DB's map/reduce](http://www.mongodb.org/display/DOCS/MapReduce) to get relationship between mapped entities.
- Exports the network in a Gephi compatible format ([GEXF](http://gexf.net)) using [Tenjin template engine](http://www.kuwata-lab.com/tenjin/)

This sofware has a quite naive automatic selection of text thematics and produces a network containing some noise. The aim of the second part is to demonstrate Gephi's features in network post-processing.


###2nd part : Gephi demonstration

Cablegate-semnet has a quite naive automatic selection of text thematics and produces a network containing some noise. On the other hand, the two types of nodes implies three types of edges so that we can expect a [dense graph](http://en.wikipedia.org/wiki/Dense_graph). The aim of this second part is to demonstrate Gephi's features in network post-processing, with a focus on :

- How to import a network data file
- Overview of basic visualization features
- How to remove meaningless content using the data table, sorting and filtering
Expand Down

0 comments on commit 0c73cc7

Please sign in to comment.