Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

167 lines (112 sloc) 6.426 kB
Thanks for using Tinasoft Pytextminer
Pytextminer is a part of a larger software : Tinasoft Desktop you can find it at http://github.com/moma/tinasoft.desktop/
A text-mining python 2.6 module producing bottom-up semantic network production.
It uses :
- NLTK, the natural language processing toolkit (http://www.nltk.org/),
- SQLite3, the embedded database library (http://sqlite.org),
- Twisted web server (http://twistedmatrix) with jsonpickle as a serializer (http://jsonpickle.github.com),
- Numpy for n-dimensionnal arrays processing (http://numpy.scipy.org/),
- pyTenjin for graph gexf files export (http://www.kuwata-lab.com/tenjin/)
Classical task are :
- multiple kinds of source file support
- extraction of key-phrases (ngrams) using various simple Natural Language Processing methods (stopwords, part-of-speech tagging, stemming, etc)
- creation of document/corpus/ngram graphs databases
- key-phrases cooccurrences calculation on a corpus basis
- production of graphs of multiple entities and multiple relations (hybrid storage into GEXF files, http://gexf.net, and into sqlite database)
- an httpserver exposing the API, sending json results
This software is part of TINA, an European Union FP7 coordination action - FP7-ICT-2009-C :
- http://tinasoft.eu/
The software implements scientific results by David Chavalarias (CREA lab; CNRS/Ecole Polytechnique UMR 7656, http://chavalarias.com) and Jean-Philippe Cointet (INRA SENS, http://jph.cointet.free.fr).
SOURCE CODE REPOSITORY
https://forge.iscpif.fr/projects/tinasoft-pytextminer
http://github.com/moma/TinasoftPytextminer
AUTHORS
- Researchers and engineers at CREA lab (UMR 7656, CNRS, Ecole Polytechnique, France)
julian bilcke <julian.bilcke (at) iscpif (dot) fr>
david chavalarias <david.chavalarias (at) polytechnique (dot) edu>
jean philippe cointet <jphcoi (at) yahoo (dot) fr>
elias showk <elishowk (at) nonutc (dot) fr>
MAINTAINER
elias showk <elishowk (at) nonutc (dot) fr>
DOCUMENTATION, SUPPORT AND FEEDBACK
http://tinasoft.eu/ (project homepage)
https://forge.iscpif.fr/projects/tinasoft-pytextminer (software development)
PYTEXTMINER AS A USER
Download standalone packages from http://tinasoft.eu
DEVELOPER DOCUMENTATION
http://tina.csregistry.org/tinauserdoc
PYTEXTMINER AS A DEVELOPER
* we provide a http server exposing the main API from the TinaApp class
* alternatively, the apitests.py script provides examples to properly use the TinaApp class methods
- get the source code :
https://forge.iscpif.fr/projects/tinasoft-pytextminer/repository
OR
git clone https://sources.iscpif.fr/tinasoft.pytextminer.git
PYTHON : you'll need Python 2.6 interpreter : http://python.org/
INSTALL THE PYTHON PACKAGE
$ sudo python setup.py install
or
$ sudo python setup.py develop
Dependencies should be checked : numpy, nltk, twisted, jsonpickle, tenjin, pyyaml
OTHERWISE MANUALLY INSTALL PYTHON DEPENDENCIES
- they're listed in setup.py
DOWNLOAD NLTK DATA
You'll need to install manually required nltk corpus data
$ export NLTK_DATA="your/path/to/TinasoftPytextminer/shared/nltk_data"
$ python
> import nltk.download()
Downloader> d punkt
Downloader> d brown
Downloader> d conll2000
on MS WINDOWS:
$ set NLTK_DATA="TinasoftPytextminer\shared\nltk_data"
$ PATH C:\Python26;%PATH%
$ python apitests.py ... (see usage)
- finally open your web browser at http://localhost:8888 (no internet connection needed)
GNU/LINUX (and probable UNIX-like systems)
- use the standalone freezed httpserver software
$ export NLTK_DATA=shared/nltk_data
$ python apitests.py ... (see usage)
DEVELOPER DOCUMENTATION
http://tina.csregistry.org/tinadevdoc
CONFIGURATION
config_*.yaml are a YAML configuration files.
The main application (TinaApp class) searches it during init, its path is a required parameter
GUIDELINES
- declare each column name of your csv file into the corresponding field name of the configuration file
- not declared columns will be ignored by the software
- here are possible required and optional entries :
#### REQUIRED
titleField: document title
contentField: document content
authorField: document acronyme
corpusNumberField: corpus number
docNumberField: document number
##### optional
index1Field: document index 1
index2Field: document index 2
dateField: document publication date
keywordsField: document keywords
- check out the format of your csv file (encoding, delimiter, quoting character) and write them into fields "locale", "delimiter" and "quotechar"
- "minSize", and "maxSize" means the length of n-grams extracted
- all other fields are the script configuration, or the default values for testing purpose
WARNING : in YAML all tabulations are spaces, all string values must be quoted (eg : 'prop_title'). Further information at http://en.wikipedia.org/wiki/YAML
SOURCE FILES DIRECTORY
- "source_files" is dedicated to the storage of your source files
- these files are used during indexation and extraction steps of the workflow
- given an existing file name in this directory, the software will be able to read it
TESTED OPERATING SYSTEMS
Tinasoft Pytextminer was tested on the following platforms:
GNU/Linux (amd4, i386) with Python 2.6
Windows XP (32bit) with Python 2.6
Mac OS X >= 10.6
COPYRIGHT AND LICENSE
Copyright (C) 2009-2011 CREA Lab, CNRS/Ecole Polytechnique UMR 7656 (Fr)
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/gpl.html>.
Jump to Line
Something went wrong with that request. Please try again.