Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
A python text-mining module producing semantic network graphs
Python

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
shared
source_files
tests
tinasoft
.gitignore
GPL-LICENSE.txt
LICENSE
MANIFEST
README
config_unix.yaml
config_win.yaml
freeze_linux.py
freeze_mac.py
freeze_win.py
httpserver.py
setup.py
user_stopwords.csv

README

Thanks for using Tinasoft Pytextminer

Pytextminer is a part of a larger software : Tinasoft Desktop you can find it at http://github.com/jbilcke/tinasoft.desktop/

A text-mining python module producing bottom-up ngram detection and mapping. Using NLTK the natural language processing toolkit (http://www.nltk.org/), bsddb the Berkeley embeddable DB connector for storage (http://www.jcea.es/programacion/pybsddb_doc/), and whoosh the indexation engine (http://whoosh.ca), it provides :
- document/corpus/ngram graphs database
- part-of-speech tagging
- nlp based tokenizer
- multiple stopwords sources support
- document text indexation
- co-occurrences matrix processing
- customizable graph generator (GEXF format)
- and soon a stemmer and a lemmatizer

This work is part of TINA, an European Union FP7 project - FP7-ICT-2009-C : http://tinasoft.eu/
The software implements scientific worksby David Chavalarias (CREA lab; CNRS/Ecole Polytechnique UMR 7656, http://chavalarias.com) and Jean-Philippe Cointet (INRA, http://jph.cointet.free.fr). You can read their publications here:

http://arxiv.org/abs/0904.3154v1
http://www.springerlink.com/content/v57686u275653nt4/

COPYRIGHT AND LICENSE

Copyright (C) 2009-2011 CREA Lab, CNRS/Ecole Polytechnique UMR 7656 (Fr)

    This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by

    the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

SOURCE CODE REPOSITORY

    http://github.com/elishowk/TinasoftPytextminer

AUTHORS

- Researchers and engineers at CREA lab (UMR 7656, CNRS, Ecole Polytechnique, France)

    david chavalarias <david.chavalarias@polytechnique.edu>
    elias showk <elishowk@nonutc.fr>
    julian bilcke <julian.bilcke@iscpif.fr>

DOCUMENTATION, SUPPORT AND FEEDBACK

    http://tinasoft.eu/
    https://forge.iscpif.fr/projects/tinasoft-pytextminer
    http://github.com/moma/TinasoftPytextminer/

INSTALL TINASOFT PYTEXTMINER AS A STANDALONE SOFTWARE

    DOWNLOAD available standalone packages from http://tinasoft.eu

    MANUAL INSTALL on MS WINDOWS:
        - download the dependencies' installer at http://tina.iscpif.fr/htdocs/repository/tinasoftpytextminer_win32_deps_installer.zip
        - then, in a command line :

            $ set NLTK_DATA="TinasoftPytextminer\shared\nltk_data"
            $ PATH C:\Python26;%PATH%
            $ python httpserver.exe config_win.yaml

        - finally open your web browser at http://localhost:8888 (no internet connection needed)

    GNU/LINUX (and probable UNIX-like systems)
        - use the standalone freezed httpserver software

            $ export NLTK_DATA=shared/nltk_data
            $ python httpserver config_unix.yaml

        - open your browser at http://localhost:8888

INSTALL TINASOFT PYTEXTMINER AS A LIBRARY

    * we provide an http trying-to-be-RESTfull server providing a stable API

    DEPENDENCIES

    BERKELEY DB : install it on your system from http://www.oracle.com/technology/software/products/berkeley-db/index.html

    PYTHON : you'll need Python 2.6 interpreter : http://python.org/

        $ sudo python setup.py install
        or
        $ sudo python setup.py develop

    Dependencies should be checked : numpy, nltk, twisted, etc (see setup.py)

    DOWNLOAD NLTK DATA

    You'll need to install manually required nltk corpus data
        $ export NLTK_DATA="/path/to/TinasoftPytextminer/shared/nltk_data"
        $ python
        > import nltk.download()
        Downloader> d punkt
        Downloader> d brown
        Downloader> d conll2000

CONFIGURATION

    config_*.yaml are a YAML configuration files.
    The main application (TinaApp class) searches it during init, its path is a required paramater

    GUIDELINES

    - declare each column name of your csv file into the corresponding field name of the configuration file
    - not declared columns will be ignored by the software
    - here are possible required and optional entries :

        #### REQUIRED
        titleField: document title
        contentField: document content
        authorField: document acronyme
        corpusNumberField: corpus number
        docNumberField: document number
        ##### optional
        index1Field: document index 1
        index2Field: document index 2
        dateField: document publication date
        keywordsField: document keywords

    - check out the format of your csv file (encoding, delimiter, quoting character) and write them into fields "locale", "delimiter" and "quotechar"
    - "minSize", and "maxSize" means the length of n-grams extracted
    - all other fields are the script configuration, or the default values for testing purpose

    WARNING : in YAML all tabulations are spaces, all string values must be quoted (eg : 'prop_title'). Further information at http://en.wikipedia.org/wiki/YAML

SOURCE FILES DIRECTORY

    - "source_files" is dedicated to the storage of your source files
    - these files are used during indexation and extraction steps of the workflow
    - given an existing file name in this directory, the software will be able to read it

SUPPORTED OPERATING SYSTEMS

    Tinasoft Pytextminer was tested on the following platforms:

        GNU/Linux (amd4, i386) with Python 2.6
        Windows XP (32bit) with Python 2.6

Something went wrong with that request. Please try again.