A python text-mining module producing semantic network graphs
License
elishowk/TinasoftPytextminer
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
master
Could not load branches
Nothing to show
Could not load tags
Nothing to show
{{ refName }}
default
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code
-
Clone
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more about the CLI.
- Open with GitHub Desktop
- Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
Thanks for using Tinasoft Pytextminer Pytextminer is a part of a larger software : Tinasoft Desktop you can find it at http://github.com/moma/tinasoft.desktop/ A text-mining python 2.6 module producing bottom-up semantic network production. It uses : - NLTK, the natural language processing toolkit (http://www.nltk.org/), - SQLite3, the embedded database library (http://sqlite.org), - Twisted web server (http://twistedmatrix) with jsonpickle as a serializer (http://jsonpickle.github.com), - Numpy for n-dimensionnal arrays processing (http://numpy.scipy.org/), - pyTenjin for graph gexf files export (http://www.kuwata-lab.com/tenjin/) Classical task are : - multiple kinds of source file support - extraction of key-phrases (ngrams) using various simple Natural Language Processing methods (stopwords, part-of-speech tagging, stemming, etc) - creation of document/corpus/ngram graphs databases - key-phrases cooccurrences calculation on a corpus basis - production of graphs of multiple entities and multiple relations (hybrid storage into GEXF files, http://gexf.net, and into sqlite database) - an httpserver exposing the API, sending json results This software is part of TINA, an European Union FP7 coordination action - FP7-ICT-2009-C : - http://tinasoft.eu/ The software implements scientific results by David Chavalarias (CREA lab; CNRS/Ecole Polytechnique UMR 7656, http://chavalarias.com) and Jean-Philippe Cointet (INRA SENS, http://jph.cointet.free.fr). SOURCE CODE REPOSITORY https://forge.iscpif.fr/projects/tinasoft-pytextminer http://github.com/moma/TinasoftPytextminer AUTHORS - Researchers and engineers at CREA lab (UMR 7656, CNRS, Ecole Polytechnique, France) julian bilcke <julian.bilcke (at) iscpif (dot) fr> david chavalarias <david.chavalarias (at) polytechnique (dot) edu> jean philippe cointet <jphcoi (at) yahoo (dot) fr> elias showk <elishowk (at) nonutc (dot) fr> MAINTAINER elias showk <elishowk (at) nonutc (dot) fr> DOCUMENTATION, SUPPORT AND FEEDBACK http://tinasoft.eu/ (project homepage) https://forge.iscpif.fr/projects/tinasoft-pytextminer (software development) PYTEXTMINER AS A USER Download standalone packages from http://tinasoft.eu DEVELOPER DOCUMENTATION http://tina.csregistry.org/tinauserdoc PYTEXTMINER AS A DEVELOPER * we provide a http server exposing the main API from the TinaApp class * alternatively, the apitests.py script provides examples to properly use the TinaApp class methods - get the source code : https://forge.iscpif.fr/projects/tinasoft-pytextminer/repository OR git clone https://sources.iscpif.fr/tinasoft.pytextminer.git PYTHON : you'll need Python 2.6 interpreter : http://python.org/ INSTALL THE PYTHON PACKAGE $ sudo python setup.py install or $ sudo python setup.py develop Dependencies should be checked : numpy, nltk, twisted, jsonpickle, tenjin, pyyaml OTHERWISE MANUALLY INSTALL PYTHON DEPENDENCIES - they're listed in setup.py DOWNLOAD NLTK DATA You'll need to install manually required nltk corpus data $ export NLTK_DATA="your/path/to/TinasoftPytextminer/shared/nltk_data" $ python > import nltk.download() Downloader> d punkt Downloader> d brown Downloader> d conll2000 on MS WINDOWS: $ set NLTK_DATA="TinasoftPytextminer\shared\nltk_data" $ PATH C:\Python26;%PATH% $ python apitests.py ... (see usage) - finally open your web browser at http://localhost:8888 (no internet connection needed) GNU/LINUX (and probable UNIX-like systems) - use the standalone freezed httpserver software $ export NLTK_DATA=shared/nltk_data $ python apitests.py ... (see usage) DEVELOPER DOCUMENTATION http://tina.csregistry.org/tinadevdoc CONFIGURATION config_*.yaml are a YAML configuration files. The main application (TinaApp class) searches it during init, its path is a required parameter GUIDELINES - declare each column name of your csv file into the corresponding field name of the configuration file - not declared columns will be ignored by the software - here are possible required and optional entries : #### REQUIRED titleField: document title contentField: document content authorField: document acronyme corpusNumberField: corpus number docNumberField: document number ##### optional index1Field: document index 1 index2Field: document index 2 dateField: document publication date keywordsField: document keywords - check out the format of your csv file (encoding, delimiter, quoting character) and write them into fields "locale", "delimiter" and "quotechar" - "minSize", and "maxSize" means the length of n-grams extracted - all other fields are the script configuration, or the default values for testing purpose WARNING : in YAML all tabulations are spaces, all string values must be quoted (eg : 'prop_title'). Further information at http://en.wikipedia.org/wiki/YAML SOURCE FILES DIRECTORY - "source_files" is dedicated to the storage of your source files - these files are used during indexation and extraction steps of the workflow - given an existing file name in this directory, the software will be able to read it TESTED OPERATING SYSTEMS Tinasoft Pytextminer was tested on the following platforms: GNU/Linux (amd4, i386) with Python 2.6 Windows XP (32bit) with Python 2.6 Mac OS X >= 10.6 COPYRIGHT AND LICENSE Copyright (C) 2009-2011 CREA Lab, CNRS/Ecole Polytechnique UMR 7656 (Fr) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/gpl.html>.
About
A python text-mining module producing semantic network graphs
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published