HTTPS clone URL
Subversion checkout URL
A python text-mining module producing semantic network graphs
Fetching latest commit…
Cannot retrieve the latest commit at this time
|Failed to load latest commit information.|
TINASOFT A text-mining python module producing bottom-up ngram detection and mapping. Using NLTK the natural language processing toolkit (http://www.nltk.org/), bsddb the Berkeley embeddable DB connector for storage (http://www.jcea.es/programacion/pybsddb_doc/), and whoosh the indexation engine (http://whoosh.ca), it provides : - document/corpus/ngram graphs - part-of-speech tagging - nlp based ngram extraction - stopword integration - full-text indexation - cooccurences calculation, and soon more types of graph proximity - and soon a stemmer and a lemmatizer This work is part of the European Union FP7 project TINA - FP7-ICT-2009-C : http://tina.csregistry.org/ This methodology and analysis is based on the following articles by David Chavalarias (CREA; CNRS UMR 7656) and Jean-Philippe Cointet (INRA): http://arxiv.org/abs/0904.3154v1 and http://www.springerlink.com/content/v57686u275653nt4/ COPYRIGHT AND LICENSE Copyright (C) 2009-2011 European Commission FP7 project TINA - FP7-ICT-2009-C / CNRS UMR 7656 CREA (fr); project number 245412 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. SOURCE CODE REPOSITORY http://github.com/elishowk/TinasoftPytextminer AUTHORS - Research engineers at CREA lab (UMR 7656, CNRS) elias showk <email@example.com> julian bilcke <firstname.lastname@example.org> SUPPORT AND FEEDBACK http://github.com/elishowk/TinasoftPytextminer/issues INSTALL First of all, you'll need Python 2.6 interpreter : http://python.org/ Then, install PyTextMiner's dependencies of the following source packages typing: $ sudo python setup.py install or $ sudo python setup.py develop Dependencies will be checked : numpy, nltk, tinasoft.data (for data storage) CONFIGURATION config.yaml is YAML declare each column name of your csv file into the corresponding field name of the configuration file, otherwise they will be ignored by the software titleField: document title contentField: document content authorField: document acronyme corpusNumberField: corpus number docNumberField: document number index1Field: document index 1 index2Field: document index 2 dateStartField: corpus start date dateEndField: corpus end date dateField: document publication date keywordsField: document keywords check out the format of your csv file (encoding, delimiter, quoting character) and write them into fields "locale", "delimiter", "quotechar" "minSize", and "maxSize" means the length of n-grams extracted all other fields are the script configuration, or the default values for testing purpose WARNING : config.yaml is written in YAML format, as a consequence all tabulations are spaces, all string values must be quoted (eg : 'prop_title'). Further information at http://en.wikipedia.org/wiki/YAML USAGE you can try analysing a csv file to database using tinaExtract.py Command-line help : $ python tinaExtract.py --help $ cd PyTextMiner/ first check fields value in config.yaml then type the following command-line $ python tinaExtract.py -i path_to/myfile.csv it will generate a statistics.zip file containing contents of the output directory without a input file specified, the software processes the default input (src/t/pubmed_tina_test.csv) TESTED PLATFORMS PyTextMiner was tested on the following Linux platforms: Linux 64 bits (AMD64) with Python 2.5 Linux 64 bits (AMD64) with Python 2.6 Linux 32 bits (i686) with Python 2.5 Linux 32 bits (i686) with Python 2.6 However, if you experience problems, see "MANUAL INSTALL" section. OUTPUTS An output directory Results of the process are stored into a new directory 'output', created by default into "$WORKDIR/PyTextMiner". This output contains three different kind of files: "out.db": is the relational database with n-grams and projects ids "documentSample.csv": provides a 10 projects sample in easy to read format of the kind of information stored in "out.db". You thus can check on a sample that this information is compatible with your needs. A set of csv files named "corpus_id-ngramOccPerCorpus.csv" that provides statistics for each batch. Here is an example of the content of sample csv file. document id,ngrams extracted "16710584","lesions displayed, which indicates, associated with, was detected, to valproate after, a case of, systemic, her skin,[etc]" Here is an example of the statistics on a corpus where are listed the n-grams extracted with their occurence within the batch and the part-of-speech tagged version of the n-grams: "ngram","documents","POS tagged" "defects","9","NNS_defects" "defence","3","NN_defence" "staphylococcus","2","NNP_Staphylococcus" "cd t cells","2","NNP_CD NNP_T VBZ_cells" "blood vessels","4","NN_blood NNS_vessels" A zipped output directory All the above files are zipped into a single file that can be directly sent to your coworkers. This file is named 'statistics.zip' and can be found in current $WORKDIR/PyTextMiner directory.