Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

eark-project/dm-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

text mining in python

The code in this repository is meant to work on a cluster environment, together with ToMaR.

This is currently work in progress (some scripts are not converted yet), thus it might not work in your environment at all.

Installation 'entities':

Install nltk:

pip install nltk

TODO: Configure the nltk_data directory.

Install the (german?) tokenizer:

python (or: /opt/anaconda3/bin/python)
>>> import nltk
>>> nltk.download()
>>> Downloader> d
>>> Identifier> punkt

Edit some parameters:

In start.sh:

export NLTK_DATA='/path/to/nltk_data'

also add the correct path to the input file and the --hadoop-streaming-jar.

In mr_ner.py:

java_path = "/path/to/java_8_JRE/bin/java"
model_ger = "/path/to/ner/model"
stanford_jar = "/path/to/stanford-ner.jar"

About

The data mining and natural language processing modules used in a cluster environment (Hadoop).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published