dm-text-classification

This is essentially a MapReduce Job to perform text classification on a Hadoop cluster.

It uses the Python library [scikit-learn] (http://scikit-learn.org/stable/index.html). Input is taken from Lily, output is written to Lily and can be searched for using Solr. It currently works for text and pdf files.

TODOs

verfiy that it works on the "real" cluster (currently tested on earkdev) ** mainly the communication between Python and Java/MapReduce Jobs ** also find an easy way to provide the Python libraries (see below)
~~change paths in the Mapper (-> py script locations)~~

Usage

The mapper needs an input list, in the format of

<path>,<contentType>

according to the Solr query results. This allows to create customized queries and only classify certain files. This file must be uploaded to the Hadoop file system, in /user/<username that executes hadoop commands>.

Launch the MapReduce Job with

./start.sh

Adapt the script if neccessary:

-i <input file name on hdfs>
-c <path to the classifier script (local)>
-m <path to the model (local)>

General Information:

The provied model was trained on german/austrian newspaper articles, expect bad performance on out-of-domain data.

However, feel free to train your own models, according to the [scikit-learn documentation] (http://scikit-learn.org/stable/documentation.html).

An exemple script that was used to generate the newspaper model can be found in the /pyscript subfolder.

Requirements

To run the Python script, the following packages need to be installed: numpy, scipy and scikit-learn:

Requirements:

sudo apt-get install gfortran
sudo apt-get install liblapack-dev

Packages:

pip install numpy
pip install scipy
pip install -U scikit-learn

Other:

Create a folder:

/tmp/clfin

and set rights (while on the user that will execute the hadoop command):

sudo chmod 1777 /tmp/clfin

The Python script and the model need to be placed according to the path inside the Mapper (need to adapt it), the script needs execute permissions:

chmod 755 classifier.py

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
pyscript		pyscript
src/main/java/org/eu/eark/textclassification		src/main/java/org/eu/eark/textclassification
README.md		README.md
mrjob-assembly.xml		mrjob-assembly.xml
pom.xml		pom.xml
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dm-text-classification

TODOs

Usage

General Information:

Requirements

Other:

About

Releases

Packages

Languages

eark-project/dm-text-classification

Folders and files

Latest commit

History

Repository files navigation

dm-text-classification

TODOs

Usage

General Information:

Requirements

Other:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages