This is essentially a MapReduce Job to perform text classification on a Hadoop cluster.
It uses the Python library [scikit-learn] (http://scikit-learn.org/stable/index.html). Input is taken from Lily, output is written to Lily and can be searched for using Solr. It currently works for text and pdf files.
- verfiy that it works on the "real" cluster (currently tested on earkdev) ** mainly the communication between Python and Java/MapReduce Jobs ** also find an easy way to provide the Python libraries (see below)
change paths in the Mapper (-> py script locations)
The mapper needs an input list, in the format of
according to the Solr query results. This allows to create customized queries and only classify certain files.
This file must be uploaded to the Hadoop file system, in
/user/<username that executes hadoop commands>.
Launch the MapReduce Job with
Adapt the script if neccessary:
-i <input file name on hdfs> -c <path to the classifier script (local)> -m <path to the model (local)>
The provied model was trained on german/austrian newspaper articles, expect bad performance on out-of-domain data.
However, feel free to train your own models, according to the [scikit-learn documentation] (http://scikit-learn.org/stable/documentation.html).
An exemple script that was used to generate the newspaper model can be found in the /pyscript subfolder.
To run the Python script, the following packages need to be installed:
sudo apt-get install gfortran sudo apt-get install liblapack-dev
pip install numpy pip install scipy pip install -U scikit-learn
Create a folder:
and set rights (while on the user that will execute the hadoop command):
sudo chmod 1777 /tmp/clfin
The Python script and the model need to be placed according to the path inside the Mapper (need to adapt it), the script needs execute permissions:
chmod 755 classifier.py