This package allows users to construct classifiers and filters with Python scripts for WEKA, given that the script conforms to an expected structure. Get started with your first classifier here!
Download the latest version of PyScript here!
This package requires the following:
- The latest and greatest version of WEKA. The nightly developer snapshot can be downloaded here.
- Extract the
weka.jar
insidedeveloper-branch.zip
, and add it to the$CLASSPATH
variable. - To run WEKA, simply run
java weka.gui.GUIChooser
.
- Extract the
- The wekaPython package written by Mark Hall. This package is actually a wrapper for Scikit-Learn, but it has code that makes it possible to interact with Python scripts.
- You can install this using the WEKA package manager in the GUI chooser (
Tools
>Package Manager
). - Ensure that
wekaPython.jar
is in your$CLASSPATH
variable as well. This .jar can be found in the$WEKA_HOME/packages/wekaPython/
directory.
- You can install this using the WEKA package manager in the GUI chooser (
- An installation of Python 2.7 with libraries installed such as Numpy and Pandas. The easiest (and safest) way to get these is to download the Anaconda distribution, since it comes with many essential packages preloaded.
- ant to be able to build the package.
- Java 8, but 7 could probably work too.
- (Optional) Theano to be able to run the linear regression example.
Now, download this Git repo, cd into the directory and run the following:
ant clean # if you have built the package previously
ant make_package -Dpackage=pyScript
cd dist
java weka.core.WekaPackageManager -install-package pyScript.zip
If the package installed successfully, you should now be able to run it from WEKA, either from the command-line or the GUI. A quick way to check if the classifier can be invoked is to simply run
java weka.Run .PyScriptClassifier
and see if WEKA recognises it. You should get an error like "Weka exception: No training file and no object input file given.".
Also make sure to install the pyscript
Python module by running:
python setup.py install
Run a linear regressor on the diabetes dataset.
java weka.Run .PyScriptClassifier \
-script scripts/linear-reg.py \
-standardize \
-t datasets/diabetes_numeric.arff -c last -no-cv
We can pass custom arguments in, and in this script two custom arguments can be specified to override the default values: alpha
(the learning rate), and epsilon
(early stopping criterion).
java weka.Run .PyScriptClassifier \
-script scripts/linear-reg.py \
-standardize \
-args "alpha=0.001;epsilon=1e-6" \
-t datasets/diabetes_numeric.arff -c last -no-cv
We can also run ZeroR on a nominal dataset such as Iris.
java weka.Run .PyScriptClassifier \
-script scripts/zeror.py \
-t datasets/iris.arff -c last -no-cv
A Scikit-Learn random forest can be trained, passing in an argument num_trees
which specifies how many trees should be used in the ensemble (this is a required argument and is not optional). To do a 10-fold cross-validation on iris.arff
using 30 trees, we run:
java weka.Run .PyScriptClassifier \
-script scripts/scikit-rf.py \
-args "num_trees=30" \
-t datasets/iris.arff
We can also write Python scripts that act as filters. Here, we apply zero-mean unit-variance (ZMUV) standardisation to all numeric attributes in the data:
java weka.Run .PyScriptFilter \
-script scripts/standardise.py \
-i datasets/diabetes_numeric.arff \
-c last
By default, the standardisation is not applied to the class attribute. If we want the class attribute to be processed, we can use the -ignore-class
flag:
java weka.Run .PyScriptFilter \
-script scripts/standardise.py \
-i datasets/diabetes_numeric.arff \
-ignore-class \
-c last