# RCS Weka Data Mining Software 

![Weka](weka.jpg)
Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization

https://www.cs.waikato.ac.nz/~ml/weka/

# Data Mining vs Machine Learning

Data Mining is a cross-disciplinary field that focuses on discovering properties of data sets.

There are different approaches to discovering properties of data sets. Machine Learning is one of them. 

On the other hand Machine Learning is a sub-field of data science that focuses on designing algorithms that can learn from and make predictions on the data. Machine learning includes Supervised Learning and Unsupervised Learning methods. Unsupervised methods actually start off from unlabeled data sets, so, in a way, they are directly related to finding out unknown properties in them.

Most likely data mining will simply merge into data science.

from Quora, Netflix VP https://medium.com/@xamat/what-s-the-relationship-between-machine-learning-and-data-mining-8c8675966615

# Manual
http://prdownloads.sourceforge.net/weka/WekaManual-3-9-3.pdf?download (also included in Weka distribution)

### Book on using Weka 
sadly 4th edition has a lot of typos and reference errors

https://www.amazon.com/exec/obidos/ASIN/0128042915/departmofcompute

## Datasets for Weka
https://www.cs.waikato.ac.nz/~ml/weka/datasets.html

# WEKA Interface

1. **Preprocess.** Choose and modify the data being acted on.
2. **Classify.** Train and test learning schemes that classify or perform regres-
sion.
3. **Cluster.** Learn clusters for the data.
4. **Associate.** Learn association rules for the data.
5. **Select attributes.** Select the most relevant attributes in the data.
6. **Visualize.** View an interactive 2D plot of the data.

# 4 different styles of learning

* **classification learning**, the learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples. 
* **Association learning**, any association among features is sought, not just ones that predict a particular class value. 
* **clustering** groups of examples that belong together are sought. 
* **numeric prediction**: the outcome to be predicted is not a discrete class but a numeric quantity. Regardless of the type of
learning involved, we call the thing to be learned the concept and the output produced by a learning scheme the concept description

# Iris dataset
The Iris Flower dataset is a famous dataset from statistics and is heavily borrowed by researchers in machine learning. It contains 150 instances (rows) and 4 attributes (columns) and a class attribute for the species of iris flower (one of setosa, versicolor, and virginica).

https://en.wikipedia.org/wiki/Iris_flower_data_set

In [1]:
# On Windows C:\Program Files\Weka-3-8\data
# MacOs
# Linux

Click the “Classify” tab. This is the area for running algorithms against a loaded dataset in Weka.

You will note that the “ZeroR” algorithm is selected by default.

Click the “Start” button to run this algorithm.

### ZeroR algorithm - useful baseline but boring
http://chem-eng.utoronto.ca/~datamining/dmc/zeror.htm

# HoeffdingTree
### http://weka.sourceforge.net/doc.stable-3-8/weka/classifiers/trees/HoeffdingTree.html


# Review Results how is the accuracy ?

In [None]:
=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances         143               95.3333 %
Incorrectly Classified Instances         7                4.6667 %
Kappa statistic                          0.93  
Mean absolute error                      0.0373
Root mean squared error                  0.1543
Relative absolute error                  8.3884 %
Root relative squared error             32.735  %
Total Number of Instances              150     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    0.000    1.000      1.000    1.000      1.000    1.000     1.000     Iris-setosa
                 0.940    0.040    0.922      0.940    0.931      0.896    0.991     0.982     Iris-versicolor
                 0.920    0.030    0.939      0.920    0.929      0.895    0.991     0.985     Iris-virginica
Weighted Avg.    0.953    0.023    0.953      0.953    0.953      0.930    0.994     0.989     

=== Confusion Matrix ===

  a  b  c   <-- classified as
 50  0  0 |  a = Iris-setosa
  0 47  3 |  b = Iris-versicolor
  0  4 46 |  c = Iris-virginica


##### Exercise attempt to get better than 95% percent
## How would one go about it ?


# C4.8 classifier algorithm
Click the “Choose” button in the “Classifier” section and click on “trees” and click on the “J48” algorithm.

This is an implementation of the C4.8 algorithm in Java (“J” for Java, 48 for C4.8, hence the J48 name) and is a minor extension to the famous C4.5 algorithm.
http://en.wikipedia.org/wiki/C4.5_algorithm

# Review Results

Firstly, note the Classification Accuracy. You can see that the model achieved a result of 144/150 correct or 96%, which seems a lot better than the baseline of 33%

Secondly, look at the Confusion Matrix. You can see a table of actual classes compared to predicted classes and you can see that there was 1 error where an Iris-setosa was classified as an Iris-versicolor, 2 cases where Iris-virginica was classified as an Iris-versicolor, and 3 cases where an Iris-versicolor was classified as an Iris-setosa (a total of 6 errors). This table can help to explain the accuracy achieved by the algorithm.

###  Can inspect log for what you've done so far
11:32:38: Started weka.classifiers.rules.ZeroR

11:32:38: Command: weka.classifiers.rules.ZeroR 

11:32:39: Finished weka.classifiers.rules.ZeroR

11:33:02: Started weka.classifiers.trees.J48

11:33:02: Command: weka.classifiers.trees.J48 -C 0.25 -M 2

11:33:02: Finished weka.classifiers.trees.J48

## Saving Unlabaled Instances

Testing set(the one you label) needs to have SAME class atributes as Training set

The missing values need to be question marks ?

https://list.waikato.ac.nz/pipermail/wekalist/2013-April/057868.html

https://stackoverflow.com/questions/10072540/how-to-clasify-an-unlabelled-dataset-with-a-newly-trained-naivebayes-classifier



# TRAINING AND TESTING

# CROSS-VALIDATION

Cross-validation is a technique that is used for the assessment of how the results of statistical analysis generalize to an independent data set. Cross-validation is largely used in settings where the target is prediction and it is necessary to estimate the accuracy of the performance of a predictive model. The prime reason for the use of cross-validation rather than conventional validation is that there is not enough data available for partitioning them into separate training and test sets (as in conventional validation).

![Cross-Validation](K-fold_cross_validation_EN.jpg)

# MDL principle

Occam's Razor - minimize model as much as possible, but no more

KISS

# Choosing an algorithm
![ML ALgo](ml_map.png)

# Saving Results

### Right click on Result List you want in Classifier
### Many options

## Saving Images
It is possible to generate image files from a number of panels in Weka's GUI interfaces. Just **hold down shift and alt and left-click** on the panel that you want to save. Available formats include: BMP, JPEG, PNG and postscript.

### VS Warning: Use EPS option -> then convert to other options later with EPS viewer program(jpg,bmp,png give you black blackgrnd)

It is also possible to save the visualization data out to an ARFF file - just use the "Save" button. You can then load it back into the Explorer and use any of Weka's filters to manipulate it.


# Python3 Wrapper for Weka 

https://pypi.org/project/python-weka-wrapper3/

Installation a bit complicated (also needs Javabridge)

http://fracpete.github.io/python-weka-wrapper3/install.html


# Major alternative Python based
http://scikit-learn.org/stable/