# Introduction

The `LibsvmDataset` class is designed to facilitate users downloading datasets from [LIBSVM Data repository](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/) and performing some basic data cleaning tasks.

Current implementation allows user to download datasets for both `binary` classification task and `regression` task.

The API documentation can be found [here](https://roth.rbind.io/resources/software/libsvmdata-downloader/).

# Downloading

`LibsvmDataset` provides a `getAndClean` method (we will explain later on what types of data clean can be performed) to download a particular dataset. Under the hook, this method utilizes an url linking to the dataset to download.
This url can either be provided by users through the `download_url` argument or generated internaly by a combination of `task` argument and `dataset` argument.

The following code snipts will demonstrate these two methods.

## Option 1: Through the `download_url` argument 

For exmaple, assume we want to download the `a1a` dataset for the `binary` classification task. Then we can go to https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a1a and find the download link to the dataset `a1a` under the "File" tag.
Then right click it and hit "copy link" button. The download link is https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a.
Provide this link to the `LibsvmDataset`'s `getAndClean`  method. 

In [1]:
from LibsvmDataset import *
# create a instance
# download_dir: A string specifies the place to store the downloaded raw dataset.
# cleand_dir: A string specifies the place to store the cleaned dataset.
libclass = LibsvmDataset(download_dir="./raw", 
                         cleand_dir="./clean")
libclass.getAndClean(download_url='https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a')

You choose to use the url option.
Parsed task: [binary] | Parsed dataset: [a1a]
Parsed download_url:[https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a]


100% (114818 of 114818) |################| Elapsed Time: 0:00:00 Time:  0:00:00


[92mdataset [a1a] is downloaded at [./raw/binary].[0m
Original y-label range: { -1.0, 1.0 } -> New y-label range: { -1.0, 1.0 }
Perform normalization:feat-11
[92mSuccess: File saved at [./clean/binary]![0m
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


## Option 2: Through the combination of `task` and `dataset` arguments

In [3]:
# assume we want to download a2a dataset
libclass.getAndClean(task='binary', dataset='a2a')

You choose to use the task+dataset option.
Parsed task: [binary] | Parsed dataset: [a2a]
Parsed download_url:[https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a2a]


100% (162053 of 162053) |################| Elapsed Time: 0:00:00 Time:  0:00:00


[92mdataset [a2a] is downloaded at [./raw/binary].[0m
Original y-label range: { -1.0, 1.0 } -> New y-label range: { -1.0, 1.0 }
Perform normalization:feat-11
[92mSuccess: File saved at [./clean/binary]![0m
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


**Note:** To use this Option 2, the users should provide correct values to `task` and `dataset` arguments. The current implementation includes all small and moderate size datasets for both `binary` and `regression` tasks. To see all available(built-in) combinations of values for the `task` and `dataset` arguments. One can use the `getAvailable()` method.

In [4]:
libclass.getAvailable()

Current supported tasks are:
 ['binary'] ['regression']
For task:['binary'], available datasets are:
----------------------------------------------------
 'a1a', 'a2a', 'a3a', 'a4a', 'a5a',

 'a6a', 'a7a', 'a8a', 'a9a', 'a1a.t',

 'a2a.t', 'a3a.t', 'a4a.t', 'a5a.t', 'a6a.t',

 'a7a.t', 'a8a.t', 'a9a.t', 'australian', 'breast-cancer',

 'cod-rna', 'cod-rna.t', 'cod-rna.r', 'colon-cancer.bz2', 'covtype.libsvm.binary.bz2',

 'diabetes', 'duke.bz2', 'fourclass', 'german.numer', 'gisette_scale.bz2',

 'gisette_scale.t.bz2', 'heart', 'ijcnn1.bz2', 'ionosphere_scale', 'leu.bz2',

 'leu.bz2.t', 'liver-disorders', 'liver-disorders.t', 'mushrooms', 'phishing',

 'skin_nonskin', 'splice', 'splice.t', 'sonar_scale', 'svmguide1',

 'svmguide1.t', 'svmguide3', 'svmguide3.t', 'w1a', 'w2a',

 'w3a', 'w4a', 'w5a', 'w6a', 'w7a',

 'w8a', 'w1a.t', 'w2a.t', 'w3a.t', 'w4a.t',

 'w5a.t', 'w6a.t', 'w7a.t', 'w8a.t',

For task:['regression'], available datasets are:
--------------------------------------------

If we want to download the dataset `avazu-app.bz2`, which is not available in the above list. What will happens? A error message will be thrown out. (In this case, we recommend users to consider the Option 1, i.e., providing an download url.)

In [5]:
# An error happens. It matches the second cause.
libclass.getAndClean(task='binary', dataset='avazu-app.bz2')

You choose to use the task+dataset option.
[91mError occurs!
  1.Either the input dataset:[avazu-app.bz2] is not intended for the task:[binary].
  2.Or the input dataset:[avazu-app.bz2] is not in the built-in database.
If you are sure the latter case happens, you can provide an url pointing to the desired dataset.[0m
[93mFail to generate download url, please check your inputs.[0m


Also, if a user gives a `dataset` name in the above list,  but mis-specify the `task`, the same error message will be thrown out, which matches the first case.

In [6]:
# An error happens. It matches the second cause.
libclass.getAndClean(task='regression', dataset='a1a')

You choose to use the task+dataset option.
[91mError occurs!
  1.Either the input dataset:[a1a] is not intended for the task:[regression].
  2.Or the input dataset:[a1a] is not in the built-in database.
If you are sure the latter case happens, you can provide an url pointing to the desired dataset.[0m
[93mFail to generate download url, please check your inputs.[0m


# Cleaning

Currently supported cleaning tasks include:

1. Change the label for binary classification. If users want to perform binary classification, the labels can set to {-1,1} by setting `binary_label` argument to '{-1,1}'; or set to {0,1} by providing '{0,1}'.

2. Feature-wise normalization. Users can normalize each feature to either range [-1,1] or [0,1] by setting the `normalization` argument to `feat-11` or `feat01`  respectively.

# Advanced Useage

Users might want to download a list of datasets all at once. This is achieved by using the method `getAndCleanFromFile` method.
In this case, users need to provide a `txt` file, where each row is a dataset name. Two samples `txt` files, `libsvm_binary_small.txt` and `libsvm_regression.txt` are provided in this repo for demonstration. 

**Note**: Users can put a "#" in front of the dataset name, in the case, the `getAndCleanFromFile` method will skip this dataset. Again in this scenario, if the dataset  specified in the txt file is not available, an error will be thrown out (but won't crash the code). **When choose the avaiable datasets, we intentionally exclude the large datasets. The first reason is that it takes long time to download. We hope users to download them one by one through the `getAndClean` method. The second reason is that usually these large datasets have multiple verisons, we want users to figure out which one they want to download and get the correct download url on their own.**

In [7]:
file_path = './libsvm_regression.txt'
libclass.getAndCleanFromFile(file_path,
                             task='regression', 
                             binary_lable='{-1,1}', 
                             normalization='feat-11')

Parsed 2 datasets from ./libsvm_regression.txt.


100% (258705 of 258705) |################| Elapsed Time: 0:00:01 Time:  0:00:01


[92mdataset [abalone] is downloaded at [./raw/regression].[0m
Perform normalization:feat-11
  col:0: max:3.000e+00 | min:1.000e+00
  Apply feature-wise [-1,1] scaling...
  col:3: max:1.130e+00 | min:0.000e+00
  Apply feature-wise [-1,1] scaling...
  col:4: max:2.825e+00 | min:2.000e-03
  Apply feature-wise [-1,1] scaling...
  col:5: max:1.488e+00 | min:1.000e-03
  Apply feature-wise [-1,1] scaling...
  col:7: max:1.005e+00 | min:1.500e-03
  Apply feature-wise [-1,1] scaling...
[92mSuccess: File saved at [./clean/regression]![0m
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


100% (28223 of 28223) |##################| Elapsed Time: 0:00:00 Time:  0:00:00


[92mdataset [bodyfat] is downloaded at [./raw/regression].[0m
Perform normalization:feat-11
  col:0: max:4.750e+01 | min:0.000e+00
  Apply feature-wise [-1,1] scaling...
  col:1: max:8.100e+01 | min:2.200e+01
  Apply feature-wise [-1,1] scaling...
  col:2: max:3.631e+02 | min:1.185e+02
  Apply feature-wise [-1,1] scaling...
  col:3: max:7.775e+01 | min:2.950e+01
  Apply feature-wise [-1,1] scaling...
  col:4: max:5.120e+01 | min:3.110e+01
  Apply feature-wise [-1,1] scaling...
  col:5: max:1.362e+02 | min:7.930e+01
  Apply feature-wise [-1,1] scaling...
  col:6: max:1.481e+02 | min:6.940e+01
  Apply feature-wise [-1,1] scaling...
  col:7: max:1.477e+02 | min:8.500e+01
  Apply feature-wise [-1,1] scaling...
  col:8: max:8.730e+01 | min:4.720e+01
  Apply feature-wise [-1,1] scaling...
  col:9: max:4.910e+01 | min:3.300e+01
  Apply feature-wise [-1,1] scaling...
  col:10: max:3.390e+01 | min:1.910e+01
  Apply feature-wise [-1,1] scaling...
  col:11: max:4.500e+01 | min:2.480e+01
  Apply

# Final Remarks

By default, the `LibsvmDataset` class will check the whether the dataset has been downloaded and cleaned. If it is true, no further action is taken.
One can set `force_download` and `force_clean` to `True`. Then the dataset will be downloaded and cleaned.