ikodaSparse maintains sparse data along with its meaningful text values.
Libsvm
format data (and the analogous LabeledPoint
Scala class) do not maintain meaningful text values for columns or rows. They are purely numeric. In contrast, ikodaSparse maintains the text values for features/columns and text category names for the labels/targets.
As an example, this allows natural language word frequency data to be processed in libsvm format without losing the meaningful information required when reporting and providing data visualization of the data analysis.
ikodaSparse is a Scala tool designed to run as part of a data pipeline on Spark.
The core of the tool is an RDD[org.apache.spark.ml.feature.LabeledPoint]
with a mapping for text names to each column and also to each label/target.
ikodaSparse also converts the data to both DataFrame
and RDD[org.apache.spark.mllib.regression.LabeledPoint]
if required
The main function of ikodaSparse is to manipulate large sparse data.
- Maintain a map of numeric feature identifiers with text names
- Maintain a map of numeric labels/targets with text labels
- Maintain a UUID for each row
- Remove columns/features
- Reorder columns/features
- Add columns
- Remove rows by label/target
- Perform mathematical operations, both row wise and column wise
- Provide data directly to scala ML functions
- Merge labels/targets.
- Merge data schemas. (i.e., convert one data set to match the column and target numbers of another).
- Merge sparse data from two sources
- Dichotomize labels/targets.(i.e., It is either of target A or OTHER)
- Identify and remove duplicate rows
- Return rows containing a particular column.
- Load and save data on a local file system
- Load and save data on Hadoop.