Skip to content

amerywu/ikodaSparse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ikodaSparse

ikodaSparse maintains sparse data along with its meaningful text values.

Libsvm format data (and the analogous LabeledPoint Scala class) do not maintain meaningful text values for columns or rows. They are purely numeric. In contrast, ikodaSparse maintains the text values for features/columns and text category names for the labels/targets.

As an example, this allows natural language word frequency data to be processed in libsvm format without losing the meaningful information required when reporting and providing data visualization of the data analysis.

ikodaSparse is a Scala tool designed to run as part of a data pipeline on Spark.

The core of the tool is an RDD[org.apache.spark.ml.feature.LabeledPoint] with a mapping for text names to each column and also to each label/target.

ikodaSparse also converts the data to both DataFrame and RDD[org.apache.spark.mllib.regression.LabeledPoint]if required

The main function of ikodaSparse is to manipulate large sparse data.

ikodaSparse can:

  1. Maintain a map of numeric feature identifiers with text names
  2. Maintain a map of numeric labels/targets with text labels
  3. Maintain a UUID for each row
  4. Remove columns/features
  5. Reorder columns/features
  6. Add columns
  7. Remove rows by label/target
  8. Perform mathematical operations, both row wise and column wise
  9. Provide data directly to scala ML functions
  10. Merge labels/targets.
  11. Merge data schemas. (i.e., convert one data set to match the column and target numbers of another).
  12. Merge sparse data from two sources
  13. Dichotomize labels/targets.(i.e., It is either of target A or OTHER)
  14. Identify and remove duplicate rows
  15. Return rows containing a particular column.
  16. Load and save data on a local file system
  17. Load and save data on Hadoop.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published