Skip to content

emailhy/facets-overview-spark

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Facets Overview Spark

  • this code is intended to be give back to open-source community, therefore this is only temporary put here for testing

Google Open Sourced the Facets Project in 2017 (https://github.com/PAIR-code/facets), which can help Data Scientists to better understand the data set and under the slogan that "Better data leads to better models".

The Google "Facets" includes two sub-projects: Facets-overview and Facets-dive.

"Facets Overview "takes input feature data from any number of datasets, analyzes them feature by feature and visualizes the analysis.

Based on Facets( github page: https://github.com/PAIR-code/facets)

Overview gives a high-level view of one or more data sets. It produces a visual feature-by-feature statistical analysis, and can also be used to compare statistics across two or more data sets. The tool can process both numeric and string features, including multiple instances of a number or string per feature.
Overview can help uncover issues with datasets, including the following:

* Unexpected feature values
* Missing feature values for a large number of examples
* Training/serving skew
* Training/test/validation set skew
Key aspects of the visualization are outlier detection and distribution comparison across multiple datasets. Interesting values (such as a high proportion of missing data, or very different distributions of a feature across multiple datasets) are highlighted in red. Features can be sorted by values of interest such as the number of missing values or the skew between the different datasets.

The Facets-overview is consists of

  • Feature Statistics Protocol Buffer
  • Feature Statistics Generation
  • Visualization

The implementation of Feature Statistics Generation have two flavors : Python (used by Jupyter notebook) and Javascripts (used by web )

Overview visualization

Current implemention is in python depends on Numpy or Javascripts. Which means the data statics generation is limited by one machine or one browser.

This project provides an additional implementation for Statistics generation. We can use Spark to leverage the spark generate stats with distributed computinig capability

Overview visualization with Spark

Design Considerations

  • Need to find a Scala Protobuf generator
  • The python implementation is in loop-mutable and update fasion, we need re-arrange it to use immutable container and collection fasion Numpy Implementation consider all the data is in Array on the current computer, we need some way to transform data without collect the data into an array Need to find equivalent Numpy functions such as avg, mean, stddev, histogram
  • The "overview" implementation can be apply to Tensorflow Record (tf.sampels, tf.sequenceSamples), where the data value could be array or array of array of data. In this Scala + Spark Implementation, tensorflow record support is leveraging tensorflow/ecosystem/spark-tensorflow-connector, where we can load the TFRecords into spark data frame. Then the rest of the implmentation is no difference.
  • We use DataFrame to represent the Feature. this is equivalent the feature array used in the Numpy. Efficiency is not the major concern in this first version of design, we may have to pass data in multiple passes.

Data Structures

Based on Feature_statistics Protobuf definitions,

  • Each dataset has DatasetFeatureStatistics which contains lists metadata (name of dataset, number of sample, and Seq[FeatureNameStatistics])
  • Each FeatureNameStatistics is feature stats for the given dataset, which includes feature name and data type and one of (NumericStatistics,StringStatistics or binary_stats)
  • NumericStatistics consists CommonStatistics (std_dev,mean, median,min, max, histogram)
  • StringStatistics consists CommonStatistics as well as (unique values, {value, frequency}, top_values,avg_length, rank_histogram )

Google's Python implementation mainly use dictionary to hold data structures. In this implemenation, we define several additional data structures to help organize the data.

  • Each dataset can be presented by the NamedDataFrame
  • Each dataset can be split with different features, so that it can also be converted to DataEntrySet
  • Each DataEntry represents one feature with basic information as well the DataFrame for the feature. Special note: feat_lens is used for tensorflow records
  • Each Feature associate with FeatureNameStatistics defined above, we use BasicNumStats and BasicStringStats to capture the basic statistics

case class NamedDataFrame(name:String, data: DataFrame)

case class DataEntrySet(name     : String,
                        size     : Long,
                        entries  : Array[DataEntry])

case class DataEntry(featureName : String,
                     `type`      : ProtoDataType,
                     values      : DataFrame,
                     counts      : DataFrame,
                     missing     : Long,
                     featLens    : Option[DataFrame] = None)

case class BasicNumStats(name: String,
                         numCount  : Long = 0L,
                         numNan    : Long = 0L,
                         numZeros  : Long = 0L,
                         numPosinf : Long = 0,
                         numNeginf : Long = 0,
                         stddev    : Double = 0.0,
                         mean      : Double = 0.0,
                         min       : Double = 0.0,
                         median    : Double = 0.0,
                         max       : Double = 0.0,
                         histogram : (Array[Double], Array[Long]) )

case class BasicStringStats(name     : String,
                            numCount : Long = 0,
                            numNan   : Long = 0L)

 

Main class


class FeatureStatsGenerator(datasetProto: DatasetFeatureStatisticsList) {

  import FeatureStatsGenerator._

  /**
    * Creates a feature statistics proto from a set of pandas data frames.
    *
    * @param dataFrames         A list of dicts describing tables for each dataset for the proto.
    *                           Each entry contains a 'table' field of the dataframe of the data and a 'name' field
    *                           to identify the dataset in the proto.
    * @param catHistgmLevel int, controls the maximum number of levels to display in histograms
    *                           for categorical features. Useful to prevent codes/IDs features
    *                           from bloating the stats object. Defaults to None.
    * @return The feature statistics proto for the provided tables.
    */
  def protoFromDataFrames(dataFrames     : List[NamedDataFrame],
                          features       : Set[String] = Set.empty[String],
                          catHistgmLevel : Option[Int] = None): DatasetFeatureStatisticsList = {

    genDatasetFeatureStats(toDataEntries( dataFrames), features, catHistgmLevel)

  }


 

FeatureStatsGenerator

Usage Smaple


 
  test("integration") {
    val features = Array("Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Marital Status",
                         "Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
                         "Hours per week", "Country", "Target")

    val trainData: DataFrame = loadCSVFile("src/test/resources/data/adult.data.csv")
    val testData = loadCSVFile("src/test/resources/data/adult.test.txt")

    val train = trainData.toDF(features: _*)
    val test = testData.toDF(features: _*)

    val dataframes = List(NamedDataFrame(name = "train", train),
                          NamedDataFrame(name = "test", test))

    val proto = generator.protoFromDataFrames(dataframes)
    persistProto(proto,base64Encode = false, new File("src/test/resources/data/stats.pb"))
    persistProto(proto,base64Encode = true, new File("src/test/resources/data/stats.txt"))
  }

  
  private def loadCSVFile(filePath: String) : DataFrame = {
    val spark = sqlContext.sparkSession
    spark.read
      .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
      .option("header", "false") //reading the headers
      .option("mode", "DROPMALFORMED")
      .option("inferSchema", "true")
      .load(filePath)
  }


  def writeToFile(fileName:String, content:String): Unit = {
    import java.nio.charset.StandardCharsets
    import java.nio.file.{Files, Paths}
    Files.write(Paths.get(fileName), content.getBytes(StandardCharsets.UTF_8))
  }

  private def toJson(proto: DatasetFeatureStatisticsList) : String = {
    import scalapb.json4s.JsonFormat
    JsonFormat.toJsonString(proto)
  }
  
  
  private[spark] def persistProto(proto: DatasetFeatureStatisticsList, base64Encode: Boolean, file: File ):Unit = {
    if (base64Encode) {
      import java.util.Base64
      val b = Base64.getEncoder.encode(proto.toByteArray)
      import java.nio.charset.StandardCharsets.UTF_8
      import java.nio.file.{Files, Paths}

      Files.write(Paths.get(file.getPath), new String(b, UTF_8).getBytes(UTF_8))
    }
    else {
      Files.write(Paths.get(file.getPath), proto.toByteArray)
    }
  }


About

Scala + Spark Implementation of the Google Facets Overview Statistic Protobuf generator

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 99.8%
  • Java 0.2%