LAYERS

What Problem is it solving?

Well, when you read a DF using spark and then provide some filter conditions on it, Spark first lists down all the files that are present in the base path and then starts applying filter on it

Now, we are dealing with big data that means the data can be present in huge quantity. So, why not give spark, only the relevant path, in which we are certain that data will be there. This way, Apache Spark can leverage its distributed computing capabilities to efficiently process large volumes of data in parallel, and not focusing on how to optimize the read time

So, here's Layers

A simple util built to help optimize the spark read.
This library is helpful when it comes to applying filter clauses on partition columns.

Let's take for example you have a dataframe on which you're applying filters,

     val dfAfterApplyingFilters = df
       .filter(col("partn_col1") === "val1"
        && col("partn_col2") =!= 123.5
        && col("non_partn_col").isin(234.6, 345.7))

Then this can be optimized this way.

    val basePath: String = "base_path" // path from where you wanted to read values
    val predicates: Map[Column, List[Predicate]] = Map(
      Column("partn_col1") -> List(equal("val1")),
      Column("partn_col2") -> List(notEqual(123.5)))
      
    val filteredPaths: List[Path] = FilterPathsBasedOnPredicate.filter(spark, List(basePath), predicates)
    
    val dataWithFilteredPaths = ORCReader.read(spark,
      Map("basePath" -> basePath),
      filteredPaths.map(_.toString): _*)
      
    val dfAfterApplyingFilters = dataWithFilteredPaths
        .filter(col("non_partn_col").isin(234.6, 345.7))

More examples can be found in How to use FilterPathsBasedOnPredicate

Operators supported

For Numeric values(Int, Long, Float Double only)
- <
- <=
- >
- >=
- between
- equal
- notEqual
- in
For String values
- equal
- notEqual
- in

Apart from the above util, there's a read util too. As of now, it supports

ORC Reader
ParquetReader
JSONReader
CSVReader

For build.sbt

"io.github.saurabh975" %% "layers" % "1.0.0"

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
build.sbt		build.sbt
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LAYERS

What Problem is it solving?

So, here's Layers

Operators supported

For build.sbt

About

Releases 1

Packages

Languages

License

Saurabh975/layers

Folders and files

Latest commit

History

Repository files navigation

LAYERS

What Problem is it solving?

So, here's Layers

Operators supported

For build.sbt

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages