Skip to content


Repository files navigation

This repository contains various code snippets that I have tried out using Spark. -- Apache Spark 2.2.1, Scala 2.11 --

  1. SparkJsonReadnFormat
  2. WordCountAgain
  3. Map_and_MapValues
  4. MoreWindowFunctions
  5. BroadcastJoin_SortMergeJoin (Refer DAG_Broadcast_SortMerge.png)

  1. SparkJsonReadnFormat

  1. Read Json File
  2. Data Frame Aggregate Functions
  3. Mapping RDD to case class
  4. Reformatting the Json structure
  5. Write output as JSON files

Input File Structure

 |-- a_id: long (nullable = true)
 |-- b_sum: double (nullable = true)
 |-- m_cd: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- td_cnt: array (nullable = true)
 |    |-- element: double (containsNull = true)
 Output File Structure
 |-- a_id: long (nullable = false)
 |-- b_sum: double (nullable = false)
 |-- td_cnt: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: double (valueContainsNull = false)
 |-- tdcnt_sum: double (nullable = false)

Databricks URL for the notebook:

  1. WordCountAgain

WordCount with Filter and Sorting

  1. Map_and_MapValues

Usage of map , group by, mapValues function occurrence of elements in list

Input File
inf1    inf1, inf2, inf3,inf1, inf2, inf3
inf2    inf1, inf2, inf3,inf1, inf2, inf3
inf3    inf3, inf1, inf4
inf4    inf1, inf2, inf3,inf1, inf2, inf3,inf3, inf1, inf4
inf5    inf3, inf1, inf4

Output Data

(inf1,ArrayBuffer(( inf3,2), (inf1,2), ( inf2,2)))
(inf2,ArrayBuffer(( inf3,2), (inf1,2), ( inf2,2)))
(inf3,ArrayBuffer(( inf1,1), (inf3,1), ( inf4,1)))
(inf4,ArrayBuffer(( inf3,2), ( inf2,2), ( inf1,1), (inf3,1), (inf1,2), ( inf4,1)))
(inf5,ArrayBuffer(( inf1,1), (inf3,1), ( inf4,1)))

  1. More Window Functions

This sample handles more functions like Window.partitionBy, Window.orderBy, sum over partition, rank, count, filter, where , ordering ..

  1. BroadcastJoin_SortMergeJoin

Sample to see the functioning of bucketing the table and doing join and how broadcast join works.


No description or website provided.







No releases published


No packages published