# Spark Learning Note - Basic Structured Operations

Jia Geng | gjia0214@gmail.com

**NOTE:** 
- some functions in spark 2.4 are not well supported by java 11, use java 8!

<a id='directory'></a>

## Directory 

- [Data Source](https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data/flight-data/csv)
- [0. Spark Session](#sec0)
- [1. Read File into DataFrame](#sec1)
- [2. Load File with Manual Schema ](#sec2)
- [3. Column, Row, and Create DataFrame from Scratch](#sec3)
- [4. Select & SelectExpr ](#sec4)
- [5. Column Manipulation](#sec5)
- [6. Row Manipulation](#sec6)
- [7. Partitions and Collect](#sec7)


## DataFrame vs Dataset

- DataFrames are untyped until runtime
- Datasets are typed during the compiling time
- Usually just work with DataFrames. When need strict compile-time checking, Dataset is prefered.
- Since Python is Dynamic language, it does not support Datasets (at least for Spark2)

## Structured API Execution Steps

*NOTE: DataFrame is immutable! All APIs will return a new DataFrame!*

- Write DataFrame/Dataset/SQL code
- Spark convert code to logical plan (analyze and **optimize** the logical plan)
- Spark transform logical plan to physical plan (**optimize** the physical plan, how to execute on cluster)
- Spark Execute physical plan on clusters

In [1]:
# check java version 
# use sudo update-alternatives --config java to switch java version if needed.
!java -version

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~19.10-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


In [2]:
from pyspark.sql import SparkSession
data_example_path = '/home/jgeng/Documents/Git/SparkLearning/data/2015flight.csv'

### 0. Spark Session <a id='sec0'></a>
The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder:


[back to top](#directory)

In [3]:
# build a spark session locally
spark = SparkSession.builder.appName('Spark Example').getOrCreate()

# specify the number of worker
spark

### 1. Read File into DataFrame  <a id='sec1'></a>

Spark session can read file of differemt formats.
- Use `.format()` to specify file format
- `option()` provide many configurations for reading the data such as read header
- DataFrame object has a method `printSchema()`

DataType on Read:
- `json` file contains information regarding the type of the data (but not precision). 
- data in `csv` file will be read as string if not specified
- **the** `option('inferSchema', True)` **will enable the schema inference for reading the csv file**

For faster access of data, use `df.cache()` to put the data in memory
When to use caching: As suggested in this post, it is recommended to use caching in the following situations:

- RDD re-use in iterative machine learning applications
- RDD re-use in standalone Spark applications
- When RDD computation is expensive, caching can help in reducing the cost of recovery in the case one executor fails

**`df.cache() is lazy operation, it does not cache the data until you use it!`**

[*back to top*](#directory)

In [4]:
# read from csv file
df = spark.read.format('csv').load(data_example_path)
df.count()

257

In [5]:
# very handy method
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)



In [6]:
# when does not specify the header, spark will treate each row as a data record and add header
df.show(3)
df.printSchema() # count is in StringType!

+-----------------+-------------------+-----+
|              _c0|                _c1|  _c2|
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 3 rows

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)



In [7]:
# use option to read header
df = spark.read.format('csv').option('header', True).option('inferSchema', True).load(data_example_path)
df.show(3)
df.printSchema()  # now count is in IntegerType!

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
+-----------------+-------------------+-----+
only showing top 3 rows

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



In [15]:
# cache the dataframe
df.cache() # cache it on memory if the table will be access frequently
df.createOrReplaceTempView('dfTable')  # this is for running the sql code on it

### 2. Load File with Manual Schema <a id='sec2'></a>

When read data from file, if not specifying the schema of your data, default schema-on-read will be created for the data. **This could cause precision issue if the data in file is in different precision. In production, it is usually recommended to manually setup the schema.**

`pyspark.sql.types` have all supported data types. 

For a list of all: https://www.oreilly.com/library/view/spark-the-definitive/9781491912201/ch04.html

Spark DataFrame Schema is defiend by StructType and a list of StructField within it.

StructField:
- name: name of the field/column
- type: data type
- nullable: whether null value is allowed
- metadata: way of storing information about this column (will be used in machine learning)

[back to top](#directory)

In [10]:
# df have a schema attribute
print(df.schema)

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,IntegerType,true)))


In [11]:
# specify a schema manually - e.g. what if count is long rather than integer
from pyspark.sql.types import StructType, StructField, StringType, LongType

# field name, data type, nullable
field_1 = StructField('DEST_COUNTRY_NAME', StringType(), True)
field_2 = StructField('ORIGIN_COUNTRY_NAME', StringType(), True)
field_3 = StructField('count', LongType(), False)

manualSchema = StructType([field_1, field_2, field_3])

In [12]:
df = spark.read.format('csv').option('header', True).schema(manualSchema).load(data_example_path)
df.show(3) 

# now count is long! 
# but nullable is true
# this is because CSV format doesn't provide any tools which allow you to specify data constraints 
# so by definition reader cannot assume that input is not null and your data indeed contains nulls.
df.printSchema()  

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
+-----------------+-------------------+-----+
only showing top 3 rows

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



### 3. Column, Row, and Create DataFrame from Scratch  <a id='sec3'></a>

col can be used in expression for `select()` method.

row is data records.

[back to top](#directory)

In [13]:
from pyspark.sql.functions import col

# create an unattached column
new_col = col('new_col')
new_col

Column<b'new_col'>

In [14]:
# get a column from df
df['count']

Column<b'count'>

In [15]:
# get all column names from a df
df.columns[0]

'DEST_COUNTRY_NAME'

In [16]:
from pyspark.sql import Row

# create a new data record
newRow = Row('Hi', None, 2, True)

# access a value from the data record
newRow[0]  # Python will automatically convert the value to correct type

'Hi'

In [17]:
# create a DataFrame from Scratch
# field name, data type, nullable
field_1 = StructField('DEST_COUNTRY_NAME', StringType(), True)
field_2 = StructField('ORIGIN_COUNTRY_NAME', StringType(), True)
field_3 = StructField('count', LongType(), False)

manualSchema = StructType([field_1, field_2, field_3])

newRow = Row('NYC', 'MIA', 2)

newDF = spark.createDataFrame([newRow], manualSchema)
newDF.show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|              NYC|                MIA|    2|
+-----------------+-------------------+-----+



### 4. Select & SelectExpr <a id='sec4'></a>
 
Flexible expression on columns data.
- use `expr()` in methods such as `select()` and `withColumn()`
- use `selectExpr()` instead of `select(expr(..), expr(..), ..)`

[back to top](#directory)

In [18]:
df.show(4)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
+-----------------+-------------------+-----+
only showing top 4 rows



In [19]:
# by default the spark session is case insensitive
spark.sql('set spark.sql.caseSensitive=false')
df.select('deST_COUNTRY_NAME').show(4)  # case insensitive!

# change the current session to be case sensitive
spark.sql('set spark.sql.caseSensitive=true')
df.select('DEST_COUNTRY_NAME').show(4)

# change it back
spark.sql('set spark.sql.caseSensitive=false')

+-----------------+
|deST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
|    United States|
|            Egypt|
+-----------------+
only showing top 4 rows

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
|    United States|
|            Egypt|
+-----------------+
only showing top 4 rows



DataFrame[key: string, value: string]

In [20]:
from pyspark.sql.functions import expr

# use AS to rename the column name
# the returned new DataFrame will have a new column name
newdf = df.select(expr("DEST_COUNTRY_NAME as destination"), 
          expr('ORIGIN_COUNTRY_NAME as departure'))
newdf.show(4)

+-------------+-------------+
|  destination|    departure|
+-------------+-------------+
|United States|      Romania|
|United States|      Croatia|
|United States|      Ireland|
|        Egypt|United States|
+-------------+-------------+
only showing top 4 rows



In [21]:
# use * to select all column
# expr can take some more operations between columns to create new column
newdf.select('*', expr('destination=departure as invalid')).show(4)

+-------------+-------------+-------+
|  destination|    departure|invalid|
+-------------+-------------+-------+
|United States|      Romania|  false|
|United States|      Croatia|  false|
|United States|      Ireland|  false|
|        Egypt|United States|  false|
+-------------+-------------+-------+
only showing top 4 rows



In [22]:
# use selectExpr to do it all use one line
# takes the expression as input, it can also take a literal as input1
df.selectExpr('*', 'DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME as invalid', '1 as one', 'True as bool').show(4)

+-----------------+-------------------+-----+-------+---+----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|invalid|one|bool|
+-----------------+-------------------+-----+-------+---+----+
|    United States|            Romania|   15|  false|  1|true|
|    United States|            Croatia|    1|  false|  1|true|
|    United States|            Ireland|  344|  false|  1|true|
|            Egypt|      United States|   15|  false|  1|true|
+-----------------+-------------------+-----+-------+---+----+
only showing top 4 rows



### 5. Column Manipulation <a id='sec5'></a>
For maniplating column
- use `df.withColumn()` to add column, change column type.
    - this method takes two params: `column name`, `expression`
    - if column name does not exist, it will append a new column
    - if column name exist, it will replace the column with new expression result
- use `df.withColumnRenamed()` to rename a column
- column name should avoid reserved characters and keywords such as `as`. If needed, use \`...\` to skip.
- use `df.drop()` to drop a column

Spark session is by default case insensitive.
- use `spark_session.sql('set spark.sql.caseSensitive=false')` to change to case sensitive

Add an id column:
- `monotonically_increasing_id()` return an id column

[back to top](#directory)

In [23]:
from pyspark.sql.functions import lit

# a more formal way to add column is to use withColumn
# must use lit to create a column of 1, otherwise, it will break
df.withColumn('one', lit(1)).show(2)  # this is NOT inplace, scala never inplace change table!

+-----------------+-------------------+-----+---+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|one|
+-----------------+-------------------+-----+---+
|    United States|            Romania|   15|  1|
|    United States|            Croatia|    1|  1|
+-----------------+-------------------+-----+---+
only showing top 2 rows



In [24]:
# withcolumn also support expr
df.withColumn('invalid', expr('DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME')).show(2)

+-----------------+-------------------+-----+-------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|invalid|
+-----------------+-------------------+-----+-------+
|    United States|            Romania|   15|  false|
|    United States|            Croatia|    1|  false|
+-----------------+-------------------+-----+-------+
only showing top 2 rows



In [25]:
# when column name exist, it will replace the column
df.withColumn('DEST_COUNTRY_NAME', expr('DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME')).show(2)

# withColumn can also be used for copy and rename a column
df.withColumn('destination', expr('DEST_COUNTRY_NAME')).show(2)

# withColumn can also be used for change the type of the column
# use col('count') to get column type and cast() to cast into new type
df.withColumn('count', col('count').cast('long')).printSchema()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|            false|            Romania|   15|
|            false|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 2 rows

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|  destination|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|United States|
|    United States|            Croatia|    1|United States|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



In [26]:
# a more straight forward way to rename a column is 
df.withColumnRenamed('DEST_COUNTRY_NAME', 'dest').show(2)

+-------------+-------------------+-----+
|         dest|ORIGIN_COUNTRY_NAME|count|
+-------------+-------------------+-----+
|United States|            Romania|   15|
|United States|            Croatia|    1|
+-------------+-------------------+-----+
only showing top 2 rows



In [27]:
# to drop a column
df.drop('count').show(2)

# drop multiple column at a time
df.drop('count', 'DEST_COUNTRY_NAME').show(2)

+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|    United States|            Romania|
|    United States|            Croatia|
+-----------------+-------------------+
only showing top 2 rows

+-------------------+
|ORIGIN_COUNTRY_NAME|
+-------------------+
|            Romania|
|            Croatia|
+-------------------+
only showing top 2 rows



### 6. Row Manipulation <a id='sec6'></a>

Basic Manipluation including:
- Filtering: filter rows with some expression with T/F output: `df.filter()` or `df.where()`
    - note that `df.select(condition)` will return a table with single column of true/false.
- Get Unique: `df.distinct()`
- Random Samples: `df.sample(withReplacement=, fraction=, seed=)`
- Random Splits: `df.randomSplit(fractions=, seed=)`
- Concat and Append: `df.union(`
- Sort: `df.orderBy(expr or col or col_name)`
- Limit: `df.limit(n)`

[back to top](#directory)

In [28]:
df = spark.read.format('csv').option('header', True).option('inferSchema', True).load(data_example_path)
df.show(3)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
+-----------------+-------------------+-----+
only showing top 3 rows



In [32]:
# this create a new table with a bool column whether the count > 100
# this can not do filtering!
df.selectExpr('count>100').show(3)

+-------------+
|(count > 100)|
+-------------+
|        false|
|        false|
|         true|
+-------------+
only showing top 3 rows



In [35]:
# below are equivalent 
df.where('count>100 and count<200').show(2)
df.where('count>100').where('count<200').show(2)
df.filter('count>100').filter('count<200').show(2)
df.filter('count>100 and count<200').show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|             Russia|  161|
|          Iceland|      United States|  181|
+-----------------+-------------------+-----+
only showing top 2 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|             Russia|  161|
|          Iceland|      United States|  181|
+-----------------+-------------------+-----+
only showing top 2 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|             Russia|  161|
|          Iceland|      United States|  181|
+-----------------+-------------------+-----+
only showing top 2 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+----

In [36]:
# unique
df.where('count>200').select('dest_country_name').distinct().show(3)

# count the number of unique destinations that have count > 200
df.where('count>200').select('dest_country_name').distinct().count() 

+--------------------+
|   dest_country_name|
+--------------------+
|             Germany|
|Turks and Caicos ...|
|              France|
+--------------------+
only showing top 3 rows



39

In [37]:
# random sample 
withReplacement, seed, fraction = True, 5, 0.7
print(df.count())
print(df.sample(withReplacement, fraction, seed).count())

256
174


In [38]:
# random split
seed = 5
par = [0.7, 0.3]  # if the ratio list does not add up to 1, the fractions will be normalized so that add up to 1!
subdfs = df.randomSplit(par, seed)  # it return a list of dfs!
print(subdfs[0].count())
print(subdfs[1].count())

172
84


In [41]:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, IntegerType
# Add new rows into df
# usually, you might want to register each new df created by register it or make it as a view
# so you can access it more dynamically. 

# prepare the new rows
rows = [Row('NYC', 'EWR', 4),
        Row('MIA', 'EWR', 201),
        Row('HNL', 'MIA', 2)]

# parallelize the rows
# work on list of Row objects
parallelizedRows = spark.sparkContext.parallelize(rows)

# create a schema for it
field_1 = StructField('DEST_COUNTRY_NAME', StringType(), True)
field_2 = StructField('ORIGIN_COUNTRY_NAME', StringType(), True)
field_3 = StructField('count', IntegerType(), False)

schema = StructType([field_1, field_2, field_3])

# create new DF
newDF = spark.createDataFrame(parallelizedRows, schema)

# append
newDF = df.union(newDF)
newDF.where(col('DEST_COUNTRY_NAME') == 'NYC').show()  # show 

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|              NYC|                EWR|    4|
+-----------------+-------------------+-----+



In [42]:
# sort - by default is ascending
df.sort('count').show(2)  
df.orderBy('count').show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|          Moldova|      United States|    1|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 2 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|          Moldova|      United States|    1|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



### 7. Partitions and Collect <a id='sec7'></a>

Partition could improve the performance of operations when dataset is larger. You can call repartition on the df or one column of the df if the column will always be operated on.

- Repartition and Coalesce 
    - `df.repartition(n)` use when n is larger than the current number of partitions
    - `df.repartition(n, col(col_name))` when want to repartition specific column
    - `df.rdd.getNumPartitions()` to check the current number of partitions
    - `df.coalesce(n)` to combine the paritions
    - **paritions and coalesce will incur shuffle of entire dataframe, could be slow!**
 
- Collect

For performance, it is recommeneded to sort within each partition before another set of operations
- `df.sortWithinPartitions(expr or col or col_name)`


Sometime you might want to pull the entire data from cluster (executors) to the driver (master). 
Functions such as **`show()` will pull a small portion of data (default 20 entries or depends on the param of `show()`) from executors to the driver memory** so that you can print it. 

Three functions that will incurr transfer data frame to driver memory
- `df.show(n)` will take n entries to driver and prints out
- `df.take(n)` will create a df on driver memory
- `df.collect()` will collect the df to the driver memory

[back to top](#directory)

In [154]:
from pyspark.sql.functions import desc, asc

# sort - specify order
df.orderBy(col('count').desc()).show(2)
df.orderBy(expr('count desc')).show(2)

# sort - multiple
df.orderBy(col('Dest_country_name').asc(), col('count').desc()).show(3)
df.orderBy(col('count').desc(), col('Dest_country_name').asc()).show(3)  # different lexi order!

+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|370002|
|    United States|             Canada|  8483|
+-----------------+-------------------+------+
only showing top 2 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|          Moldova|      United States|    1|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 2 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|          Algeria|      United States|    4|
|           Angola|      United States|   15|
|         Anguilla|      United States|   41|
+-----------------+-------------------+-----+
only showing top 3 rows

+-----------------+-------------------+------

In [155]:
# take k ~ return the first 5 rows as a new df
df.limit(5).show()  # different than df.show(5)!!
df.limit(5).count()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+



5

In [156]:
# sort within partitions - for better performance
# this actually sorted the whole list
# because we only have one partition
df.sortWithinPartitions('count').show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
|          Moldova|      United States|    1|
|            Malta|      United States|    1|
|    United States|          Gibraltar|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



In [157]:
# check number of partitions
df.rdd.getNumPartitions()

1

In [158]:
# repartition on df
df.repartition(5).rdd.getNumPartitions()

5

In [159]:
# repartition on column
df.repartition(col('count')).rdd.getNumPartitions()  

200

In [160]:
# default n is 200, can also specify it!
df.repartition(5, col('count')).rdd.getNumPartitions()

5

In [161]:
# use coalesce to combine the partitions
partitioned = df.repartition(100)
partitioned.coalesce(2).rdd.getNumPartitions()

2

In [162]:
# below three operations all inccur transfering data to driver memory

df.take(2)  # this takes two rows into the driver memory
df.show(2)  # this take two rows in memory and prints out the two rows
df.limit(2).collect()  # this first create a 2 row data frame then collect it to the driver

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1)]