# Chapter 7. Loading and Saving Your Data

In this chapter, we discuss strategies to organize data such as bucketing and partitioning data for storage, compression schemes, splittable and non-splittable files, and Parquet files.

Both engineers and data scientists will find parts of this chapter useful, as they evaluate what storage format is best suited for downstream consumption for future Spark jobs using the saved data.

# Setup

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

from pyspark import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext('local', 'Ch7')

# create a SparkSession
spark = (SparkSession
    .builder
    .appName("ch7 example")
    .getOrCreate())

import os
from pyspark.sql.functions import *

data_dir = '/Users/bartev/dev/gitpie/LearningSparkV2/databricks-datasets/learning-spark-v2/'

# Text Files

In this chapter, we discuss strategies to organize data such as bucketing and partitioning data for storage, compression schemes, splittable and non-splittable files, and Parquet files.

Both engineers and data scientists will find parts of this chapter useful, as they evaluate what storage format is best suited for downstream consumption for future Spark jobs using the saved data.



## READING TEXT FILES

Reading is simple. Use an instance of a SparkSession and DataFrameReader to read a text file into a DataFrame in Python or Dataset in Scala.



In [3]:
lines_df = (spark.read
               .text('./dirty-data.csv'))

lines_df.show(n=10, truncate=False)

+-----------------------+
|value                  |
+-----------------------+
|name,age,person        |
|jack,10,True           |
|jill,9,True            |
|humpty dumpty,egg,False|
|red,12,True            |
+-----------------------+



In [6]:
(spark
 .read
 .text('derby.log')
 .show(n=5, truncate=False))

+---------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                        |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|----------------------------------------------------------------                                                                             |
|Wed Jun 24 16:19:04 MST 2020:                                                                                                                |
|Booting Derby version The Apache Software Foundation - Apache Derby - 10.12.1.1 - (1704137): instance a816c00e-0172-e8a0-afbb-000008ac2aa0   |
|on database directory /Users/bartev/dev/github-bv/san-tan/lrn-spark/metastore_db with class loader sun.misc.Launcher$AppClassLoader@153

In [30]:
lines_df = (spark
 .read
 .text('spark-bartev-log.log')
 )

(lines_df
 .filter(col('value').contains('Spark' or 'spark'))
 .show(n=10, truncate=False))

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                                                        |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Spark Command: /Library/Java/JavaVirtualMachines/applejdk-8.jdk/Contents/Home/bin/java -cp /Users/bartev/dev/spark-2.4.5-bin-hadoop2.7/

## load apache log

In [47]:
(spark
 .read
 .text(os.path.join(data_dir, 'SPARK_README.md'))
 .filter(col('value').contains('Spark'))
 .show(n=5, truncate=False))


+------------------------------------------------------------------------------+
|value                                                                         |
+------------------------------------------------------------------------------+
|# Apache Spark                                                                |
|Spark is a fast and general cluster computing system for Big Data. It provides|
|rich set of higher-level tools including Spark SQL for SQL and DataFrames,    |
|and Spark Streaming for stream processing.                                    |
|You can find the latest Spark documentation, including a programming          |
+------------------------------------------------------------------------------+
only showing top 5 rows



In [38]:
lines_df = (spark
           .read
           .text(os.path.join(data_dir, 'web1_access_log_20190715-064001.log')))


In [48]:
lines_df.show(n=10, truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                     |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|161.239.128.119 - - [15/Jul/2019:06:44:31 -0700] "DELETE /apps/cart.jsp?appID=3419 HTTP/1.0" 200 4976 "http://baker-kennedy.org/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/5340 (KHTML, like Gecko) Chrome/13.0.849.0 Safari/

In [49]:
lines_df.first()

Row(value='161.239.128.119 - - [15/Jul/2019:06:44:31 -0700] "DELETE /apps/cart.jsp?appID=3419 HTTP/1.0" 200 4976 "http://baker-kennedy.org/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/5340 (KHTML, like Gecko) Chrome/13.0.849.0 Safari/5340"')

## How do I parse this log using pyspark?

## Writing text files

Saving text files is just as simple as reading text files. In our first example above, we can save the filtered text README.txt after filtering out all but Spark occurrences in the file.

In [56]:
lines_readme = (spark
 .read
 .text(os.path.join(data_dir, 'SPARK_README.md'))
)

(lines_readme
 .filter(col('value').contains('Spark'))
 .show(n=30, truncate=False))

+------------------------------------------------------------------------------------+
|value                                                                               |
+------------------------------------------------------------------------------------+
|# Apache Spark                                                                      |
|Spark is a fast and general cluster computing system for Big Data. It provides      |
|rich set of higher-level tools including Spark SQL for SQL and DataFrames,          |
|and Spark Streaming for stream processing.                                          |
|You can find the latest Spark documentation, including a programming                |
|## Building Spark                                                                   |
|Spark is built using [Apache Maven](http://maven.apache.org/).                      |
|To build Spark and its example programs, run:                                       |
|["Building Spark"](http://spark.apache.org

In [52]:
(lines_readme
 .filter(col('value').contains('Spark'))
 .write.text('./storage/SPARK_README.md'))

In [57]:
spark.read.text('./storage/SPARK_README.md/part-*.txt').show(truncate=False)

+------------------------------------------------------------------------------------+
|value                                                                               |
+------------------------------------------------------------------------------------+
|# Apache Spark                                                                      |
|Spark is a fast and general cluster computing system for Big Data. It provides      |
|rich set of higher-level tools including Spark SQL for SQL and DataFrames,          |
|and Spark Streaming for stream processing.                                          |
|You can find the latest Spark documentation, including a programming                |
|## Building Spark                                                                   |
|Spark is built using [Apache Maven](http://maven.apache.org/).                      |
|To build Spark and its example programs, run:                                       |
|["Building Spark"](http://spark.apache.org

## Partitioning

In [60]:
(lines_readme
#  .filter(col('value').contains('Spark'))
 .repartition(4)
 .write
 .format('parquet')
 .mode('overwrite')
 .save('./storage/SPARK_README.parquet')
)

In [61]:
(spark
 .read
 .parquet('./storage/SPARK_README.parquet/part-*.parquet')
 .show())

+--------------------+
|               value|
+--------------------+
|    build/mvn -Ds...|
|    scala> sc.par...|
|To run one of the...|
|supports general ...|
|Try the following...|
|MLlib for machine...|
|Spark uses the Ha...|
|<http://spark.apa...|
|can also use an a...|
|This README file ...|
|                    |
|                    |
|                    |
|                    |
|                    |
|                    |
|                    |
|                    |
|    ./bin/spark-s...|
|high-level APIs i...|
+--------------------+
only showing top 20 rows



### Partition by a given field or column

In [67]:
names_df = (
    spark
    .read
    .csv('./dirty-data.csv', header=True))

names_df.show()

+-------------+---+------+
|         name|age|person|
+-------------+---+------+
|         jack| 10|  True|
|         jill|  9|  True|
|humpty dumpty|egg| False|
|          red| 12|  True|
|          joe|  9| False|
|          red| 10|  True|
+-------------+---+------+



In [68]:
(names_df
    .write
    .mode('overwrite')
    .format('parquet')
    .partitionBy('name')
    .save('./storage/people.parquet'))

In [69]:
(names_df
    .write
    .mode('overwrite')
    .format('parquet')
    .partitionBy('age')
    .save('./storage/people.parquet'))

In [70]:
(names_df
    .write
    .mode('overwrite')
    .format('parquet')
    .partitionBy('age', 'name')
    .save('./storage/people.parquet'))

## Bucketing

### Create a managed table

In [73]:
(names_df
 .groupBy('name')
 .count()
 .orderBy(desc('count'))
 .write.format('parquet')
 .bucketBy(3, 'count', 'name')
 .saveAsTable('names_tbl')
)

### Examine 'bucketed' properties of the bucketed table

In [84]:
(spark
 .sql('describe formatted names_tbl')
 .show(truncate=False)
#  .toPandas()
)

+----------------------------+----------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                   |comment|
+----------------------------+----------------------------------------------------------------------------+-------+
|name                        |string                                                                      |null   |
|count                       |bigint                                                                      |null   |
|                            |                                                                            |       |
|# Detailed Table Information|                                                                            |       |
|Database                    |default                                                                     |       |
|Table                       |names_tbl                                 

In [83]:
(spark
 .sql('describe names_tbl')
 .show(truncate=False)
#  .toPandas()
)

+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|name    |string   |null   |
|count   |bigint   |null   |
+--------+---------+-------+



In [80]:
peeps_df = (spark.read
 .format('parquet')
 .load('./storage/people.parquet'))

peeps_df.show()

+------+---+-------------+
|person|age|         name|
+------+---+-------------+
| False|egg|humpty dumpty|
| False|  9|          joe|
|  True|  9|         jill|
|  True| 10|          red|
|  True| 12|          red|
|  True| 10|         jack|
+------+---+-------------+



Not working??

In [82]:
(peeps_df
    .write
    .format('delta')
    .save('./storage/peeps.delta'))

Py4JJavaError: An error occurred while calling o630.save.
: java.lang.ClassNotFoundException: Failed to find data source: delta. Please find packages at http://spark.apache.org/third-party-projects.html
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:245)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: delta.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
	at scala.util.Try.orElse(Try.scala:84)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
	... 13 more
