d
# File Formats

In this notebook, we cover how different file formats impact your Spark Job performance.

Spark Summit 2016: [Why You Should Care about Data Layout in the Filesystem](https://databricks.com/session/why-you-should-care-about-data-layout-in-the-filesystem)

Let's read in a colon delimited file.

In [3]:
%fs ls /databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt

In [4]:
%fs head --maxBytes=1000 /databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt

In [5]:
csvDF = (spark
         .read
         .csv("/databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt", header="true", sep=":"))

Are these data types correct? All of them are string types.

We need to tell Spark to infer the schema.

In [7]:
csvDF = (spark
         .read
         .csv("/databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt", header="true", sep=":", inferSchema="true"))

Wow, that took a long time just to figure out the schema for this file! 

Now let's try the same thing with compressed files (Gzip and Snappy formats).

Notice that the gzip file is the most compact - we will see if it is the fastest to operate on.

In [9]:
%fs ls /databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt

In [10]:
%fs ls /databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt.gz

In [11]:
%fs ls /databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt.snappy

Read in the Gzip compression format file.

In [13]:
csvDFgz = (spark
           .read
           .csv("/databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt.gz", header="true", sep=":", inferSchema="true"))

Although the uncompressed format took up more space than the Gzip format, it was significantly faster to operate on than the Gzip format.

In [15]:
csvDFsnappy = (spark
               .read
               .csv("/databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt.snappy", header="true", sep=":", inferSchema="true"))

Wait, I thought Snappy was supposed to be splittable - why was only one slot reading in the file?

Regular CSV files that are compressed with Snappy format are not splittable. If you want to work with non-column based formats, you should use `bzip2` (Snappy is great for Parquet, which we'll see later).

In [17]:
%fs ls /databricks-datasets/learning-spark-v2/people/people-with-header-10m.csv.bzip

Wow! The bzip file actually takes up less space than the snappy or gzip file. Let's read it in.

In [19]:
csvBzip = (spark
           .read
           .csv("/databricks-datasets/learning-spark-v2/people/people-with-header-10m.csv.bzip", header=True, sep=":", inferSchema=True))

Look at how much faster that was! Note how many partitions it has now.

Let's dig into compression schemes and `inferSchema`...

How can we avoid this painful schema inference step?

In [21]:
csvDF.schema.json()

In [22]:
dbutils.fs.put("/tmp/myschema.json", csvDF.schema.json(), True)

from pyspark.sql.types import StructType
import json

schema_json = dbutils.fs.head("/tmp/myschema.json", 1024)
knownSchema = StructType.fromJson(json.loads(schema_json))

In [23]:
csvDFgz = (spark
          .read
          .csv("/databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt.gz", 
               header="true", sep=":", schema=knownSchema))

Much better, we loaded it in less than a second!

Now let's compare this CSV file to Parquet.

In [25]:
%fs ls /databricks-datasets/learning-spark-v2/people/people-10m.parquet/

In [26]:
%python
size = [i.size for i in dbutils.fs.ls("/databricks-datasets/learning-spark-v2/people/people-10m.parquet/") if i.name.endswith(".parquet")]
__builtin__.sum(size)

In addition to the Parquet file taking up less than 1/2 of the space required to store the uncompressed text file, it also encodes the column names and their associated data types.

***BONUS*** - Why did we go from 1 CSV file to 8 Parquet files??

In [28]:
parquetDF = spark.read.parquet("/databricks-datasets/learning-spark-v2/people/people-10m.parquet/")

Lastly, it is much faster to operate on Parquet files than CSV files (especially when we are filtering or selecting a subset of columns). 

Look at the difference in times below! `%timeit` is a built-in Python function, so we are going to create temporary views to access the data in Python.

In [30]:
parquetDF.createOrReplaceTempView("parquetDF")
csvDF.createOrReplaceTempView("csvDF")
csvDFgz.createOrReplaceTempView("csvDFgz")

In [31]:
%python
%timeit -n1 -r1 spark.table("parquetDF").select("gender", "salary").where("salary > 10000").count()

If you're running on Databricks, subsequent calls to this Parquet file will be faster due to automatic caching!

In [33]:
%python
%timeit -n1 -r1 spark.table("parquetDF").select("gender", "salary").where("salary > 10000").count()

In [34]:
%python
%timeit -n1 -r1 spark.table("csvDF").select("gender", "salary").where("salary > 10000").count()

In [35]:
%python
%timeit -n1 -r1 spark.table("csvDFgz").select("gender", "salary").where("salary > 10000").count()

-sandbox

## Comparison
| Type    | <span style="white-space:nowrap">Inference Type</span> | <span style="white-space:nowrap">Inference Speed</span> | Reason                                          | <span style="white-space:nowrap">Should Supply Schema?</span> |
|---------|--------------------------------------------------------|---------------------------------------------------------|----------------------------------------------------|:--------------:|
| <b>CSV</b>     | <span style="white-space:nowrap">Full-Data-Read</span> | <span style="white-space:nowrap">Slow</span>            | <span style="white-space:nowrap">File size</span>  | Yes            |
| <b>Parquet</b> | <span style="white-space:nowrap">Metadata-Read</span>  | <span style="white-space:nowrap">Fast/Medium</span>     | <span style="white-space:nowrap">Number of Partitions</span> | No (most cases)             |
| <b>Tables</b>  | <span style="white-space:nowrap">n/a</span>            | <span style="white-space:nowrap">n/a</span>            | <span style="white-space:nowrap">Predefined</span> | n/a            |
| <b>JSON</b>    | <span style="white-space:nowrap">Full-Read-Data</span> | <span style="white-space:nowrap">Slow</span>            | <span style="white-space:nowrap">File size</span>  | Yes            |
| <b>Text</b>    | <span style="white-space:nowrap">Dictated</span>       | <span style="white-space:nowrap">Zero</span>            | <span style="white-space:nowrap">Only 1 Column</span>   | Never          |
| <b>JDBC</b>    | <span style="white-space:nowrap">DB-Read</span>        | <span style="white-space:nowrap">Fast</span>            | <span style="white-space:nowrap">DB Schema</span>  | No             |

##Reading CSV
- `spark.read.csv(..)`
- There are a large number of options when reading CSV files including headers, column separator, escaping, etc.
- We can allow Spark to infer the schema at the cost of first reading in the entire file.
- Large CSV files should always have a schema pre-defined.

## Reading Parquet
- `spark.read.parquet(..)`
- Parquet files are the preferred file format for big-data.
- It is a columnar file format.
- It is a splittable file format.
- It offers a lot of performance benefits over other formats including predicate pushdown.
- Unlike CSV, the schema is read in, not inferred.
- Reading the schema from Parquet's metadata can be extremely efficient.

## Reading Tables
- `spark.read.table(..)`
- The Databricks platform allows us to register a huge variety of data sources as tables via the Databricks UI.
- Any `DataFrame` (from CSV, Parquet, whatever) can be registered as a temporary view.
- Tables/Views can be loaded via the `DataFrameReader` to produce a `DataFrame`
- Tables/Views can be used directly in SQL statements.

## Reading JSON
- `spark.read.json(..)`
- JSON represents complex data types unlike CSV's flat format.
- Has many of the same limitations as CSV (needing to read the entire file to infer the schema)
- Like CSV has a lot of options allowing control on date formats, escaping, single vs. multiline JSON, etc.

## Reading Text
- `spark.read.text(..)`
- Reads one line of text as a single column named `value`.
- Is the basis for more complex file formats such as fixed-width text files.

## Reading JDBC
- `spark.read.jdbc(..)`
- Requires one database connection per partition.
- Has the potential to overwhelm the database.
- Requires specification of a stride to properly balance partitions.