# simple test of parquet vs. json with sparksql queries

Looking at relative speeds of load and query time on data sets in equivalent JSON and Parquet files.  This is in no way intended to be a meaningful benchmark, just getting a feeling for what to expect.

In [2]:
sqlContext

<pyspark.sql.context.HiveContext at 0x1118e0710>

## #LahoreBlast tweets

A set of 5,601 tweets mentioning "LahoreBlast" from a few days back, comprising ~31M of data.  First, load it from the Twitter source JSON format, and infer its schema.

In [58]:
!ls -lh lahore.json

-rw-r--r--  1 dchud  staff    31M Apr  3 18:41 lahore.json


In [59]:
!wc -l lahore.json

    5601 lahore.json


In [26]:
%timeit lahore_json = sqlContext.read.json('lahore.json')

1 loop, best of 3: 897 ms per loop


In [7]:
lahore_json.printSchema()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- media: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- display_url: string (nullable = true)
 |    |    |    |-- expanded_url: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- id_str: string (nullable = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true

Writing out to parquet files is as simple as it gets:

In [25]:
%timeit lahore_parquet = sqlContext.read.load("lahore.parquet")

10 loops, best of 3: 115 ms per loop


Parquet data is stored in a directory with binary metadata and compressed column chunk files.  See the [Parquet docs](https://parquet.apache.org/documentation/latest/) for more details.  With this relatively small data set, it generated two chunk files.

In [60]:
!ls -lhR lahore.parquet

total 5776
-rw-r--r--  1 dchud  staff     0B Apr  3 18:46 _SUCCESS
-rw-r--r--  1 dchud  staff    55K Apr  3 18:46 _common_metadata
-rw-r--r--  1 dchud  staff   220K Apr  3 18:46 _metadata
-rw-r--r--  1 dchud  staff   1.3M Apr  3 18:46 part-r-00000-0df57774-59ea-4336-b237-03824327b060.gz.parquet
-rw-r--r--  1 dchud  staff   1.3M Apr  3 18:46 part-r-00001-0df57774-59ea-4336-b237-03824327b060.gz.parquet


Well under 3M total.  Compared to our original 31M file, that's pretty good.  Just to be fair, though, let's compress the original JSON and see how much we save:

In [61]:
!gzip lahore.json

In [62]:
!ls -lh lahore.json.gz

-rw-r--r--  1 dchud  staff   3.8M Apr  3 18:41 lahore.json.gz


Okay, so it's not that big an improvement compression-wise.  But the gains really don't come here, they come from the column-oriented structure.  To see those, you have to run queries which focus on column structures, like SQL aggregation functions.

In [63]:
!gunzip lahore.json.gz

## timing simple queries

This is twitter data, so we should be able to perform simple counts on the records and aggregation functions on attributes with straight up (spark) SQL.

In [9]:
lahore_json.registerTempTable("lj")

In [99]:
%time count = sqlContext.sql("SELECT COUNT(*) AS the_count FROM lj")

CPU times: user 1.38 ms, sys: 1.31 ms, total: 2.69 ms
Wall time: 31.6 ms


In [100]:
%timeit count.show()

+---------+
|the_count|
+---------+
|     5601|
+---------+

+---------+
|the_count|
+---------+
|     5601|
+---------+

+---------+
|the_count|
+---------+
|     5601|
+---------+

+---------+
|the_count|
+---------+
|     5601|
+---------+

The slowest run took 4.28 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 752 ms per loop


The count is correct, and we're getting sub-second response time, though it's skewed by caching.  Let's compare this timing against the data loaded from Parquet.

In [14]:
lahore_parquet.registerTempTable("lp")

In [101]:
%time count = sqlContext.sql("SELECT COUNT(*) AS the_count FROM lj")

CPU times: user 1.58 ms, sys: 4.88 ms, total: 6.46 ms
Wall time: 79.5 ms


In [104]:
%timeit count.show()

+---------+
|the_count|
+---------+
|     5601|
+---------+

+---------+
|the_count|
+---------+
|     5601|
+---------+

+---------+
|the_count|
+---------+
|     5601|
+---------+

+---------+
|the_count|
+---------+
|     5601|
+---------+

1 loop, best of 3: 288 ms per loop


Quite a bit faster, but it's hard to be certain without being explicit about the caching step.

Note that we do get the exact same schema of course.

In [17]:
lahore_parquet.printSchema()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- media: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- display_url: string (nullable = true)
 |    |    |    |-- expanded_url: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- id_str: string (nullable = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true

Now a query that should take advantage of the column orientation.  First the JSON as a baseline.

In [108]:
jfollows = sqlContext.sql("""
SELECT SUM(user.followers_count), AVG(user.followers_count), MIN(user.followers_count), MAX(user.followers_count)
FROM lj
""")
%timeit jfollows.show()

+--------+-----------------+---+-------+
|     _c0|              _c1|_c2|    _c3|
+--------+-----------------+---+-------+
|31515647|5626.789323335119|  0|5680333|
+--------+-----------------+---+-------+

+--------+-----------------+---+-------+
|     _c0|              _c1|_c2|    _c3|
+--------+-----------------+---+-------+
|31515647|5626.789323335119|  0|5680333|
+--------+-----------------+---+-------+

+--------+-----------------+---+-------+
|     _c0|              _c1|_c2|    _c3|
+--------+-----------------+---+-------+
|31515647|5626.789323335119|  0|5680333|
+--------+-----------------+---+-------+

+--------+-----------------+---+-------+
|     _c0|              _c1|_c2|    _c3|
+--------+-----------------+---+-------+
|31515647|5626.789323335119|  0|5680333|
+--------+-----------------+---+-------+

1 loop, best of 3: 373 ms per loop


In [109]:
pfollows = sqlContext.sql("""
SELECT SUM(user.followers_count), AVG(user.followers_count), MIN(user.followers_count), MAX(user.followers_count)
FROM lp
""")
%timeit pfollows.show()

+--------+-----------------+---+-------+
|     _c0|              _c1|_c2|    _c3|
+--------+-----------------+---+-------+
|31515647|5626.789323335119|  0|5680333|
+--------+-----------------+---+-------+

+--------+-----------------+---+-------+
|     _c0|              _c1|_c2|    _c3|
+--------+-----------------+---+-------+
|31515647|5626.789323335119|  0|5680333|
+--------+-----------------+---+-------+

+--------+-----------------+---+-------+
|     _c0|              _c1|_c2|    _c3|
+--------+-----------------+---+-------+
|31515647|5626.789323335119|  0|5680333|
+--------+-----------------+---+-------+

+--------+-----------------+---+-------+
|     _c0|              _c1|_c2|    _c3|
+--------+-----------------+---+-------+
|31515647|5626.789323335119|  0|5680333|
+--------+-----------------+---+-------+

1 loop, best of 3: 300 ms per loop


Hmm... pretty much the same that time.

Let's try a slightly bigger data set, this time a set of 17,654 (so >3x the LahoreBlast set) tweets mentioning a racist misogynist pig running for office.

In [112]:
!ls -lh trump.json

-rw-r--r--  1 dchud  staff    87M Apr  3 19:10 trump.json


In [113]:
!wc -l trump.json

   17654 trump.json


In [35]:
%time trump_json = sqlContext.read.json('trump.json')

CPU times: user 2.3 ms, sys: 2.74 ms, total: 5.05 ms
Wall time: 3.01 s


In [114]:
%timeit trump_json = sqlContext.read.json('trump.json')

1 loop, best of 3: 3.2 s per loop


If we trust that average, that's a little over 5K tweets loaded from json per second.

In [37]:
%time trump_json.write.save('trump.parquet')

CPU times: user 3.5 ms, sys: 8.79 ms, total: 12.3 ms
Wall time: 8.76 s


In [40]:
%timeit trump_parquet = sqlContext.read.load('trump.parquet')

10 loops, best of 3: 102 ms per loop


Ah! That's better.  If we trust this average, that's about 173K tweets loaded from parquet per second.  Impressive.

In [115]:
trump_json.registerTempTable('tj')

In [120]:
count = sqlContext.sql("SELECT COUNT(*) AS the_count FROM tj")
%timeit count.show()

+---------+
|the_count|
+---------+
|    17655|
+---------+

+---------+
|the_count|
+---------+
|    17655|
+---------+

+---------+
|the_count|
+---------+
|    17655|
+---------+

+---------+
|the_count|
+---------+
|    17655|
+---------+

1 loop, best of 3: 875 ms per loop


In [123]:
tfollows = sqlContext.sql("""
SELECT MIN(user.followers_count), MAX(user.followers_count)
FROM tj
""")
%timeit tfollows.show()

+---+--------+
|_c0|     _c1|
+---+--------+
|  0|13650626|
+---+--------+

+---+--------+
|_c0|     _c1|
+---+--------+
|  0|13650626|
+---+--------+

+---+--------+
|_c0|     _c1|
+---+--------+
|  0|13650626|
+---+--------+

+---+--------+
|_c0|     _c1|
+---+--------+
|  0|13650626|
+---+--------+

1 loop, best of 3: 731 ms per loop


In [122]:
tfollows = sqlContext.sql("""
SELECT SUM(user.followers_count), AVG(user.followers_count)
FROM tj
""")
%timeit tfollows.show()

+---------+-----------------+
|      _c0|              _c1|
+---------+-----------------+
|106247853|6018.003568394222|
+---------+-----------------+

+---------+-----------------+
|      _c0|              _c1|
+---------+-----------------+
|106247853|6018.003568394222|
+---------+-----------------+

+---------+-----------------+
|      _c0|              _c1|
+---------+-----------------+
|106247853|6018.003568394222|
+---------+-----------------+

+---------+-----------------+
|      _c0|              _c1|
+---------+-----------------+
|106247853|6018.003568394222|
+---------+-----------------+

1 loop, best of 3: 797 ms per loop


In [117]:
trump_parquet.registerTempTable("tp")

In [118]:
count = sqlContext.sql("SELECT COUNT(*) AS the_count FROM tp")
%timeit count.show()

+---------+
|the_count|
+---------+
|    17655|
+---------+

+---------+
|the_count|
+---------+
|    17655|
+---------+

+---------+
|the_count|
+---------+
|    17655|
+---------+

+---------+
|the_count|
+---------+
|    17655|
+---------+

1 loop, best of 3: 216 ms per loop


In [124]:
tfollows = sqlContext.sql("""
SELECT MIN(user.followers_count), MAX(user.followers_count)
FROM tp
""")
%timeit tfollows.show()

+---+--------+
|_c0|     _c1|
+---+--------+
|  0|13650626|
+---+--------+

+---+--------+
|_c0|     _c1|
+---+--------+
|  0|13650626|
+---+--------+

+---+--------+
|_c0|     _c1|
+---+--------+
|  0|13650626|
+---+--------+

+---+--------+
|_c0|     _c1|
+---+--------+
|  0|13650626|
+---+--------+

1 loop, best of 3: 307 ms per loop


In [125]:
tfollows = sqlContext.sql("""
SELECT SUM(user.followers_count), AVG(user.followers_count)
FROM tp
""")
%timeit tfollows.show()

+---------+-----------------+
|      _c0|              _c1|
+---------+-----------------+
|106247853|6018.003568394222|
+---------+-----------------+

+---------+-----------------+
|      _c0|              _c1|
+---------+-----------------+
|106247853|6018.003568394222|
+---------+-----------------+

+---------+-----------------+
|      _c0|              _c1|
+---------+-----------------+
|106247853|6018.003568394222|
+---------+-----------------+

+---------+-----------------+
|      _c0|              _c1|
+---------+-----------------+
|106247853|6018.003568394222|
+---------+-----------------+

1 loop, best of 3: 227 ms per loop


And these comparisons seem to bring a speedup of 2-4x with parquet over json.