# Understanding repartitions in Spark
Link: https://medium.com/@mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

First: reading data that has 401 parquet files:

In [20]:
!aws s3 ls s3://spineds-testing/vendor-comscore/web-panel/traffic/dt=2020-04-08/ | wc -l

401


In [6]:
p = "s3a://spineds-testing/vendor-comscore/web-panel/traffic/dt=2020-04-08/"
#df.unpersist()
df = spark.read.parquet(p)

#.limit(1000000).cache()

print(f"{df.count():,}")

592,707,174


How many partitions is the data divided ?

In [11]:
df.rdd.getNumPartitions()

414

Selecting a portion of the data (think sample works similiar to limit)..

In [16]:
df2 = df.sample(True, 1/10000)

414

How many rows do we have now ?

In [38]:
print(f"{df2.count():,}")

5,951


How many partitions ?

In [39]:
df2.rdd.getNumPartitions()

414

Wow ! 
* Original data has: 592,707,174 rows and 414 paritions => 1,431,659 rows / partition
* Sample data has: 5,951 rows in 414 partitions => 14 rows per partition !!!

Someting is no good here !

Let's see what happen if we store this data without any repartition.

In [36]:
df2.write \
      .mode('overwrite') \
      .parquet("/home/user/repa/")

In [37]:
!ls -l /home/user/repa/ | wc -l

416


In [40]:
!ls -lh /home/user/repa/ | tail -n 5

-rw-r--r-- 1 user user 7.6K Apr 24 21:43 part-00409-e3e31258-fa2c-4633-ae5d-db1a5cfcfc7d-c000.snappy.parquet
-rw-r--r-- 1 user user 7.8K Apr 24 21:43 part-00410-e3e31258-fa2c-4633-ae5d-db1a5cfcfc7d-c000.snappy.parquet
-rw-r--r-- 1 user user 6.9K Apr 24 21:43 part-00411-e3e31258-fa2c-4633-ae5d-db1a5cfcfc7d-c000.snappy.parquet
-rw-r--r-- 1 user user 6.9K Apr 24 21:43 part-00412-e3e31258-fa2c-4633-ae5d-db1a5cfcfc7d-c000.snappy.parquet
-rw-r--r-- 1 user user 5.2K Apr 24 21:43 part-00413-e3e31258-fa2c-4633-ae5d-db1a5cfcfc7d-c000.snappy.parquet


We have like 416 files with a size of 128 MB. It doesn't make sense.

Let's see what's in one file.

In [41]:
local_file = "/home/user/repa/part-00001-e3e31258-fa2c-4633-ae5d-db1a5cfcfc7d-c000.snappy.parquet"
df3 = spark.read.parquet(local_file)
print(f"{df3.count():,}")

16


This is highly inefficient ! 

Since we have less than 6000 rows, then we can put all the data just in one partition.

Going from 414 partitions to 1. We can use coalsce

In [43]:
df2.rdd.getNumPartitions()

414

In [44]:
df2 = df2.coalesce(1)

In [45]:
df2.rdd.getNumPartitions()

1

In [None]:
df2.write \
      .mode('overwrite') \
      .parquet("/home/user/repa2/")

In [None]:
!ls -lh /home/user/repa2/