# numSlice Vs partitionBy comparison
https://stackoverflow.com/questions/24269495/what-are-the-differences-between-slices-and-partitions-of-rdds/24311863
* Differs when reading zipped file or not
* When reading zip or .gz file it will not give partitions by using minPartitions argument
* I believe slices and partitions are the same thing in Apache Spark.

However, there is a subtle but potentially significant difference between the two pieces of code you posted.

This code will attempt to load demo.txt directly into 100 partitions using 100 concurrent tasks:

rdd = sc.textFile('demo.txt', 100)
For uncompressed text, it will work as expected. But if instead of demo.txt you had a demo.gz, you will end up with an RDD with only 1 partition. Reads against gzipped files cannot be parallelized.

On the other hand, the following code will first open demo.txt into an RDD with the default number of partitions, then it will explicitly repartition the data into 100 partitions that are roughly equal in size.


## 1. Use zipped file
* Unzip .gz file on Mac terminal:  # -k means keep original .gz filled	
* gunzip -k file.gz

In [6]:
data_file = "kddcup.data_10_percent.gz"
raw_data_gz = sc.textFile(data_file)
raw_data_gz.getNumPartitions()

1

### 1.1 Can not add paritions when reading compressed file

In [1]:
data_file = "kddcup.data_10_percent.gz"
raw_data_gz = sc.textFile(data_file,minPartitions=8)
raw_data_gz.getNumPartitions(),raw_data_gz.count()

(1, 494021)

In [2]:
raw_data_cl = sc.parallelize(raw_data_gz.collect()) #default to num of logical cores
raw_data_cl.getNumPartitions(),raw_data_cl.count()

(16, 494021)

In [2]:
raw_data_cl = sc.parallelize(raw_data_gz.collect(),numSlices=8)
raw_data_cl.getNumPartitions(),raw_data_cl.count()

(8, 494021)

In [3]:
raw_data_cl = raw_data_cl.repartition(4)
raw_data_cl.getNumPartitions(),raw_data_cl.count()

(4, 494021)

### 1.2 partitionBy to change

In [50]:
raw_data_gz = raw_data_gz.partitionBy(8)

In [56]:
raw_data_gz.getNumPartitions()

8

In [17]:
raw_data_gz = raw_data_gz.repartition(16)
raw_data_gz.getNumPartitions()

16

In [None]:
del raw_data_gz

## 2. Use unzipped file

In [33]:
data_file = "kddcup.data_10_percent"
raw_data_txt = sc.textFile(data_file)
raw_data_txt.getNumPartitions()

3

In [34]:
raw_data_txt.count()

494021

### 2.1 Assign minimum num of partitions

In [48]:
raw_data_txt = sc.textFile(data_file,minPartitions=1)
raw_data_txt.getNumPartitions(),raw_data_txt.count()

(3, 494021)

In [38]:
raw_data_txt = sc.textFile(data_file,minPartitions=8)
raw_data_txt.getNumPartitions(),raw_data_txt.count()

(8, 494021)

## 3 reparition() and partitionBy() does not take in-place

In [39]:
raw_data_txt.repartition(16)
raw_data_txt.getNumPartitions(),raw_data_txt.count()

(8, 494021)

### 3.1 reparition can be assigned back w/o action error

In [45]:
raw_data_txt = raw_data_txt.repartition(16)
raw_data_txt.getNumPartitions(),raw_data_txt.count()

(16, 494021)

### 3.2 partitionBy( ) is different, causing action error if assign back

In [47]:
raw_data_txt = raw_data_txt.partitionBy(16)
# Below will cause Py4JJavaError for actions
# raw_data_txt.getNumPartitions(),raw_data_txt.count()

In [30]:
raw_data_txt.partitionBy(16)
raw_data_txt.getNumPartitions()

3

In [31]:
raw_data_txt.count()

494021