# Writing Spark configurations

- Now that you've reviewed some of the Spark configurations on your cluster, you want to modify some of the settings to tune Spark to your needs. You'll import some data to review that your changes have affected the cluster.

- The spark configuration is initially set to the default value of 200 partitions.

- The `spark` object is available for use. A file named `departures.txt.gz` is available for import. An initial DataFrame containing the distinct rows from departures.txt.gz is available as `departures_df`.

## Instructions

- Store the number of partitions in `departures_df` in the variable before.
- Change the `spark.sql.shuffle.partitions` configuration to 500 partitions.
- Recreate the `departures_df` DataFrame reading the distinct rows from the departures file.
- Print the number of partitions from before and after the configuration change.

In [4]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [None]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [None]:
departures_df = spark.read.csv('file://<pwd>/Dataset/departures.txt.gz').distinct()
departures_df = departures_df.repartition(200)

# Store the number of partitions in variable
before = departures_df.____

# Configure Spark to use 500 partitions
____('spark.sql.shuffle.partitions', ____)

# Recreate the DataFrame using the departures data file
departures_df = spark.read.csv('file://<pwd>/Dataset/departures.txt.gz').____

# Print the number of partitions for each instance
print("Partition count before change: %d" % ____)
print("Partition count after change: %d" % ____)