## 1. Introduction

This notebook shows how to connect Jupyter notebooks to a Spark Cluster, read a local CSV and store it to Hadoop as partitioned parquet files.

## 2. Connection to Spark Cluster

To connect to the Spark cluster, create a SparkSession object with the following params:

+ **appName:** application name displayed at the [Spark Master Web UI](http://localhost:8080/);
+ **master:** Spark Master URL, same used by Spark Workers;
+ **spark.executor.memory:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config.

In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.\
        builder.\
        appName("pyspark-notebook").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "2G").\
        getOrCreate()

22/11/26 03:40:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## 3. Load and Store Data
We will now load data from a local CSV and store it to Hadoop partitioned by column.
Afterward you can access Hadoop UI to explore the saved parquet files.
Access Hadoop UI on 'http://localhost:9870' (Utilities -> Browse the files system )

In [2]:
import pandas
from pyspark.sql.types import *
from pyspark.sql import functions as F
import os
import time    
epochNow = int(time.time())

In [3]:
#Iterate over all files until we find the sales file and then creates a Pandas dataframe.
for path, subdirs, files in os.walk('./data/'):
    for name in files:
        if "salesRecord" in name:
            csvName = name
            csvPath = os.path.join(path, name)
            print("Loading data from csv {}".format(csvPath))
            salesDfPandas = pandas.read_csv(csvPath)

Loading data from csv ./data/salesRecord.csv


In [4]:
#Create PySpark DataFrame from Pandas
salesDfSpark=spark.createDataFrame(salesDfPandas)

  for column, series in pdf.iteritems():


In [5]:
#Remove spaces in column names
salesDfSpark = salesDfSpark.select([F.col(col).alias(col.replace(' ', '_')) for col in salesDfSpark.columns])
print("Sales Dataframe created with schema : ")
salesDfSpark.printSchema()

Sales Dataframe created with schema : 
root
 |-- Region: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Item_Type: string (nullable = true)
 |-- Sales_Channel: string (nullable = true)
 |-- Order_Priority: string (nullable = true)
 |-- Order_Date: string (nullable = true)
 |-- Order_ID: long (nullable = true)
 |-- Ship_Date: string (nullable = true)
 |-- Units_Sold: long (nullable = true)
 |-- Unit_Price: double (nullable = true)
 |-- Unit_Cost: double (nullable = true)
 |-- Total_Revenue: double (nullable = true)
 |-- Total_Cost: double (nullable = true)
 |-- Total_Profit: double (nullable = true)



In [6]:
# Write Dataframe into HDFS
# Repartition it by "Country" column before storing as parquet files in Hadoop
salesDfSpark.write.option("header",True) \
        .partitionBy("Country") \
        .mode("overwrite") \
        .parquet("hdfs://hadoop-namenode:9000/sales/{}_{}.parquet".format(csvName,epochNow))
print("Sales Dataframe stored in Hadoop.")

                                                                                

Sales Dataframe stored in Hadoop.


In [7]:
# Read from HDFS to confirm it was successfully stored
df_load = spark.read.parquet("hdfs://hadoop-namenode:9000/sales/{}_{}.parquet".format(csvName,epochNow))
print("Sales Dataframe read from Hadoop : ")
df_load.show()

                                                                                

Sales Dataframe read from Hadoop : 
+--------------------+---------------+-------------+--------------+----------+---------+----------+----------+----------+---------+-------------+----------+------------+-------+
|              Region|      Item_Type|Sales_Channel|Order_Priority|Order_Date| Order_ID| Ship_Date|Units_Sold|Unit_Price|Unit_Cost|Total_Revenue|Total_Cost|Total_Profit|Country|
+--------------------+---------------+-------------+--------------+----------+---------+----------+----------+----------+---------+-------------+----------+------------+-------+
|Middle East and N...|      Household|       Online|             H| 6/27/2012|927666509| 7/17/2012|      5990|    668.27|   502.54|    4002937.3| 3010214.6|    992722.7|Bahrain|
|Middle East and N...|      Baby Food|       Online|             M| 2/21/2011|195833718|  4/7/2011|       404|    255.28|   159.42|    103133.12|  64405.68|    38727.44|Bahrain|
|Middle East and N...|         Fruits|       Online|             L|11/20/2

22/11/26 03:45:27 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: Master removed our application: KILLED
22/11/26 03:45:27 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
	at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:716)
	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:152)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:258)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:168)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$Mess