# Spark sample showing read/write methods
In this sample notebook, we will read CSV file(s) from HDFS, write it as parquet & orc file(s) and save a Hive table definition.

In [2]:
# Read the clickstream CSV file(s) into a spark data frame, print schema & top rows
results = spark.read.option("inferSchema", "true").csv('/clickstream_data').toDF(
            "wcs_click_date_sk", "wcs_click_time_sk", "wcs_sales_sk", "wcs_item_sk", "wcs_web_page_sk", "wcs_user_sk"
            )
results.printSchema()
results.show()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,application_1555189187089_0001,pyspark3,idle,Link,Link,✔


SparkSession available as 'spark'.


root
 |-- wcs_click_date_sk: integer (nullable = true)
 |-- wcs_click_time_sk: integer (nullable = true)
 |-- wcs_sales_sk: integer (nullable = true)
 |-- wcs_item_sk: integer (nullable = true)
 |-- wcs_web_page_sk: integer (nullable = true)
 |-- wcs_user_sk: integer (nullable = true)

+-----------------+-----------------+------------+-----------+---------------+-----------+
|wcs_click_date_sk|wcs_click_time_sk|wcs_sales_sk|wcs_item_sk|wcs_web_page_sk|wcs_user_sk|
+-----------------+-----------------+------------+-----------+---------------+-----------+
|            36890|            40052|        null|       4379|             34|       null|
|            36890|            41285|        null|       6245|             34|       null|
|            36890|            23115|        null|      13852|             34|       null|
|            36890|            17702|        null|      15975|             34|       null|
|            36890|            62676|        null|       2119|             3

In [3]:
# Disable saving SUCCESS file
sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false") 

# Print the current warehouse directory where the parquet files will be stored
print(spark.conf.get("spark.sql.warehouse.dir"))

# Save results as parquet & orc file and create hive table
results.write.format("parquet").mode("overwrite").saveAsTable("web_clickstreams")
results.write.format("orc").mode("overwrite").saveAsTable("web_clickstreams_orc")

hdfs:///user/hive/warehouse

In [5]:
# Read the product reviews CSV files into a spark data frame, print schema & top rows
results = spark.read.option("inferSchema", "true").csv('/product_review_data').toDF(
            "pr_review_sk", "pr_review_content"
            )
results.printSchema()
results.show()

root
 |-- pr_review_sk: integer (nullable = true)
 |-- pr_review_content: string (nullable = true)

+------------+--------------------+
|pr_review_sk|   pr_review_content|
+------------+--------------------+
|       72621|Works fine. Easy ...|
|       89334|great product to ...|
|       89335|Next time will go...|
|       84259|Great Gift Great ...|
|       84398|After trip to Par...|
|       66434|Simply the best t...|
|       66501|This is the exact...|
|       66587|Not super magnet;...|
|       66680|Installed as bath...|
|       66694|Our home was buil...|
|       84489|Hi ;We are runnin...|
|       79052|Terra cotta is th...|
|       73034|One of my fingern...|
|       73298|We installed thes...|
|       66810|needed silicone c...|
|       66912|Great Gift Great ...|
|       67028|Laguiole knives a...|
|       89770|Good sound timers...|
|       84679|AWESOME FEEDBACK ...|
|       84953|love the retro gl...|
+------------+--------------------+
only showing top 20 rows

In [6]:
# Save results as parquet, and orc formats and create hive table
results.write.format("parquet").mode("overwrite").saveAsTable("product_reviews")
results.write.format("orc").mode("overwrite").saveAsTable("product_reviews_orc")