# Lab: Use Spark Dataframes for ETL

In this lab, you will use Spark dataframes to process some data

--> 1. This lab uses the zip archive webpage.zip from Blackboard. You should either use the reusable notebook you wrote
in a previous lab to unzip this and place the resulting webpage directory in DBFS's/FileStore/ or you should follow
the steps from a previous lab to do this manually.

--> 2. View the beginning of the part-00000 file. Don't forget that the splitlines() method will split your string into a
list by splitting at line breaks. You can therefore make your output much more presentable by passing through this
list using something similar to:

In [0]:
dbutils.fs.head("dbfs:/FileStore/tables/extracts/webpage/part-m-00000")
#print(dbutils.fs.head("dbfs:/FileStore/tables/extracts/webpage/part-m-00000"))

Using dbutils.fs.head(<filename>).splitlines() to do the same thing as above

In [0]:
for line in dbutils.fs.head("dbfs:/FileStore/tables/extracts/webpage/part-m-00000").splitlines():
  print(line)

## Observations - 

The columns are id, an integer, followed by two string columns: webpage and associated files. Note that the data in
the last column, the associated les, is a comma-delimited string. We're going to extract the data in that column, split the
string and create a new dataset containing each webpage and its associated les in separate rows.

## Create a dataframe

You will now explore ways to make dataframes: both from input text and from other dataframes.

--> 3. Create a new dataframe called webpages from the webpage data. You should try creating this in three different ways:

##(a) Using spark.createDataFrame()
where you create a schema and an RDD and use these to create the dataframe.

In [0]:
from pyspark.sql.types import *

mySchema = StructType( [
                        StructField("id", IntegerType()),
                        StructField("webpage", StringType()),
                        StructField("associated_files", StringType())
                          ] )

webRDD = sc.textFile("dbfs:/FileStore/tables/extracts/webpage/*"). \
         map( lambda line : line.split()).  \
         map( lambda values :[int(values[0]),values[1],values[2]])

webDF = spark.createDataFrame(webRDD, mySchema)
webDF.show(5)

# Point to note - .split() function directly checks for linebreaks
# webRDD.take(5)


## (b) Using spark.read.csv()
Don't forget that you will need to change the default delimiter.

In [0]:
webDF_read = spark.read.csv("dbfs:/FileStore/tables/extracts/webpage/part-m-*", sep="\t")
webDF_read.printSchema()

In [0]:
webDF_read.show(5)

##(c) Using .toDF() method from an RDD.

In [0]:
webDF_Method = webRDD.toDF(schema=mySchema)
webDF_Method.show(5)

In [0]:
webDF_Method_NoSch = webRDD.toDF()
webDF_Method_NoSch.show(5)

--> 4. Examine the schema of the new DataFrame by calling webpages.printSchema(). If your columns don't have the
expected names, ensure that you rename them (e.g. using withColumnRenamed). Also, you should check that the rst
column has an integer type. If it does not, you need to change this: one way to do it is using withColumn and cast.

In [0]:
webDF_Method.printSchema()

In [0]:
webDF_Method.withColumnRenamed("id", "ID")

In [0]:
webDF_Method.printSchema()

In [0]:
webDF_Method_NoSch.printSchema()

In [0]:
webDF_Method_NoSch["_1"].cast("integer")

In [0]:
webDF_Method_NoSch.withColumn("_1", webDF_Method_NoSch["_1"].cast("Integer")).withColumnRenamed("_1", "ID").withColumnRenamed("_2", "Webpage")

In [0]:
webDF_Method_NoSch.printSchema()

RDDs are immutable, so obviously DFs also become immutable

--> In order to change the DF, just use an equal to operator and assign

In [0]:
webDF_Method_NoSch = webDF_Method_NoSch.withColumn("_1", webDF_Method_NoSch["_1"].cast("Integer")).withColumnRenamed("_1", "ID").withColumnRenamed("_2", "Webpage")

--> 5. Check that your revised webpage dataframes all have the same schema.

In [0]:
webDF_Method_NoSch.printSchema()

--> 6. Create a new DataFrame by selecting the webpage and associated files columns from the existing DataFrame.

In [0]:
webDF_sliced = webDF_Method.select("webpage", "associated_files")
webDF_sliced.show(5)

--> 7. In order to manipulate the data using Spark, convert the DataFrame into a Pair RDD using the map method. The
input into the map method is a Row object. They key is the webpage value, and the value is the associated files
string.

In [0]:
webRDD_sliced = webDF_sliced.rdd.map(lambda line: (line.webpage, line.associated_files))
webRDD_sliced.take(5)

--> 8. Now that you have an RDD, you can use the familiar flatMapValues transformation to split and extract the lenames
in the associated files column.

In [0]:
Files_RDD = webRDD_sliced.flatMapValues(lambda line: line.split(","))
Files_RDD.take(15)

--> 9. Create a new DataFrame from the RDD.

In [0]:
Files_DF = Files_RDD.toDF()
Files_DF.show(15)

--> 10. Call printSchema on the new DataFrame.

In [0]:
Files_DF.printSchema()

--> 11. If you ended up with generic names for columns, such as 1 and 2, create a new DataFrame by renaming the columns
to reflect the data they hold using the withColumnRenamed method to rename the two columns.

In [0]:
Files_DF = Files_DF.withColumnRenamed("_1", "webpage").withColumnRenamed("_2", "associated_files")


--> 12. Call printSchema to conrm that the new DataFrame has the correct column names.

In [0]:
Files_DF.printSchema()

--> 13. Your final DataFrame contains the processed data, so save it in Parquet format (the default) in /FileStore/webpage files

In [0]:
# df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

Files_DF.write.save("/FileStore/webpage_files/")

--> 14. Check the les have been created.

In [0]:
dbutils.fs.ls("/FileStore/webpage_files")