#Lab: process data files with Spark

In this exercise you will parse a set of activation records in JSON format to extract the account numbers and model names.
Spark is commonly used for ETL (Extract/Transform/Load) operations. Sometimes data is stored in line-oriented records,
like the web logs in the previous exercise, but sometimes the data is in a multi-line format that must be processed as a whole
le. In this exercise you will practice working with le-based instead of line-based formats.

## Put your data on Databricks

Download the activations.tgz le from Blackboard and unpack with tar xvzf activations.tgz. Review the data. Each
JSON le contains data for all the devices activated by customers during a specic month.

--> 1. Download the json activations.zip le from Blackboard. This will download it to the machine you're using. To get
the zip le onto Databricks, we'll use the Databricks user interface (UI).

--> 2. Log into Databricks so you can see the \Welcome to databricks" screen. (If you're already logged in, you can get back
to this screen by clicking the topmost icon on the left.)

--> 3. There should be three columns, the middle of which is titled \Import & Explore Data". Click on the \click to browse"
link above this and select the json activations.zip le you downloaded to your system. Wait for the le to upload,
but do not click on anything else on that page after the upload nishes.

--> 4. By default, the UI uploads to /FileStore/tables/. Back in a notebook, use the dbutils.fs.ls command to check
that the le has been uploaded.

In [0]:
dbutils.fs.ls("dbfs:/FileStore/tables/")

--> 5. We need to extract this zip archive. Since the dbutils toolkit doesn't provide an unzip command, you need to copy the
le to the driver node, extract there using a shell command, and put the extracted contents back into DBFS. So, rst
copy the json activations.zip from DBFS to the /tmp directory on your driver node using dbutils and verify the
le has been copied.

In [0]:
dbutils.fs.cp("dbfs:/FileStore/tables/json_activations.zip", "file:/tmp")

In [0]:
dbutils.fs.ls("file:/tmp")

--> 6. Use the UNIX command unzip to extract the contents. I.e. assuming success in Point 5, you should be able to run

which should return information about the creation of an activations directory and that various les have been
inflated into it. The -d option tells unzip to perform the extraction into tmp rather than your current working
directory. Conrm this has worked with an ls of the /tmp/ and /tmp/activations directories.

In [0]:
%sh

unzip -d /tmp/ /tmp/json_activations.zip

In [0]:
%sh

ls /tmp/activations

--> 7. Create a DBFS directory activations in the /FileStore/ directory using the dbutils.fs.mkdirs command.

In [0]:
dbutils.fs.mkdirs("dbfs:/FileStore/activations/")
dbutils.fs.ls("dbfs:/FileStore/")

--> 8. Now you need to move (using the dbutils.fs.mv command) the contents of the local /tmp/activations directory
into your newly created DBFS /FileStore/activations directory so you can perform some Spark processing. If you
get an error when trying to use this command, you should look up its details so you can address the problem. Check
the contents of /FileStore/activations and conrm it contains the expected les.

In [0]:
dbutils.fs.mv("file:/tmp/activations/", "dbfs:/FileStore/activations/", True)

In [0]:
# checking the contents --> 

dbutils.fs.ls("dbfs:/FileStore/activations/")

--> 9. Use the dbutils.fs.head command to view the format of one of the les. Although perhaps not formatted as well, it
should look something like this:

In [0]:
v_head = dbutils.fs.head("dbfs:/FileStore/activations/2008-11.json", 1000)
print(v_head)

## The task

Your code should process a set of activation JSON les and extract the account number and device model for each activation,
and save the list to a le formatted as account number:model

--> 10. Use wholeTextFiles to create an RDD from the activations dataset. The resulting RDD will consist of tuples, in
which the rst value is the name of the le, and the second value is the contents of the le (JSON) as a string.

In [0]:
AC_RDD = sc.wholeTextFiles("dbfs:/FileStore/activations/*.json")
AC_RDD.take(5)

--> 11. Each JSON le can contain many activation records; map the contents of each le to a collection of JSON records.
Take each JSON string, parse it, and return a collection of JSON records; map each record to a separate RDD element.

In [0]:
import json 

AC_RDD2 = AC_RDD.map(lambda s: json.loads(s[1]))            #Map to make it a json view

AC_RDD2.take(1)

In [0]:
for record in AC_RDD2.take(1):
  internal_record = record["activations"]["activation"][:]
  for sub_rec in internal_record:
    print(sub_rec["account-number"] + ":" + sub_rec["model"])

--> 12. Map each activation record to a string in the format account-number:model

In [0]:
# This doesn't work because the object is "Dictionary of Dictionary of List of list of dictionary" basically. So, when multiple lists of lists are present, we use .flatMap transformation first. 
#ACC_MODEL_RDD = AC_RDD2.map(lambda record : record["activations"]["activation"]).map(lambda sub : sub[0]["account-number"])

#.flatMap() transformation - try it 
#ACC_MODEL_RDD = AC_RDD2.flatMap(lambda record : record["activations"]["activation"])

# In addition, we give a second map for extracting the required fields 
ACC_MODEL_RDD = AC_RDD2.flatMap(lambda record : record["activations"]["activation"]).map(lambda subrec : subrec["account-number"] + ":" + subrec["model"])
ACC_MODEL_RDD.take(15)

--> 13. Save the formatted strings to a text le in the DBFS directory /FileStore/account-models

In [0]:
try:                                                                                    # task 19 - Exception Handling to enable the notebook always run
  ACC_MODEL_RDD.saveAsTextFile("dbfs:/FileStore/account-models/")
except:
  print("File already exists in the path - dbfs:/FileStore/account-models --> So now we recursively delete this directory and create it again")
  dbutils.fs.rm("dbfs:/FileStore/account-models", True)
  ACC_MODEL_RDD.saveAsTextFile("dbfs:/FileStore/account-models")

In [0]:
ACC_MODEL_RDD.count()

In [0]:
print(dbutils.fs.head("dbfs:/FileStore/account-models/part-00000"))