# Working with data stored in WASB

Azure Storage Blob (WASB) is used as the storage account associated with an HDInsight cluster. An HDInsight cluster can have a default storage and additional storage. The URL to access the cluster storage is:

    wasb[s]://<container_name>@<storage_account_name>.blob.core.windows.net/<path>
    
The URL to access only the default storage is:

    wasb[s]:///<path>

This notebook provides examples of how to read data from WASB into a Spark context and then perform operations on that data. The notebook also provides examples of how to write the output of Spark jobs directly into a WASB location.


-------
## Read data from WASB into Spark

The examples below read from the default storage account associated with your Spark cluster so the URL used in the examples is `wasb:///<path>`. However, you can also read from an additional storage account with the following syntax:

    wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>

----------
## Notebook setup

When using Spark kernel notebooks on HDInsight, there is no need to create a SparkContext or a HiveContext; those are all created for you automatically when you run the first code cell, and you'll be able to see the progress printed. The contexts are created with the following variable names:
- SparkContext (sc)
- HiveContext (sqlContext)

To run the cells below, place the cursor in the cell and then press **SHIFT + ENTER**.

### Create an RDD of strings

In [None]:
/* textLines is an RDD of strings */
val textLines = sc.textFile("wasb:///example/data/gutenberg/ulysses.txt")

### Create an RDD of key-value pairs

In [None]:
/* seqFile is an RDD of key-value pairs */
val seqFile = sc.sequenceFile[String, Int]("wasb:///example/data/people.seq")

### Create a dataframe from parquet files

Create a dataframe from an input parquet file. For more information about parquet files, see [here](http://spark.apache.org/docs/1.6.2/sql-programming-guide.html#parquet-files).

In [None]:
/* parquetFile is a dataframe that matches the schema of the input parquet file */
val parquetFile = sqlContext.read.parquet("wasb:///example/data/people.parquet")

### Create a dataframe from JSON document

Create a dataframe that matches the schema of the input JSON document.

In [None]:
/* jsonFile is a dataframe that matches the schema of the input JSON file */
val jsonFile = sqlContext.read.json("wasb:///example/data/people.json")

### Create an RDD from CSV files

CSV is not a first-class input data format supported by Spark, so it is not as easy to perform operations on a CSV file in Spark. For simple CSV files, this approach will work: first, load the CSV file as a text file and then apply a map operation to parse the individual lines.

In [None]:
/* csvFile is an RDD of lists, each list representing a line in the CSV file */
val csvFile = (sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/building/building.csv")
                 .map((line) => line.split(",")))

------
## Write data from Spark to WASB in different formats

The examples below show you how to write output data from Spark directly into the storage accounts associated with your Spark cluster. If you are writing to the default storage account, you can provide the output path like this:

    wasb[s]:///<path>

If you are writing to additional storage accounts associated with the cluster, you must provide the output path like this:

    wasb[s]://<container_name>@<storage_account_name>.blob.core.windows.net/<path>

### Save an RDD as text files

If you have an RDD, you can convert it to a text file like the following:

In [None]:
/* textLines is an RDD converted into a text file */
textLines.saveAsTextFile("wasb:///example/data/gutenberg/ulysses2sc.txt")

### Save a dataframe as text files

If you have a dataframe that you want to save as a text file, you must first convert it to an RDD and then save that RDD as a text file.

In [None]:
/* parquetFile is a dataframe converted into RDD. parquetRDD is then converted into a text file. */
val parquetRDD = parquetFile.rdd
parquetRDD.saveAsTextFile("wasb:///example/data/peoplesc.txt")

### Save a dataframe as parquet or JSON

If you have a dataframe, you can save it to Parquet or JSON with the `.write.parquet()` and `.write.json()` methods respectively.

In [None]:
parquetFile.write.parquet("wasb:///example/data/people2sc.parquet")
jsonFile.write.json("wasb:///example/data/people2sc.json")

Note that since both parquetFile and jsonFile are dataframes, we can save them in any format, regardless of the input format.


In [None]:
parquetFile.write.json("wasb:///example/data/people3sc.json")
jsonFile.write.parquet("wasb:///example/data/people3sc.parquet")

If you have an RDD and want to save it as a parquet file or JSON file, you'll have to 
convert it to a dataframe. See [Interoperating with RDDs](http://spark.apache.org/docs/1.6.2/sql-programming-guide.html#interoperating-with-rdds) for more information.

### Save an RDD of key-value pairs as a sequence file

In [None]:
/* If your RDD isn't made up of key-value pairs then you'll get a runtime error. */
seqFile.saveAsSequenceFile("wasb:///example/data/people2sc.seq")

### Save an RDD as a CSV file

Because CSV is not natively supported by Spark, so there is no built-in way to write an RDD to a CSV file. However, you can work around this if you want to save your data as CSV. 

In [None]:
csvFile.map((line) => line.mkString(",")).saveAsTextFile("wasb:///example/data/HVAC2sc.csv")