# Lab: Explore RDDs using Databricks

In this lab, you will use Spark to explore some Databricks datasets.

##Explore a text file
The file we'll be using is on DBFS and is called /databricks-datasets/songs/data-001/part-00001.

--> 1. Create a new notebook on Databricks running a Python kernel.

--> 2. Use the dbutils.fs.head command to view the first few lines of the file. Don't forget that '\t' represents a tab, while
'\n' represents a new line.

In [0]:
dbutils.fs.head("/databricks-datasets/songs/data-001/part-00001")

In [0]:
text = dbutils.fs.head("/databricks-datasets/songs/data-001/part-00001")
split_text = text.split("\n")

for item in split_text:
  print(item)

--> 3. Define an RDD to be created by reading the file in.

In [0]:
mydata = sc.textFile("/databricks-datasets/songs/data-001/part-00001")

--> 4. Spark has not yet read the file. It will not do so until you perform an operation on the RDD. Try counting the number
of lines in the dataset.

In [0]:
mydata.count()

--> 5. Try executing the collect operation to display the data in the RDD. Note that this returns and displays the entire
dataset. This is convenient for very small RDDs like this one, but be careful using collect for more typical large
datasets.

In [0]:
mydata.collect()

##Explore log files

The Databricks datasets also include some log files in DBFS /databricks-datasets/sample logs/. We'll start off using
just the first of these files, /databricks-datasets/sample logs/part-00000.

--> 6. Set a variable for the data file so you do not have to retype it each time.

In [0]:
logfile = "/databricks-datasets/sample_logs/part-00000"

--> 7. Use the dbutils head command to view the file so you get an idea of the structure.

In [0]:
dbutils.fs.head(logfile)

--> 8. Create an RDD from the data file. (Don't forget to use the variable you defined earlier!)

In [0]:
RDD_logfile = sc.textFile(logfile)
RDD_logfile.count()

--> 9. Create an RDD containing only those lines that correspond to 401 errors.
--> 10. View the first 10 lines of the data using take.

In [0]:
RDD_lf_401 = RDD_logfile.filter(lambda line:line.find("401")!=-1)
RDD_lf_401.take(10)

--> 11. Sometimes you do not need to store intermediate objects in a variable, in which case you can combine the steps into a
single line of code. Combine the previous commands into a single command to count the number of 401 errors.

In [0]:
RDD_logfile.filter(lambda line:line.find("401")!=-1).count()

--> 12. Now try using the map function to define a new RDD. Start with a simple map that returns the length of each line in
the log file.
This will produce an array of five integers corresponding to the first five lines in the file.

In [0]:
RDD_lf_linelen = RDD_logfile.map(lambda line:len(line))
RDD_lf_linelen.take(10)

--> 13. That is not very useful. Instead, try mapping to an array of words for each line. This time, you will print out five
arrays, each containing the words in the corresponding log file line.

In [0]:
RDD_logfile.map(lambda line: line.split(' ')).take(5)