# Lab: Data storage and Python

In this lab, you will explore the various data storage options available and do some (sequential) le manipulations using
Python. You should carry out the following in a Databricks notebook.

In [0]:
dbutils.fs.help()

## Use a shell script to explore some files

--> 1. Using a shell command, view the first 2 lines of the local file /databricks/driver/logs/usage.json

In [0]:
%sh 

head -2 /databricks/driver/logs/usage.json

--> 2. Use the dbsutil.fs utility to view the first 961 bytes of the DBFS file /databricks-datasets/sample logs/part-00000.

In [0]:
#%fs head("/databricks-datasets/sample_logs/part-00000")    #    ---> Test - gives 65536 bytes max by default
#dbutils.fs.head("dbfs:/databricks-datasets/sample_logs/part-00000", "maxBytes: int(100)")     # --> doesn't work
dbutils.fs.head("dbfs:/databricks-datasets/sample_logs/part-00000",961)

--> 3. In your browser, you can explore the details of the S3 settings of a publicly available bucket hosted at
https://landsat-pds.s3.amazonaws.com.

--> 4. Now we'll put a file from this dataset (from S3) in your DBFS on Databricks. You should copy the following file to
DBFS /FileStore/landsat data.txt using dbutils.fs.cp from the the path
s3a://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00 MTL.txt

In [0]:
dbutils.fs.cp("s3a://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_MTL.txt", "dbfs:/FileStore/landsat_data.txt")

--> 5. View the first 2997 bytes of your newly downloaded file.

In [0]:
dbutils.fs.head("dbfs:/FileStore/landsat_data.txt", 2997)

## Count number of lines using Python

For small datasets, and a simple processing task, we can read and process a file locally line by line.

-> 6. The following code opens the local logs/usage.json file and reads it line by line. Since the file path is a string, it
must be enclosed in quotes.

fp = open ("/ databricks / driver / logs / usage . json ", "r")
for line in fp:
print ( line )
fp. close ()

This opens your file for reading ("r") and creates a file pointer (fp) that you can use to access the file's contents. It
uses a for loop to go through all the lines in the file and when finished, it closes the file.

In [0]:
fp = open ("/databricks/driver/logs/usage.json", "r")
for line in fp:
  print ( line )
fp. close ()

--> 7. To introduce a bit of reusability, you can declare a cell which defines a variable containing your file path first, and then
modify the cell defined in point 6 to use the variable. You should therefore have one cell, preceding the cell you creted
in Point 6, containing:

filepath = "/databricks/driver/logs/usage.json"

And the modified code should now contain:

In [0]:
filepath = "/databricks/driver/logs/usage.json"

In [0]:
# and now --> 

fp = open ( filepath , "r")
for line in fp:
  print ( line )
fp. close ()

--> 8. To demonstrate the reusability, copy the sample logs/part-00000 file from DBFS to your local file space (so you can
run your local data based function):

In [0]:
dbutils.fs.cp("/databricks-datasets/sample_logs/part-00000", "file:/tmp/sample_log.txt")

--> 9. Now change the filepath variable to correspond to "/tmp/sample log.txt" and run the print lines cell again. Did the
output change?

In [0]:
filepath = "/tmp/sample_log.txt"

fp = open ( filepath , "r")
for line in fp:
  print ( line )
fp. close ()

--> 10. Python can count the lines in the file using a counter variable. Add another cell which does this - it is a slightly
modified version of the previous pass through the file:

In [0]:
fp = open(filepath, "r")
count = 0

for line in fp:
  count += 1
  
fp.close()
print("Number of lines: ", count)

Changing the value of the filepath variable and running both the cells which pass through the file will now run them
both on the changed file!

--> 11. (Optional) Is the number you found the same if you use the wc shell command? (Hint: it may not be - why do you
think this is? )   --> answer - because the file path changes

In [0]:
%sh 
echo "Number of lines: " 
cat "/tmp/sample_log.txt" | wc -l

## Using functions for code reuse

Another option for reusing the same bit of code is to define a function, which is passed the path to a file and prints the
number of lines in that file.

--> 12. Insert the following function definition in a new cell. Note that executing this cell doesn't produce any output.

In [0]:
def print_num_lines_in_file(filename):
  fp = open (filename,"r")

  counter = 0

  for line in fp:
    counter += 1
    
  print ("Number of lines in", filename , "is", counter )

--> 13. You need to invoke the function that you have defined. Try it on the file you've already explored:

In [0]:
filepath = "/tmp/sample_log.txt"
print_num_lines_in_file(filepath)

In [0]:
filepath2 = "/databricks/driver/logs/usage.json"
print_num_lines_in_file(filepath2)

--> 14. (Optional) Can you create a new function which takes a file path and prints the number of words (separated by
whitespace) on the first line of the file? (Hint: an option would be to use readline, split and len functions). You
should run this function on /tmp/sample log.txt.

In [0]:
def first_line_words(filename):
  fp = open(filename, "r")
  read_text = fp.readline()

  print(read_text)
  print(read_text.split(" "))
  print("\nNumber of values in this line : ", len(read_text.split(" ")))


filepath = "/tmp/sample_log.txt"
first_line_words(filepath)
