In [0]:
dbutils.fs.ls("dbfs:/FileStore/tables/")

## Explore web log les

--> In this exercise you will be reducing and joining large datasets, which can take a bit of time. You may wish to perform the
labs below using a smaller dataset, consisting of only a few of the web log les, rather than all of them. Remember that you
can specify a wildcard; e.g. textFile("/FileStore/logs/*6") would include only lenames ending with the digit 6, while
textFile("/FileStore/logs/*6*)" will correspond to lenames containing the digit 6 etc.

--> 2. Using map-reduce, count the number of requests from each user.

(a) Use map to create a Pair RDD with the user ID as the key, and the integer 1 as the value. (The user ID is the
third eld in each line.) Your data will look something like this:

In [0]:
dbutils.fs.ls('dbfs:/FileStore/tables/logs/')

In [0]:
# Just to check what kind of data is present in each log file
print(dbutils.fs.head('dbfs:/FileStore/tables/logs/2013-10-01.log', 1000))

In [0]:
#Creating an RDD ->
log_RDD = sc.textFile('dbfs:/FileStore/tables/logs/*01.log')                  # Every month 01st

#log_RDD.count()    # 36,851 records
log_RDD.take(2)

In [0]:
log_RDD_M1 = log_RDD.map(lambda string : string.split()[2]).map(lambda user: (user, 1))      # log_RDD_M1 contains all the User IDs now
log_RDD_M1.take(5)

(b) Use reduce sum the values for each user ID.

In [0]:
log_RDD_R1 = log_RDD_M1.reduceByKey(lambda x, y : x+y)     # log_RDD_R1 contains the count of User ID was present as of 01st of every month
log_RDD_R1.take(5)

--> 3. Use countByKey to determine how many users visited the site for each frequency. That is, how many users visited
once, twice, three times and so on.

(a) Use map to reverse the key and value, like this: (5, userid),(7, userid),(9, userid)...

In [0]:
log_RDD_MRev = log_RDD_R1.map(lambda tupl : (tupl[1],tupl[0])).sortByKey()

log_RDD_MRev.take(5)

(b) Use the countByKey action to return a collection of frequency:user-count pairs.

In [0]:
log_RDD_MRev.countByKey()                    # countByKey() method is used after Map, to simplify the Reduce step shown above.

--> 4. Create an RDD where the user id is the key, and the value is the list of all the IP addresses that user has connected
from. (IP address is the rst eld in each request line.)

Initially --> (userID, 20.1.34.55)

Finally --> (userID, [20.1.34.55, 74.125.12.32, ...])

In [0]:
# Map1 - take 3 elements with step 2 (totally 2 elements only), Map 2 - invert the list into a tuple, and later pick only unique

log_RDD_M2 = log_RDD.map(lambda string : string.split()[0:3:2]).map(lambda list: (list[1], list[0])).distinct()
log_RDD_M2.take(10)

In [0]:
# Here we use groupByKey() to group the IP addresses per each key / UserID , and sort keys by desc order

log_RDD_R2 = log_RDD_M2.groupByKey().sortByKey(ascending=False)                       

log_RDD_R2.take(10)                         # Now, we have a iterable list inside these bunch of tuples, which can be shown in the next step.

In [0]:
for tupl in log_RDD_R2.sortByKey(ascending=False).take(10):
    print(tupl[0], ":", [val for val in tupl[1]])

## Join web log data with account data

In the Sqoop exercise you completed earlier, you imported data les containing Loudacre's customer account data from
MySQL to HDFS. Review that data now (located in /loudacre/accounts). The rst eld in each line is the user ID, which
corresponds to the user ID in the web server logs. The other elds include account details such as creation date, rst and
last name and so on.

In [0]:
print(dbutils.fs.head("dbfs:/FileStore/tables/accounts/part-m-00001", 1000))

--> 5. Join the accounts data with the weblog data to produce a dataset keyed by user ID which contains the user account
information and the number of website hits for that user.

(a) Create an RDD based on the accounts data consisting of key/value-array pairs: (userid,[values...])

In [0]:
acc_RDD = sc.textFile("dbfs:/FileStore/tables/accounts/part*")                  #all the accounts files data are stored in acc_RDD

acc_RDD.count()

In [0]:
acc_RDD.take(4)

In [0]:
# keyBy(f) method used to create key-value pairs within an RDD

acc_RDD_M1 = acc_RDD.keyBy(lambda line : line.split(',')[0])     

acc_RDD_M1.take(5)

(b) Join the Pair RDD with the set of user-id/hit-count pairs calculated in the rst step.

In [0]:
acc_RDD_J1 = acc_RDD_M1.join(log_RDD_R1)

acc_RDD_J1.take(5)

(c) Display the user ID, hit count, and rst name (3rd value) and last name (4th value) for the rst 5 elements, e.g.:

In [0]:
acc_RDD_M2 = acc_RDD_J1.map(lambda line: (line[0],line[1][1],line[1][0]))
acc_RDD_M2.take(5)

In [0]:
acc_RDD_M3 = acc_RDD_M2.map(lambda line : (line[0], line[1], line[2].split(',')[3], line[2].split(',')[4]))
acc_RDD_M3.take(5)                                # First 5 elements having User ID, hit-count, First Name and Last Name

## Challenges

A few extra challenges to practise working with Spark:

Challenge 1: Use keyBy to create an RDD of account data with the postal code (9th eld in the CSV le, 5 digits) as
the key.
{ Tip: Assign this new RDD to a variable for use in the next challenge

In [0]:
acc_RDD_M4 = acc_RDD.keyBy(lambda line : line.split(',')[8])               # Postal code as the key into the new RDD
acc_RDD_M4.take(5)

Challenge 2: a pair RDD with postal code as the key and a list of names (Last Name,First Name) in that postal code
as the value.
{ Hint: First name and last name are the 4th and 5th elds respectively
{ Optional: Try using the mapValues operation

In [0]:
acc_RDD_M5 = acc_RDD_M4.map(lambda fields : (fields[0], (fields[1].split(",")[3], fields[1].split(",")[4]))).sortByKey()
acc_RDD_M5.take(15)

In [0]:
# Optional - Using mapValues() operation

acc_RDD_Opt = acc_RDD_M4.mapValues(lambda fields : (fields.split(",")[3], fields.split(",")[4])).sortByKey()
acc_RDD_Opt.take(15)

Challenge 3: Sort the data by postal code, then for the rst ve postal codes, display the code and list the names in
that postal zone, e.g.

In [0]:
# Sort already completed. 

acc_RDD_Grp = acc_RDD_Opt.groupByKey()
acc_RDD_Grp.take(5)

In [0]:
# Displaying the data using a for loop

for item in acc_RDD_Grp.take(5):
  print("Postal Code: ", item[0])
  print(" Names:", [val for val in item[1] ], "\n")

In [0]:
# Changing the format : 

for item in acc_RDD_Grp.take(5):
  print("--- ", item[0])
  for val in item[1]:
    print("\t", val[0], ",", val[1])