# Lab: S3 and MapReduce

We'll write mapper and reducer programs in Python to do a wordcount caculation using multiprocessing. The (optional)
extension applies this to a piece of text hosted in your Amazon S3 bucket. The initial parts of the lab will show you how to
create an Amazon S3 bucket, and once this is accomplished, we'll go onto the Python programs.

## AWS Academy Data Analytics Lab 1

The instructions for this lab are on the AWS Academy canvas page. Follow the link to Modules and then Lab 1.
--> 1. This lab shows you how to store data in Amazon S3. It should take you no more than 30 minutes.

In [0]:
# Finished on 10/Mar/2021

## Creating MapReduce functions using Python

We will write Python (not pyspark) code which returns the frequency of words within a given piece of text. As seen in the
MapReduce lecture and demo, we need to create two Python files, a mapper and a reducer, for Hadoop jobs. The mapper
will split a chunk of text on white space and count the number of occurrences of a word in the chunk. The reducer will
combine the mapped word counts into a single value for each word. Keep in mind that each reducer task is guaranteed to
see all values for the same key, but may also see multiple keys. In this lab, we'll only run the code on the Databricks driver
node.

## MapReduce in Python

This builds on the demo, the code for which is available from Blackboard as a Databricks notebook. You'll implement a
simple version of the mapper and reducer functions for wordcount. Note that the functions are written in Python (not
pyspark) and therefore do not use the SparkContext.

--> 2. Define a variable text which contains some text. Feel free to include punctuation, extra spaces etc.

In [0]:
dummy_text = "    This is line 1\nThis is line 2\nThis is line 3\nThis is line 4\nUh oh, this is the end of the Dummy_text variable   "
print(dummy_text)

--> 3. Find a Python method which removes any white space from the start and the end of an input string. Test that it
works as expected by applying it to your text string (you may need to adjust its contents so you can find out whether
your application of the method works).

In [0]:
dummy_text = dummy_text.strip()
print(dummy_text)

--> 4. Divide the text cleaned up in Point 3 into a list of individual words.

In [0]:
dummy_text = dummy_text.replace(",", " ,")
dummy_text = dummy_text.replace("\n", " ")      # Necessary to remove linebreaks
dummy_list = dummy_text.split(" ")
print(dummy_list)

--> 5. Turn the list of words into a dictionary, where the key is each word and the value is its frequency in the text output
by Point 4. You should use Counter for this purpose. An example of use is below:

In [0]:
from collections import Counter

dummy_dict = Counter(dummy_list)
print(dummy_dict)

--> 6. Use the points above to define the mapper function: this function should take a piece of text as an argument:

In [0]:
def mapper(text):
  text = text.strip()
  text = text.lower()
  text = text.replace(",", " ,")
  text = text.replace("\n", " ")
  list = text.split(" ")
  d = Counter(list)
  return d

In [0]:
#and it should return a Counter dictionary of words with their frequencies in text.
mapper(dummy_text)

--> 7. Now for the reducer function. This should take two arguments, both Counter dictionaries and use the update method
to update the values in cnt1 with the values in cnt2. It should return the updated Counter dictionary. I.e., it's header
should look like:

In [0]:
# Dictionary - update() method
def reducer(cnt1, cnt2):
  cnt1.update(cnt2)
  return cnt1

--> 8. You can use readlines to read the contents of a file into a list.

(a) Copy your pride and prejudice.txt file from the /FileStore/ to your local /tmp directory.

In [0]:
dbutils.fs.cp("dbfs:/FileStore/pride_and_prejudice.txt", "file:/tmp/tables/pride_and_prejudice.txt")

In [0]:
dbutils.fs.ls("file:/tmp/tables/")

In [0]:
pnp_file = "file:/tmp/tables/pride_and_prejudice.txt"
dbutils.fs.head(pnp_file, 1000)

(b) Modify the following code to read the local file you've just created into a list called data

In [0]:
try :
  file = open("/tmp/tables/pride_and_prejudice.txt", "r")
  data = file.readlines()
finally :
  file.close()

--> 9. Next you can copy and paste the chunks and the chunk mapper functions which dived the input into chunks and run
the mapper and reducer over each chunk. If you have a compatibility issue (i.e. mapper or reducer expect more /
fewer inputs) when you try to run the chunk mapper, double check that all mapper and reducer arguments are as
expected.

In [0]:
from math import ceil
from functools import reduce

def chunks(l, n):
# """ Yield n successive chunks from l."""
  list_len = len(l)
  chunk_length = ceil(list_len/n)
  for i in range(0, list_len, chunk_length):
    yield l[i:i + chunk_length]

def chunk_mapper(chunk):
  mapped = map(mapper, chunk)
  reduced = reduce(reducer, mapped)
  return reduced

--> 10. Copy the pride and prejudice.txt le we created for one of the homeworks from the /FileStore to local /tmp/.

In [0]:
# Already done

--> 11. Read its contents into a variable called data using readlines.

In [0]:
# Already done above

print(data)

--> 12. Set up data chunks so it divides your data into 32 chunks.

In [0]:
data_chunks = chunks(data, 32)

--> 13. Now you can use the multiprocessing to run your code in parallel!

In [0]:
from multiprocessing import Pool

pool = Pool(8)
# step 1:
mapped = pool.map(chunk_mapper ,data_chunks )
# step 2:
reduced = reduce(reducer , mapped)
print(reduced)

## Optional: run your MapReduce code on data you have stored in your S3 bucket
### AWS Educate: Create a (lasting) S3 bucket

The AWS Academy environment is set up for you, which is very helpful for getting everything started up quickly, and it
means that it all gets torn down at the end of the lab so you don't need to worry about running out of credits, but there
are various rewall rules applied which mean you can't always do everything you may have wanted with your resources. In
this section, you will create an S3 bucket that you can use from other programs, though obviously you will pay (in credits
in your AWS Educate account; it will be within the free tier limits in AWS Free Tier) for storage of any objects you upload,
download or store in there. You will get to create an S3 bucket next week as well, so you will be able to practise these skills
again.

--> 14. Log into your AWS Educate account and go to classroom (BDTT 2020-2021 { you may have to follow a new invite).

--> 15. Click on AWS console (you may need to click on Continue rst and allow pop ups from this source).

--> 16. Under AWS services, nd and select S3.

--> 17. The rst time you use this service, the bucket overview page will show no buckets. Click Create bucket.

--> 18. Enter Bucket name: follow the rules for naming. Ensure that you make a note of your bucket name.

--> 19. Leave AWS region set to us-east-1 (only us-east-1, Northern Virginia, is supported by AWS Educate Account).

--> 20. Leave all other selections, including \block all public access" as they are.

--> 21. Click Create bucket. This has now created your bucket and will return you to the Buckets summary page..

--> 22. Select the bucket you have just created from the list of buckets on the Buckets summary page.

--> 23. You will see there are no objects in it. Click upload to upload your data. Remember to organize your storage well
(i.e. consider using folders when that would make your life easier!) Select the le you wish to upload from your local
machine.

--> 24. Leave other settings on defaults. This includes the storage class (standard access) and access control lists giving only
you permission to read.

--> 25. Download the ebook (Alice's Adventures in Wonderland) from https://www.gutenberg.org/files/11/11-0.txt to
your local machine.

--> 26. Click on upload, select the le you've just downloaded and wait for the upload to nish (the status will change to
Suceeded).

--> 27. You can view the contents of the bucket by clicking on the bucket name from the Buckets summary page.

## Read the data from your S3 bucket in your Python notebook
The following Python code will read the file -my key- from the bucket -my bucket- in your Python notebook.

In [0]:
# importing the boto3 library
import boto3
from botocore import UNSIGNED
from botocore.client import Config

# connect to S3 using boto3 client
s3_client = boto3.client(service_name ='s3', config = Config(signature_version = UNSIGNED))

# # get S3 object
result = s3_client.get_object(Bucket = 'ambyaws', Key = 'pride_and_prejudice.txt')

# # Read a text file line by line using splitlines object
for line in result["Body"].read().splitlines():
  each_line = line.decode('utf-8')
  print(each_line)

--> If you run this (even after replacing <my bucket> and <my key> with your values) on Databricks Community Edition,
it'll fail due to Access denied. The Databricks Community Edition doesn't appear to allow you to store your credentials and
therefore to read from your own S3 bucket, you'll need to make the object public. Warning: You need to ensure that
you do not leave the object with public permissions after the lab is over.
  
--> In order to make the above script work, S3 bucket should be accessed by everyone (public) and objects individually also should be accessed by everyone (public).

--> 28. Click on the bucket name.

--> 29. Select the Permissions tab.

--> 30. Click Edit under Block public access.

--> 31. Untick Block all public access and click on Save changes.

--> 32. Type conrm in the Edit Block public access box and click conrm.

--> 33. Return to the bucket and select the object you want to make public and choose Make public under Actions.

Now you can read your data in using the boto3 code above and it'll work! You should run the MapReduce program now
using the second part of this lab.

Stop public access to your objects
Even if you don't delete them, you need to stop the objects in your bucket being public. If you leave the items
public and someone nds them, they could run your account out of credit without your knowledge.

--> 33. Navigate to the bucket.

--> 34. Click on Permissions.

--> 35. Click Edit under Block public access.

--> 36. Tick Block all public access and click on Save changes.

--> 37. Type conrm in the Edit Block public access box and click conrm.

## Delete your S3 bucket
Keeping data in S3 costs money (or credits), so unless you plan on reusing the data in Standard storage, you should delete
the bucket.

--> 38. Click the radio button next to the bucket you want to delete and click on Delete.

--> 39. If the bucket has contents, you will need to empty it rst { the UI guides you through the steps for this.

(a) Click on the link in the box that's appeared containing Buckets must be empty before they can be deleted.
To delete all objects in the bucket, use the empty bucket conguration.

(b) Type the required phrase in the conrm box and click Empty.

--> 40. Return to the Bucket summary page, select bucket and click on Delete.

--> 41. Type the required phase in the conrm box and click Delete.

You should now not be incurring any charges in your AWS Educate account.