# SI 330: Homework 4: APIs on AWS


## Due: Friday, February 9, 2018,  11:59:00pm

### Submission instructions</font>
After completing this homework, you will turn in two files via Canvas ->  Assignments -> HW 4:
Your Notebook, named si330-hw4-YOUR_UNIQUE_NAME.ipynb and
the HTML file, named si330-hw4-YOUR_UNIQUE_NAME.html.

### Name:  Dingan Chen
### Uniqname: dinganc
### People you worked with: I worked by myself

## Top-Level Goal
To create a microservice that returns the counts of all bigrams in a text passage.



## Learning Objectives
After completing this Lab, you should know how to:
* create an AWS Lambda function that takes a string and returns the counts of all bigrams in that text
* write an AWS API Gateway integration which allows both GET and POST requests to access an AWS Lambda
* write documenation to the microservice that you've created

### Note: See end of notebook for notes about going "Above and Beyond"

### Outline of Steps For Analysis
Here's an overview of the steps that you'll need to do to complete this lab.
2. Upload data to an S3 bucket
1. Create an AWS Lambda function that normalizes, tokenizes, and creates and counts bigrams from text, both via a POST request with the text and via a GET request to a URL that returns the text (e.g. an S3 bucket)
3. Create a python code block in this notebook to demonstrate the functionality of your microservice

Each of these steps is detailed below.

## Step 1: Upload data to an S3 bucket
To get ready to test the POST functionality of the code you generate in the next step, you should upload a text file that is **500 or fewer lines** to an S3 bucket.  See the description of CORS for an explanation of why we want to put the data in the same domain (amazonaws.com) as the Lambda.

Follow the same approach that we used in the lab to upload a small text file to your S3 bucket, ensuring that the permissions are set to allow public access

### <font color="magenta">Q1: Enter the URL of your text file

 https://s3.us-east-2.amazonaws.com/bucket-for-si330-hmwk4-dinganc/PrideAndPrejudice.txt

## Step 2: Create an AWS Lambda function that normalizes, tokenizes, and creates and counts bigrams from text

Similar to what we did in the lab, you're going to create a microservice that consists of two parts: an AWS Lambda and an API Gateway.  You can use exactly the same technique that we did in the lab to get started.

You will need to modify the code in the Lambda to handle two types of requests:
1. A GET request with a queryStringParameter of url=http://some.url.goes.here/text.txt, which specifies the location of the text to be processed and
2. A POST request with the text to be processed included as the "text" value in the body payload.

### The following code block is a reasonable starting point for creating your Lambda.  Note that this code should not be run in this notebook but rather serve as the starting point for your work in the Lambda editor.

**NOTE** Please see https://stackoverflow.com/questions/21844546/forming-bigrams-of-words-in-list-of-sentences-with-python for hints about how to create bigrams without NLTK.

In [16]:
"""
PUT SOME DOCUMENTATION HERE
"""
import json
import re
from botocore.vendored import requests # This line has been added. 
# You'll need to figure out how to use this requests, 
# but it works the same way as the requests module (called using ```import requests```) in python.

def lambda_handler(event, context):
    method = event['httpMethod']
    text = ""
    d = {"text": ""}
    # Handle GET method
    if method == 'GET':
        params = event['queryStringParameters']
        if params:
            url = ... # retrieve the text from the URL
    if method == 'POST':
        body = json.loads(event['body'])
        if 'text' in body:
            pass
    # normalize
    # tokenize
    # find bigrams
    # NOTE: see https://stackoverflow.com/questions/21844546/forming-bigrams-of-words-in-list-of-sentences-with-python
    #       for hints about how to create bigrams
    # count bigrams
    
    # Note the strict format of the return dictionary
    # It must contain these three elements, and the body
    # must be a stringified JSON object (i.e. you have to call 
    # json.dumps on the JSON structure you're returning)
    return { 
        "statusCode": 200,
        "headers": {"Content-Type": "application/json"},
        "body": json.dumps(d),
    }

### <font color="magenta">Q2a: Enter the URL of your Lambda

https://vnbny82y4k.execute-api.us-east-1.amazonaws.com/prod/FromTXTToBigrams

### <font color="magenta">Q2b: Copy your final Lambda code into the following code block (but do not run it here)

In [17]:
"""
This API takes in an url to a text file and returns the ngram count as a json formatted string
"""
import json
import re
from botocore.vendored import requests
from collections import defaultdict
# normalize: convert the text to lowercase
def get_text_and_normalize(url):
    r=requests.get(url)
    r.encoding='utf-8'
    text=r.text.lower().strip()
    text=re.sub(r'\ufeff','',text)
    text=re.sub(r'[^\w\s]','',text)
    return text
# tokenize: split the text into sentences, the split each sentence into words
# NOTE: it's probably best to use re.split()
def tokenize(text):
    sentences=[re.sub(r'\r\n',' ',i) for i in re.split(r'\r\n\r\n',text)if i !='']
    return sentences

# find bigrams
# NOTE: it's very difficult to set up NLTK on Lambda, so you'll need to find bigrams "manually"
# NOTE: see https://stackoverflow.com/questions/21844546/forming-bigrams-of-words-in-list-of-sentences-with-python
#       for hints about how to create bigrams
    
# count bigrams
    
# Note the strict format of the return dictionary
# It must contain these three elements, and the body
# must be a stringified JSON object (i.e. you have to call 
# json.dumps on the JSON structure you're returning)

def bigram_count(sentences):
    bigrams=[]
    bigram_k=defaultdict(int)
    for sentence in sentences:
        words = sentence.split()
        for i in range(len(words)):
            if i < len(words) - 1:
                bigrams.append((words[i], words[i + 1]))
    for bigram in bigrams:
        bigram_k[bigram]+=1
    return sorted(bigram_k.items(),key=lambda x: x[1],reverse=True)

def lambda_handler(event, context):
    method = event['httpMethod']
    text = ""
    d = {"bigrams": ""}
    if method == 'GET':
        params = event['queryStringParameters']
        if params:
            url = params['url'] # retrieve the text from the URL
            try:
                d['bigrams']=bigram_count(tokenize(get_text_and_normalize(url)))
            except:
                d['error']='error in parsing linked text file'
    if method == 'POST':
        body = json.loads(event['body'])
        if 'url' in body:
            url=body['url']
            try:
                d['bigrams']=bigram_count(tokenize(get_text_and_normalize(url)))
            except:
                d['error']='error in parsing linked text file'
    return { 
        "statusCode": 200,
        "headers": {"Content-Type": "application/json"},
        "body": json.dumps(d),
   }

## Step 3: Demonstrate the GET and POST functionality of your Lambda

### <font color="magenta">Q3: Create a code block that uses `requests` to demonstrate the functionality of your Lambda.  You can modify the template below or create your own.

In [18]:
import requests
import json

lambdaURL = 'https://vnbny82y4k.execute-api.us-east-1.amazonaws.com/prod/FromTXTToBigrams' # change this URL
textURL = 'https://s3.us-east-2.amazonaws.com/bucket-for-si330-hmwk4-dinganc/PrideAndPrejudice.txt' # change this URL

# Demonstrate the GET functionality by passing the URL of your text file in S3 to your Lambda as a GET request
response = requests.get(lambdaURL + '?url=' + textURL)
bigrams = json.loads(response.text)

print ('printing top 5, with GET method')

for pair in bigrams['bigrams'][:5]:
    print('\n')
    print("(",pair[0][0],",",pair[0][1],"), which appears",pair[1]," times")# you should make this print something nicer

# Demonstrate the POST functionality by passing the text as a JSON parameter to requests.post()
# note that we retrieve the contents of the S3 bucket using requests.get()

s3text = requests.get(textURL) # get the text from the bucket
d = {"url": textURL}
response = requests.post(lambdaURL, json = d)
bigrams = json.loads(response.text)

print ('\nprinting top 5, with POST method')
for pair in bigrams['bigrams'][:5]:
    print('\n')
    print("(",pair[0][0],",",pair[0][1],"), which appears",pair[1]," times") # you should make this print something nicer

printing top 5, with GET method


( that , he ), which appears 5  times


( my , dear ), which appears 4  times


( mr , bennet ), which appears 4  times


( it , is ), which appears 2  times


( a , single ), which appears 2  times

printing top 5, with POST method


( that , he ), which appears 5  times


( my , dear ), which appears 4  times


( mr , bennet ), which appears 4  times


( it , is ), which appears 2  times


( a , single ), which appears 2  times


## Save your notebook, download it as HTML and submit both the .ipynb and .html files to Canvas

## Notes about going "Above and Beyond"

There are ample opportunities for extending this homework assignment.  You might, for example, decide to break the microservice into three separate ones (normalizing, tokenizing, and creating bigrams).  Alternatively, you might invest time into getting NLTK data into Lambda so you can use its functionality (see https://stackoverflow.com/questions/42394335/paths-in-aws-lambda-with-python-nltk).  Another interesting investigation might be to use the addition of a data file to an S3 bucket as a trigger to run the bigram analysis, perhaps writing the results to another (public) bucket.

**IF YOU CHOOSE TO GO ABOVE AND BEYOND, YOU _MUST_ CHANGE THE FOLLOWING MARKDOWN BLOCK**

## Above and Beyond

Indicate here why you believe that your work should be considered "above and beyond".

I seperated the normalizing, tokenizing, and creating bigrams into three different micro services

text retrival:
    invoke: https://vnbny82y4k.execute-api.us-east-1.amazonaws.com/prod/texttrtrival
    parameter:
        url: the url to the text file
            example: 'https://s3.us-east-2.amazonaws.com/bucket-for-si330-hmwk4-dinganc/PrideAndPrejudice.txt'
    response:
        a jason string,'{'text':'text content of the file'}'
        
tokenization:
    invoke: https://vnbny82y4k.execute-api.us-east-1.amazonaws.com/prod/tokenization
    parameter:
        text: a jason string containing the text for tokenization
            example: '{'text':'however little known the feelings or views of such a man may be on his first ent it is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife'}'
    response:
        a jason string,'{'sentences':a list of tokenized sentences}'
        
generate bigram:
    invoke: https://vnbny82y4k.execute-api.us-east-1.amazonaws.com/prod/genbigram
    parameter:
        sentences: a jason string containing the list of sentences
            example: '{'sentences':['however little known the feelings or views of such a man', 'may be on his first ent it is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife']}'
    response:
        a list of list, which includes bigrams and frequency of apperance of that bigram

In [19]:
# Code for text retrival
"""
This API takes in an url to a text file and returns the text content of the link as a json formatted string
"""
import json
import re
from botocore.vendored import requests
from collections import defaultdict
# normalize: convert the text to lowercase
def get_text_and_normalize(url):
    r=requests.get(url)
    r.encoding='utf-8'
    text=r.text.lower().strip()
    text=re.sub(r'\ufeff','',text)
    text=re.sub(r'[^\w\s]','',text)
    return text


def lambda_handler(event, context):
    method = event['httpMethod']
    text = ""
    d = {"text": ""}
    if method == 'GET':
        params = event['queryStringParameters']
        if params:
            url = params['url'] # retrieve the text from the URL
            try:
                d['text']=get_text_and_normalize(url)
            except:
                d['error']='error in retrieving linked text file'
    if method == 'POST':
        body = json.loads(event['body'])
        if 'url' in body:
            url=body['url']
            try:
                d['text']=get_text_and_normalize(url)
            except:
                d['error']='error in retrieving linked text file'
    return { 
        "statusCode": 200,
        "headers": {"Content-Type": "application/json"},
        "body": json.dumps(d),
   }

In [20]:
import requests
import json

lambdaURL = 'https://vnbny82y4k.execute-api.us-east-1.amazonaws.com/prod/texttrtrival' 
textURL = 'https://s3.us-east-2.amazonaws.com/bucket-for-si330-hmwk4-dinganc/PrideAndPrejudice.txt'

# Demonstrate the GET functionality by passing the URL of your text file in S3 to your Lambda as a GET request
response = requests.get(lambdaURL + '?url=' + textURL)
text = json.loads(response.text)

print (text['text'][:200])


# Demonstrate the POST functionality by passing the text as a JSON parameter to requests.post()

d = {"url": textURL}
response = requests.post(lambdaURL, json = d)
text = json.loads(response.text)

print (text['text'][:200])

rtr_text=response.text

it is a truth universally acknowledged that a single man in possession
of a good fortune must be in want of a wife

however little known the feelings or views of such a man may be on his
first ent
it is a truth universally acknowledged that a single man in possession
of a good fortune must be in want of a wife

however little known the feelings or views of such a man may be on his
first ent


In [21]:
# Code for tokenization
"""
This API takes in a jason string from the above call and returns the tokenization as a json formatted string
"""
import json
import re
from botocore.vendored import requests
from collections import defaultdict

# tokenize: split the text into sentences, the split each sentence into words
def tokenize(text):
    sentences=[re.sub(r'\r\n',' ',i) for i in re.split(r'\r\n\r\n',text)if i !='']
    return sentences

def lambda_handler(event, context):
    method = event['httpMethod']
    text = ""
    d = {"sentences": ""}
    if method == 'GET':
        params = event['queryStringParameters']
        if params:
            text = json.loads(params['text'])['text'] # retrieve the text from the JSON string
            try:
                d['sentences']=tokenize(text)
            except:
                d['error']='error in parsing linked text file'
    if method == 'POST':
        body = json.loads(event['body'])
        if 'text' in body:
            text = json.loads(body['text'])['text'] # retrieve the text from the JSON string
            try:
                d['sentences']=tokenize(text)
            except:
                d['error']='error in parsing linked text file'
    return { 
        "statusCode": 200,
        "headers": {"Content-Type": "application/json"},
        "body": json.dumps(d),
   }

In [22]:
import requests
import json

lambdaURL = 'https://vnbny82y4k.execute-api.us-east-1.amazonaws.com/prod/tokenization'
response = requests.get(lambdaURL + '?text=' + rtr_text)

print (response.text[:200])

d = {'text':rtr_text}
response = requests.post(lambdaURL, json = d)

print (response.text[:200])
rtr_sent=response.text
# you should make this print something nicer

{"sentences": ["it is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife", "however little known the feelings or views of such a man may be on
{"sentences": ["it is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife", "however little known the feelings or views of such a man may be on


In [23]:
# Code for generating bigram count
"""
This API takes in a json formatted sentence list and returns the ngram count as a json formatted string
"""
import json
import re
from botocore.vendored import requests
from collections import defaultdict

# find bigrams
# NOTE: it's very difficult to set up NLTK on Lambda, so you'll need to find bigrams "manually"
# NOTE: see https://stackoverflow.com/questions/21844546/forming-bigrams-of-words-in-list-of-sentences-with-python
#       for hints about how to create bigrams
    
# count bigrams
    
# Note the strict format of the return dictionary
# It must contain these three elements, and the body
# must be a stringified JSON object (i.e. you have to call 
# json.dumps on the JSON structure you're returning)

def bigram_count(sentences):
    bigrams=[]
    bigram_k=defaultdict(int)
    for sentence in sentences:
        words = sentence.split()
        for i in range(len(words)):
            if i < len(words) - 1:
                bigrams.append((words[i], words[i + 1]))
    for bigram in bigrams:
        bigram_k[bigram]+=1
    return sorted(bigram_k.items(),key=lambda x: x[1],reverse=True)

def lambda_handler(event, context):
    method = event['httpMethod']
    text = ""
    d = {"bigrams": ""}
    if method == 'GET':
        params = event['queryStringParameters']
        if params:
            sentences = json.loads(params['sentences'])['sentences'] # retrieve the token list from the json string
            try:
                d['bigrams']=bigram_count(sentences)
            except:
                d['error']='error in generating bigram from provided token list'
    if method == 'POST':
        body = json.loads(event['body'])
        if 'sentences' in body:
            sentences = json.loads(body['sentences'])['sentences'] # retrieve the token list from the json string
            try:
                d['bigrams']=bigram_count(sentences)
            except:
                d['error']='error in generating bigram from provided token list'
    return { 
        "statusCode": 200,
        "headers": {"Content-Type": "application/json"},
        "body": json.dumps(d),
   }

In [24]:
import requests
import json

lambdaURL = 'https://vnbny82y4k.execute-api.us-east-1.amazonaws.com/prod/genbigram' # change this URL
response = requests.get(lambdaURL + '?sentences=' + rtr_sent)
bigrams = json.loads(response.text)

print ('\nprinting top 5, with GET method')
for pair in bigrams['bigrams'][:5]:
    print('\n')
    print("(",pair[0][0],",",pair[0][1],"), which appears",pair[1]," times") 

d = {'sentences':rtr_sent}
response = requests.post(lambdaURL, json = d)
bigrams = json.loads(response.text)

print ('\nprinting top 5, with POST method')
for pair in bigrams['bigrams'][:5]:
    print('\n')
    print("(",pair[0][0],",",pair[0][1],"), which appears",pair[1]," times") 


printing top 5, with GET method


( that , he ), which appears 5  times


( my , dear ), which appears 4  times


( mr , bennet ), which appears 4  times


( it , is ), which appears 2  times


( a , single ), which appears 2  times

printing top 5, with POST method


( that , he ), which appears 5  times


( my , dear ), which appears 4  times


( mr , bennet ), which appears 4  times


( it , is ), which appears 2  times


( a , single ), which appears 2  times
