# Lab 3.1: Extracting Text from Webpages and Images

In this lab, you will use Beautiful Soup and Amazon Textract to extract text from the web and turn the results into a pandas dataframe.

In the second part of the lab, you will experiment with Amazon Textract to extract text from images.


## Lab steps

To complete this lab, you will follow these steps:

1. [Extracting information from a webpage](#1.-Extracting-information-from-a-webpage)
2. [Extracting text from images](#2.-Extracting-text-from-images)
    


In [None]:
#Upgrade dependencies
!pip install --upgrade pip
!pip install --upgrade sagemaker
!pip install --upgrade beautifulsoup4
!pip install --upgrade html5lib
!pip install --upgrade requests
!pip install --upgrade textract-trp

## 1. Extracting information from a webpage
([Go to top](#Lab-3.1:-Extracting-text-from-the-web))

In this section, you will use Beautiful Soup to extract the titles, authors, summaries, published data, and hyperlinks from blog posts. The extracted text could then be used in a downstream NLP task, such as topic extraction, sentiment analysis, text-to-speech, or translation.

Start by importing both the **Beautiful Soup** and **requests** packages.

In [None]:
from bs4 import BeautifulSoup
import requests

The blog post you will parse is the [AWS Machine Learning blog](https://aws.amazon.com/blogs/machine-learning/) at https://aws.amazon.com/blogs/machine-learning/.

Using your web browser, open the AWS Machine Learning page. 

Use the browser's *inspector mode* to discover the structure of the page. In Mozilla FireFox and Google Chrome, you can open the inspector by pressing CTRL+SHIFT+C. If you use a different browser, consult the browser documentation.

View the different elements of the webpage by moving your pointer over the page. Move the pointer over the following elements, and see whether you can find the tags that are used to identify the informtion:

* Title of the blog post
* Author
* Date published
* Text summary
* Hyperlink to the blog post

Don't worry if you can't find all the tags. The following code walkthrough will help you find tags.


First, use the **requests** library to load the webpage. Before you proceed, confirm that the HTTP status code is *200*.

In [None]:
page = requests.get('https://aws.amazon.com/blogs/machine-learning/')
page.status_code

Load the **content** from the page into a **soup** object.

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')

View the entire page by using the `soup.prettify()` function.

**Note:** The content from the AWS Blogs page might be lengthy. To move to the next task, scroll down in this notebook.

In [None]:
print(soup.prettify())

All the elements on the page can be accessed using dot (.) notation. Thus, to view the title, you could use `soup.title`. If you want only the `text`, use the text element as follows:

In [None]:
print(soup.title.text)

When you used the inspector to search for tags on the AWS Blogs page, you might have found that blog-post content is organized/categorized/marked with `<article>` tags, which indicate a self-contained unit of content.

In [None]:
print(soup.article.prettify())

Review the output. Can you find the title?

The title can be found at `soup.article.h2.span`:

In [None]:
print(soup.article.h2.span.prettify())

To display only the text, use the `text` property:

In [None]:
print(soup.article.h2.span.text)

Find the publish date of the article:

In [None]:
print(soup.article.time.text)

Next, extract the article summary:

In [None]:
print(soup.article.section.p.text)

The author name is in the footer. A blog post can have multiple authors. However, for now, retrieve only the *first author*:

In [None]:
print(soup.article.footer.span.prettify())

The hyperlink to the full article text is the last piece of information that you must find:

In [None]:
print(soup.article.section.a['href'])

You have now identified all the relevant elements. You can find all the articles by using the `find_all()` function. You can then loop through the results and output information about the blog post, such as the title, author, and so on.

For example, to find all the authors and then loop through them, the author, use `find_all()`:

In [None]:
for article in soup.find_all('article'):
    print('==========================================')
    print(article.h2.span.text)
    authors = article.footer.find_all('span', {"property":"author"})
    print('by', end=' ')
    for author in authors:
        if author.span != None:
            print(author.span.text, end=', ')
    print(f'on {article.time.text}')
    print(article.section.p.text)
    print(article.section.a['href'])
    

After you figure out the data format, you can add the results to an array:

In [None]:
blog_posts = []
for article in soup.find_all('article'):
    authors = article.footer.find_all('span', {"property":"author"})
    author_text = []
    for author in authors:
        if author.span != None:
            author_text.append(author.span.text)
    blog_posts.append([article.h2.span.text, ', '.join(author_text), article.time.text, article.section.p.text, article.section.a['href'] ])
    

Next, load the array into a pandas dataframe:

In [None]:
import pandas as pd
import time

In [None]:
df = pd.DataFrame(blog_posts, columns=['title','authors','published','summary','link'])

You must convert the **published** column to a `datetime` value.

In [None]:
df['published'] = pd.to_datetime(df['published'])

Adjust the column width for pandas, and display the first five rows of the dataframe:

In [None]:
pd.options.display.max_rows
pd.set_option('display.max_colwidth', None)
df.head()

Now that the data is in a pandas dataframe, you can use this data in downstream NLP tasks. You will come back to this data in Module 5.

## 2. Extracting text from images
([Go to top](#Lab-3.1:-Extracting-text-from-the-web))

In this section, you will extract the text from an image by using Amazon Textract.

For this exercise, you will use the following simple image. This file was loaded into Amazon Simple Storage Service (Amazon S3) when you started the lab.

![Image of a simple document](../s3/simple-document-image.jpg)

Start by importing the library for the AWS SDK for Python (Boto3).

In [None]:
import boto3

Setup the variables for the bucket and document name.

In [None]:
# Document
s3BucketName = "c51302a798363l1767466t1w753256443787-labbucket-vzr6xg8irt81"
documentName = "lab31/simple-document-image.jpg"

Extract text from the image by using Amazon Textract to call an application programming interface (API).

In [None]:
# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

print(response)

The response looks unformatted, but the **Blocks** list contains the key information that you need. 

Extract this information from the **Blocks** list:

In [None]:
# Print text
print("\nText\n========")
text = ""
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')
        text = text + " " + item["Text"]

You have now extracted the text from the image. You can use this text in a downstream NLP task.

You will now experiment with one additional image. This image contains *tables* of text.

![Image of Employment Application](../s3/employmentapp.png)

Set the new document name:

In [None]:
# Document
documentName = "lab31/employmentapp.png"

Call the Amazon Textract API again. However, this time, specify the **TABLES** feature type:

In [None]:
# Amazon Textract client

response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    },
    FeatureTypes=["TABLES"])


Parse the table by using the Amazon Textract results parser (**textract-trp**).

**Note:** You installed the Amazon Textract results parser when you ran the `pip install --upgrade textract-trp` command at the start of this notebook.

In [None]:
from trp import Document
doc = Document(response)

for page in doc.pages:
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}".format(r, c, cell.text))

You have now extracted the text from a different image, and you could continue to process it further, if needed.

# Congratulations!

You have completed this lab, and you can now end the lab by following the lab guide instructions.

*©2021 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. All trademarks are the property of their owners.*