# Speech Understanding 
# Lecture 13: Speech-enabled web browser

### Mark Hasegawa-Johnson, KCGI

1. <a href="#section1">How to open a web page using python</a>
1. <a href="#section2">How HTML works</a>
1. <a href="#section3">Using BeautifulSoup to extract the content you want</a>
1. <a href="#section4">Using BeautifulSoup to explore a real-world webpage</a>
1. <a href="#section5">Automatic news announcer</a>
1. <a href="#homework">Homework</a>


<a id='section1'></a>

## 1. How to open a web page using python

Python has three main packages for dealing with web pages:

* `webbrowser` interacts with your system's default web browser
* `requests` downloads the text of a web page into python
* `BeautifulSoup` parses the text of the web page

First, let's try using `webbrowser` to tell our web browser to do something:

In [3]:
import webbrowser
webbrowser.open("http://wsj.com")

True

Now let's use the speech recognizer to input the web page:

In [2]:
import speech_recognition as sr
import webbrowser
speech = sr.Recognizer()

while True:
    print('Python is listening...')
    with sr.Microphone() as source:
        speech.adjust_for_ambient_noise(source)
        try:
            audio = speech.listen(source)
            inp = speech.recognize_google(audio)
        except sr.UnknownValueError:
            continue
        except sr.RequestError:
            continue
        except sr.WaitTimeoutError:
            continue
        else:
            break
print('You just said',inp,'.')
inp.replace('browser ', '')
webbrowser.open("http://" + inp)


Python is listening...
You just said play wsj.com .


True

Finally, let's use speech recognition to perform a web search.  To do that, all we need is to replace this line:

```webbrowser.open("http://" + inp)```

...with this one:

```webbrowser.open("http://google.com/search?q=" + inp)```

In [3]:
import speech_recognition as sr
import webbrowser
speech = sr.Recognizer()

while True:
    print('Python is listening...')
    with sr.Microphone() as source:
        speech.adjust_for_ambient_noise(source)
        try:
            audio = speech.listen(source)
            inp = speech.recognize_google(audio)
        except sr.UnknownValueError:
            continue
        except sr.RequestError:
            continue
        except sr.WaitTimeoutError:
            continue
        else:
            break
print('You just said',inp,'.')
inp.replace('browser ', '')
webbrowser.open("http://google.com/search?q=" + inp)


Python is listening...
You just said play wsj.com .


True

<a id='section2'></a>

## 2. How HTML works

Web pages are written using the **hypertext markup language (HTML)**.  You can write HTML using special tools, but you can also write it using any plaintext editor.

For example, consider the following text:

```
<html>
    <head>
        <title>Example Web Page</title>
        <style>
            .bluetext { color: blue; }
            .leftmargin { margin-left: 10px; }
        </style>
    </head>
    <body>
        <h1>Test web page</h1>
        <p>
        Web pages are written using the hypertext markup language (HTML).  You can write HTML using special tools, but you can also write it using any plaintext editor.   The markup in an HTML file is done using tags.  Each tag either opens an envelope, or closes an envelope.  The text in between the opening tag and the closing tag is called the content of the envelope.  Envelopes can be nested, one inside another, as this <b>p</b> tag is nested inside the <b>body</b> tag.
        </p>
        <p>
        "Hypertext" is text that includes links.  For example, here are some links:
        </p>
        <p class="leftmargin">
            <a class="bluetext" href="https://wikipedia.org">Wikipedia</a>
        </p>
        <p class="leftmargin">
            <a class="bluetext" href="https://www.npr.org">NPR</a>
        </p>
    </body>
</html>
```

This shows the content of the file "testpage.html."  If you click on this file, it will render as formatted text.  But if you open it in a text editor, you will see the content shown above.

### Tags, Envelopes, and Nesting

The formatting commands in HTML come in the form of tags.  There are two types of tags:
* An opening command, such as `<p>` opens an envelope (in this case a paragraph)
* A closing command, such as `</p>`, closes the envelope

Every envelope has **content**.  Some envelopes also have **attributes**.
* The **content** of the envelope is the text between the opening-tag and closing-tag.
* The **attributes** of the envelope are defined inside the tag.  For example, the text `<a class="bluetext" href="https://www.npr.org">` means that the `<a>` tag has 2 attributes: `class="bluetext"` and `href="https://www.npr.org"`.

Envelopes can be nested.  For example, in the example file above, the envelopes are nested like this:


```
<html>
  ├─ <head>
  │   ├─ <title>
  │   └─ <style>
  └─ <body>
      ├─ <h1>
      ├─ <p>
      │  ├─ <b>
      │  └─ <b>
      ├─ <p>
      ├─ <p>
      │  └─ <a>
      └─ <p>
         └─ <a>
```

### Types of HTML tags

There are many different tags.  A complete listing is here: https://html.spec.whatwg.org/multipage/

The tags used in the example above include:

| Tag name | Description |
| :- | :- |
| \<html> | A file marked up using HTML |
| \<head> | Header: information that's not visible in the web page |
| \<title> | Title of the page |
| \<style> | Formatting class definitions |
| \<body> | Body: the part that's visible in the web page |
| \<h1> | A top-level header (\<h2>, \<h3>, and \<h4> are lower-level headers) |
| \<p> | A paragraph |
| \<b> | Boldface text |
| \<a> | A hypertext link |
 


#### Real web pages

A real web page is just like the one above, but more complicated.  To see a useful example, go to <a href="https://www.npr.org/">https://www.npr.org/</a>.  In your browser menu, find the option that says **View Page Source** (in Firefox, that's inside the **Tools** menu), and click on it.

Notice that the top of the file is a very long header, including `<script>` and `<style>` tags that will be used later in the page.

After the very long header you will find a body, with lists formatted using `<ul>` and `<li>` tags, and with news content in plaintext between the tags.

<a id='section3'></a>

## 3. Using BeautifulSoup to extract the content you want

In [4]:
!pip install bs4



<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beatiful Soup</a> is a python package that makes it relatively easy to extract content from web pages.

For example, suppose we want to find all of the **p** tags in a document (i.e., all of the paragraphs).  This is done as follows:

1. First, parse the whole document using `bs4.BeatifulSoup`
1. Second, use `findAll("p")` to return a list of all the p tags

That's all!

In [4]:
import bs4

with open("testpage.html") as f:
    example_soup = bs4.BeautifulSoup(f, "html.parser")

ptags = example_soup.findAll("p")
print("There are", len(ptags), "paragraphs in the document.\n")
print("The first one is:\n")
print(ptags[0], "\n")
print("The third one is:\n")
print(ptags[2], "\n")

print("The children of the third paragraph are:\n")
print(ptags[2].contents)

There are 4 paragraphs in the document.

The first one is:

<p>
        Web pages are written using the hypertext markup language (HTML).  You can write HTML using special tools, but you can also write it using any plaintext editor.   The markup in an HTML file is done using tags.  Each tag either opens an envelope, or closes an envelope.  The text in between the opening tag and the closing tag is called the content of the envelope.  Envelopes can be nested, one inside another, as this <b>p</b> tag is nested inside the <b>body</b> tag.
        </p> 

The third one is:

<p class="leftmargin">
<a class="bluetext" href="https://wikipedia.org">Wikipedia</a>
</p> 

The children of the third paragraph are:

['\n', <a class="bluetext" href="https://wikipedia.org">Wikipedia</a>, '\n']


Suppose you know that the third paragraph contains a hyperlink.  You can use `find("a")` to return the first **a** tag inside the corresponding paragraph:

In [5]:
atag = ptags[2].find("a")

print("The href attribute of the hyperlink in paragraph 3 is:",atag['href'])

print("The text content of that hyperlink is:", atag.text)


The href attribute of the hyperlink in paragraph 3 is: https://wikipedia.org
The text content of that hyperlink is: Wikipedia


<a id='section4'></a>

## 4. Using BeautifulSoup to explore a real-world web page

Now let's use BeautifulSoup to explore the NPR web page.

In [6]:
import bs4, requests
webpage = requests.get("https://npr.org")
npr_soup = bs4.BeautifulSoup(webpage.text, "html.parser")

ptags = npr_soup.findAll("p")
print("There are", len(ptags), "paragraphs in the document.\n")
print("The first one is:\n")
print(ptags[0], "\n")
print("The third one is:\n")
print(ptags[2], "\n")


There are 83 paragraphs in the document.

The first one is:

<p>
                Vice President Kamala Harris speaks at her campaign headquarters in Wilmington, Del., on Monday.
                <b aria-label="Image credit" class="credit">
                    
                    Erin Schaff/AFP via Getty Images
                    
                </b>
<b class="hide-caption"><b>hide caption</b></b>
</p> 

The third one is:

<p>
                In this image taken from body camera video released by Illinois State Police, Sonya Massey, left, talks with former Sangamon County Sheriff’s Deputy Sean Grayson outside her home in Springfield, Ill., July 6, 2024. Footage released Monday, July 22, by a prosecutor reveals a chaotic scene in which Massey, who called 911 for help, is shot in the face in her home by Grayson. (Illinois State Police via AP)
                <b aria-label="Image credit" class="credit">
                    
                    AP/Illinois State Police
                  

The news items in the NPR web page are stored in `<div>` envelopes with a special class: they are called `<div class="story-text">`.  Let's list those.

In [7]:
div_tags = npr_soup.find_all('div', 'story-text')

print("There are", len(div_tags), "story text sections.\n")
print("The first one is:\n")
print(div_tags[0],"\n")
print("The third one is:\n")
print(div_tags[2],"\n")

There are 34 story text sections.

The first one is:

<div class="story-text">
<div class="slug-wrap">
<h2 class="slug">
<a "="" data-metrics-ga4='{"action":"homepage_curation_click","category":"recirculation","clickPosition":"1.1.1-L","clickType":"section slug","clickUrl":"https:\/\/www.npr.org\/sections\/politics\/"}' href="https://www.npr.org/sections/politics/">
                        Politics
                        </a>
</h2>
</div>
<a "="" data-metrics-ga4='{"action":"homepage_curation_click","category":"recirculation","clickPosition":"1.1.1-L","clickType":"curated story","clickUrl":"https:\/\/www.npr.org\/2024\/07\/22\/g-s1-12690\/democrats-rally-behind-vice-president-harris"}' href="https://www.npr.org/2024/07/22/g-s1-12690/democrats-rally-behind-vice-president-harris">
<h3 class="title">Harris says, as a former prosecutor, 'I know Donald Trump's type'</h3>
</a>
<a data-metrics-ga4='{"action":"homepage_curation_click","category":"recirculation","clickPosition":"1.1.1-L","clic

If you look through those `story-text` sections, you can see that there are only two parts that might sound good if spoken out loud:

* Each of them has a title, called `<h3 class="title">`
* One of them also has a teaser, called `<p class="teaser">`.

Let's write a function that extracts a list of story-texts from the NPR web page, and returns a tuple containing the title and (if it exists) the teaser for each of them.

In [9]:
def get_stories(soup):
    stories = []
    for div_tag in soup.find_all('div', 'story-text'):
        titletag = div_tag.find('h3', 'title')
        teasertag = div_tag.find('p', 'teaser')
        
        if teasertag != None:
            stories.append((titletag.text, teasertag.text))
        else:
            stories.append((titletag.text, ""))
    return stories

In [8]:
stories = get_stories(npr_soup)
print("There are", len(stories), "stories.\n")
for n in range(5):
    print("Story number %d:"%(n))
    print(stories[n], "\n")


NameError: name 'get_stories' is not defined

<a id='section5'></a>

## 5. Automatic news announcer

Let's use **gtts** to make an automatic news announcer.  We want our announcer to

1. Print all of the stories on a webpage, showing only the title of each.
1. Ask the user to enter a number.
1. Read out loud the user's desired story, including both title and teaser.

Let's create this as two functions: `read_story_list` and `read_nth_story`.

In [None]:
import gtts

def read_story_list(stories, filename):
    text = "Here is a list of today's top stories."
    for n, story in enumerate(stories):
        text += "Story Number %d: %s.\n"%(n, story[0])
    gtts.gTTS(text=text, lang="en").save(filename)
        
def read_nth_story(stories, n, filename):
    gtts.gTTS(text=stories[n][0]+" "+stories[n][1], lang="en").save(filename)

In [None]:
import librosa, IPython

read_story_list(stories, 'test.mp3')
x, fs = librosa.load('test.mp3')
IPython.display.Audio(data=x, rate=fs)

In [None]:
choice = input('Which story should I read?')
read_nth_story(stories, int(choice), 'test.mp3')
x, fs = librosa.load('test.mp3')
IPython.display.Audio(data=x, rate=fs)


<a id='homework'></a>

## Homework

This directory contains a file called `homework13.py`, that contains two methods for you to complete: `extract_stories_from_NPR_text`, and `read_nth_story`.



### Homework 13.1: Extract stories from NPR text

In [None]:
import homework13, importlib
importlib.reload(homework13)
help(homework13.extract_stories_from_NPR_text)

Notice that: `extract_stories_from_NPR_text` is almost the same as `get_stories`, but it starts from the webpage text, not from the soup.  So you need to run `soup = bs4.BeautifulSoup(text, "html.parser")`, and then you re-use the rest of the code from `get_stories`.



### Homework 13.2: read_nth_story

In [None]:
importlib.reload(homework13)
help(homework13.read_nth_story)

Once you've replaced all of the `raise RuntimeError` lines with code that works,

1. Try your code in the following code block.  Once you get it to work here, then
1. Run the grading block at the very end.  Once that grading block shows a score of 100%, then
1. Commit your changed notebook, and push it to github.com.

In [None]:
import homework13, requests, IPython, importlib, librosa
importlib.reload(homework13)

webpage = requests.get("https://npr.org")
stories = homework13.extract_stories_from_NPR_text(webpage.text)
homework12.read_nth_story(stories, 0, 'test.mp3')

x, fs = librosa.load('test.mp3')
IPython.display.Audio(data=x, rate=fs)


### Receiving your grade

In order to receive a grade for your homework, you need to:

1. Run the following code block on your machine.  The result may list some errors, and then in the very last line, it will show a score.  That score (between 0% and 100%) is the grade you have earned so far.  If you want to earn a higher grade, please continue editing `homework3.py`, and then run this code block again.
1. When you are happy with your score (e.g., when it reaches 100%), choose `File` $\Rightarrow$ `Save and Checkpoint`.  Then use `GitHub Desktop` to commit and push your changes.
1. Make sure that the 100% shows on your github repo on github.com.  If it doesn't, you will not receive credit.

In [None]:
import importlib, grade
importlib.reload(grade)