* Overview of Web Scraping
* Getting Started with BeautifulSoup
* Overview of HTML
* Process HTML using BeautifulSoup
* Extract URLs from HTML
* Extract Data from Web Pages
* Use requests to read HTML Page
* Parse and Process Web Page using BeautifulSoup
* Exercise and Solution

* Overview of Web Scraping
  * BeautifulSoup
  * Scrapy
  * Pandas (pd.read_html for unsecure pages with tables)

* Getting Started with BeautifulSoup

Run `pip install BeautifulSoup4` to install beautifulsoup. Make sure to restart Notebook environment.

* Overview of HTML

```html
<html>
    <body>
        <table>
            <tbody>
                <tr>
                    <th>Details</th>
                    <th>URL</th>
                </tr>
                <tr>
                    <td>Video Content</td>
                    <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a></td>
                </tr>
                <tr>
                    <td>Reference Material</td>
                    <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a></td>
                </tr>
            </tbody>
        </table>
    </body>
</html>
```

In [None]:
%%html

<html>
    <body>
        <table>
            <tbody>
                <tr>
                    <th>Details</th>
                    <th>URL</th>
                </tr>
                <tr>
                    <td>Video Content</td>
                    <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a></td>
                </tr>
                <tr>
                    <td>Reference Material</td>
                    <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a></td>
                </tr>
            </tbody>
        </table>
    </body>
</html>

* Process HTML using BeautifulSoup
  * Create a string object with HTML
  * Import BeautifulSoup from bs4
  * Create BeautifulSoup object
  * Extract information from HTML String (such as tag attributes, tag text, etc)

In [None]:
html_str = """<table>
    <tbody>
        <tr>
            <th>Details</th>
            <th>URL</th>
        </tr>
        <tr>
            <td>Video Content</td>
            <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
            </td>
        </tr>
        <tr>
            <td>Reference Material</td>
            <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
            </td>
        </tr>
    </tbody>
</table>"""

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(html_str, 'html.parser')
print(soup.prettify())

In [None]:
soup.table.tbody.tr

* Extract URLs from HTML
  * Get all the anchor tags (using `find_all`)
  * Get the value which is specified under href. The href values are nothing but the urls

In [None]:
soup.find_all('a')

In [None]:
for item in soup.find_all('a'):
    print(item['href'])

In [None]:
for item in soup.find_all('a'):
    print(item.text)

In [None]:
[(item.text, item['href']) for item in soup.find_all('a')]

* Extract Data from Web Pages
  * URL - https://en.wikipedia.org/wiki/Python_(programming_language)
  * Use `requests` to get content from web page
  * Parse and process HTML content using BeautifulSoup

* Use requests to read HTML Page

In [None]:
import requests

In [None]:
url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'

In [None]:
html_content = requests.get(url).content

* Parse and Process Web Page using BeautifulSoup

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(html_content, 'html.parser')
# print(soup.prettify())

In [None]:
soup.find_all('a')

In [None]:
http_urls = []
for item in soup.find_all('a'):
    if item.get('href') and item.get('href').startswith('http'):
        http_urls.append(item['href'])

In [None]:
http_urls

In [None]:
sorted(set(http_urls))

* Exercise - Get Wiki Page URLs from [NFL Wiki Page](https://en.wikipedia.org/wiki/National_Football_League)
1. Read the entire HTML Content from NFL Wiki Page (https://en.wikipedia.org/wiki/National_Football_League)
2. Get URLs which start with **/wiki**
3. Prefix URLs with **https://en.wikipedia.org/** (eg: https://en.wikipedia.org/wiki/Buffalo_Bills)
4. Make sure to get unique url sorted in ascending order

* Solution - Get Wiki Page URLs from [NFL Wiki Page](https://en.wikipedia.org/wiki/National_Football_League)

In [None]:
import requests

In [None]:
from bs4 import BeautifulSoup

In [None]:
url = 'https://en.wikipedia.org/wiki/National_Football_League'

In [None]:
html_content = requests.get(url).content

In [None]:
soup = BeautifulSoup(html_content, 'html.parser')

In [None]:
wiki_urls = []
for item in soup.find_all('a'):
    if item.get('href') and item.get('href').startswith('/wiki/'):
        wiki_urls.append(f"https://en.wikipedia.org{item.get('href')}")

In [None]:
sorted(set(wiki_urls))