# Example Code for Extracting Links from Webpages

__Description__: PageRank and graph mining on the Web presupposes having a graph that represents the Web's structure. How might one collect such a representation though? Fortunately, many tools exist for web scraping and web crawling, as one might expect. 

In this notebook, I provide some examples of how you might build such data using simple Python libraries.

In [1]:
%matplotlib inline

In [2]:
import tldextract # Useful for extracting domains from URLs

def getDomain(url):
    """
    Use the tldextract package to get the domain name for a given URL
    """
    
    extracted_domain = tldextract.extract(url)
    return extracted_domain.fqdn

## Parsing News Articles with Newspaper3k

A common task is to extract content from news articles, and news pages often have links to other articles both within that newspaper and to other news sources (e.g., NYTimes linking to Washington Post, or your local news affiliate linking to a newswire).

The `Newspaper3k` package makes this task fairly easy to automate for well-structured pages.

Documentation for `Newspaper3k` package is available here: https://newspaper.readthedocs.io/en/latest/

In [3]:
import newspaper # Import the newspaper package

In [4]:
# Specify a URL to a News Article
url = 'https://www.nytimes.com/2021/02/01/travel/greece-firewalking-ritual.html'

# Create an article pointing to this URL
article = newspaper.Article(url)


In [5]:
# Tell Newspaper to download the URL's content
article.download()


In [6]:
# Parse the content of the article
#. This call is necessary prior to any Newspaper analysis
article.parse()


In [7]:
# Print some standard stats from the article
print("Publish Date:", article.publish_date)
print("Authors:", article.authors)
print("Image:", article.top_image)
print("Text:", article.text)

Publish Date: 2021-02-01 00:00:00
Authors: ['Demetrios Ioannou', 'Photographs']
Image: https://static01.nyt.com/images/2021/02/02/travel/01travel-greece-promo/01travel-greece-promo-facebookJumbo.jpg
Text: The room was dimly lit, illuminated only by a weak yellow light bulb and the flames from the fireplace. A small group of men and women, clutching the holy icons of Greek Orthodox saints, was dancing and twirling around the floor under the sound of the instruments: a Thracian lyra, a gaida, a tambourine. The dancers, surrendering to the music, had their eyes closed.

Everyone sang together:

Constantine the little one, little Constantine,

His mother had him, she took care of him while he was very young,

A message came for him to go to war,

He saddles and horseshoes his horse in the night,

He puts silver petals, golden nails and a pearl on the saddle.

Their voices carried outside into the rainy streets. A while later, in a kind of an ecstasy, they began walking barefoot on burning 

In [8]:
# Set up a Content Extractor to get links from an article's HTML
extractor = newspaper.extractors.ContentExtractor(newspaper.Config())


In [9]:
print("Article's Domain:", getDomain(article.canonical_link))

# Pass the Article's HTML to the extractor to get all the links
for link in extractor.get_urls(article.html):
    
    print("\t", "Raw Link:", link)
    
    if link.startswith("http"):
        # process this link
        print("\t", "Off-site Link:", getDomain(link))
    elif link.startswith("//"): # Special case where link excludes the HTTP
        # process this link
        print("\t", "Off-site Link:", getDomain("http:" + link))
    else: # Not a web link to another article?
        pass

Article's Domain: www.nytimes.com
	 Raw Link: #site-content
	 Raw Link: #site-index
	 Raw Link: /
	 Raw Link: https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi
	 Off-site Link: myaccount.nytimes.com
	 Raw Link: /section/travel
	 Raw Link: /
	 Raw Link: /
	 Raw Link: https://www.facebook.com/dialog/feed?app_id=9869919170&link=https%3A%2F%2Fwww.nytimes.com%2F2021%2F02%2F01%2Ftravel%2Fgreece-firewalking-ritual.html%3Fsmid%3Dfb-share&name=Glimpses%20of%20an%20Ancient%20Fire-Walking%20Ritual%20in%20Northern%20Greece&redirect_uri=https%3A%2F%2Fwww.facebook.com%2F
	 Off-site Link: www.facebook.com
	 Raw Link: https://api.whatsapp.com/send?text=Glimpses%20of%20an%20Ancient%20Fire-Walking%20Ritual%20in%20Northern%20Greece%20https%3A%2F%2Fwww.nytimes.com%2F2021%2F02%2F01%2Ftravel%2Fgreece-firewalking-ritual.html%3Fsmid%3Dwa-share
	 Off-site Link: api.whatsapp.com
	 Raw Link: https://twitter.com/intent/tweet?url=https%3A%2F%2Fwww.nytimes.com%2F2021%2F02%2F01%2Ftravel%2Fgr

## Parsing With BeautifulSoup

The `Newspaper3k` package works well for news sites, but many sites on the Web aren't news sites (e.g., Wikipedia, Amazon, etc.). How can we easily map the Web and include these domains?

As this task is sufficiently common, a great Python package, `BeautifulSoup`, exists for this purpose. We show a demo of it below as well.

Relevant documentation is available here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [10]:
from bs4 import BeautifulSoup # BeautifulSoup package
import re  # Regular expression engine, for string matching

### HTTP and Requests

`BeautifulSoup` is particularly good at parsing the HTML of web pages, but how do we get the HTML of a page? Enter the Python `requests` package, which you can find here:

https://requests.readthedocs.io/en/master/

In [11]:
import requests


In [12]:
# To demonstrate getting HTML of a page, lets pull the Wikipedia homepage
r = requests.get('http://wikipedia.org') # Run an HTTP GET request

# Check for success.
#. If successful, we'll get a 200 status code 
#. HTTP has many such codes, which you can reference online (e.g., 404, 500)
print(r.status_code) 

200


In [13]:
# Use the Encoding supplied by the server response
#. to convert the content to a string, which should
#. produce the HTML we want to analyze
article_html = r.content.decode(r.encoding)
print("Length:", len(article_html))

print(article_html[:256])

Length: 68913
<!DOCTYPE html>
<html lang="mul" class="no-js">
<head>
<meta charset="utf-8">
<title>Wikipedia</title>
<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia 


### BeautifulSoup and HTML

Now we have the HTML for this page, so we can process it with BS

In [14]:
soup = BeautifulSoup(article_html, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="mul">
 <head>
  <meta charset="utf-8"/>
  <title>
   Wikipedia
  </title>
  <meta content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation." name="description"/>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
  </script>
  <meta content="initial-scale=1,user-scalable=yes" name="viewport"/>
  <link href="/static/apple-touch/wikipedia.png" rel="apple-touch-icon"/>
  <link href="/static/favicon/wikipedia.ico" rel="shortcut icon"/>
  <link href="//creativecommons.org/licenses/by-sa/3.0/" rel="license"/>
  <style>
   .sprite{background-image:url(portal/wikipedia.org/assets/img/sprite-46c49284.png);background-image:linear-gradient(transparent,transparent),url(portal/wikipedia.org/assets/img/sprite-46c49284.svg);background-repeat:no-repeat;display:inline-block;vertical-align:middle}.

In [15]:
# BS lets us find all HTML tags of a particular type
#. For links, we care about <a href...> style 
#. (though we could also look for img links or anything that
#. links to content)

# Here, we use .findAll() to find all <a href...> tags
for link in soup.findAll('a'):
    if link.has_attr('href'):
        link_url = link.get('href')
        
        print("\t", "Raw Link:", link_url)

        if link_url.startswith("http"):
            # process this link
            print("\t", "Off-site Link:", getDomain(link_url))
        elif link_url.startswith("//"): # Special case where link excludes the HTTP
            # process this link
            print("\t", "Off-site Link:", getDomain("http:" + link_url))
        else: # Not a web link to another article?
            pass

	 Raw Link: //en.wikipedia.org/
	 Off-site Link: en.wikipedia.org
	 Raw Link: //ja.wikipedia.org/
	 Off-site Link: ja.wikipedia.org
	 Raw Link: //de.wikipedia.org/
	 Off-site Link: de.wikipedia.org
	 Raw Link: //es.wikipedia.org/
	 Off-site Link: es.wikipedia.org
	 Raw Link: //ru.wikipedia.org/
	 Off-site Link: ru.wikipedia.org
	 Raw Link: //fr.wikipedia.org/
	 Off-site Link: fr.wikipedia.org
	 Raw Link: //it.wikipedia.org/
	 Off-site Link: it.wikipedia.org
	 Raw Link: //zh.wikipedia.org/
	 Off-site Link: zh.wikipedia.org
	 Raw Link: //pl.wikipedia.org/
	 Off-site Link: pl.wikipedia.org
	 Raw Link: //pt.wikipedia.org/
	 Off-site Link: pt.wikipedia.org
	 Raw Link: //ar.wikipedia.org/
	 Off-site Link: ar.wikipedia.org
	 Raw Link: //de.wikipedia.org/
	 Off-site Link: de.wikipedia.org
	 Raw Link: //en.wikipedia.org/
	 Off-site Link: en.wikipedia.org
	 Raw Link: //es.wikipedia.org/
	 Off-site Link: es.wikipedia.org
	 Raw Link: //fr.wikipedia.org/
	 Off-site Link: fr.wikipedia.org
	 Raw Link

	 Raw Link: //rn.wikipedia.org/
	 Off-site Link: rn.wikipedia.org
	 Raw Link: //sm.wikipedia.org/
	 Off-site Link: sm.wikipedia.org
	 Raw Link: //sg.wikipedia.org/
	 Off-site Link: sg.wikipedia.org
	 Raw Link: //st.wikipedia.org/
	 Off-site Link: st.wikipedia.org
	 Raw Link: //tn.wikipedia.org/
	 Off-site Link: tn.wikipedia.org
	 Raw Link: //cu.wikipedia.org/
	 Off-site Link: cu.wikipedia.org
	 Raw Link: //ss.wikipedia.org/
	 Off-site Link: ss.wikipedia.org
	 Raw Link: //chr.wikipedia.org/
	 Off-site Link: chr.wikipedia.org
	 Raw Link: //chy.wikipedia.org/
	 Off-site Link: chy.wikipedia.org
	 Raw Link: //ve.wikipedia.org/
	 Off-site Link: ve.wikipedia.org
	 Raw Link: //ts.wikipedia.org/
	 Off-site Link: ts.wikipedia.org
	 Raw Link: //tum.wikipedia.org/
	 Off-site Link: tum.wikipedia.org
	 Raw Link: //tw.wikipedia.org/
	 Off-site Link: tw.wikipedia.org
	 Raw Link: //ti.wikipedia.org/
	 Off-site Link: ti.wikipedia.org
	 Raw Link: //nqo.wikipedia.org/
	 Off-site Link: nqo.wikipedia.org
	 