**PySDS Week 03 Day 03 v.1 - Exercise - Webcrawlers**

Modifying a scraper to suit needs



In [None]:
# Exercise 1. 
# How pervaisive is the notion of a social network at the OII? 
# Modify the crawler that we showed in class. You can either use a new scraper 
# from scrapy, beautifulSoup, mechanicalSoup or modify the code we have. 
#
# Part 1. Write pseudocode from the following instructions:
#
# Use the department's homepage (http://www.oii.ox.ac.uk) as your seed. 
# Navigate to all links that you find that also have www.oii.ox.ac.uk in them.
# If a page includes the word "network" or "networks", then mark it in the 
# "has network" pile. Otherwise mark it in the "not mentioned" pile. 
# When you run out of links, return the number in each pile. 
#
# NOTE> Please exempt any page with http://www.oii.ox.ac.uk/study See updated code snippet. 

################################
# Answer below here 


'''
Take seed pages and add to set of pages to visit

Repeat the following using while loop until out of pages or user defined page limit exceeded:

    Parse next webpage in the set of pages to visit,
    returning the page data and all links on the webpage that are within oii (but not study section)

    for all returned links
        if the link is not in our set of all pages visited
            add it to the set of pages to visit

    if any of the stopwords are in the page data
        add the url to the pages with words list
    else
        add the url to the pages without words list
        
    sleep for defined period, and increment page count

return lists of pages with the words, and pages without the words
'''


################################
# Peer review comments below here 




In [54]:
# Part 2. Creating a scraper that looks for oii links. 

# If you are using a subclass of HTTPparser, consider reviewing the documentation 
# https://docs.python.org/3/library/html.parser.html
#
# Write a subclass of HTTPparser (or another means) 
# that returns links if they have 'www.oii.ox.ac.uk'
# in the full path. You should think about how you are going to test this.

################################
# Answer below here 

from html.parser import HTMLParser 
import urllib.error
from urllib.request import urlopen  
from urllib import parse
import time

# Very much based on the parser / spider commands from the lecture

# We are going to create a class called LinkParser that inherits some
# methods from HTMLParser which is why it is passed into the definition
class LinkParser(HTMLParser):

    # This is a function that HTMLParser normally has
    # but we are adding some functionality to it
    def handle_starttag(self, tag, attrs):
        # We are looking for the begining of a link. Links normally look
        # like <a href="www.someurl.com"></a>
        if tag == 'a':
            for (key, value) in attrs:
                if key == 'href':
                    # We are grabbing the new URL. We are also adding the
                    # base URL to it. For example:
                    # www.saintlad.com is the base and
                    # somepage.html is the new URL (a relative URL)
                    #
                    # We combine a relative URL with the base URL to create
                    # an absolute URL like:
                    # www.saintlad.com/somepage.html
                    newUrl = parse.urljoin(self.baseUrl, value)
                    # And add it to our colection of links:
                    if 'www.oii.ox.ac.uk' in newUrl[:25] and 'www.oii.ox.ac.uk/study' not in newUrl: #### check if link is within oii but not in 'study' section
                        self.links = self.links + [newUrl]

    # This is a new function that we are creating to get links
    # that our spider() function will call
    def getLinks(self, url):
        self.links = []
        # Remember the base URL which will be important when creating
        # absolute URLs
        self.baseUrl = url
        # Use the urlopen function from the standard Python 3 library
        response = urlopen(url)
        # Make sure that we are looking at HTML and not other things that
        # are floating around on the internet (such as
        # JavaScript files, CSS, or .PDFs for example)
        # BH: I changed this to text/html in rather than == text/html, since 
        #     some pages have text/html; encoding=utf-8.
        if 'text/html' in response.getheader('Content-Type'):
            htmlBytes = response.read()
            # Note that feed() handles Strings well, but not bytes
            # (A change from Python 2.x to Python 3.x)
            htmlString = htmlBytes.decode("utf-8")
            self.feed(htmlString)

            return htmlString, self.links
        else:
            return "",[]
        
def spider(seed_set,stop_words,max_pages,sleep=0.2): 
    page_count = 0
    pages_with_words = []
    pages_without_words = []
    all_pages = set(seed_set)
    
    try: 
        pages_to_visit  = list(all_pages)
    except: 
        print("Spider expects a collection of URLs as first argument.")

    parser = LinkParser()
    
    while len(pages_to_visit) and page_count < max_pages:
        url = pages_to_visit[0]
        print(url) # progress indicator
        pages_to_visit = pages_to_visit[1:] #Get rid of first page        
        page_count += 1
        try:
            data, links = parser.getLinks(url) # get link data   
        except urllib.error.HTTPError: # catch 404 errors
            print('#############404 Error################', url)
            continue
            
        for i in links: # add links to all pages / pages to visit if not already
            if i not in all_pages:
                all_pages.add(i)
                pages_to_visit.append(i)

        if any(word in data for word in stop_words): # check if any of the stop words are in the text
            pages_with_words.append(url)            
        else: 
            pages_without_words.append(url)

        time.sleep(sleep)
    return (pages_with_words,pages_without_words)

################################
# Peer review comments below here 




In [57]:
# Part 3. Translate your pseudocode spider to working code 

# Here we will want to run a crawler.
# Call your getLinks() method from within your working code. 
# I assume it will be an extension of my code in the lecture.
#
# Please note, you should use a set or other form of counter to ensure
# that you do not visit the same link twice. I have warned IT about today. 
# But still...let's try not to DDOS the deparmtnet webpage. 
# Note 1.  Please exempt any page with /study See updated code snippet. 
# Note 2. Max links = 100
# Note 3. time.sleep(0.2)

################################
# Answer below here 

withwordlist, withoutwordlist = spider(["https://www.oii.ox.ac.uk/"],["network", "networks"],100)
print()
print('Number of pages with any of the stopwords:', len(withwordlist))
print('Number of pages without any of the stopwords:', len(withoutwordlist))

################################
# Peer review comments below here 




https://www.oii.ox.ac.uk/
https://www.oii.ox.ac.uk/research/
https://www.oii.ox.ac.uk/research/projects/
https://www.oii.ox.ac.uk/research/publications/
https://www.oii.ox.ac.uk/research/policy/
https://www.oii.ox.ac.uk/research/ref/
https://www.oii.ox.ac.uk/people/
https://www.oii.ox.ac.uk/people/new-positions/
https://www.oii.ox.ac.uk/events/
https://www.oii.ox.ac.uk/events/series/
https://www.oii.ox.ac.uk/events/past-events/
https://www.oii.ox.ac.uk/videos/
https://www.oii.ox.ac.uk/videos/playlists/
https://www.oii.ox.ac.uk/news/
https://www.oii.ox.ac.uk/news/releases/
https://www.oii.ox.ac.uk/news/coverage/
https://www.oii.ox.ac.uk/about/
https://www.oii.ox.ac.uk/about/giving/
https://www.oii.ox.ac.uk/oxford-internet-institute-awards/
https://www.oii.ox.ac.uk/about/library/
https://www.oii.ox.ac.uk/about/find-us/
https://www.oii.ox.ac.uk/blog/
https://www.oii.ox.ac.uk/follow-us/
https://www.oii.ox.ac.uk/alumni/
https://www.oii.ox.ac.uk/news/releases/junk-news-dominating-coverage-of