<a href="https://colab.research.google.com/github/ajmbarron/web_scraping_with_python-/blob/main/Chapter_4_Web_Crawling_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Functions Module#

*Objective: This module is intended to scrap specific websites*



***Structure:***
*   **Defining class Content**:

    With this class we get an url instance characteristics
    as url name, title and body.


*   ****

*   **Defining function to request page and return html**:

    getPage() request an url and transforms it to a 
    BeautifulSoup Object, which would represent
    the html source code for an specific url.

*   ****

*  **Defining functions to scrape based on Content instance:**

   As we already defined a class called Content, we can now
   get all characterizable elements for this class inside a function.

   This functions have specific line tasks to extract all 
   characterizable elements for Content instance.

   Also we use getPage() inside each one to request the pages
   as BeautifulSoup classes. 

   Every parsing function does the same:

   * Selects the title element and extracts the text for the title.
   * Selects the main content of the article.
   * Selects other content items as needed.
   * Returns a Content object instantiated witht the strings found previously.

   * ****




In [None]:
################### importing required libraries #############
from bs4 import BeautifulSoup
import requests

##############################################################
# Defining url as a class content which is an instance       #
# characterized by url name, title and body                  #
#                                                            #  
# Contains info from a single url                            # 
#                                                            #
##############################################################

class Content:
  def __init__(self, url, title, body):
    self.url=url
    self.title=title
    self.body=body


##############################################################
# Defining functions to scrape                               #
#  argument= URL                                             #
#                                                            #
# getPage()                                                  # 
# scrapeNYTimes()                                            #
# scrapeBrookings()                                          #
##############################################################


###############################################################
# This function gets page url                                 # 
#  and returns a beautifulsoup object as an html              #                                                     
###############################################################

def getPage(url):
  # request page #
  req = requests.get(url)

  # return a beautifulsoup object (html source page) #
  return BeautifulSoup(req.text, 'html.parser')


###############################################################
# This function request url, founds website title, text lines #
# and   lines body and returns a Content class                #                                                     
###############################################################

def scrapeNYTimes(url):

  # request page #
  bs=getPage(url)

  # extract url title #
  title=bs.find('h1').text

  # extract lines #
  lines=bs.select('div.StoryBodyCompanionColumn div p')

  # extract set of lines as a report body #
  body='/n'.join([line.text for line in lines])

  # return Content instance #
  return Content(url, title, body)


#################################################################
# This function request a page, gets the page title,            #
# the page body and returns a content class                     #
#################################################################

def scrapeBrookings(url):

  # request page #
  bs=getPage(url)

  # extract page title #
  title=bs.find('h1').text

  # extract page body #
  body=bs.find('div',{'class', 'post-body'}).text

  # return Content instance #
  return Content(url, title, body)

##################################################################


#Scrape Websites Module#

*Objective: In this module we apply Functions 
           Module to NYT and Brookings urls*. 

In [None]:
#####################################
# Brookings URL                     #
#                                   #
#                                   #
#####################################

###### defining url #######
url='https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/'

###### get content instance (url, title and body) ######
content=scrapeBrookings(url)

###### print content instance characteristic: title ####
print('Title: {}'.format(content.title))

print('Title printed')
###### print content instance characteristic: url ######
print('URL: {}\n'.format(content.url))

print('url printed')
###### print content instance characteristic: body ####
print(content.body)

print('body printed')

#####################################
# NYT URL                           #   
#                                   #
#                                   #
#####################################
url='https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'

content=scrapeNYTimes(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
print(content.body)


To make things more convenient, rather than dealing with all of these
tag arguments and key/value pairs, you can use `BeautifulSoup` `select`
function with a single string CSS selector for each piece of information
you want to collect and put all of these selectors in a dictionary object


In [None]:
##############################################
#  Defining classes instances                #  
#                                            #
##############################################


class Content:
  """
  Common base class for all articles/pages
  """
  def __init__(self, url, title, body):
    self.url=url
    self.title=title
    self.body=body

  def print(self):
    """
    Flexible printing function controls output

    """
    print('URL: {}'.format(self.url))
    print('TITLE: {}'.format(self.title))
    print('BODY: {}'.format(self.body))


##### collects information about how to collect data ####
##### the info here pertains to an entire website #######

class Website:
  """
  Contains information about website structure
  """

  def __init__(self, name, url, titleTag, bodyTag):
    self.name=name
    self.url=url
    self.titleTag=titleTag # stores the string tag h1 that indicates where the titles can be gound
    self.bodyTag=bodyTag



Writing a `Crawler` to scrape the title and content of any URL that is provided
for a given web page from a given website.

In [None]:
class Crawler:

    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        """
        Utilty function used to get a content string from a Beautiful Soup
        object and a selector. Returns an empty string if no object
        is found for the given selector
        """
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    def parse(self, site, url):
        """
        Extract content from a given page URL
        """
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print()

##################### initializing ###################

crawler=Crawler()


websites=[]

for row in siteData:
    websites.append(Website(row[0], row[1], row[2], row[3]))


crawler.parse(websites[0], 'http://shop.oreilly.com/product/0636920028154.do')
crawler.parse(
    websites[1], 'http://www.reuters.com/article/us-usa-epa-pruitt-idUSKBN19W2D0')
crawler.parse(
    websites[2],
    'https://www.brookings.edu/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
crawler.parse(
    websites[3], 
    'https://www.nytimes.com/2018/01/28/business/energy-environment/oil-boom.html')


# Structuring Crawlers #



*   Crawling Sites Through Search
*   Crawling Sites Through Links
*   Crawling Multiple Page Types



# Crawling Sites Through Search #

* Most sites retrieve a list of search results for a particular
  topic by passing that topic as a string through a parameter
  in the URL.

  For example: `http://....//search=MyTopic`

  The first part of this URL can be saved as a property of the `Website` object, and the topic can simply be appended to it.

* After you've located and normalized the URL's on the search page. The `Content` class is much the same as in the previous examples. You are adding the URL property to keep track of where the content was found:






For `Content` class we are adding the URL property to keep track of where the content was found.



In [None]:
class Content:
  """ Common base class for all article/pages"""
  def __init___(self, topic, url, title, body):
        self.topic=topic
        self.title=title
        self.body=body
        self.url=url
  
  def print(self):
    """
    Flexible printing function controls output
    """
    print('New Article found for topic: {}'.format(self.topic))
    print('URL: {}'.format(self.url))
    print('TITLE: {}'.format(self.title))
    print('BODY:\n{}'.format(self.body))


# the searchUrl  defines where you should go get search results
# if you append the topic you are looking for.

# the resultListing defines the "box" that holds information about each result

# resultUrl defines the tag inside this box that will give you the exact URL 
# for the result

# the absoluteUrl property is a boolean that tells you wheter these search
# results are absolute or relative URLs


class Website:
  """ Contains information about website structure """

  def __init___(self, name, url, searchUrl, resultListing, resultUrl, absoluteUrl, titleTag, bodyTag):
    self.name=name
    self.url=url
    self.searchUrl=searchUrl
    self.resultListing=resultListing
    self.absoluteUrl=absoluteUrl
    self.titleTag=titleTag
    self.bodyTag=bodyTag




In [None]:
class Crawler:
  def getPage(self, url):
    try:
      req = requests.get(url)
    except requests.exceptions.RequestException:
      return None
    return BeautifulSoup(req.text, 'html.parser')

   def safeGet(self, pageObj, selector):
     childObj = pageObj.select(selector)
     if childObj is not None and len(childObj) > 0:
       reutrn childObj[0].get_text()
     return ''

   def search(self, topic, site):
     """
     Searches a given website for a given topic and records
     all pages found 

     """
     bs=self.getPage(site.searchUrl + topic)
     searchResults = bs.select(site.resultListing)
     for result in searchResults:
       url=result.select(site.resultUrl)[0].attrs['href']
       # Check to see wheter it's a relative or an absolute URL
       if(site.absoluteUrl):
         bs = self.getPage(url)
       else:
         bs = self.getPage(site.url+url)
       if bs is None:
         print('Something was wrong with that page or URL. Skipping!')

          return
       title=self.safeGet(bs, site.titleTag)
       body=self.safeGet(bs, site.bodyTag)
       if title!='' and body != '':
         content = Content(topic, title, body, url)
         content.print()

crawler=Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com', 'https://ssearch.oreilly.com/?q=',
        'article.product-result', 'p.title a', True, 'h1', 'section#product-description'],
    ['Reuters', 'http://reuters.com', 'http://www.reuters.com/search/news?blob=', 'div.search-result-content',
        'h3.search-result-title a', False, 'h1', 'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu', 'https://www.brookings.edu/search/?s=',
        'div.list-content article', 'h4.title a', True, 'h1', 'div.post-body']
]


sites = []
for row in siteData:
    sites.append(Website(row[0], row[1], row[2],
                         row[3], row[4], row[5], row[6], row[7]))

topics = ['python', 'data science']
for topic in topics:
    print('GETTING INFO ABOUT: ' + topic)
    for targetSite in sites:
        crawler.search(topic, targetSite)