In [1]:
# Web Crawling Models
import requests
from bs4 import BeautifulSoup


class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body


def getPage(url):
    req = requests.get(url)
    return BeautifulSoup(req.text, 'html.parser')


def scrapeNYTimes(url):
    bs = getPage(url)
    title = bs.find("h1").text
    lines = bs.find_all("p", {"class":"story-content"})
    body = '\n'.join([line.text for line in lines])
    return Content(url, title, body)


def scrapeBrookings(url):
    bs = getPage(url)
    title = bs.find("h1").text
    body = bs.find("div",{"class":"post-body"}).text
    return Content(url, title, body)


url = 'https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/'

content = scrapeBrookings(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
print(content.body)

url = 'https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'
content = scrapeNYTimes(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
print(content.body)

Title: Delivering inclusive urban access: 3 uncomfortable truths
URL: https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/


The past few decades have been filled with a deep optimism about the role of cities and suburbs across the world. These engines of economic growth host a majority of world population, are major drivers of economic innovation, and have created pathways to opportunities for untold amounts of people.	
Authors






Jeffrey Gutman
Nonresident Senior Fellow - Global Economy and Development







Adie Tomer
Fellow - Metropolitan Policy Program

 Twitter
AdieTomer






But all is not well within our so-called Urban Century. Rapid urbanization, rising gentrification, concentrated poverty, and shortages of basic infrastructure have combined to create spatial inequity in cities and suburbs across the globe. The challenges of housing, moving, and employing so many people have led to longer travel times, ris

Title: The Men Who Want to Live Forever
URL: https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html




In [2]:
import requests
from bs4 import BeautifulSoup


class Content:
    """
    Common base class for all articles/pages
    """
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body
        
    def print_it(self):
        """
        Flexible printing function controls output
        """
        print("URL: {}".format(self.url))
        print("TITLE: {}".format(self.title))
        print("BODY:\n{}".format(self.body))


class WebsiteTags:
    """
    Contains information about website structure
    """
    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag
        
        

class Crawler:
    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        """
        Utility function used to get a content string from a
        Beautiful Soup object and a selector. Returns an empty
        string if no object is found for the given selector
        """
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    def parse(self, site, url):
        """
        Extract content from a given page URL
        """
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print_it()
                


crawler = Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com', 'h1', 'section#product-description'],
    ['Reuters', 'http://reuters.com', 'h1', 'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu', 'h1', 'div.post-body'],
    ['New York Times', 'http://nytimes.com', 'h1', 'p.story-content']
    ]

websites = []
for row in siteData:
    websites.append(WebsiteTags(row[0], row[1], row[2], row[3]))


crawler.parse(websites[0], 'http://shop.oreilly.com/product/0636920028154.do')
crawler.parse(websites[1], 'http://www.reuters.com/article/us-usa-epa-pruitt-idUSKBN19W2D0')
crawler.parse(websites[2], 'https://www.brookings.edu/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
crawler.parse(websites[3], 'https://www.nytimes.com/2018/01/28/business/energy-environment/oil-boom.html')


URL: http://shop.oreilly.com/product/0636920028154.do
TITLE: Learning Python, 5th Edition 
BODY:

Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutz’s popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. It’s an ideal way to begin, whether you’re new to programming or a professional developer versed in other languages. 

Complete with quizzes, exercises, and helpful illustrations,  this easy-to-follow, self-paced tutorial gets you started with both Python 2.7 and 3.3— the latest releases in the 3.X  and 2.X lines—plus all other releases in common use today. You’ll also learn some advanced language features that recently have become more common in Python code.

Explore Python’s major built-in object types such as numbers, lists, and dictionaries 
Create and process objects with Python statements, and learn Python’s general syntax model
Use functions

In [3]:
for web in websites:
    print(f'{web.name}    {web.url}    {web.titleTag}    {web.bodyTag}')


O'Reilly Media    http://oreilly.com    h1    section#product-description
Reuters    http://reuters.com    h1    div.StandardArticleBody_body_1gnLA
Brookings    http://www.brookings.edu    h1    div.post-body
New York Times    http://nytimes.com    h1    p.story-content


In [4]:
import requests
from bs4 import BeautifulSoup


class Content:
    """
    Common base class for all articles/pages
    """
    def __init__(self, topic, url, title, body):
        self.topic = topic
        self.url = url
        self.title = title
        self.body = body
        
    def print_it(self):
        """
        Flexible printing function controls output
        """
        print("New article found for topic: {}".format(self.topic))
        print("URL: {}".format(self.url))
        print("TITLE: {}".format(self.title))
        print("BODY:\n{}".format(self.body))
        

class Website:
    """Contains information about website structure"""
    
    def __init__(self, name, url, searchUrl, resultListing, resultUrl, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.searchUrl = searchUrl  # defines where you should go to get search results if you append the topic you are looking for.
        self.resultListing = resultListing  # defines the “box” that holds information about each result.
        self.resultUrl = resultUrl  # defines the tag inside this box that will give you the exact URL for the result.
        self.absoluteUrl=absoluteUrl  # This property is a boolean that tells you whether these search results are absolute or relative URLs.
        self.titleTag = titleTag
        self.bodyTag = bodyTag
        
    

class Crawler:
    
    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')
    
    def safeGet(self, pageObj, selector):
        childObj = pageObj.select(selector)
        if childObj is not None and len(childObj) > 0:
            return childObj[0].get_text()
        return ""
    
    def search(self, topic, site):
        """
        Searches a given website for a given topic and records all pages found
        """
        bs = self.getPage(site.searchUrl + topic)
        searchResults = bs.select(site.resultListing)
        for result in searchResults:
            url = result.select(site.resultUrl)[0].attrs["href"]
            # Check to see whether it's a relative or an absolute URL
            if(site.absoluteUrl):
                bs = self.getPage(url)
            else:
                bs = self.getPage(site.url + url)
            if bs is None:
                print("Something was wrong with that page or URL. Skipping!")
                return
        title = self.safeGet(bs, site.titleTag)
        body = self.safeGet(bs, site.bodyTag)
        if title != '' and body != '':
            content = Content(topic, title, body, url)
            content.print_it()



crawler = Crawler()
siteData = [
    ['O\'Reilly Media', 'http://oreilly.com', 'https://ssearch.oreilly.com/?q=','article.product-result', 'p.title a', True, 'h1', 'section#product-description'],
    ['Reuters', 'http://reuters.com', 'http://www.reuters.com/search/news?blob=', 'div.search-result-content','h3.search-result-title a', False, 'h1', 'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu', 'https://www.brookings.edu/search/?s=', 'div.list-content article', 'h4.title a', True, 'h1', 'div.post-body']
]


sites = []

for row in siteData:
    sites.append(Website(row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7]))

topics = ['python', 'data science']

for topic in topics:
    print("GETTING INFO ABOUT: " + topic)
    for targetSite in sites:
        crawler.search(topic, targetSite)



GETTING INFO ABOUT: python
New article found for topic: python
URL: Appointments Apocalypse
TITLE: 

BODY:
https://www.brookings.edu/opinions/appointments-apocalypse/
GETTING INFO ABOUT: data science
New article found for topic: data science
URL: Data Science For Dummies
TITLE: 
Learn to:  Deduce, discover, and communicate valuable insights from structured, semi-structured, and unstructured data sources Use meaningful visualizations to display and interpret data Take advantage of data processing tools like Hadoop® and MapReduce Turn your organization's data into a competitive advantage  Gain in-depth insight into your business with data science—this book makes it easy! Big data is a big deal. This book helps you harness its power and give your business that all-important competitive edge. You'll learn to manage large amounts of data within hardware and software limitations, merge data sources, ensure consistent reporting, and interpret the data to tell your business story in a way that

==========================================
# Crawling Sites Through Links
==========================================

In [5]:
import requests
from bs4 import BeautifulSoup
import re


class Content:
    """
    Common base class for all articles/pages
    """
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body
        
    def print_it(self):
        """
        Flexible printing function controls output
        """
        print("URL: {}".format(self.url))
        print("TITLE: {}".format(self.title))
        print("BODY:\n{}".format(self.body))
        

class Website:
    """Contains information about website structure"""
    
    def __init__(self, name, url, targetPattern, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.targetPattern = targetPattern
        self.absoluteUrl=absoluteUrl  # This property is a boolean that tells you whether these search results are absolute or relative URLs.
        self.titleTag = titleTag
        self.bodyTag = bodyTag
        
    

class Crawler:
    
    def __init__(self, site):
        self.site = site
        self.visited = []
    
    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')
    
    def safeGet(self, pageObj, selector):
        """
        Utility function used to get a content string from a
        Beautiful Soup object and a selector. Returns an empty
        string if no object is found for the given selector
        """
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''
    
    def parse(self, url):
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs, self.site.titleTag)
            body = self.safeGet(bs, self.site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print_it()
    
    def crawl(self):
        """
        Get pages from website home page
        """
        bs = self.getPage(self.site.url)
        targetPages = bs.findAll('a', href = re.compile(self.site.targetPattern))
        for targetPage in targetPages:
            targetPage = targetPage.attrs['href']
            if targetPage not in self.visited:
                self.visited.append(targetPage)
                if not self.site.absoluteUrl:
                    targetPage = '{}{}'.format(self.site.url, targetPage)
                self.parse(targetPage)


reuters = Website('Reuters', 'https://www.reuters.com', '^(/article/)', False, 'h1', 'div.StandardArticleBody_body')
crawler = Crawler(reuters)
crawler.crawl()


URL: https://www.reuters.com/article/deutschland-cdu-maassen-idDEKCN1V80EX
TITLE: CDU debattiert über Parteiausschluss von Maaßen
BODY:
German Defence Minister Annegret Kramp-Karrenbauer attends the weekly cabinet meeting in Berlin, Germany, August 14, 2019.   REUTERS/Fabrizio BenschBerlin (Reuters) - In der CDU wird über ein Parteiausschlussverfahren gegen den früheren Verfassungsschutzchef Hans-Georg Maaßen diskutiert.  CDU-Chefin Annegret Kramp-Karrenbauer sowie CDU-Generalsekretär Paul Ziemiak dementierten zwar, dass sie einen Ausschluss gefordert habe. Maaßen selbst reagierte aber enttäuscht und forderte im Gegenzug eine Abgrenzung der Sachsen-CDU von der Bundes-CDU. Ostdeutsche CDU-Politiker lehnten einen Rauswurf des Mitglieds der rechten Unions-Gruppierung WerteUnion ab, Sachsens Ministerpräsident Michael Kretschmer kritisierte die Bundes-CDU. “Diese neuerlichen Personaldebatten sind überhaupt nicht hilfreich”, sagte Thüringens Landeschef Mike Mohring dem ZDF.   Auslöser war ei

URL: https://www.reuters.com/article/deutschland-tesla-stornierung-idDEKCN1V61NX
TITLE: Elektroauto-Vermieter will keine Tesla mehr - "Qualitätsmängel"
BODY:
Tesla Superchargers are shown in Mojave, California, U.S., March 11, 2019. REUTERS/Mike BlakeFrankfurt (Reuters) - Der US-Elektroauto-Hersteller Tesla hat einen Millionenauftrag aus Deutschland verloren.  Der Elektroauto-Vermieter Nextmove aus Leipzig hatte Ende des vergangenen Jahres 100 “Tesla 3” bestellt, geriet sich aber nach der Lieferung der ersten 15 Autos im Frühjahr wegen Qualitätsmängeln mit dem Hersteller in die Haare. Nach Angaben von Tesla hat Nextmove die rund fünf Millionen Euro schwere Lieferung der restlichen 85 Modelle storniert. Nextmove erklärte, man habe die Auslieferung weiterer Autos “wegen schwerer Qualitäts- und Sicherheitsmängel” gestoppt. Doch statt diese zu beheben, habe Tesla ein 24-Stunden-Ultimatum gestellt und anschließend die Bestellung der noch nicht gelieferten Autos storniert. “Nur jedes vierte 

URL: https://www.reuters.com/article/deutschland-finanzen-idDEKCN1V61FL
TITLE: Medien - Schwarze Null wackelt im Falle einer Rezession
BODY:
German Chancellor Angela Merkel and Finance Minister Olaf Scholz make a statement before a meeting with representatives from German unions and industry at the government's guest house, Schloss Meseberg in Meseberg, Germany, June 17, 2019. REUTERS/Hannibal HanschkeBerlin (Reuters) - Bundeskanzlerin Angela Merkel und Finanzminister Olaf Scholz sind laut “Spiegel Online” bereit, im Falle einer Rezession das Ziel eines ausgeglichenen Haushalts aufzugeben. “Niemand hat die Absicht, einer Krise hinterherzusparen”, zitierte das Magazin einen Konjunkturexperten der Regierung. Merkel und Scholz hatten sich zuletzt allerdings zur Schwarzen Null im Bundeshaushalt bekannt.  Das Finanzministerium spielt Reuters-Informationen zufolge bereits wegen der voraussichtlich sehr kostspieligen Maßnahmen zum Klimaschutz eine Abkehr von der bisherigen Linie durch, ohne n

URL: https://www.reuters.com/article/m-rkte-idDEKCN1V61LB
TITLE: Börsen erholen sich - Anleger hoffen auf Geldsegen der Notenbanken
BODY:
Frankfurt (Reuters) - In Erwartung weiterer Geldspritzen der Notenbanken und weiterer Konjunkturhilfen haben Anleger sich zum Wochenschluss wieder aus der Deckung getraut.  The bull, symbol for successful trading, is seen in front of the German stock exchange (Deutsche Boerse) in Frankfurt, Germany, February 12, 2019.  REUTERS/Kai PfaffenbachDer Dax legte 1,3 Prozent auf 11.562,74 Punkte zu, der EuroStoxx50 gewann 1,4 Prozent auf 3329,08 Zähler. An den US-Börsen ging es ebenfalls aufwärts. An den Märkten[FEDWATCH] wird die Wahrscheinlichkeit, dass die US-Notenbank (Fed) auf ihrer Sitzung im September den Leitzins gleich um einen halben Prozentpunkt nach unten schraubt, inzwischen auf ein Drittel beziffert. Auch von der Europäischen Zentralbank (EZB) wird angesichts von Handelsstreit und Bremsspuren in der Wirtschaft ein Schritt nach unten erwartet. P

URL: https://www.reuters.com/article/mrkte-aktien-europa-wochenbilanz-idDEL8N25C40V
TITLE: TABELLE-Europäische Kursgewinner und -verlierer der Woche
BODY:
    Frankfurt, 16. Aug (Reuters) - Konjunktursorgen haben die
Anleger in der abgelaufenen Börsenwoche umgetrieben. Zum
Wochenschluss stützte die Hoffnung auf einen anhaltenden
Geldsegen der Notenbanken. Der Dax gab auf Wochensicht
mehr als ein Prozent ab. 
    
 Indizes                   +/-    Stand   Stand 
                            in   16.08.1   Vorwoc
                            %       9       he
 Dax                       -1,2  11.562,  11.693,
                                      74       80
 EuroStoxx50               -0,1  3.329,0  3.333,7
                                       8        4
 Stoxx50                   +0,1  3.063,3  3.060,5
                                       4        6
 EuroStoxx-Autoindex<.SXA  -3,0   416,98   429,82
 E>                                       
                                            

URL: https://www.reuters.com/article/deutschland-cdu-senftleben-idDEKCN1V80H4
TITLE: Brandenburgs CDU-Chef teilt Kramp-Karrenbauer-Kritik an Maaßen
BODY:
German Defence Minister Annegret Kramp-Karrenbauer attends the weekly cabinet meeting in Berlin, Germany, August 14, 2019.   REUTERS/Fabrizio BenschBerlin (Reuters) - Die CDU-Vorsitzende Annegret Kramp-Karrenbauer hat vom CDU-Landesvorsitzenden in Brandenburg demonstrative Unterstützung für ihre Kritik am früheren Verfassungsschutzpräsidenten Hans-Georg Maaßen erhalten. “Wenn jemand ständig gegen die Gemeinschaft Foul spielt, muss er sich nicht wundern, dass die Team-Managerin klare Worte findet”, sagte Ingo Senftleben am Sonntag der Nachrichtenagentur Reuters. In Brandenburg finden am 1. September wie in Sachsen Landtagswahlen statt.  Senftleben warf Maaßen, der Mitglied der konservativen Gruppierung WerteUnion innerhalb der Union ist, andauernde Spaltungsversuche der CDU vor. “Gerade CDU-Wähler sind aber sehr sensibel, weil ihnen Ge

============================================
# Crawling Multiple Page Types
============================================