***
Welcome! 

In this Notebook we will explore how we can scrape some famous pages using beautifulsoup, requests and specific libraries.
***

### Index:

[1.1 - Scraping Web Pages - Wikipedia and Yahoo Finance](#1.1---Scraping-Web-Pages---Wikipedia-and-Yahoo-Finance)
<br>
[1.2 - Scraping Web Pages - LinkedIn](#1.2---Scraping-Web-Pages---LinkedIn)
<br>
[1.3 - Scraping using Specific Libraries](#1.3---Scraping-using-Specific-Libraries)

In [2]:
import requests
from bs4 import BeautifulSoup

### 1.1 - Scraping Web Pages - Wikipedia and Yahoo Finance

There are multiple ways to scrape information from web pages. One of the most common ones is to use the request library that makes some http request to a web page and the beautifulsoup library that enables you to parse a text from that webpage.
<br>
<br>
Let's see how!

In this example, we are going to scrape the 1984 book wikipedia page: https://en.wikipedia.org/wiki/Nineteen_Eighty-Four

In [3]:
response = requests.get(
	url="https://en.wikipedia.org/wiki/Nineteen_Eighty-Four",
)

Response 200 gives an OK http status - means that the page was found and retrieved successfuly.

In [4]:
response

<Response [200]>

If we try to retrieve a page that does not exist we get another code from the requests response:

In [5]:
response = requests.get(
    url="https://en.wikipedia.org/wiki/Nineteen_Eighty-Fourrrr",
)

print(response)

<Response [404]>


Returning to our example:

In [6]:
response = requests.get(
	url="https://en.wikipedia.org/wiki/Nineteen_Eighty-Four",
)

And seeing the content of the web page:

In [7]:
response.content

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Nineteen Eighty-Four - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"26e9ac1a-be9f-44ed-b775-209fab76c8e4","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Nineteen_Eighty-Four","wgTitle":"Nineteen Eighty-Four","wgCurRevisionId":1017503947,"wgRevisionId":1017503947,"wgArticleId":23454753,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 errors: missing periodical","Webarchive template wayback links","All articles with dead external links","Articles with dead external l

Yikes! A bit confusing - beautifulsoup helps us to get some meaning out of this:

In [8]:
soup = BeautifulSoup(response.content, 'html.parser')

In [9]:
soup.find_all("p")[11]

<p>In the year 1984, civilization has been damaged by world war, civil conflict, and revolution. Airstrip One (formerly known as Great Britain) is a province of <a class="mw-redirect" href="/wiki/Nations_of_Nineteen_Eighty-Four" title="Nations of Nineteen Eighty-Four">Oceania</a>, one of the three <a href="/wiki/Totalitarianism" title="Totalitarianism">totalitarian</a> super-states that rule the world. It is ruled by the "Party" under the ideology of "<a href="/wiki/Ingsoc" title="Ingsoc">Ingsoc</a>" (a Newspeak shortening of "English Socialism") and the mysterious leader <a href="/wiki/Big_Brother_(Nineteen_Eighty-Four)" title="Big Brother (Nineteen Eighty-Four)">Big Brother</a>, who has an intense <a href="/wiki/Cult_of_personality" title="Cult of personality">cult of personality</a>. The Party brutally purges out anyone who does not fully conform to their regime using the <a href="/wiki/Thought_Police" title="Thought Police">Thought Police</a> and constant surveillance through <a hr

In [10]:
# The soup get text is an easy method to extract text
print(soup.get_text())





Nineteen Eighty-Four - Wikipedia

































Nineteen Eighty-Four

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search
1949 dystopian novel by George Orwell
This article is about the 1949 novel by George Orwell. For the year, see 1984. For other uses, see 1984 (disambiguation).
Not to be confused with 1Q84.


Nineteen Eighty-Four: A Novel First edition coverAuthorGeorge OrwellCover artistMichael KennarCountryUnited KingdomLanguageEnglishGenreDystopian, political fiction, social science fictionSet inLondon, Airstrip One, OceaniaPublisherSecker & WarburgPublication date8 June 1949 (1949-06-08)Media typePrint (hardback and paperback)Pages328Awards
NPR Top 100 Science Fiction and Fantasy Books
OCLC470015866Dewey Decimal823.912[1]Preceded byAnimal Farm 
Nineteen Eighty-Four: A Novel, often referred to as 1984, is a dystopian social science fiction novel by English novelist George Orwell. It was published on 8 June 1949 by Secker & Warburg as O

From here, we can apply common text techniques we have loearned so far, such as word_tokenization, lower() and freqdist:

In [11]:
import nltk
nltk.FreqDist(
    nltk.word_tokenize(soup.get_text().lower())
).most_common(20)

# Notice how most of the words below are stop words!

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.floa

[('the', 990),
 (',', 853),
 ('.', 795),
 ('of', 489),
 ('and', 380),
 ("''", 288),
 ('in', 280),
 ('(', 245),
 (')', 245),
 ('``', 240),
 ('a', 228),
 ('to', 217),
 ('[', 177),
 (']', 177),
 ('is', 163),
 ("'s", 161),
 (':', 137),
 ('that', 137),
 ('^', 120),
 ('orwell', 115)]

As you can see the text is may be bit harder to extract due to all the html anchors in the middle of the text (we'll look for a simple wikipedia extractor next) - some websites are easier than other - an example, let's try to extract some news from Yahoo Finance:
- https://finance.yahoo.com/news/jury-tells-apple-pay-308-034612898.html

In [12]:
yahoo_finance_url = 'https://finance.yahoo.com/news/jury-tells-apple-pay-308-034612898.html'
yf = requests.get(
    url=yahoo_finance_url,
)

soup_yf = BeautifulSoup(yf.content, 'html.parser')

In [13]:
yf

<Response [200]>

In [14]:
soup_yf

<!DOCTYPE html>
<html class="Fz(62.5%) Pos(r) desktop bktfinance-US-en-US-def ua-undefined ua-undefined" id="atomic" lang="en-US"><head><script>
        window.performance.mark('PageStart');
        document.documentElement.className += ' JsEnabled jsenabled';
        /**
        * Empty darlaOnready method, to avoid JS error.
        * This can happen when Async Darla JS file is loaded earlier than Darla Proxy JS.
        * This method will be overridden by Darla Proxy
        */
        window.darlaOnready = function() {};
        </script><title>U.S. jury tells Apple to pay $308.5 million for patent infringement</title><meta content="text/html, charset=utf-8" http-equiv="content-type"/><meta content="on" http-equiv="x-dns-prefetch-control"/><meta content="chrome=1" http-equiv="X-UA-Compatible"/><meta content="guce.yahoo.com" name="oath:guce:consent-host"/><meta content="Apple Inc, Personalized Media Communications LLC, digital rights management, Alphabet Inc, Netflix Inc" name="news

In [15]:
# This finds by html anchor - this is searching by html anchor - find all returns
# every instance of p
soup_yf.find_all("p")[1]

# Find returns the first instance
soup_yf.find("p")

<p class="Fz(14px) LineClamp(1,1.3em) M(0)">Yahoo Finance hosts all-star panel to preview the Berkshire Hathaway meeting on April 26th at 12 p.m. ET. </p>

In [16]:
soup_yf.find(id="Header")

<header class="Pos(r) T(0) W(100%)" id="Header"><div class="wafer-rapid-module" id="module-header"><div><style>/*! Copyright 2017 Yahoo Holdings, Inc. All rights reserved. */template{display:none}[dir=ltr] ._yb_6egfr{text-align:left}[dir=rtl] ._yb_6egfr{text-align:right}._yb_6egfr{font-family:Helvetica Neue,Helvetica,Tahoma,Geneva,Arial,sans-serif;font-weight:400;font-stretch:normal;direction:ltr;display:block;box-sizing:border-box;-webkit-font-smoothing:antialiased;z-index:1000;overflow-anchor:none}.ybar-ytheme-fuji2._yb_6egfr,.ybar-ytheme-oneyahoo._yb_6egfr{font-family:YahooSans VF,YahooSans,Helvetica Neue,Helvetica,Tahoma,Geneva,Arial,sans-serif}#ybar._yb_15iq9{margin:0 auto}._yb_6egfr ._yb_1hfbb{display:flex;flex-direction:column}.ybar-dark .ybar-property-homepage._yb_fvyn8 ._yb_1hfbb{background-image:linear-gradient(to top right,#7282fb,#755bf9,#7934f7);background-color:#7934f7}.ybar-light .ybar-property-homepage._yb_fvyn8 ._yb_1hfbb{background-color:#fff}._yb_1jxu7{display:flex;j

In [17]:
soup_yf.find(id="postArticle")

<div class="Pos(r) Pstart(15%) Pend(32%) Mb($gridMargin) Mt(-20px)" id="postArticle"></div>

In [18]:
# You can also search by class in HTML tags
soup_yf.find(class_="_yb_ydmoo")

You can check the id's of a webpage by hitting left-click and *inspect* on a web page in your browser.

In [19]:
# Using the get text on yahoo finance
soup_yf.get_text()

'U.S. jury tells Apple to pay $308.5 million for patent infringement        HOME    MAIL    NEWS    FINANCE    SPORTS    ENTERTAINMENT    LIFE    SHOPPING    YAHOO PLUS    MORE...           Yahoo Finance Sign in    Mail Sign in to view your mail    Finance  Finance   Watchlists  Watchlists   My Portfolio  My Portfolio  Screeners  Screeners   Saved ScreenersSaved Screeners Equity ScreenerEquity Screener Mutual Fund ScreenerMutual Fund Screener ETF ScreenerETF Screener Future ScreenerFuture Screener Index ScreenerIndex Screener    Yahoo Finance Plus  Yahoo Finance Plus   DashboardDashboard Research ReportsResearch Reports Investment IdeasInvestment Ideas BlogBlog    Markets  Markets   CryptocurrenciesCryptocurrencies CalendarsCalendars Trending TickersTrending Tickers Stocks: Most ActivesStocks: Most Actives Stocks: GainersStocks: Gainers Stocks: LosersStocks: Losers Top ETFsTop ETFs FuturesFutures World IndicesWorld Indices CurrenciesCurrencies Top Mutual FundsTop Mutual Funds Options: 

Scraping the web is not that simple and depends on the legislation of your country. Most of the common applications on Scraping are not illegal in Europe and the US (as of 2020) and there are businesses that rely on scraping other webpages. Check the following news regarding this topic:
        - https://www.imperva.com/blog/is-web-scraping-illegal/
        - https://scrapediary.com/is-web-scraping-legal/
<br>
<br>
This does not mean that abusive scraping such as scrapes where you pick up data behind login pages and scrape load that may impact the website's behavior to other users is discouraged and unethical.

### 1.2 - Scraping Web Pages - LinkedIn

Some websites may block you after some specific scrapes or put some kind of gatekeeper that disables you from scraping the page - one famous example are linkedin profiles: 

In [20]:
linkedin_ivo = 'https://pt.linkedin.com/in/ivobernardo'
linkedin = requests.get(
    url=linkedin_ivo,
)


In [21]:
soup_linkedin = BeautifulSoup(linkedin.content, 'html.parser')

In [22]:
soup_linkedin

<html><head>
<script type="text/javascript">
window.onload = function() {
  // Parse the tracking code from cookies.
  var trk = "bf";
  var trkInfo = "bf";
  var cookies = document.cookie.split("; ");
  for (var i = 0; i < cookies.length; ++i) {
    if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {
      trk = cookies[i].substring(8);
    }
    else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {
      trkInfo = cookies[i].substring(8);
    }
  }

  if (window.location.protocol == "http:") {
    // If "sl" cookie is set, redirect to https.
    for (var i = 0; i < cookies.length; ++i) {
      if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {
        window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);
        return;
      }
    }
  }

  // Get the new domain. For international domains such as
  // fr.linkedin.com, we convert it to www.linkedin.com
  // treat .cn similar to .com

### 1.3 - Scraping using Specific Libraries

Sometimes, we have available some libraries that are more convenient to use when we want to scrape some specific website - as an example, let's experiment with the wikipedia library.

In [23]:
import wikipedia
summary_orwell = wikipedia.summary("Nineteen_Eighty-Four")  

In [24]:
summary_orwell

'Nineteen Eighty-Four: A Novel, often referred to as 1984, is a dystopian social science fiction novel by English novelist George Orwell. It was published on 8 June 1949 by Secker & Warburg as Orwell\'s ninth and final book completed in his lifetime. Thematically, Nineteen Eighty-Four centres on the consequences of totalitarianism, mass surveillance, and repressive regimentation of persons and behaviours within society. Orwell, himself a democratic socialist, modelled the authoritarian government in the novel after Stalinist Russia. More broadly, the novel examines the role of truth and facts within politics and the ways in which they are manipulated.\nThe story takes place in an imagined future, the year 1984, when much of the world has fallen victim to perpetual war, omnipresent government surveillance, historical negationism, and propaganda. Great Britain, known as Airstrip One, has become a province of a totalitarian superstate named Oceania that is ruled by the Party who employ th

In [25]:
summary_orwell = wikipedia.summary("Nineteen_Eighty-Four")  

In [26]:
fullpage_orwell = wikipedia.page("Nineteen_Eighty-Four")

In [27]:
# and to extract the full text from the page
fullpage_orwell.content

'Nineteen Eighty-Four: A Novel, often referred to as 1984, is a dystopian social science fiction novel by English novelist George Orwell. It was published on 8 June 1949 by Secker & Warburg as Orwell\'s ninth and final book completed in his lifetime. Thematically, Nineteen Eighty-Four centres on the consequences of totalitarianism, mass surveillance, and repressive regimentation of persons and behaviours within society. Orwell, himself a democratic socialist, modelled the authoritarian government in the novel after Stalinist Russia. More broadly, the novel examines the role of truth and facts within politics and the ways in which they are manipulated.\nThe story takes place in an imagined future, the year 1984, when much of the world has fallen victim to perpetual war, omnipresent government surveillance, historical negationism, and propaganda. Great Britain, known as Airstrip One, has become a province of a totalitarian superstate named Oceania that is ruled by the Party who employ th