# Web Scraping

### Load HTML from URL

In [6]:
from bs4 import BeautifulSoup
import requests
url = "http://www.packtpub.com/books"
page = requests.get(url)
soup_packtpage = BeautifulSoup(page.content, 'html.parser')
soup_packtpage

 <!DOCTYPE doctype html>

<html lang="en">
<head>
<script>
    var BASE_URL = 'https://www.packtpub.com/';
    var require = {
        "baseUrl": "https://www.packtpub.com/static/version1600690935/frontend/Packt/default/en_GB"
    };
</script>
<meta charset="utf-8"/><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"NRJS-0f4d86b78cc0c8047b9",applicationID:"475968873"};window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var i=n[t]={exports:{}};e[t][0].call(i.exports,function(n){var i=e[t][1][n];return r(i||n)},i,i.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<t.length;i++)r(t[i]);return r}({1:[function(e,n,t){function r(){}function i(e,n,t){return function(){return o(e,[u.now()].concat(c(arguments)),n?null:this,t),n?void 0:this}}var o=e("handle"),a=e(5),c=e(6),f=e("ee").get("tracer"),u=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var d=["setPageViewNam

### Load HTML from Local

In [10]:
with open("data/foo.html", "r") as foo_file:
    soup_foo = BeautifulSoup(foo_file, 'html.parser')
    
soup_foo

<p>Hello World</p>

In [12]:
print(soup_foo)

<p>Hello World</p>



### Parsing XML file

In [15]:
helloworld = "<p>Hello World</p>"
soup_string = BeautifulSoup(helloworld, 'lxml')
soup_string

<html><body><p>Hello World</p></body></html>

In [17]:
soup_xml = BeautifulSoup(helloworld, features= "xml")
soup_xml

<?xml version="1.0" encoding="utf-8"?>
<p>Hello World</p>

In [18]:
invalid_html = '<a invalid content'
soup_invalid_html = BeautifulSoup(invalid_html,'lxml')
soup_invalid_html

<html><body><a content="" invalid=""></a></body></html>

In [19]:
soup_invalid_html = BeautifulSoup(invalid_html,'html5lib')
soup_invalid_html

<html><head></head><body></body></html>

In [20]:
soup_invalid_html = BeautifulSoup(invalid_html,'html.parser')
soup_invalid_html



### Tags

In [35]:
html_atag = """<html><body><p>Test html a tag example</p>
<a href="http://www.packtpub.com'>Home</a>
<a href="http;//www.packtpub.com/books'>Books</a>
</body>
</html>"""
soup = BeautifulSoup(html_atag,'lxml')
atag = soup.a
print(atag)

<a href="http://www.packtpub.com'&gt;Home&lt;/a&gt;
&lt;a href=" http="">Books</a>


In [36]:
type(atag)

bs4.element.Tag

In [37]:
atag.name

'a'

In [38]:
# Change the name of the tag
atag.name = 'p'
print(soup)
atag.name = 'a'

<html><body><p>Test html a tag example</p>
<p href="http://www.packtpub.com'&gt;Home&lt;/a&gt;
&lt;a href=" http="">Books</p>
</body>
</html>


In [41]:
atag = soup.a
print(atag['href'])

http://www.packtpub.com'>Home</a>
<a href=


In [42]:
print(atag.attrs)

{'href': "http://www.packtpub.com'>Home</a>\n<a href=", 'http': ''}


### The NavigableString Object

In [46]:
first_a_string = atag.string
first_a_string

'Books'

### Formatted Printing

In [47]:
html_markup = """<p class="ecopyramid">
<ul id="producers">
<li class="producerlist">
<div class="name">plants</div>
<div class="number">100000</div>
</li>
<li class="producerlist">
<div class="name">algae</div>
<div class="number">100000</div>
</li>
</ul>"""

soup = BeautifulSoup(html_markup,"lxml")
print(soup.prettify())

<html>
 <body>
  <p class="ecopyramid">
  </p>
  <ul id="producers">
   <li class="producerlist">
    <div class="name">
     plants
    </div>
    <div class="number">
     100000
    </div>
   </li>
   <li class="producerlist">
    <div class="name">
     algae
    </div>
    <div class="number">
     100000
    </div>
   </li>
  </ul>
 </body>
</html>


In [51]:
producer_entry = soup.ul
print(producer_entry.prettify())

<ul id="producers">
 <li class="producerlist">
  <div class="name">
   plants
  </div>
  <div class="number">
   100000
  </div>
 </li>
 <li class="producerlist">
  <div class="name">
   algae
  </div>
  <div class="number">
   100000
  </div>
 </li>
</ul>


### Unformatted Print

In [53]:
print(str(soup))

<html><body><p class="ecopyramid">
</p><ul id="producers">
<li class="producerlist">
<div class="name">plants</div>
<div class="number">100000</div>
</li>
<li class="producerlist">
<div class="name">algae</div>
<div class="number">100000</div>
</li>
</ul></body></html>


In [56]:
print(soup.decode)

<bound method BeautifulSoup.decode of <html><body><p class="ecopyramid">
</p><ul id="producers">
<li class="producerlist">
<div class="name">plants</div>
<div class="number">100000</div>
</li>
<li class="producerlist">
<div class="name">algae</div>
<div class="number">100000</div>
</li>
</ul></body></html>>


### Output formatters in Beautiful Soup

In [57]:
html_markup = """<html>
<body>& &amp; ampersand
¢ &cent; cent
© &copy; copyright
÷ &divide; divide
> &gt; greater than
</body>
</html>
"""
soup = BeautifulSoup(html_markup,"lxml")
print(soup.prettify())

<html>
 <body>
  &amp; &amp; ampersand
¢ ¢ cent
© © copyright
÷ ÷ divide
&gt; &gt; greater than
 </body>
</html>



In [58]:
print(soup.prettify(formatter="html"))

<html>
 <body>
  &amp; &amp; ampersand
&cent; &cent; cent
&copy; &copy; copyright
&divide; &divide; divide
&gt; &gt; greater than
 </body>
</html>



In [59]:
print(soup.prettify(formatter=None))

<html>
 <body>
  & & ampersand
¢ ¢ cent
© © copyright
÷ ÷ divide
> > greater than
 </body>
</html>



In [61]:
def remove_chara(markup):
    return markup.replace("a","")

In [62]:
print(soup.prettify(formatter=remove_chara))

<html>
 <body>
  & & mpersnd
¢ ¢ cent
© © copyright
÷ ÷ divide
> > greter thn
 </body>
</html>



### `get_text()`

In [63]:
html_markup = """<p class="ecopyramid">
<ul id="producers">
<li class="producerlist">
<div class="name">plants</div>
<div class="number">100000</div>
</li>
<li class="producerlist">
<div class="name">algae</div>
<div class="number">100000</div>
</li>
</ul>"""
soup = BeautifulSoup(html_markup,"lxml")
print(soup.get_text())




plants
100000


algae
100000




In [65]:
soup_packtpage = BeautifulSoup(page.content,"lxml")
print(soup_packtpage.get_text())




    var BASE_URL = 'https://www.packtpub.com/';
    var require = {
        "baseUrl": "https://www.packtpub.com/static/version1600690935/frontend/Packt/default/en_GB"
    };

(window.NREUM||(NREUM={})).loader_config={licenseKey:"NRJS-0f4d86b78cc0c8047b9",applicationID:"475968873"};window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var i=n[t]={exports:{}};e[t][0].call(i.exports,function(n){var i=e[t][1][n];return r(i||n)},i,i.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<t.length;i++)r(t[i]);return r}({1:[function(e,n,t){function r(){}function i(e,n,t){return function(){return o(e,[u.now()].concat(c(arguments)),n?null:this,t),n?void 0:this}}var o=e("handle"),a=e(5),c=e(6),f=e("ee").get("tracer"),u=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var d=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],p="api-",l=p+"ixn-";a(d,fun

In [66]:
[x.extract() for x in soup_packtpage.find_all('script')]

[<script>
     var BASE_URL = 'https://www.packtpub.com/';
     var require = {
         "baseUrl": "https://www.packtpub.com/static/version1600690935/frontend/Packt/default/en_GB"
     };
 </script>,
 <script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"NRJS-0f4d86b78cc0c8047b9",applicationID:"475968873"};window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var i=n[t]={exports:{}};e[t][0].call(i.exports,function(n){var i=e[t][1][n];return r(i||n)},i,i.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<t.length;i++)r(t[i]);return r}({1:[function(e,n,t){function r(){}function i(e,n,t){return function(){return o(e,[u.now()].concat(c(arguments)),n?null:this,t),n?void 0:this}}var o=e("handle"),a=e(5),c=e(6),f=e("ee").get("tracer"),u=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var d=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace"

### Creating a Web Scraper

In [81]:
import urllib3
import re
from bs4 import BeautifulSoup
packtpub_url = "http://www.packtpub.com/"

In [99]:
def get_bookurls(url):
    page = requests.get(url)
    soup_packtpage = BeautifulSoup(page.content, "lxml")
    page.close()
    next_page_li = soup_packtpage.find("li", class_="pager-next last")
    if next_page_li is None:
        next_page_url = None
    else:
        next_page_url = packtpub_url+next_page_li.a.get('href')
    return next_page_url

In [100]:
start_url = "http://www.packtpub.com/books"
get_bookurls(start_url)

In [102]:
start_url = "http://www.packtpub.com/books"
continue_scrapping = True
books_url = [start_url]
while continue_scrapping:
    next_page_url= get_bookurls(start_url)
    if next_page_url is None:
        continue_scraping = False
    else:
        books_url.append(next_page_url)
    start_url = next_page_url

MissingSchema: Invalid URL 'None': No schema supplied. Perhaps you meant http://None?

In [105]:
p_dict = {'action': 'query',
         'prop': 'revisions',
          'titles': 'University_of_Virginia',
          'rvslots': '*',
          'rvprop': 'content',
          'formatversion': '2',
          'format': 'json'}
r = requests.get("https://en.wikipedia.org/w/api.php", params = p_dict)
output=r.text
output

'{"batchcomplete":true,"query":{"normalized":[{"fromencoded":false,"from":"University_of_Virginia","to":"University of Virginia"}],"pages":[{"pageid":59801,"ns":0,"title":"University of Virginia","revisions":[{"slots":{"main":{"contentmodel":"wikitext","contentformat":"text/x-wiki","content":"{{redirect|UVa||Uva (disambiguation){{!}}Uva}}\\n{{short description|University in Charlottesville, Virginia, United States}}\\n\\n{{use American English|date = April 2019}}\\n{{use mdy dates|date=March 2019}}\\n\\n{{Infobox university\\n| name = University of Virginia\\n| image_name = University of Virginia seal.svg \\n| image_upright = 0.7\\n| founder = [[Thomas Jefferson]]\\n| established = {{start date and age|January 25, 1819}}<ref>{{cite web |title = History |url = https://bicentennial.virginia.edu/history |website = University of Virginia Bicentennial |publisher = Rector and Visitors of the University of Virginia |access-date = March 3, 2020}}</ref>\\n| type = [[Public university#United Sta

In [106]:
from bs4 import BeautifulSoup
output=BeautifulSoup(r.text, 'html.parser')
output

{"batchcomplete":true,"query":{"normalized":[{"fromencoded":false,"from":"University_of_Virginia","to":"University of Virginia"}],"pages":[{"pageid":59801,"ns":0,"title":"University of Virginia","revisions":[{"slots":{"main":{"contentmodel":"wikitext","contentformat":"text/x-wiki","content":"{{redirect|UVa||Uva (disambiguation){{!}}Uva}}\n{{short description|University in Charlottesville, Virginia, United States}}\n\n{{use American English|date = April 2019}}\n{{use mdy dates|date=March 2019}}\n\n{{Infobox university\n| name = University of Virginia\n| image_name = University of Virginia seal.svg \n| image_upright = 0.7\n| founder = [[Thomas Jefferson]]\n| established = {{start date and age|January 25, 1819}}<ref>{{cite web |title = History |url = https://bicentennial.virginia.edu/history |website = University of Virginia Bicentennial |publisher = Rector and Visitors of the University of Virginia |access-date = March 3, 2020}}</ref>\n| type = [[Public university#United States|Public]] 

In [107]:
BeautifulSoup("http://www.google.com")



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))
  ' that document to Beautiful Soup.' % decoded_markup


<html><body><p>http://www.google.com</p></body></html>

In [113]:
html_string = """<p href="http://www.jonswebpage.com">My webpage</p>"""
soup = BeautifulSoup(html_string,'lxml')
soup.p

<p href="http://www.jonswebpage.com">My webpage</p>

In [118]:
soup.string

'My webpage'

In [120]:
url = 'https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser") 
soup

<!DOCTYPE doctype html>
<html lang="en"><head><script>!function(c,f){var t,o,i,e=[],r={passive:!0,capture:!0},n=new Date,a="pointerup",u="pointercancel";function p(n,e){t||(t=e,o=n,i=new Date,w(f),s())}function s(){0<=o&&o<i-n&&(e.forEach(function(n){n(o,t)}),e=[])}function l(n){if(n.cancelable){var e=(1e12<n.timeStamp?new Date:performance.now())-n.timeStamp;"pointerdown"==n.type?function(n,e){function t(){p(n,e),i()}function o(){i()}function i(){f(a,t,r),f(u,o,r)}c(a,t,r),c(u,o,r)}(e,n):p(e,n)}}function w(e){["click","mousedown","keydown","touchstart","pointerdown"].forEach(function(n){e(n,l,r)})}w(c),self.perfMetrics=self.perfMetrics||{},self.perfMetrics.onFirstInputDelay=function(n){e.push(n),s()}}(addEventListener,removeEventListener)</script><script defer="" src="https://cdn.optimizely.com/js/16180790160.js"></script><title data-rh="true">Ethics in Web Scraping. We all scrape web data. Well, those of… | by James Densmore | Towards Data Science</title><meta charset="utf-8" data-rh=

In [127]:
soup.find("p")

<p class="gi gj dp gk b gl gm gn go gp gq gr gs gt gu gv gw gx gy gz ha hb hc hd he hf dh eh" id="f9b0">We all scrape web data. Well, those of us who work with data do. Data scientists, marketers, data journalists, and the data curious alike. Lately, I’ve been thinking more about the ethics of the practice and have been dissatisfied by the lack of consensus on the topic.</p>

In [128]:
soup.find_all("p")

[<p class="gi gj dp gk b gl gm gn go gp gq gr gs gt gu gv gw gx gy gz ha hb hc hd he hf dh eh" id="f9b0">We all scrape web data. Well, those of us who work with data do. Data scientists, marketers, data journalists, and the data curious alike. Lately, I’ve been thinking more about the ethics of the practice and have been dissatisfied by the lack of consensus on the topic.</p>,
 <p class="gi gj dp gk b gl gm gn go gp gq gr gs gt gu gv gw gx gy gz ha hb hc hd he hf dh eh" id="7541">Let me be clear that I’m talking <strong class="gk hg">ethics</strong> not the law. The law in regards to scraping web data is complex, fuzzy and ripe for reform, but that’s another matter. It’s not that no one is thinking, or writing, about the ethics in scraping but rather that both those scraping and those being scraped can’t agree on basic principles.</p>,
 <p class="gi gj dp gk b gl gm gn go gp gq gr gs gt gu gv gw gx gy gz ha hb hc hd he hf dh eh" id="c989">I’ve been on both sides. I scape data mostly fo

In [129]:
soup.find_all(class_"p")

SyntaxError: invalid syntax (<ipython-input-129-4df101a6067e>, line 1)

In [130]:
soup.find(text="p")

In [132]:
article = soup.find_all("p")
article[0] 

<p class="gi gj dp gk b gl gm gn go gp gq gr gs gt gu gv gw gx gy gz ha hb hc hd he hf dh eh" id="f9b0">We all scrape web data. Well, those of us who work with data do. Data scientists, marketers, data journalists, and the data curious alike. Lately, I’ve been thinking more about the ethics of the practice and have been dissatisfied by the lack of consensus on the topic.</p>

In [134]:
article[0].find_next_sibling() 

<p class="gi gj dp gk b gl gm gn go gp gq gr gs gt gu gv gw gx gy gz ha hb hc hd he hf dh eh" id="7541">Let me be clear that I’m talking <strong class="gk hg">ethics</strong> not the law. The law in regards to scraping web data is complex, fuzzy and ripe for reform, but that’s another matter. It’s not that no one is thinking, or writing, about the ethics in scraping but rather that both those scraping and those being scraped can’t agree on basic principles.</p>

In [141]:
article[1]

<p class="gi gj dp gk b gl gm gn go gp gq gr gs gt gu gv gw gx gy gz ha hb hc hd he hf dh eh" id="7541">Let me be clear that I’m talking <strong class="gk hg">ethics</strong> not the law. The law in regards to scraping web data is complex, fuzzy and ripe for reform, but that’s another matter. It’s not that no one is thinking, or writing, about the ethics in scraping but rather that both those scraping and those being scraped can’t agree on basic principles.</p>

In [136]:
article[0].find_parents("<p>")

[]

In [140]:
article.get_text()

AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

In [138]:
[i.get_text() for i in article]

['We all scrape web data. Well, those of us who work with data do. Data scientists, marketers, data journalists, and the data curious alike. Lately, I’ve been thinking more about the ethics of the practice and have been dissatisfied by the lack of consensus on the topic.',
 'Let me be clear that I’m talking ethics not the law. The law in regards to scraping web data is complex, fuzzy and ripe for reform, but that’s another matter. It’s not that no one is thinking, or writing, about the ethics in scraping but rather that both those scraping and those being scraped can’t agree on basic principles.',
 'I’ve been on both sides. I scape data mostly for personal projects, but I’ve employed it as a form of data collection on the job as well. On the other side, I’ve wrestled over how to filter out “bots” from my own or my employer’s web logs and analytics in order to focus on real customers. It’s been a reality of life for years now, and rather than fighting it let’s just set some ground rules