# Chapter 1: Your first web scraper

## Connecting

In [1]:
from urllib.request import urlopen
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


## An introducction to BeautifulSoup

### Instaling

### Running BeautifulSoup

The most commonly used object in the BeautifulSoup library is, appropriately, the
BeautifulSoup object.

In [4]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
# el .read() no es nesesario
# parses: html.parser and lxml and  html5libx : specifies the parser
# that you want BeautifulSoup to use in order to create that object

In [7]:
print(bs)
print(bs.h1) # ds.html.body.h1,  bs.body.h1,  ns.html.h1

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

<h1>An Interesting Title</h1>


This HTML content is then transformed into a BeautifulSoup object

### Connecting Realiably and Hanling Exceptions

**HTTPError**

In [17]:
from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)
    # return null, break, or do some other "Plan B"
else:
    # program continues. Note: If you return or break in the
    # exception catch, you do not need to use the "else" statement
    print("It worked!")

aqui


**HTTPError and URLError**

In [20]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
try:
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print('It Worked!')

The server could not be found!


**AtributteError**

If you attempt to access a tag that does not exist, BeautifulSoup will return a
None object. The problem is, attempting to access a tag on a None object itself will
result in an AttributeError being thrown.

In [31]:
print(bs.nonExistentTag)
print(bs.nonExistentTag.someTag)

None




AttributeError: 'NoneType' object has no attribute 'someTag'

In [26]:
try:
    badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent == None:
        print ('Tag was not found')
    else:
        print(badContent)

Tag was not found




**Errors**

In [30]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
    print('Title could not be found')
else:
    print(title)

<h1>An Interesting Title</h1>


# Chapter 2: Advanced HTML Parsing

You'll take look at parsing complicated HTML pages in order to extract only the information you're looking for.

## You don't always need a Hammer

## I.2 .2 Another Serving of BeautifulSoup

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [3]:
html = urlopen(" http://www.pythonscraping.com/pages/warandpeace.html")
bs = BeautifulSoup(html.read(), "html.parser")

Using this BeautifulSoup object, you can use the find_all function to extract a
Python list of proper nouns found by selecting only the text within <span
class="green"></span> tags (find_all is an extremely flexible function you’ll be
using a lot later in this book):

In [11]:
nameList = bs.findAll('span', {'class':'green'})
print(nameList[0])
for name in nameList:
    print(name.get_text())

<span class="green">Anna
Pavlovna Scherer</span>
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


### find() and find_all( with BeautifulSoup)

**find_all(tag, attributes, recursive, text, limit, keywords)**<br>
**find(tag, attributes, recursive, text, keywords)**

.find_all(['h1','h2','h3','h4','h5','h6'])<br>
.find_all('span', {'class':{'green', 'red'}})<br>
nameList = bs.find_all(text='the prince')<br>
bs.find_all(id='title', class_='text')<br>
bs.find_all('', {'id':'text'})<br>
bs.find_all(class_='green')<br>
bs.find_all('', {'class':'green'})

Recall that passing a list of tags to .find_all() via the attributes list acts as an “or”
filter (it selects a list of all tags that have tag1, tag2, or tag3...). If you have a lengthy
list of tags, you can end up with a lot of stuff you don’t want. The keyword argument
allows you to add an additional “and” filter to this.

### Other BeatifulSoup Object

* BeautifulSoup objects
* Tag objects
* NavigableString objects
* Comment object

### Navigating Trees

In [18]:
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

much like in a human family tree, children
are always exactly one tag below a parent, whereas descendants can be at any level in
the tree below a parent. For example, the tr tags are children of the table tag,
whereas tr, th, td, img, and span are all descendants of the table tag (at least in our
example page). All children are descendants, but not all descendants are children.

, bs.div.find_all('img') will find the first div tag in the document, and
then retrieve a list of all img tags that are descendants of that div tag.
If you want to find only descendants that are children, you can use the .children
tag:

In [21]:
for child in bs.find('table',{'id':'giftList'}).children:
    print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


In [22]:
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

 So, by selecting the title row
and calling next_siblings, you can select all the rows in the table, without selecting
the title row itself.

In [25]:
print(bs.find('img',{'src':'../img/gifts/img1.jpg'})
      .parent.previous_sibling.get_text())


$15.00



## I.2.3 Regular expressions

## I.2.4 Regular expressions  and BeautifulSoap

Certainly, you can’t count
on the only images on the page being product images.

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

In [2]:
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img',
                     {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
for image in images:
    print(image['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


A regular expression can be inserted as any argument in a BeautifulSoup expression,
allowing you a great deal of flexibility in finding target elements

## I.2.5 Accessing Attributes 

Often in web scraping you’re not looking for the content of a tag; you’re
looking for its attributes

**DATO**: El Elemento HTML Anchor crea un enlace a otras páginas de internet, archivos o ubicaciones dentro de la misma página, direcciones de correo, o cualquier otra URL

In [9]:
# Import Beautiful Soup
from bs4 import BeautifulSoup
  
# Initialize the object with a HTML page
soup = BeautifulSoup('''
    <html>
        <h2 class="hello"> Heading 1 </h2>
        <h1> Heading 2 </h1>
    </html>
    ''', "lxml")
  
# Get the whole h2 tag
tag = soup.h2
  
# Get the attribute
attribute = tag.attrs
  
# Print the output
print(attribute)

print(attribute["class"])

{'class': ['hello']}
['hello']


In [7]:
# Initialize the object with a HTML page
soup = BeautifulSoup('''
	<html>
		<h2 class="hello"> Heading 1 </h2>
		<h1> Heading 2 </h1>
	</html>
	''', "lxml")

# Get the whole h2 tag
tag = soup.h2

# Get the attribute
attribute = tag['class']

# Print the output
print(attribute)


['hello']


myImgTag.attrs['src']

## I.2.6 Lambda expressions

In [11]:
html = '<html><body><a href="foo.org">bar</a></body></html>'
soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a', href=lambda x: ".org" in x)
links

[<a href="foo.org">bar</a>]

In [13]:
html = '<html><body><a>bar</a></body></html>'
soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a', href=lambda x: x is not None and ".org" in x)
links

[]

# Chapter 3: Writing Web Crawlers

You'll start looking at real-world problems, with scrapers traversing multiple pages and even multiple sites.

## I.3.1 Traversing a single domain

Six Degrees of wikipedia: the goal is to link two unlike subjects (Wikipedia article that link to each other) by a chain containig no more than six total.

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [4]:
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup( html , "html.parser")    

In [8]:
for link in bs.find_all("a"):
    if "href" in link.attrs:
        print(link.attrs["href"])

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#searchInput
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/Philadelphia,_Pennsylvania
/wiki/Kevin_Bacon_filmography
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
#cite_note-1
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Wikipedia:Citation_needed
http://baconbros.com/
#cite_note-2
#cite_note-actor-3
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Balto_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(1982_film)
/wiki/Tremors_(1990_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/X-Men:_First_Class
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chan

study Wikipedia API

In [12]:
import re
for link in bs.find('div', {'id':'bodyContent'}).find_all(
'a', href=re.compile('^(/wiki/)((?!:).)*$')):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia,_Pennsylvania
/wiki/Kevin_Bacon_filmography
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Balto_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(1982_film)
/wiki/Tremors_(1990_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/Streaming_television
/wiki/I_Love_Dick_(TV_series)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or

In [13]:
import random 
import datetime

In [24]:
datetime.datetime.now()

datetime.datetime(2022, 5, 12, 15, 13, 43, 961900)

In [25]:
random.seed(2022) # no acepta como argumento lo anterior
def getLinks(articleUrl):
    html = urlopen('http://en.wikipedia.org{}'.format(articleUrl))
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find('div', {'id':'bodyContent'}).find_all('a',href=re.compile('^(/wiki/)((?!:).)*$'))

In [26]:
links = getLinks('/wiki/Kevin_Bacon')
while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

/wiki/Internet_Broadway_Database
/wiki/Theatre
/wiki/Satyr_play
/wiki/Odysseus
/wiki/Phaeacians
/wiki/Laodamas
/wiki/Thessaly
/wiki/Volos
/wiki/Heraklion_(regional_unit)
/wiki/Chania_(regional_unit)
/wiki/Chania
/wiki/Fortezza_of_Rethymno
/wiki/Exomvourgo
/wiki/Sophia_the_Martyr
/wiki/List_of_Last_Exile_characters#Sophia_Forrester
/wiki/TV_Tokyo
/wiki/NHK_General_TV
/wiki/Miyazaki_Prefecture
/wiki/ISBN_(identifier)
/wiki/Ruby_(programming_language)
/wiki/Eiffel_(programming_language)
/wiki/ALGOL
/wiki/Strong_and_weak_typing
/wiki/C_(programming_language)
/wiki/Space_(punctuation)
/wiki/ISBN_(identifier)
/wiki/Prime_number
/wiki/Achilles_number
/wiki/Hexagonal_number
/wiki/Prime_omega_function
/wiki/Stieltjes_constants
/wiki/Euler%E2%80%93Mascheroni_constant
/wiki/Germany
/wiki/France
/wiki/Somalia
/wiki/List_of_cities_in_Somalia_by_population
/wiki/Bari,_Somalia
/wiki/Somalia
/wiki/British_Somaliland
/wiki/Somalia
/wiki/Walashma_dynasty
/wiki/Sabr_ad-Din_II
/wiki/Walashma_dynasty
/wiki

KeyboardInterrupt: 

autonomous production code requires far more exception handling than can fit into this book.

## I.3.2 Crawling an Entire Site

But what if you need to systematically catalog or search every page on a site?

The dark and deep webs

When might crawling an entire website be useful, and when might it be harmful?

* Generating a site map
* Gathering data

The general approach to an exhaustive site crawl is to start with a top-level page (such as the home page), and search for a list of all internal links on that page. Every one of those links is then crawled, and additional lists of links are found on each of them, triggering another round of crawling.

To avoid crawling the same page twice use set. A set is similar to a list, but elements do not have a specific order, and only unique elements will be stored.

In [27]:
pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
""" This is translated as “the front page
of Wikipedia” as soon as the empty URL is prepended with http://en.wikipe
dia.org inside the function."""
getLinks('')

/wiki/Wikipedia
/wiki/Wikipedia:Protection_policy#semi
/wiki/Wikipedia:Requests_for_page_protection
/wiki/Category:Administrative_backlog
/wiki/Help:Category
/wiki/Wikipedia:Categorization
/wiki/Wikipedia:Contents/Categories
/wiki/Wikipedia:Contents
/wiki/Wikipedia:Contents/Overviews
/wiki/Wikipedia:Contents/Outlines
/wiki/Wikipedia:Contents/Lists
/wiki/Wikipedia:Contents/Portals
/wiki/Wikipedia:Contents/Glossaries
/wiki/Wikipedia:Contents/Indices
/wiki/Wikipedia:Contents/A%E2%80%93Z_index
/wiki/Wikipedia:Contents/Reference
/wiki/Wikipedia:Contents/Culture_and_the_arts
/wiki/Wikipedia:Contents/Geography_and_places
/wiki/Special:Nearby
/wiki/Special:MyTalk
/wiki/Special:NewSection/User_talk:190.102.134.97
/wiki/File:User-info.svg
/wiki/File:Full-protection-shackle.svg
/wiki/Scalable_Vector_Graphics


KeyboardInterrupt: 

A warning regarding recursion: If left running long enought, the preceding program sill almost certainly crash.

### Collecting  Data across an entire sites

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    try:
        print(bs.h1.get_text())
        print(bs.find(id ='mw-content-text').find_all('p')[0])
        print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])
    except AttributeError:
        print('This page is missing something! Continuing.')
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print('-'*20)
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1123)>

## I.3.3 Crawling across the internet

In [None]:
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

In [None]:
pages = set()
random.seed(datetime.datetime.now())
#Retrieves a list of all Internal links found on a page
def getInternalLinks(bs, includeUrl):
includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme,
urlparse(includeUrl).netloc)
internalLinks = []
#Finds all links that begin with a "/"
for link in bs.find_all('a',
href=re.compile('^(/|.*'+includeUrl+')')):
if link.attrs['href'] is not None:
if link.attrs['href'] not in internalLinks:
if(link.attrs['href'].startswith('/')):
internalLinks.append(
includeUrl+link.attrs['href'])
else:
internalLinks.append(link.attrs['href'])
return internalLinks
#Retrieves a list of all external links found on a page
def getExternalLinks(bs, excludeUrl):
externalLinks = []
#Finds all links that start with "http" that do
#not contain the current URL
for link in bs.find_all('a',
href=re.compile('^(http|www)((?!'+excludeUrl+').)*$')):
if link.attrs['href'] is not None:
if link.attrs['href'] not in externalLinks:
externalLinks.append(link.attrs['href'])
return externalLinks
def getRandomExternalLink(startingPage):
html = urlopen(startingPage)
bs = BeautifulSoup(html, 'html.parser')
externalLinks = getExternalLinks(bs,
urlparse(startingPage).netloc)
if len(externalLinks) == 0:
print('No external links, looking around the site for one')
domain = '{}://{}'.format(urlparse(startingPage).scheme,
urlparse(startingPage).netloc)
internalLinks = getInternalLinks(bs, domain)
return getRandomExternalLink(internalLinks[random.randint(0,
len(internalLinks)-1)])
else:
return externalLinks[random.randint(0, len(externalLinks)-1)]
def followExternalOnly(startingSite):
externalLink = getRandomExternalLink(startingSite)
print('Random external link is: {}'.format(externalLink))
followExternalOnly(externalLink)
followExternalOnly('http://oreilly.com')

# Chapter 4: Web  Crawling Models