# Web Scraping with python

## Intro

web scrapers can access data from databases which expand to millions of pages unlike browser which are generally good at executing js scripts.<br/>
A Google search for “cheapest flights to Boston” will result in a slew of advertisements
and popular flight search sites. Google only knows what these websites say on their
content pages, not the exact results of various queries entered into a flight search
application. However, a well-developed web scraper can chart the cost of a flight to
Boston over time, across a variety of websites, and tell you the best time to buy your
ticket.

if it hits you “Isn’t data gathering what APIs are for?” (If you’re unfamiliar
with APIs, see Chapter 4.) Well, APIs can be fantastic, if you find one that suits your
purposes. They can provide a convenient stream of well-formatted data from one
server to another.<br/>
• You are gathering data across a collection of sites that do not have a cohesive API.<br/>
• The data you want is a fairly small, finite set that the webmaster did not think
warranted an API.<br/>
• The source does not have the infrastructure or technical ability to create an API.<br/>


<i/> A web browser can tell the processor to send some data to the application that handles your wireless (or wired) interface, but many languages have libraries that can do that just as well.

In [11]:
import requests
html = requests.get("http://pythonscraping.com/pages/page1.html")
type(html)

requests.models.Response

In [12]:
string_html=html.text
type(string_html)

str

#### An Introduction to BeautifulSoup

“Beautiful Soup, so rich and green,<br/>
Waiting in a hot tureen!<br/>
Who for such dainties would not stoop?<br/>
Soup of the evening, beautiful Soup!”<br/>

In [14]:
from bs4 import BeautifulSoup

In [20]:
bsObj = BeautifulSoup(string_html,'html.parser')

In [21]:
bsObj

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

In [22]:
bsObj.h1

<h1>An Interesting Title</h1>

we extracted from the page was nested two layers deep
into our BeautifulSoup object structure (html → body → h1). However, when we
actually fetched it from the object, we called the h1 tag directly:<br/>
bsObj.h1<br/>
In fact, any of the following function calls would produce the same output:<br/>
bsObj.html.body.h1<br/>
bsObj.body.h1<br/>
bsObj.html.h1<br/>

Virtually any information can be extracted from any HTML
(or XML) file, as long as it has some identifying tag surrounding it, or near it. In
chapter 3, we’ll delve more deeply into some more-complex BeautifulSoup function
calls, as well as take a look at regular expressions and how they can be used with Beau
tifulSoup in order to extract information from websites.

In [None]:
html = requests.get("http://www.pythonscraping.com/pages/page1.html")

Here, to anticipate the server error or the page not found error we include try and except and othe r conditions to filter the ways the scraping can go wrong.


In [None]:
try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
#return null, break, or do some other "Plan B"
else:
    #program continues. Note: If you return or break in the
    #exception catch, you do not need to use the "else" statement

In [None]:
if html is None:
    print("URL is not found")
else:
    #program continues

In [26]:
import requests
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = requests.get(url)
        string_html = html.text
    except HTTPError as e:             #if page not found for above request
                                        #an HTTP error will be returned.
        return None
    try:
        bsObj = BeautifulSoup(string_html,'html.parser')
        title = bsObj.body.h1
    except AttributeError as e:       #Server not found
        return None
    return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title can't be found")
else:
    print(title)


<h1>An Interesting Title</h1>


You’ll also likely want to heavily reuse code. Having generic functions such as getSiteHTML and getTitle (complete with thorough exception handling) makes it easy to quickly—and reliably—scrape the web.

## Chap 2
### Advanced HTML parsing

There are many techniques to chip away the content that doesn’t look like the content that we’re searching for, until we arrive at the information we’re seeking. In this chapter, we’ll take look at parsing complicated HTML pages in order to extract only the information we’re looking for.

We need better formatted HTML:<br/>
• so what we do is find the print this page link or mobile version link(sometimes by presenting yourself as a mobile device)<br/>
• Look for the information hidden in a JavaScript file. Remember, you might need
to examine the imported JavaScript files in order to do this. For example, I once
collected street addresses (along with latitude and longitude) off a website in a
neatly formatted array by looking at the JavaScript for the embedded Google Map
that displayed a pinpoint over each address.<br/>
• This is more common for page titles, but the information might be available in
the URL of the page itself.<br/>
• If the information you are looking for is unique to this website for some reason,
you’re out of luck. If not, try to think of other sources you could get this informa‐
tion from. Is there another website with the same data? Is this website displaying
data that it scraped or aggregated from another website?

##### Especially when faced with buried or poorly formatted data, it’s important not to just start digging. Take a deep breath and think of alternatives.

In this section, we’ll discuss searching for tags by
attributes, working with lists of tags, and parse tree navigation.

In [27]:
html = requests.get("http://www.pythonscraping.com/pages/warandpeace.html")
string_html=html.text

bsObj = BeautifulSoup(string_html,'html.parser')

we used <br/>
bsObj.tagName<br/>
in order to get the first occurrence of that tag on the page. Now, we’re calling
bsObj.findAll(tagName, tagAttributes) in order to get a list of all of the tags on
the page, rather than just the first.

In [28]:
nameList = bsObj.findAll("span", {"class":"green"})

In [29]:
nameList #it should list all the proper nouns in the text, in the order they appear in 
         #War and Peace.

[<span class="green">Anna
 Pavlovna Scherer</span>, <span class="green">Empress Marya
 Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="green">the prince</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">Prince Vasili</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">Wintzingerode</span>, <span class="green">King of Prussia</span>, <span class="green">le Vicomte de Mortemart</span>, <span class="green">Montmorencys</span>, <span class="green">Rohans</span>, <span class="green">Abbe Morio</span>, <span class="green">the Emperor</span>, <span class="green">the prince</span>, <span class="green">Pri

##### find and findall

findAll(tag, attributes, recursive, text, limit, keywords)<br/>
find(tag, attributes, recursive, text, keywords)

###### tag
like tag argument (bsObj.body.h1) we can use:<br/>
you can pass a string name of a tag
or even a Python list of string tag names. For example, the following will return a list
of all the header tags in a document:

In [36]:
bsObj.findAll({'h1','h2','h3','h4','h5','h6'})  ## will find every header from the HTML till 
                                                ##h6

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

###### attributes
The attributes argument takes a Python dictionary of attributes and matches tags
that contain any one of those attributes. For example, the following function would
return both the green and red span tags in the HTML document:

In [None]:
bsObj.findAll("span", {"class":"green", "class":"red"})

###### recursive

The recursive argument is a boolean. How deeply into the document do you want to
go? If recursion is set to True , the findAll function looks into children, and child‐
ren’s children, for tags that match your parameters. If it is false , it will look only at
the top-level tags in your document. By default, findAll works recursively ( recur
sive is set to True ); it’s generally a good idea to leave this as is, unless you really know
what you need to do and performance is an issue.

###### text:
The text argument is unusual in that it matches based on the text content of the tags,
rather than properties of the tags themselves.


In [39]:
namelist=bsObj.findAll(text='the prince')

In [40]:
len(namelist)

7

###### limit
limit is used in findAll, find is equivalent to findAll with limit 1.

##### keyword
The keyword argument allows you to select tags that contain a particular attribute.
For example:

In [None]:
allText = bsObj.findAll(id="text")
print(allText[0].get_text())      # These give you the whole text of the page that you 
                                  #intended to scrap

everything that can done with keyword can also be done using normal techniques.<br/>
above can be accomplished by:<br/>
bsObj.findall('',{'id':'text'})

#### Other BeautifulSoup Objects:
Till now we have seen two types of BS objects:<br/>
1) bsObj
2) tag objects... (find,findAll)
other 2 are:<br/>
<p>
NavigableString objects
Used to represent text within tags, rather than the tags themselves (some func‐
tions operate on, and produce, NavigableStrings , rather than tag objects).
<p>
The Comment object
Used to find HTML comments in comment tags, <!--like this one-->
These four objects are the only objects you will ever encounter (as of the time of this
writing) in the BeautifulSoup library.

#### Navigating Trees:


In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")

bssObj = BeautifulSoup(html)

##### children and descendants
bsObj.body.h1 selects the first h1 tag that is a
descendant of the body tag. It will not find tags located outside of the body.<br/>

bsObj.div.findAll("img") will find the first div tag in the document,
then retrieve a list of all img tags that are descendants of that div tag.<br/>


<i>If you want to find only descendants that are children, you can use the .children tag:

In [None]:
for child in bssObj.find("table",{"id":"giftList"}).children:
    print(child)

<i>This code prints out the list of product rows in the giftList table. If you were to
write it using the descendants() function instead of the children() function, about
two dozen tags would be found within the table and printed, including img tags, span
tags, and individual td tags. It’s definitely important to differentiate between children
and descendants!

In [83]:
#siblings
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
    print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj= BeautifulSoup(html)
images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")}) 
#This prints out only the relative image paths that start with ../img/gifts/img and end in 
#.jpg
for image in images:
    print(image["src"])

<i> to get the attributes of the tag objects we use:</i><br/>
myTag.attrs<br/>
The source location for an
image, for example, can be found using the following line:<br/>
myImgTag.attrs['src']



Every tag object that BeautifulSoup encounters is
evaluated in this function, and tags that evaluate to “true” are returned while the rest
are discarded.
For example, the following retrieves all tags that have exactly two attributes:


In [None]:
soup.findAll(lambda tag: len(tag.attrs) == 2)