<a href="https://colab.research.google.com/github/abdallaRml/lu/blob/master/Copy_of_Copy_of_Step_3_6_BeginningToScrape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraper
in this this example we’ll cover how to format and interpret data without the help of a browser.
This example starts with the basics of sending a GET request (a request to fetch, or “get,” the content of a web page) to a web server for a specific page, reading the HTML output from that page, and doing some simple data extraction in order to isolate the content that you are looking for.



# Connecting
A web browser can tell the processor to send data to the application that handles your wireless (or wired) interface, but you can do the same thing in Python with just three lines of code:

In [1]:
from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


This command outputs the complete HTML code for page1 located at the URL http:// pythonscraping.com/pages/page1.html. 

More accurately, this outputs the HTML file page1.html, found in the directory <web root>/pages, on the server located at the domain name http://pythonscraping.com.
Why is it important to start thinking of these addresses as “files” rather than “pages”? Most modern web pages have many resource files associated with them. These could be image files, JavaScript files, CSS files, or any other content that the page you are requesting is linked to. When a web browser hits a tag such as < img src="cuteKit ten.jpg" >, the browser knows that it needs to make another request to the server to get the data at the file cuteKitten.jpg in order to fully render the page for the user.

# Running BeautifulSoup
The most commonly used object in the BeautifulSoup library is, appropriately, the BeautifulSoup object.
Note that this returns only the first instance of the h1 tag found on the page. By convention, only one h1 tag should be used on a single page, but conventions are often broken on the web, so you should be aware that this will retrieve the first instance of the tag only, and not necessarily the one that you’re looking for.
As in previous web scraping examples, you are importing the urlopen function and calling html.read() in order to get the HTML content of the page. In addition to the text string, BeautifulSoup can also use the file object directly returned by urlopen, without needing to call .read() first:


In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


# Connecting Reliably and Handling Exceptions
The web is messy. Data is poorly formatted, websites go down, and closing tags go missing. 
Let’s take a look at the first line of our scraper, after the import statements, and figure out how to handle any exceptions this might throw:
Two main things can go wrong in this line:

•	The page is not found on the server (or there was an error in retrieving it).
•	The server is not found.
In the first situation, an HTTP error will be returned. This HTTP error may be “404 Page Not Found,” “500 Internal Server Error,” and so forth. In all of these cases, the urlopen function will throw the generic exception HTTPError. If an HTTP error code is returned, the program now prints the error, and does not execute the rest of the program under the else statement.
If the server is not found at all (if, say, http://www.pythonscraping.com is down, or the URL is mistyped), urlopen will throw an URLError. This indicates that no server could be reached at all, and, because the remote server is responsible for returning HTTP status codes, an HTTPError cannot be thrown, and the more serious URLError must be caught. You can add a check to see whether this is the case:




In [3]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen("https://pythonscrapingthisurldoesnotexist.com")
except HTTPError as e:
    print("The server returned an HTTP error")
except URLError as e:
    print("The server could not be found!")
else:
    print(html.read())

The server could not be found!


Of course, if the page is retrieved successfully from the server, there is still the issue of the content on the page not quite being what you expected. 

Every time you access a tag in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually exists. 

If you attempt to access a tag that does not exist, BeautifulSoup will return a None object. The problem is, attempting to access a tag on a None object itself will result in an AttributeError being thrown.

The following line (where nonExistentTag is a made-up tag, not the name of a real BeautifulSoup function)

print(bs.nonExistentTag)

returns a None object. This object is perfectly reasonable to handle and check for. 

The trouble comes if you don’t check for it, but instead go on and try to call another function on the None object, as illustrated in the following:

print(bs.nonExistentTag.someTag)

This returns an exception:
AttributeError: 'NoneType' object has no attribute 'someTag'So how can you guard against these two situations? 

The easiest way is to explicitly check for both situations:


In [4]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(), "lxml")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title


title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

<h1>An Interesting Title</h1>
