# **Python Urllib Module** 
***urllib — URL handling modules***

 Python language is used extensively for web programming. When we browse website we use the web address which is also known as *uniform resource locator* (URL). Python has inbuilt materials which can handle the calls to the URL as well as pass the result that comes out of visiting the URL. Which can be done by a module named as **urllib**. We will see the various functions present in this module which help in getting the result from the URL.

It uses the *urlopen* function and is able to fetch URLs using a variety of different protocols.Urllib is a package that collects several modules for working with URLs, such as:

* urllib.request for opening and reading.
* urllib.parse for parsing URLs
* urllib.error for the exceptions raised
* urllib.robotparser for parsing robot.txt files



***Installing urllib:***

To install urllib in the python environment, we use the below command using pip.

`pip install urllib`


Let deep dive in above mentioned modules: 
* **urllib.request**

This module helps to define functions and classes to open URLs (mostly HTTP) and and fetch its content to the python environment.


```
import urllib.request
address = urllib.request.urlopen('https://www.youtube.com/')
print(address.read())
```
This will display the source code of the URL i.e. YouTube. Try it Yourself!!

* **urllib.parse**

This module helps to define functions to manipulate URLs and their components parts, to build or break them. We can parse the URL to check if it is a valid one or not. It usually focuses on splitting a URL into small components; or joining different URL components into URL strings.


```
from urllib.parse import * parse_url = urlparse('https://www.geeksforgeeks.org / python-langtons-ant/')
print(parse_url)
print("\n")
unparse_url = urlunparse(parse_url)
print(unparse_url)
```
**Note**:- The different components of a URL are separated and joined again. Try using some other URL for better understanding.

Different other functions of urllib.parse are :

**urllib.parse.urlparse:**	Separates different components of URL

**urllib.parse.urlunparse:**	Join different components of URL

**urllib.parse.urlsplit:**	It is similar to urlparse() but doesn’t split the params

**urllib.parse.urlunsplit:**	Combines the tuple element returned by urlsplit() to form URL

**urllib.parse.urldeflag:**	If URL contains fragment, then it returns a URL removing the fragment.

* **urllib.error**

This module defines the classes for exception raised by urllib.request. Whenever there is an error in fetching a URL, this module helps in raising exceptions. The following are the exceptions raised :

1. URLError – It is raised for the errors in URLs, or errors while fetching the URL due to connectivity, and has a ‘reason’ property that tells a user the reason of error.
2. HTTPError – It is raised for the exotic HTTP errors, such as the authentication request errors. It is a subclass or URLError. Typical errors include ‘404’ (page not found), ‘403’ (request forbidden),
and ‘401’ (authentication required).


```
# URL Error
  
import urllib.request
import urllib.parse
  
# trying to read the URL but with no internet connectivity
try:
    x = urllib.request.urlopen('https://www.google.com')
    print(x.read())
  
# Catching the exception generated     
except Exception as e :
    print(str(e))
```
output

```
URL Error: urlopen error [Errno 11001] getaddrinfo failed
```


```
# HTTP Error
  
import urllib.request
import urllib.parse
  
# trying to read the URL
try:
    x = urllib.request.urlopen('https://www.google.com / search?q = test')
    print(x.read())
  
# Catching the exception generated    
except Exception as e :
    print(str(e))
```

output

```
HTTP Error 403: Forbidden
```

* **urllib.robotparser**

This module contains a single class, RobotFileParser. This class answers question about whether or not a particular user can fetch a URL that published robot.txt files. Robots.txt is a text file webmasters create to instruct web robots how to crawl pages on their website. The robot.txt file tells the web scraper about what parts of the server should not be accessed.



```
# importing robot parser class
import urllib.robotparser as rb
  
bot = rb.RobotFileParser()
  
# checks where the website's robot.txt file reside
x = bot.set_url('https://www.geeksforgeeks.org / robot.txt')
print(x)
  
# reads the files
y = bot.read()
print(y)
  
# we can crawl the main site
z = bot.can_fetch('*', 'https://www.geeksforgeeks.org/')
print(z)
  
# but can not crawl the disallowed url
w = bot.can_fetch('*', 'https://www.geeksforgeeks.org / wp-admin/')
print(w)
```

output

```
None
None
True
False
```

