# What is urllib ?
Urllib module is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function and is able to fetch URLs using a variety of different protocols. Building ,loading and parsing the url

## Install Using


In [None]:
pip install urllib

## Import urllib

In [1]:
import urllib.request

##### Urllib is a package that collects several modules for working with URLs, such as:

## urllib.request
This module helps to define functions and classes to open URLs (mostly HTTP). One of the most simple ways to open such URLs is : urllib.request.urlopen(url)

In [2]:
request_url = urllib.request.urlopen('https://www.google.com')
print(request_url.read()) #it will print source code of google page

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-IN"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="R1teH6BYlJlcCzwb+SAACA==">(function(){window.google={kEI:\'GDDIYIKUD9etoATWvovIBQ\',kEXPI:\'0,772215,1,530320,56873,954,5105,206,4804,2316,383,246,5,1354,5250,16232,10,1106274,1197719,533,31,328984,51224,16111,28687,17572,4859,1361,9291,3028,3889,13691,4020,978,13228,2676,1171,4192,6430,1141,7512,5875,234,4282,2779,918,2855,2226,889,704,1279,2212,239,291,149,1103,840,1986,210,4101,4120,2024,2296,1704,12966,3227,2845,7,5599,6755,5096,7876,3748,1181,108,3407,908,2,941,2614,2399,7468,3275,3,346,230,1014,1,5445,148,5990,5333,2652,4,1253,275,2304,1236,5803,74,1717,266,2627,2014,4067,7434,2110,1714,3050,2658,4243,518,2596,30,3854,1810,7964,1592,713,638,1494,617,4969,7266,3269,665,5800,2557,2046,2048,3138,6,6

## urllib.parse
This module helps to define functions to manipulate URLs and their components parts, to build or break them. It usually focuses on splitting a URL into small components; or joining different URL components into URL string.

In [3]:
from urllib.parse import *
parse_url = urlparse('https://www.google.com/python')
print(parse_url)
print("\n")
unparse_url = urlunparse(parse_url)
print(unparse_url)

ParseResult(scheme='https', netloc='www.google.com', path='/python', params='', query='', fragment='')


https://www.google.com/python


### Different other functions of urllib.parse are :

## urllib.error
This module defines the classes for exception raised by urllib.request. Whenever there is an error in fetching a URL, this module helps in raising exceptions. The following are the exceptions raised :

URLError – It is raised for the errors in URLs, or errors while fetching the URL due to connectivity, and has a ‘reason’ property that tells a user the reason of error.

HTTPError – It is raised for the exotic HTTP errors, such as the authentication request errors. It is a subclass or URLError. Typical errors include ‘404’ (page not found), ‘403’ (request forbidden),
and ‘401’ (authentication required).

In [5]:
import urllib.request
import urllib.parse
  
# trying to read the URL but with no internet connectivity
try:
    x = urllib.request.urlopen('https://www.google.com')
    print(x.read())
  
# Catching the exception generated     
except Exception as e :
    print(str(e))

<urlopen error [Errno 11001] getaddrinfo failed>


### urllib.robotparser 
This module provides a single class, RobotFileParser, which answers questions about whether or not a particular user agent can fetch a URL on the Web site that published the robots.txt file.

The robots.txt file format is a simple text-based access control system for computer programs that automatically access web resource

In [10]:
from urllib import parse
from urllib import robotparser

AGENT_NAME = 'PyMOTW'
URL_BASE = 'https://pymotw.com/'
parser = robotparser.RobotFileParser()
parser.set_url(parse.urljoin(URL_BASE, 'robots.txt'))
parser.read()                                        #Reads the robots.txt URL and feeds it to the parser.

PATHS = [
    '/',
    '/PyMOTW/',
    '/admin/',
    '/downloads/PyMOTW-1.92.tar.gz',
]

for path in PATHS:
    print(parser.can_fetch(AGENT_NAME, path), path)                    #Returns True if the useragent is allowed to fetch the url according to the rules contained in the parsed robots.txt file.
    url = parse.urljoin(URL_BASE, path)
    print(parser.can_fetch(AGENT_NAME, url), url)
    print()

True /
True https://pymotw.com/

True /PyMOTW/
True https://pymotw.com/PyMOTW/

False /admin/
False https://pymotw.com/admin/

False /downloads/PyMOTW-1.92.tar.gz
False https://pymotw.com/downloads/PyMOTW-1.92.tar.gz

