 Ensure that you have both beautifulsoup and requests installed:
#   pip install beautifulsoup4
#   pip install requests


In [1]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install bs4

Note: you may need to restart the kernel to use updated packages.


In [3]:
import requests 
from bs4 import BeautifulSoup

### Using the requests module, we use the "get" function
### provided to access the webpage provided as an
### argument to this function:

In [5]:
result = requests.get("https://www.google.com/")

To make sure that the website is accessible, we can
ensure that we obtain a 200 OK response to indicate
that the page is indeed present:

In [6]:
print(result.status_code)

200


For other potential status codes you may encounter,
consult the following Wikipedia page:
   https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

We can also check the HTTP header of the website to
verify that we have indeed accessed the correct page:

In [8]:
print(result.headers)

{'Date': 'Sat, 05 Mar 2022 12:25:27 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2022-03-05-12; expires=Mon, 04-Apr-2022 12:25:27 GMT; path=/; domain=.google.com; Secure, NID=511=gbs__K7DFQOY2W7jhs59GLsKobo2WPyroFPn4TBXVCWKqzJIbJcI-9syaTCzA8wXJ_wBksEqCUPgnAfrNopSQr3ucOay4rCH6z9OraUzOZ0XgiPLdB_bjQGHW6MZr4ULBMI5ZO5szNXA6PucX6kAxAiS5J_6lzsiByZedyzOPcU; expires=Sun, 04-Sep-2022 12:25:27 GMT; path=/; domain=.google.com; HttpOnly', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"', 'Transfer-Encoding': 'chunked'}


 For more information on HTTP headers and the information
 one can obtain from them, you may consult:
  https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

 Now, let us store the page content of the website accessed
 from requests to a variable:

In [10]:
src = result.content
print(src)

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-IN"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="c0r87ZnDNA3TzhiAiDQDQg==">(function(){window.google={kEI:\'t1YjYubsJOaVr7wPrKyMsAE\',kEXPI:\'0,1302536,56873,6059,206,4804,925,1391,383,246,5,1354,4013,1237,1122516,1197768,626,29,380068,16114,17444,11240,17572,4858,1362,9290,3026,17583,4020,978,13228,3847,4192,6430,22112,629,5081,887,707,1278,2742,149,562,541,840,6297,108,3406,606,2023,1777,520,14670,3228,2844,7,5599,11851,15768,552,1851,2614,3784,9358,3,576,1014,1,5445,148,11323,966,1686,4,1528,2304,6462,577,6345,13964,1714,3050,2658,7356,31,13628,13795,7428,5818,2539,4094,4052,3,3541,1,11943,2320,2544,38,25309,2,14022,1931,784,255,3278,1272,743,5853,10463,1160,5679,1021,2380,2718,18297,2,2,5,7718,4568,2587,3671,2984,3739,16695,1252,4606,2,2,5,1220,1

Now that we have the page source stored, we will use the
BeautifulSoup module to parse and process the source.
To do so, we create a BeautifulSoup object based on the
source variable we created above:

In [11]:
soup = BeautifulSoup(src, 'lxml')

Now that the page source has been processed via Beautifulsoup
we can access specific information directly from it. For instance,
say we want to see a list of all of the links on the page:

In [12]:
links = soup.find_all("a")
print(links)
print("\n")

[<a class="gb1" href="https://www.google.co.in/imghp?hl=en&amp;tab=wi">Images</a>, <a class="gb1" href="https://maps.google.co.in/maps?hl=en&amp;tab=wl">Maps</a>, <a class="gb1" href="https://play.google.com/?hl=en&amp;tab=w8">Play</a>, <a class="gb1" href="https://www.youtube.com/?gl=IN&amp;tab=w1">YouTube</a>, <a class="gb1" href="https://news.google.com/?tab=wn">News</a>, <a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a>, <a class="gb1" href="https://drive.google.com/?tab=wo">Drive</a>, <a class="gb1" href="https://www.google.co.in/intl/en/about/products?tab=wh" style="text-decoration:none"><u>More</u> »</a>, <a class="gb4" href="http://www.google.co.in/history/optout?hl=en">Web History</a>, <a class="gb4" href="/preferences?hl=en">Settings</a>, <a class="gb4" href="https://accounts.google.com/ServiceLogin?hl=en&amp;passive=true&amp;continue=https://www.google.com/&amp;ec=GAZAAQ" id="gb_70" target="_top">Sign in</a>, <a href="/advanced_search?hl=en-IN&amp;authuser

Perhaps we just want to extract the link that has contains the text
"About" on the page instead of every link. We can use the built-in
"text" function to access the text content between the <a> </a>
tags.

In [13]:
for link in links:
    if "About" in link.text:
        print(link)
        print(link.attrs['href'])

<a href="/intl/en/about.html">About Google</a>
/intl/en/about.html
