# **Web scraping using beautiful soup**

This notebook includes data scraping, which takes a website URL as an input and extracts the information listed below as an output from that webpage.


1.   Specific HTML tags along with titles and meta description
2.   Extract specific tags, heading tags from h1-h6 along with titles and meta description
3. Extracting ALT tags
4. For counting words inside a web page
5. Inspection of broken links inside a webpage
6. Extracting the source code of the webpage in google colab
7. Extracting all URLs from a website without duplication
8. Measuring the forntend and backend performance of website






In [3]:
!pip install beautifulsoup4



**1. For scraping specific HTML tags along with titles and meta description**

In [4]:
#Importing libraries
from bs4 import BeautifulSoup
import urllib
from urllib import request
import urllib.request as ur

[Target URL](https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education)

In [5]:
# Getting input for webiste from user
urlinput = input("Enter url :") # https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education
print(" This is the website link that you entered", urlinput)

# For extracting specific tags from webpage
def getTags(tag):
  s = ur.urlopen(urlinput)
  soup = BeautifulSoup(s.read())
  return soup.findAll(tag)

# For extracting specific title & meta description from webpage
def titleandmetaTags():
    s = ur.urlopen(urlinput)
    soup = BeautifulSoup(s.read())
    #----- Extracting Title from website ------#
    title = soup.title.string
    print ('Website Title is :', title)
    #-----  Extracting Meta description from website ------#
    '''
    meta_description = soup.find_all('meta')
    for tag in meta_description:
        if 'name' in tag.attrs.keys() and tag.attrs['name'].strip().lower() in ['description', 'keywords']:
            print("---------------------------------------")
            print ('NAME    :',tag.attrs['name'].lower())
            print("---------------------------------------")
            print ('CONTENT :',tag.attrs['content'])
            print("----------------*****-----------------------") '''

#------------- Main ---------------#
if __name__ == '__main__':
  titleandmetaTags()
  tags = getTags('h1')
  for tag in tags:
     print("tag:")
     print(tag) # display tags
     print("tag content:")
     print(tag.contents) # display contents of the tags


Enter url :https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education
 This is the website link that you entered https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education
Website Title is : Manipal Academy of Higher Education - Wikipedia
tag:
<h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">Manipal Academy of Higher Education</span></h1>
tag content:
[<span class="mw-page-title-main">Manipal Academy of Higher Education</span>]


**2. For extracting specific tags, all heading tags from h1-h6 along with titles and meta description**

In [6]:
# Importing libraries
from bs4 import BeautifulSoup
import urllib
from urllib import request
import urllib.request as ur

In [7]:
# Getting input for webiste from user
url_input = input("Enter url :")
print(" This is the website link that you entered", url_input)
print("---------------------------------------")


# For extracting specific tags from webpage
def getTags(tag):
  s = ur.urlopen(url_input)
  soup = BeautifulSoup(s.read())
  return soup.findAll(tag)

# For extracting all h1-h6 heading tags from webpage
def headingTags(headingtags):
  h = ur.urlopen(url_input)
  soup = BeautifulSoup(h.read())
  print("List of headings from headingtags function h1, h2, h3, h4, h5, h6 :")
  print("---------------------------------------")
  for heading in soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]):
    print(heading.name + ' ' + heading.text.strip())
    print("---------------------------------------")

# For extracting specific title & meta description from webpage
def titleandmetaTags():
    s = ur.urlopen(urlinput)
    soup = BeautifulSoup(s.read())
    #----- Extracting Title from website ------#
    title = soup.title.string
    print ('Website Title is :', title)
    print("---------------------------------------")
    '''
    #-----  Extracting Meta description from website ------#
    meta_description = soup.find_all('meta')
    for tag in meta_description:
        if 'name' in tag.attrs.keys() and tag.attrs['name'].strip().lower() in ['description', 'keywords']:
            #print ('NAME    :',tag.attrs['name'].lower())
            print ('CONTENT :',tag.attrs['content'])
'''


#------------- Main ---------------#
if __name__ == '__main__':
  titleandmetaTags()
  tags = getTags('p')
  headtags = headingTags('h1')
  for tag in tags:
     print(" Here are the tags from getTags function:", tag.contents)





Enter url :https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education
 This is the website link that you entered https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education
---------------------------------------
Website Title is : Manipal Academy of Higher Education - Wikipedia
---------------------------------------
List of headings from headingtags function h1, h2, h3, h4, h5, h6 :
---------------------------------------
h2 Contents
---------------------------------------
h1 Manipal Academy of Higher Education
---------------------------------------
h2 Governance
---------------------------------------
h2 History
---------------------------------------
h2 Academics
---------------------------------------
h3 Rankings
---------------------------------------
h3 Libraries
---------------------------------------
h2 Research
---------------------------------------
h3 Manipal Advanced Research Group
---------------------------------------
h2 Notable alumni
--------------------

**3. For extracting ALT tags (Image Alter tags)**

In [None]:
import urllib.request as ur

url_input = input("Enter url :")
print("The website link that you entered is:", url_input)

def alt_tag():
  url =  ur.urlopen(url_input)
  htmlSource = url.read()
  url.close()
  soup = BeautifulSoup(htmlSource)
  print('\n The alt tag along with the text in the web page')
  print(soup.find_all('img',alt= True))



#------------- Main ---------------#
if __name__ == '__main__':
  alt_tag()


Enter url :https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education
The website link that you entered is: https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education

 The alt tag along with the text in the web page
[<img alt="" aria-hidden="true" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>, <img alt="Wikipedia" class="mw-logo-wordmark" src="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;"/>, <img alt="The Free Encyclopedia" class="mw-logo-tagline" height="13" src="/static/images/mobile/copyright/wikipedia-tagline-en.svg" style="width: 7.3125em; height: 0.8125em;" width="117"/>, <img alt="" class="mw-file-element" data-file-height="1376" data-file-width="1024" decoding="async" height="16" src="//upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/12px-Commons-logo.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/18px-Commons-logo.svg.png 1

In [None]:
# For reviewing alt tags in seperate lines
soup.find_all('img',alt= True)

[<img alt="" aria-hidden="true" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>,
 <img alt="Wikipedia" class="mw-logo-wordmark" src="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;"/>,
 <img alt="The Free Encyclopedia" class="mw-logo-tagline" height="13" src="/static/images/mobile/copyright/wikipedia-tagline-en.svg" style="width: 7.3125em; height: 0.8125em;" width="117"/>,
 <img alt="" class="mw-file-element" data-file-height="1376" data-file-width="1024" decoding="async" height="16" src="//upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/12px-Commons-logo.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/18px-Commons-logo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/24px-Commons-logo.svg.png 2x" width="12"/>,
 <img alt="Edit this at Wikidata" class="mw-file-element" data-file-height="20" data-file-width="20" decoding="asyn

**4. For counting words inside a web page**

In [None]:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

# Getting content from web page
r = requests.get("https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education")
soup = BeautifulSoup(r.content)

# For getting words within paragrphs
text_paragraph = (''.join(s.findAll(string=True))for s in soup.findAll('p'))
count_paragraph = Counter((x.rstrip(punctuation).lower() for y in text_paragraph for x in y.split()))

# For getting words inside div tags
text_div = (''.join(s.findAll(string=True))for s in soup.findAll('div'))
count_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))

# Adding two counters for getting a list with words count (from most to less common)
total = count_div + count_paragraph
list_most_common_words = total.most_common()

In [None]:
# Total words inside a webpage
len(total)

1107

In [None]:
# List of common words
list_most_common_words

[('university', 594),
 ('of', 558),
 ('the', 358),
 ('.hlist', 343),
 ('', 335),
 ('in', 309),
 ('manipal', 288),
 ('and', 269),
 ('.mw-parser-output', 234),
 ('education', 173),
 ('a', 144),
 ('higher', 143),
 ('retrieved', 118),
 ('academy', 115),
 ('online', 112),
 ('from', 106),
 ('research', 98),
 ('karnataka', 95),
 ('to', 92),
 ('by', 92),
 ('2023', 91),
 ('sciences', 85),
 ('2022', 84),
 ('july', 80),
 ('institutional', 77),
 ('ranking', 77),
 ('framework', 77),
 ('1', 70),
 ('institute', 65),
 ('on', 65),
 ('b', 64),
 ('india', 63),
 ('2021', 63),
 ('li', 63),
 ('is', 61),
 ('technology', 61),
 ('—', 60),
 ('state', 60),
 ('was', 58),
 ('rankings', 56),
 ('business', 56),
 ('.navbox', 56),
 ('as', 54),
 ('with', 51),
 ('science', 51),
 ('universities', 49),
 ('times', 49),
 ('national', 49),
 ('archived', 49),
 ('original', 49),
 ('august', 49),
 ('dl,.mw-parser-output', 49),
 ('ol,.mw-parser-output', 49),
 ('dd', 49),
 ('dt', 49),
 ('bangalore', 47),
 ('5', 42),
 ('mahe', 42)

**5. For inspecting Broken links inside a webpage**

We want to retrieve the response code 200 if the site is fully functional. We'll get the 404 response code if it's not available.

In [None]:
# Importing libraries
from bs4 import BeautifulSoup, SoupStrainer
import requests

# Getting URL from user
url = input("Enter your url: ")

def broken_page():
  # For making request to get the URL
  user_req_page = requests.get(url)

  # For getting the response code of given URL
  response_code = str(user_req_page.status_code)

  # For displaying the text of the URL in str
  data =user_req_page.text

  # For using BeautifulSoup to access the built-in methods
  soup = BeautifulSoup(data)

  # Iterate over all links on the given URL with the response code next to it i.e 404 for PAGE NOT FOUND, 200 if website is functional/available
  for link in soup.find_all('a'):
    print(f"Url: {link.get('href')} " + f"| Status Code: {response_code}")


#----- NOTE ------#
# --------- TO VERIFY PAGE NOT FOUND 404 ERROR, enter below web link as a input URL --------#
#https://roine.github.com/p1

#------------- Main ---------------#
if __name__ == '__main__':
  broken_page()

Enter your url: https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education
Url: #bodyContent | Status Code: 200
Url: /wiki/Main_Page | Status Code: 200
Url: /wiki/Wikipedia:Contents | Status Code: 200
Url: /wiki/Portal:Current_events | Status Code: 200
Url: /wiki/Special:Random | Status Code: 200
Url: /wiki/Wikipedia:About | Status Code: 200
Url: //en.wikipedia.org/wiki/Wikipedia:Contact_us | Status Code: 200
Url: https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en | Status Code: 200
Url: /wiki/Help:Contents | Status Code: 200
Url: /wiki/Help:Introduction | Status Code: 200
Url: /wiki/Wikipedia:Community_portal | Status Code: 200
Url: /wiki/Special:RecentChanges | Status Code: 200
Url: /wiki/Wikipedia:File_upload_wizard | Status Code: 200
Url: /wiki/Main_Page | Status Code: 200
Url: /wiki/Special:Search | Status Code: 200
Url: /w/index.php?title=Special:CreateAccount&returnto=Manipal+Aca

**6. For getting the source code of the webpage**

Here, we will be using 'page_source' method is used retrieve the page source of the webpage the user is currently accessing.

*NOTE: (Page source : The source code/page source is the programming behind any webpage)*

In [None]:
# install chromium, its driver, and selenium
!apt update
!apt install chromium-chromedriver
!pip install selenium

# set options to be headless, ..
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

[33m0% [Working][0m            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
[33m0% [Connecting to security.ubuntu.com (185.125.190.81)] [Connected to cloud.r-project.org (18.154.10[0m                                                                                                    Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
                                                                                                    Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
[33m0% [2 InRelease 14.2 kB/128 kB 11%] [Connecting to security.ubuntu.com (185.125.190.81)] [3 InReleas[0m[33m0% [2 InRelease 15.6 kB/128 kB 12%] [Connected to security.ubuntu.com (185.125.190.81)] [Waiting for[0m                                                                                                    Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
[33m0% [2 In

In [None]:
#------------- FOR DISPLAYING SOURCE CODE OF THE WEBPAGE -------------#

# open it, go to a website, and get results
wd = webdriver.Chrome(options=options)

# Prompt user to enter the URL
url = input("Enter your url: ")

# For making request to get the URL
wd.get(url)

# To display code results
print(wd.page_source)

Enter your url: https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education
<html class="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available vector-animations-ready vector-feature-appearance-pinned-clientpref-0 ve-available" lang="en" dir="ltr"><head>
<meta charset="UTF-8">
<title>Manipal Academy of Higher Education - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-pag

**7. Extraction of all URLs from a website without duplication**

In [None]:
#---- Importing libraries ----#
import re
import requests
from bs4 import BeautifulSoup

all_links = set() #------ Creating a unique set of links ------#

for i in range(7):
   r = requests.get(("https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education").format(i))
   soup = BeautifulSoup(r.content , "html.parser")
   for link in soup.find_all("a",href=re.compile('/')):
            link = (link.get('href'))
            #----- For the removal of duplicate URLs, We will simply add a link to that set; this assures that it's distinct ------#
            if link not in all_links:
              print(link)
            all_links.add(link)

/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=Manipal+Academy+of+Higher+Education
/w/index.php?title=Special:UserLogin&returnto=Manipal+Academy+of+Higher+Education
/wiki/Special:MyContributions
/wiki/Special:MyTalk
https://hi.wikipedia.org/wiki/%E0%A4%AE%E0%A4%A3%E0%A4%BF%E0%A4%AA%E0%A4%BE%E0%A4%B2_%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%8D%E0%A4%B5%E0%A4%B5%E0%A4%BF%E0%A4%A6%E0%A5%8D%E0%A4%AF%E0%A4%BE%E0%A4%B2%E0%A4%AF
https://kn.wikipedia.org/wiki/%E0%B2%AE%E0%B2%A3%E0%B2%BF%E0%B2%AA%E0%B2%BE%E0%B2%B2_%E0%B2%85%E0%B2%95%E0%B2%BE%E0

**8. Measuring the forntend and backend performance of website**

In [None]:
#----- Installation of selenium and chromedriver in google colab -----#
!pip install selenium
!apt-get update
!apt install chromium-chromedriver

Collecting selenium
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |████████████████████████████████| 911kB 3.7MB/s 
Installing collected packages: selenium
Successfully installed selenium-3.141.0
Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:6 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [697 B]
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/re

In [None]:
#---- Importing libraries ----#
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import csv
import os.path

In [None]:
#---- Accessing chromedriver in google colab ----#
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome(options=options)
driver =webdriver.Chrome(options=options)

In [None]:
#----- Creating csv file to write the calculated performance of the website
csv_path = "performance.csv"
file = open(csv_path, 'w', newline='')
writer = csv.writer(file)
writer.writerow(["backendPerformance_calc","frontendPerformance_calc"])


#----- Getting input for webiste from user
url = input("Enter url :")
print("This is the website link that you entered:", url)

#----- Setting iterations for testing the perfromance
iterations = 10
for i in range(iterations):
    driver =webdriver.Chrome(options=options)
    driver.get(url) #-- Passing url as parameter in Selenium method (driver.get)

    #-- Using Navigation Timing API to calculate the timings, Here driver.execute_script is called and the return value is stored in navigationStart
    #driver.execute_script then synchronously executes JavaScript in the current window or frame. In this case the ‘return window.performance.timing.navigationStart’ code will run.
    navigationStart = driver.execute_script("return window.performance.timing.navigationStart")
    responseStart = driver.execute_script("return window.performance.timing.responseStart")
    domComplete = driver.execute_script("return window.performance.timing.domComplete")

    backendPerformance_calc = responseStart - navigationStart
    frontendPerformance_calc = domComplete - responseStart

    #--This will print iteration wise backend and front end performance for website
    print("Iteration no:", i)
    print("Back End performance in MS: %s" % backendPerformance_calc)
    print("Front End performance in MS: %s" % frontendPerformance_calc)
    print("------------------------")

    #-- Writing row wise data in the file
    writer.writerow([backendPerformance_calc,frontendPerformance_calc])
    driver.close()




Enter url :https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education
This is the website link that you entered: https://en.wikipedia.org/wiki/Manipal_Academy_of_Higher_Education
Iteration no: 0
Back End performance in MS: 268
Front End performance in MS: 369
------------------------
Iteration no: 1
Back End performance in MS: 192
Front End performance in MS: 484
------------------------
Iteration no: 2
Back End performance in MS: 205
Front End performance in MS: 456
------------------------
Iteration no: 3
Back End performance in MS: 188
Front End performance in MS: 482
------------------------
Iteration no: 4
Back End performance in MS: 199
Front End performance in MS: 723
------------------------
Iteration no: 5
Back End performance in MS: 183
Front End performance in MS: 455
------------------------
Iteration no: 6
Back End performance in MS: 180
Front End performance in MS: 327
------------------------
Iteration no: 7
Back End performance in MS: 172
Front End performance in



* Backend Performance: The time taken from the start of the navigation to the start of the server response.

* Frontend Performance: The time taken from the start of the server response to the complete loading of the document.

* The script iterates 10 times, performing these calculations and logging the results in a CSV file to analyze the performance metrics for the given URL.

In [None]:
#---- For closing the CSV file and the WebDriver ----#
driver.quit()
file.close()


In [None]:
#---- To view performance in a dataframe ----#
import pandas as pd
df=pd.read_csv("performance.csv")

In [None]:
#----- Displaying DataFrames output ------#
df

Unnamed: 0,backendPerformance_calc,frontendPerformance_calc
0,268,369
1,192,484
2,205,456
3,188,482
4,199,723
5,183,455
6,180,327
7,172,310
8,172,492
9,183,516
