## Web Scraping Tutorial With Beautiful Soup
I will be scraping <a href="https://en.wikipedia.org/wiki/Main_Page">Wikipedia Main Page</a>

In [2]:
# Importing Required Modules/Packages
import requests
from bs4 import BeautifulSoup

# URL to be scraped
_url = "https://en.wikipedia.org/wiki/Main_Page"

In [3]:
# Getting the html content
response = requests.get(_url)
htmlContent = response.content
# print(htmlContent)

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YDPL@whpFZRGPMIZFRKs@gAAAMs","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1004593520,"wgRevisionId":1004593520,"wgArticleId":15580374,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable":!1,"wgRelevantPageIs

In [4]:
# After getting html content we need to parse and beautify our html content so that we can read it
soupObj = BeautifulSoup(htmlContent, parser='html.parser')
# print(soupObj)
print("----------------*********************-------------------")
print("Printing prettified Soup Content")
# print(soupObj.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YDPL@whpFZRGPMIZFRKs@gAAAMs","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1004593520,"wgRevisionId":1004593520,"wgArticleId":15580374,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable":!1,"wgRelevantPageIsProbabl

#### After getting this prettified html content, next task is to traverse the HTML Tree and find the element that we wish to find out such as extracting all the text, anchor-tags, extracting all the images present on the page, etc. 
So to understand what is a HTML Tree Traversal means, i have attached a image below. 

<img src="https://raw.githubusercontent.com/dev-sandarbh/assets/master/HTML-tree.png"/>
<a href="http://web.simmons.edu/~grabiner/comm244/weekfour/document-tree.html" target="_blank">Image Credit and For Further Reading About HTML Tree</a>

In [5]:
# Traversing the HTML tree of our scraped URL
# Printing the title of the page
print("Title: ", soupObj.title)

print("Printing all the Paragraph Tags..........")
# Printing all the paragraph tags 
paras = soupObj.find_all('p')
# print("Paragraph tags present in our scraped page: ", paras, end=" ")
for ps in paras:
    print(ps)

print("Printing all the Anchor Tags..........")
# Printing all the anchor tags 
anchors = soupObj.find_all('a')
for anc in anchors:
    print(anc)
# print("Anchor tags present in our scraped page: ", anchors, end=" ")

Title:  <title>Wikipedia, the free encyclopedia</title>
Printing all the Paragraph Tags..........
<p>The <b><a href="/wiki/Siege_of_Lilybaeum_(250%E2%80%93241_BC)" title="Siege of Lilybaeum (250–241 BC)">Siege of Lilybaeum</a></b> lasted from 250 to 241 BC, as the <a href="/wiki/Roman_Republic" title="Roman Republic">Roman</a> army laid siege to the <a href="/wiki/Ancient_Carthage" title="Ancient Carthage">Carthaginian</a>-held <a href="/wiki/Sicily" title="Sicily">Sicilian</a> city of Lilybaeum (modern <a href="/wiki/Marsala" title="Marsala">Marsala</a>; <i>reconstruction pictured</i><span style="padding-left:0.15em;">)</span> during the <a href="/wiki/First_Punic_War" title="First Punic War">First Punic War</a>. Lilybaeum was well-fortified and situated on the coast, where it could be supplied and reinforced by sea. In mid–250 BC the Romans <a href="/wiki/Siege" title="Siege">besieged</a> the city with more than 100,000 men. They made a concerted effort to take it by assault, but wer

In [6]:
# Finding first p tag and first a tag
print("First p tag: \n", soupObj.find('p').prettify())
print("First a tag: \n", soupObj.find('a').prettify())

First p tag: 
 <p>
 The
 <b>
  <a href="/wiki/Siege_of_Lilybaeum_(250%E2%80%93241_BC)" title="Siege of Lilybaeum (250–241 BC)">
   Siege of Lilybaeum
  </a>
 </b>
 lasted from 250 to 241 BC, as the
 <a href="/wiki/Roman_Republic" title="Roman Republic">
  Roman
 </a>
 army laid siege to the
 <a href="/wiki/Ancient_Carthage" title="Ancient Carthage">
  Carthaginian
 </a>
 -held
 <a href="/wiki/Sicily" title="Sicily">
  Sicilian
 </a>
 city of Lilybaeum (modern
 <a href="/wiki/Marsala" title="Marsala">
  Marsala
 </a>
 ;
 <i>
  reconstruction pictured
 </i>
 <span style="padding-left:0.15em;">
  )
 </span>
 during the
 <a href="/wiki/First_Punic_War" title="First Punic War">
  First Punic War
 </a>
 . Lilybaeum was well-fortified and situated on the coast, where it could be supplied and reinforced by sea. In mid–250 BC the Romans
 <a href="/wiki/Siege" title="Siege">
  besieged
 </a>
 the city with more than 100,000 men. They made a concerted effort to take it by assault, but were unsucc

In [7]:
# finding first li tag
print("First Li Tag: ", soupObj.find('li').prettify())

# finding first li tag with class
print("First Li Tag with Class: ", soupObj.find('li')['class']) # since there is no class associated with this li tag thus it gives us error

First Li Tag:  <li>
 <a href="/wiki/Portal:The_arts" title="Portal:The arts">
  The arts
 </a>
</li>



KeyError: 'class'

In [8]:
# to find all the ul tags with a associated class
print("All the Li-tags having class=hlist = ",soupObj.find_all('ul', class_="hlist"))

All the Li-tags having class=hlist =  [<ul class="wikipedia-languages-langs hlist hlist-separated inline">
<li><a class="external text" href="https://ar.wikipedia.org/wiki/"><span class="autonym" lang="ar" title="Arabic (ar:)">العربية</span></a></li>
<li><a class="external text" href="https://de.wikipedia.org/wiki/"><span class="autonym" lang="de" title="German (de:)">Deutsch</span></a></li>
<li><a class="external text" href="https://es.wikipedia.org/wiki/"><span class="autonym" lang="es" title="Spanish (es:)">Español</span></a></li>
<li><a class="external text" href="https://fr.wikipedia.org/wiki/"><span class="autonym" lang="fr" title="French (fr:)">Français</span></a></li>
<li><a class="external text" href="https://it.wikipedia.org/wiki/"><span class="autonym" lang="it" title="Italian (it:)">Italiano</span></a></li>
<li><a class="external text" href="https://nl.wikipedia.org/wiki/"><span class="autonym" lang="nl" title="Dutch (nl:)">Nederlands</span></a></li>
<li><a class="external 

In [9]:
# Similarly, we can find out the particular element with the help of its 'ID'
print("H2 Tag with id=mp-lang :", soupObj.find_all(id="mp-lang"))

H2 Tag with id=mp-lang : [<h2 class="mp-h2" id="mp-lang"><span class="mw-headline" id="Wikipedia_languages">Wikipedia languages</span></h2>]


In [10]:
# Getting text within a tag
print("Text within tag = ", soupObj.find(id="mp-lang").get_text())

Text within tag =  Wikipedia languages


In [11]:
# Or we can just print the text from the whole soup object
print("Text within our Soup Obejct: ", soupObj.get_text())

Text within our Soup Obejct:  


Wikipedia, the free encyclopedia
































Main Page

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search



Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
6,255,397 articles in English


The arts
Biography
Geography
History
Mathematics
Science
Society
Technology
All portals





From today's featured article


Reconstruction of Lilybaeum

The Siege of Lilybaeum lasted from 250 to 241 BC, as the Roman army laid siege to the Carthaginian-held Sicilian city of Lilybaeum (modern Marsala; reconstruction pictured) during the First Punic War. Lilybaeum was well-fortified and situated on the coast, where it could be supplied and reinforced by sea. In mid–250 BC the Romans besieged the city with more than 100,000 men. They made a concerted effort to take it by assault, but were unsuccessful. The Romans then attacked the Carthaginian fleet, but their fleet was itself destroyed in the naval battles o

### Time to do something good, we're done with the basics:)
## Task: Scrape all the links present on the page and store them in a list/set

In [12]:
# an empty set to hold all our link
link_container = set() 

# storing all the links in a variable named '_anchors'
_anchors = soupObj.find_all('a')

# looping through all the links and storing them
for link in _anchors:
    if(link.get('herf')!='#'):    #coz we dont want '#' to be in our list
        link_container.add(link.get('href'))
        print(link.get('href'))

None
#mw-head
#searchInput
/wiki/Wikipedia
/wiki/Free_content
/wiki/Encyclopedia
/wiki/Help:Introduction_to_Wikipedia
/wiki/Special:Statistics
/wiki/English_language
/wiki/Portal:The_arts
/wiki/Portal:Biography
/wiki/Portal:Geography
/wiki/Portal:History
/wiki/Portal:Mathematics
/wiki/Portal:Science
/wiki/Portal:Society
/wiki/Portal:Technology
/wiki/Wikipedia:Contents/Portals
/wiki/File:Reconstruction_of_Lilybaeum_(cropped).jpg
/wiki/Siege_of_Lilybaeum_(250%E2%80%93241_BC)
/wiki/Roman_Republic
/wiki/Ancient_Carthage
/wiki/Sicily
/wiki/Marsala
/wiki/First_Punic_War
/wiki/Siege
/wiki/Battle_of_Drepana
/wiki/Battle_of_Phintias
/wiki/Battle_of_the_Aegates
/wiki/Suing_for_peace
/wiki/Treaty_of_Lutatius
/wiki/Siege_of_Lilybaeum_(250%E2%80%93241_BC)
/wiki/2015_Formula_One_World_Championship
/wiki/Evelyn_Mase
/wiki/SS_Mauna_Loa
/wiki/Wikipedia:Today%27s_featured_article/February_2021
https://lists.wikimedia.org/mailman/listinfo/daily-article-l
/wiki/Wikipedia:Featured_articles
/wiki/File:GW-pa