# Web scraping from Wikipedia
Requests: It is an efficient HTTP library used for accessing web pages.
Urlib3: It is used for retrieving data from URLs.
Selenium: It is an open-source automated testing suite for web applications across different browsers and platforms

# Requests library

In [5]:
# import required modules 
import requests 
  
# get URL 
page = requests.get("https://en.wikipedia.org/wiki/Main_Page") 
  
# display status code 
print(page.status_code) 
  
# display scrapped data 
print(page.content) 

200
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YDSLsjCwx5LN2EV7sY3VpwAAAMU","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1004593520,"wgRevisionId":1004593520,"wgArticleId":15580374,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable":!1,"wgRelevantPa

# Beautiful Soup for page parsing

In [6]:
# import required modules 
from bs4 import BeautifulSoup 
import requests 
  
# get URL 
page = requests.get("https://en.wikipedia.org/wiki/Main_Page") 
  
# scrape webpage 
soup = BeautifulSoup(page.content, 'html.parser') 
  
# display scrapped data 
print(soup.prettify()) 

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Wikipedia, the free encyclopedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YDSLsjCwx5LN2EV7sY3VpwAAAMU","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1004593520,"wgRevisionId":1004593520,"wgArticleId":15580374,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable":!1,"wgRel

# Digging deep into Beautiful Soup further

In [7]:
from bs4 import BeautifulSoup 
import requests 
  
# get URL 
page = requests.get("https://en.wikipedia.org/wiki/Main_Page") 
  
# scrape webpage 
soup = BeautifulSoup(page.content, 'html.parser') 
  
list(soup.children) 
  
# find all occurance of p in HTML 
# includes HTML tags 
print(soup.find_all('p')) 
  
print('\n\n') 
  
# return only text 
# does not include HTML tags 
print(soup.find_all('p')[0].get_text()) 

[<p><b><a href="/wiki/Margaret_(singer)" title="Margaret (singer)">Margaret</a></b> (born 1991) is a Polish singer and songwriter. Before her mainstream debut, she performed with <a href="/wiki/Underground_music" title="Underground music">underground</a> bands, recorded soundtracks for television commercials and produced a fashion blog. Through her blogging, she was discovered by music manager Sławomir Berdowski and signed by the record label <a href="/wiki/Extensive_Music" title="Extensive Music">Extensive Music</a>. Margaret gained international recognition with her singles "<a href="/wiki/Thank_You_Very_Much_(Margaret_song)" title="Thank You Very Much (Margaret song)">Thank You Very Much</a>" (2013) and "<a href="/wiki/Cool_Me_Down" title="Cool Me Down">Cool Me Down</a>" (2016), the first of which was included on her first <a href="/wiki/Extended_play" title="Extended play">extended play</a> (EP) <i><a href="/wiki/All_I_Need_(Margaret_EP)" title="All I Need (Margaret EP)">All I Need

# Exploring page structure
For example, the element with id mp-left is the parent element and its nested children have the class mp-h2. So we will print the information with the first nested child and prettify it using the prettify() function.

In [8]:
# import required modules 
from bs4 import BeautifulSoup 
import requests 
  
# get URL 
page = requests.get("https://en.wikipedia.org/wiki/Main_Page") 
  
# scrape webpage 
soup = BeautifulSoup(page.content, 'html.parser') 
  
# create object 
object = soup.find(id="mp-left") 
  
# find tags 
items = object.find_all(class_="mp-h2") 
result = items[0] 
  
# display tags 
print(result.prettify()) 

<h2 class="mp-h2" id="mp-tfa-h2">
 <span id="From_today.27s_featured_article">
 </span>
 <span class="mw-headline" id="From_today's_featured_article">
  From today's featured article
 </span>
</h2>

