In [2]:
# Import requests library
import requests

<span style='color:steelblue'> **1: Use requests to get a response from the Wikipedia home page** </span>

In [3]:
# First assign the URL of Wikipedia home page to a string 
wiki_home = "https://en.wikipedia.org/wiki/Main_Page"

In [4]:
# Use the 'get' method from requests library to get a response
response = requests.get(wiki_home)

In [5]:
# What is this 'response' object anyway
type(response)

requests.models.Response

<span style='color:steelblue'> **2: Write a small function to check the status of web request**

***This kind of small helper/utility functions are incredibly useful for complex projects.***

***Start building a habit of writing small functions to accomplish small modular tasks, instead of writing long scripts, which are hard to debug and track.*** </span>

In [6]:
def status_check(r):
    if r.status_code==200:
        print("Success!")
        return 1
    else:
        print("Failed!")
        return -1

In [7]:
status_check(response)

Success!


1

<span style='color:steelblue'> **3: Write small function to check the encoding of the web page** </span>

In [8]:
def encoding_check(r):
    return (r.encoding)

In [9]:
encoding_check(response)

'UTF-8'

<span style='color:steelblue'> **4: Write a small function to decode the concents of the response** </span>

In [10]:
def decode_content(r,encoding):
    return (r.content.decode(encoding))

In [11]:
contents = decode_content(response,encoding_check(response))

<span style='color:steelblue'> **What is the type of the contents?** </span>

In [12]:
type(contents)

str

***Wow! Finally we got a string object. Super easy to read text from a popular webpage like Wikipedia***

<span style='color:steelblue'> **5: Check the length of the text you got back and print some selected portions** </span>

In [13]:
len(contents)

84315

In [14]:
print(contents[:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"9d454b50-4184-46be-b3af-c9733eca4a8d","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1004593520,"wgRevisionId":1004593520,"wgArticleId":15580374,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable":!1,"wgRelevantPage

In [15]:
print(contents[15000:16000])

 in 2017. (<b><a href="/wiki/South_Park:_The_Stick_of_Truth" title="South Park: The Stick of Truth">Full&#160;article...</a></b>)
</p>
<div class="tfa-recent" style="text-align: right;">
Recently featured: <div class="hlist hlist-separated inline">
<ul><li><a href="/wiki/York_County,_Maine,_Tercentenary_half_dollar" title="York County, Maine, Tercentenary half dollar">York County, Maine, Tercentenary half dollar</a></li>
<li><a href="/wiki/Cai_Lun" title="Cai Lun">Cai Lun</a></li>
<li><a href="/wiki/Moorgate_tube_crash" title="Moorgate tube crash">Moorgate tube crash</a></li></ul>
</div></div>
<div class="tfa-footer hlist hlist-separated noprint" style="text-align: right;">
<ul><li><b><a href="/wiki/Wikipedia:Today%27s_featured_article/August_2021" title="Wikipedia:Today&#39;s featured article/August 2021">Archive</a></b></li>
<li><b><a href="https://lists.wikimedia.org/mailman/listinfo/daily-article-l" class="extiw" title="mail:daily-article-l">By email</a></b></li>
<li><b><a href="/w

<span style='color:steelblue'> **6: Use BeautifulSoup package to parse the raw HTML text more meaningfully and search for a particular text** </span>

In [16]:
from bs4 import BeautifulSoup

In [17]:
soup = BeautifulSoup(contents, 'html.parser')

**What is this new soup object?**

In [18]:
type(soup)

bs4.BeautifulSoup

<span style='color:steelblue'> **7: Can we somehow read intelligible text from this soup object** </span>

In [19]:
txt_dump=soup.text

In [20]:
type(txt_dump)

str

In [21]:
len(txt_dump)

9752

In [22]:
print(txt_dump[7000:8500])

epository



MediaWikiWiki software development



Meta-WikiWikimedia project coordination



WikibooksFree textbooks and manuals



WikidataFree knowledge base



WikinewsFree-content news



WikiquoteCollection of quotations



WikisourceFree-content library



WikispeciesDirectory of species



WikiversityFree learning tools



WikivoyageFree travel guide



WiktionaryDictionary and thesaurus



Wikipedia languages


This Wikipedia is written in English. Many other Wikipedias are available; some of the largest are listed below.





1,000,000+ articles



العربية
Deutsch
Español
Français
Italiano
Nederlands
日本語
Polski
Português
Русский
Svenska
Українська
Tiếng Việt
中文





250,000+ articles



Bahasa Indonesia
Bahasa Melayu
Bân-lâm-gú
Български
Català
Čeština
Dansk
Esperanto
Euskara
فارسی‎
עברית
한국어
Magyar
Norsk Bokmål
Română
Srpski
Srpskohrvatski
Suomi
Türkçe





50,000+ articles



Asturianu
Bosanski
Eesti
Ελληνικά
Simple English
Galego
Hrvatski
Latviešu
Lietuvių
മലയാളം
Македонск

<span style='color:steelblue'> **8: Extract the text from the section 'From today's featured article'** </span>

In [23]:
# First extract the starting and end indices of the text we are interested in
idx1=txt_dump.find("From today's featured article")
idx2=txt_dump.find("Recently featured")

In [24]:
idx1, idx2

(352, 1407)

In [25]:
print(txt_dump[idx1+len("From today's featured article"):idx2+1])





Trey Parker and Matt Stone, creators of South Park

South Park: The Stick of Truth is a 2014 role-playing video game developed by Obsidian Entertainment in collaboration with South Park Digital Studios, and published by Ubisoft for Microsoft Windows, PlayStation 3, and Xbox 360. Based on the American adult animated television series South Park, the game features whimsical fantasy role-playing. As the New Kid, the player can freely explore the town of South Park with a supporting party of characters, fighting aliens, Nazi zombies, and gnomes. The visuals replicate the aesthetic of the television series. South Park creators Trey Parker and Matt Stone (both pictured) wrote the game's script, consulted on the design and voiced many of the characters, as in the television program. Reviewers praised the comedic script and authentic visual style, but some faulted the game over technical issues and a lack of challenging combat. A sequel, South Park: The Fractured but Whole, was released in 

<span style='color:steelblue'> **9: Try to extract the important historical events that happened on today's date...** </span>

In [26]:
idx3=txt_dump.find("On this day")

In [27]:
print(txt_dump[idx3+len("On this day"):idx3+len("On this day")+1000])




August 6



Johnson signing the Voting Rights Act into law

1861 – Under the threat of military bombardment, Dosunmu, Oba of Lagos, ceded the island of Lagos to British forces.
1890 – At Auburn Prison in the U.S. state of New York, William Kemmler became the first person to be executed by electric chair.
1965 – U.S. president Lyndon B. Johnson signed the Voting Rights Act into law (pictured), outlawing literacy tests and other discriminatory voting practices that had been responsible for the widespread disfranchisement of African Americans.
1991 – British computer programmer Tim Berners-Lee posted a public invitation to collaborate on a system of interlinked, hypertext documents accessible via the Internet, known as the World Wide Web.
Saint Dominic  (d. 1221)Bix Beiderbecke  (d. 1931)Shapour Bakhtiar  (d. 1991)

More anniversaries: 
August 5
August 6
August 7


Archive
By email
List of days of the year




From today's featured list

Phil Younghusband

Former professional associatio

<span style='color:steelblue'> **10: Use advanced BS4 technique to extract relevant text without guessing or hard coding where to look** </span>

In [28]:
for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                print(i.text)

1861 – Under the threat of military bombardment, Dosunmu, Oba of Lagos, ceded the island of Lagos to British forces.
1890 – At Auburn Prison in the U.S. state of New York, William Kemmler became the first person to be executed by electric chair.
1965 – U.S. president Lyndon B. Johnson signed the Voting Rights Act into law (pictured), outlawing literacy tests and other discriminatory voting practices that had been responsible for the widespread disfranchisement of African Americans.
1991 – British computer programmer Tim Berners-Lee posted a public invitation to collaborate on a system of interlinked, hypertext documents accessible via the Internet, known as the World Wide Web.
Saint Dominic  (d. 1221)Bix Beiderbecke  (d. 1931)Shapour Bakhtiar  (d. 1991)
August 5
August 6
August 7
Archive
By email
List of days of the year


In [29]:
text_list=[]
for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                text_list.append(i.text)

In [30]:
len(text_list)

4

In [31]:
for i in text_list:
    print(i)
    print('-'*100)

1861 – Under the threat of military bombardment, Dosunmu, Oba of Lagos, ceded the island of Lagos to British forces.
1890 – At Auburn Prison in the U.S. state of New York, William Kemmler became the first person to be executed by electric chair.
1965 – U.S. president Lyndon B. Johnson signed the Voting Rights Act into law (pictured), outlawing literacy tests and other discriminatory voting practices that had been responsible for the widespread disfranchisement of African Americans.
1991 – British computer programmer Tim Berners-Lee posted a public invitation to collaborate on a system of interlinked, hypertext documents accessible via the Internet, known as the World Wide Web.
----------------------------------------------------------------------------------------------------
Saint Dominic  (d. 1221)Bix Beiderbecke  (d. 1931)Shapour Bakhtiar  (d. 1991)
----------------------------------------------------------------------------------------------------
August 5
August 6
August 7
-------

In [36]:
print(text_list[0])

1861 – Under the threat of military bombardment, Dosunmu, Oba of Lagos, ceded the island of Lagos to British forces.
1890 – At Auburn Prison in the U.S. state of New York, William Kemmler became the first person to be executed by electric chair.
1965 – U.S. president Lyndon B. Johnson signed the Voting Rights Act into law (pictured), outlawing literacy tests and other discriminatory voting practices that had been responsible for the widespread disfranchisement of African Americans.
1991 – British computer programmer Tim Berners-Lee posted a public invitation to collaborate on a system of interlinked, hypertext documents accessible via the Internet, known as the World Wide Web.


<span style='color:steelblue'> **Functionalize this process i.e. write a compact function to extract "On this day" text from Wikipedia Home Page** </span>

In [37]:

def wiki_on_this_day(url="https://en.wikipedia.org/wiki/Main_Page"):
    """
    Extracts the text corresponding to the "On this day" section on the Wikipedia Home Page.
    Accepts the Wikipedia Home Page URL as a string, a default URL is provided.
    """
    import requests
    from bs4 import BeautifulSoup
    
    wiki_home = str(url)
    response = requests.get(wiki_home)
    
    def status_check(r):
        if r.status_code==200:
            return 1
        else:
            return -1
    
    status = status_check(response)
    if status==1:
        contents = decode_content(response,encoding_check(response))
    else:
        print("Sorry could not reach the web page!")
        return -1
    
    soup = BeautifulSoup(contents, 'html.parser')
    text_list=[]
    
    for d in soup.find_all('div'):
            if (d.get('id')=='mp-otd'):
                for i in d.find_all('ul'):
                    text_list.append(i.text)
    
    return (text_list[0])

In [38]:
print(wiki_on_this_day())

1861 – Under the threat of military bombardment, Dosunmu, Oba of Lagos, ceded the island of Lagos to British forces.
1890 – At Auburn Prison in the U.S. state of New York, William Kemmler became the first person to be executed by electric chair.
1965 – U.S. president Lyndon B. Johnson signed the Voting Rights Act into law (pictured), outlawing literacy tests and other discriminatory voting practices that had been responsible for the widespread disfranchisement of African Americans.
1991 – British computer programmer Tim Berners-Lee posted a public invitation to collaborate on a system of interlinked, hypertext documents accessible via the Internet, known as the World Wide Web.


***A wrong URL produces an error message as expected***

In [39]:
print(wiki_on_this_day("https://en.wikipedia.org/wiki/Main_Page1"))

Sorry could not reach the web page!
-1
