In this notebook we show how we can scrap data from webpages using the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), a python library.

In [1]:
#making the necessary imports
from pprint import pprint
from bs4 import BeautifulSoup
from urllib.request import urlopen 

In [2]:
myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python" #specify the url
html = urlopen(myurl).read() #query the website so that it returns a html page  
soupified = BeautifulSoup(html, 'html.parser') # parse the html in the 'html' variable, and store it in Beautiful Soup format

In [3]:
pprint(soupified.prettify()) #to get an idea of the html structure of the webpage

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 '                 <span class="v-visible-sr">\n'
 '                  93 silver badges\n'
 '                 </span>\n'
 '                 <span aria-hidden="true" title="123 bronze badges">\n'
 '                  <span class="badge3">\n'
 '                  </span>\n'
 '                  <span class="badgecount">\n'
 '                   123\n'
 '                  </span>\n'
 '                 </span>\n'
 '                 <span class="v-visible-sr">\n'
 '                  123 bronze badges\n'
 '                 </span>\n'
 '                </div>\n'
 '               </div>\n'
 '              </div>\n'
 '             </div>\n'
 '             <div class="post-signature grid--cell fl0">\n'
 '              <div class="user-info user-hover">\n'
 '               <div class="user-action-time">\n'
 '                answered\n'
 '                <span class="relativetime" title="2018-03-01 11:34:55Z">\n'
 "                 Mar 1 

In [4]:
soupified.title #to get the title of the web page 

<title>datetime - How to get the current time in Python - Stack Overflow</title>

In [5]:
question = soupified.find("div", {"class": "question"}) #find the nevessary tag and class which it belongs to
questiontext = question.find("div", {"class": "s-prose js-post-body"})
print("Question: \n", questiontext.get_text().strip())

answer = soupified.find("div", {"class": "answer"}) #find the nevessary tag and class which it belongs to
answertext = answer.find("div", {"class": "s-prose js-post-body"})
print("Best answer: \n", answertext.get_text().strip())

Question: 
 What is the module/method used to get the current time?
Best answer: 
 Use:
>>> import datetime
>>> datetime.datetime.now()
datetime.datetime(2009, 1, 6, 15, 8, 24, 78915)

>>> print(datetime.datetime.now())
2009-01-06 15:08:24.789150

And just the time:
>>> datetime.datetime.now().time()
datetime.time(15, 8, 24, 78915)

>>> print(datetime.datetime.now().time())
15:08:24.789150

See the documentation for more information.
To save typing, you can import the datetime object from the datetime module:
>>> from datetime import datetime

Then remove the leading datetime. from all of the above.


BeautifulSoup is one of the many libraries which allow us to scrape web pages. Depending on your needs you can choose between the many available choices like beautifulsoup, scrapy, selenium, etc