# Session 06

Today's session was devoted to getting some more practice defining our own functions and learning the basics of obtaining data from the Web via webscraping. In particular, we covered:

- (re)organizing our code using functions
- having a function return multiple values
- retrieving a web page from a server using requests
- extracting information from HTML using BeautifulSoup

In [None]:
# exercise: let us copy the simple contacts app we created last time 
# and let us reorganize the code using functions

command = ''

# create dictionary here!

contacts = {}

# below we define 4 different functions. to avoid confusion with the global variable
# 'contacts', we decided we would name the argument these functions receive 'contacts_dict'
# (for "contacts_dictionary").
# we also decided that the functions that modify that dictionary will return the modified
# version of the dictionary. the code that calls these functions will then store the new
# version of the dictionary that it passed to these functions in its original location.

# first function: add_contact

def add_contact(contacts_dict):
    name = input("Name: ")
    number = input("Number: ")
    contacts_dict[name] = number
    return contacts_dict # return modified version of the argument we received

# second function: update_contact

def update_contact(contacts_dict):
    name = input("Name of contact to update:")
    if name in contacts_dict.keys():
        newnumber = input("New number for %s:" % name)
        contacts_dict[name] = newnumber
    else:
        print("Sorry, dude, wake up, I don't have a contact by that name!")
    return contacts_dict # return modified version of the argument we received

        
# third function: delete_contact

def delete_contact(contacts_dict):
    name = input("Name of contact to delete: ")
    if name in contacts_dict.keys():
        del contacts_dict[name]
    else:
        print("Sorry, mate, I couldn't find a contact with that name (%s)" % name)
    return contacts_dict # return modified version of the argument we received


# fourth: search_for_contact

def search_for_contact(contacts_dict):
    name = input("Name to look up: ")
    if name in contacts_dict.keys():
        print("%s's number is %s" % (name, contacts_dict[name]) )
    else:
        print("You wish!!!! You don't have their number, you are dreaming! ;) ")
    return # unlike the functions above, this function has nothing to return!


# now we simply replace the code blocks that we moved into our functions above with
# calls to each of the functions. notice how much clearer and better organized our
# program becomes as a result!

while command != 'q': # 'q' is for quitting the program
    
    if command == 'a': # add contact
        
        # pass dictionary to the function, the function returns a modified version of the 
        # dictionary and we save that modified version back in contacts
        contacts = add_contact(contacts)
        
    elif command == 'u': # update contact
        
        # pass dictionary to the function, the function returns a modified version of the 
        # dictionary and we save that modified version back in contacts
        contacts = update_contact(contacts)
        
    elif command == 'd': # delete contact

        # pass dictionary to the function, the function returns a modified version of the 
        # dictionary and we save that modified version back in contacts
        contacts = delete_contact(contacts)
        
    elif command == 's': # search for contact
        
        # this function doesn't modify contacts!
        search_for_contact(contacts)
        
    else: # any other key pressed
        print("Commands accepted: A, U, D, S or Q.")
        
    command = input("What do you want to do? [A]dd, [U]pdate, [D]elete, [S]earch or [Q]uit: ")
    
    command = command.lower()


The approach we adopted above is the generally correct one: a function receives the information it needs through arguments and returns any new information.

Unfortunately, Python is a bit inconsistent in how it function arguments are treated. Arguments that are strings and numbers cannot be modified by the functions that receive those arguments. However, collections and other more complicated that types (including dictionaries, as in our code above) *can* be modified by the function that receives them as arguments. This is rather inconsistent and confusing, so I recommend always using the approach we used above. (But please be aware that, when the arguments passed to a function are (e.g.) a list or a dictionary, any changes made inside a function to those arguments will effectively outlast the execution of the function.)

In [1]:
# what about if a function needs to return multiple arguments? easy!

# this function takes 2 numbers as arguments and returns 4 values:
# - sum of those numbers
# - subtraction of those numbers
# - multiplication
# - division

def silly_arithmetic(x, y):
    sum_xy = x + y
    sub_xy = x - y
    mult_xy = x * y
    div_xy = x / y
    
    return sum_xy, sub_xy, mult_xy, div_xy



In [5]:
# how does the code that calls the function use/receive those returned values? 
# there are two approaches.

# First approach:
n1 = float( input("Number 1: ") )
n2 = float( input("Number 2: ") )

results = silly_arithmetic(n1, n2)
print(results)
print("The sum of %f and %f is %f" % (n1, n2, results[0] ) )

Number 1: 40
Number 2: 30
(70.0, 10.0, 1200.0, 1.3333333333333333)
The sum of 40.000000 and 30.000000 is 70.000000


In [6]:
# second approach:

n1 = float( input("Number 1: ") )
n2 = float( input("Number 2: ") )

# wow, we can assign to 4 variables in a single statement!
sum_n1n2, sub_n1n2, mul_n1n2, div_n1n2 = silly_arithmetic(n1, n2) # for this to work the number of values returned by the function on the RHS must exactly match the number of variables on the LHS of the = operator
print("The product of %f and %f is %f" % (n1, n2, mul_n1n2 ) )

Number 1: 40
Number 2: 20
The product of 40.000000 and 20.000000 is 800.000000


## Intro to web scraping

In [10]:

# we will always use the requests module, which we need to import
import requests

# retrieve a certain page by specifying its URL
response = requests.get("http://pages.stern.nyu.edu/~marriaga")

# response object returned by requests.get contains two values we are interested in

# the first is the .status_code, which should be 200 (meaning "success")
print(response.status_code)

200


In [11]:
# the other is the content of the page that was retrieved
print(response.text)

<HTML>
<HEAD>
<TITLE>Manuel Arriaga</TITLE>


<meta charset="UTF-8">

<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">

<meta name="author" CONTENT="Manuel Arriaga">
<meta name="keywords" CONTENT="Manuel Arriaga">
<meta name="description" CONTENT="Learn more about Manuel Arriaga's work and projects">




<meta property="og:title"              content="Manuel Arriaga" />
<meta property="og:url"                content="http://pages.stern.nyu.edu/~marriaga" />
<meta property="og:description"        content="Manuel Arriaga" />
<meta property="og:image"              content="http://pages.stern.nyu.edu/~marriaga/manuel-arriaga.jpg"/>
<meta property="og:type"               content="website" />
<meta property="og:locale"             content="en_US"/>

<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">

<link href="https://fonts.googleapis.com/css?family=Comfortaa" rel="stylesheet">


<link type="text/css" hr

In [13]:
# let us use the name 'html' instead, because that is what response.text contains
html = response.text

In [14]:
# to make sense of HTML and more easily extract information from it, we use the bs4 
# (beautiful soup version 4) module

# must import it first
import bs4

# then we create a "soup" object using that module's BeautifulSoup() function. It takes
# two arguments: 
# - the HTML of the page we want to extract information from
# - the string "html.parser", telling BeautifulSoup that this is indeed a HTML page and it should be interpreted as such
soup = bs4.BeautifulSoup(html, "html.parser")

In [55]:
# it returns an object of a special type that supports a number of useful functions
print( type(soup) )

<class 'bs4.BeautifulSoup'>


In [15]:
# The most important functions in BeautifulSoup are .find() and .find_all()

# they both allow us to locate HTML elements on the page

# if we want just the first matching element, we can use .find()
print( soup.find('title') ) # gets the <TITLE></TITLE> element

<title>Manuel Arriaga</title>


In [16]:
# what if we want just the inner text inside the element? easy, ask for its .text:
print( soup.find('title').text )

Manuel Arriaga


In [18]:
# if we want all matching elements, we use .find_all()

# here we get all 2nd level headings on the page (<H2></H2>)
headings = soup.find_all('h2')
# let us loop over the list returned by BeautifulSoup and print each of them
for heading in headings:
    print(heading)

<h2 style="background-color: maroon;">Academia</h2>
<h2 style="background-color: green;">Writing</h2>
<h2 style="background-color: darkblue;">Activism</h2>
<h2 style="background-color: #E67451;">Software</h2>
<h2 style="background-color: purple;">Tonic Tapes</h2>
<h2 style="background-color: #BC8F8F;">Underlined Press</h2>
<h2 style="background-color: #307D7E;">Email</h2>


In [54]:
# notice that each object returned by .find() and .find_all() is not just a string: instead,
# it is a special BeautifulSoup object that can itself be queried/searched for the text it 
# contains, for its attributes and for other HTML elements nested inside it.

print( type(headings[0]) ) # not a string!

<class 'bs4.element.Tag'>


In [19]:
# let us print just the text of each of these headings
headings = soup.find_all('h2')
for heading in headings:
    print(heading.text)

Academia
Writing
Activism
Software
Tonic Tapes
Underlined Press
Email


In [24]:
# example: let us get the URLs for all links on the page

# first let us get all links, which in HTML are <a>/<a> elements

links = soup.find_all('a')
for l in links: print(l)

<a href="http://www.stern.nyu.edu/" target="_blank">NYU Stern</a>
<a href="http://www.cambridgedigitalinnovation.com/" target="_blank">Center for
      Digital Innovation</a>
<a href="http://www.amazon.com/Rebooting-Democracy-Citizens-Reinventing-Politics/dp/191019817X">Rebooting
      Democracy: A Citizen's Guide to Reinventing Politics</a>
<a href="http://www.forumdoscidadaos.pt/en">Fórum dos Cidadãos</a>
<a href="http://www.policyjurygroup.org/" target="_blank">Policy
      Jury Group</a>
<a href="software">this page</a>
<a href="http://pages.stern.nyu.edu/~marriaga/beyond-the-hfs.pdf">alternative
      way to organize personal documents</a>
<a href="http://tonictapes.org/">this small tribute to the historic
      music club Tonic</a>
<a href="http://underlinedpress.com/">print a beautiful custom-made
      book containing your favorite poems, prose fragments and other
      texts</a>


In [25]:
# let us take a look just at the 1st
print(links[0])

<a href="http://www.stern.nyu.edu/" target="_blank">NYU Stern</a>


In [26]:
# this is the clickable "anchor text" that gets dispayed and can be clicked on
print(links[0].text)

NYU Stern


In [28]:
# what about the actual URL? We can get it by asking for the value of the 'href' (meaning
# 'hypertext reference') attribute of the <a> element
print( links[0].get("href") )

http://www.stern.nyu.edu/


In [30]:
# exercise: print the URLs for all links on my page

links = soup.find_all('a')
for l in links:
    url = l.get("href") 
    
    # some times URLs are "relative" to the page or site we are crawling. in such cases,
    # we need to prefix them with what is missing for them to be valid URLs
    
    if not url.startswith('http://'): # it is not a fully qualified URL!
        if url.startswith('/'): # relative url pointing to another page on this server
            url = "http://pages.stern.nyu.edu" + url
        else: # relative url pointing to a subfolder beneath the current page
            url = "http://pages.stern.nyu.edu/~marriaga/" + url 
    print(url) # success, these are now all valid URLs

http://www.stern.nyu.edu/
http://www.cambridgedigitalinnovation.com/
http://www.amazon.com/Rebooting-Democracy-Citizens-Reinventing-Politics/dp/191019817X
http://www.forumdoscidadaos.pt/en
http://www.policyjurygroup.org/
http://pages.stern.nyu.edu/~marriaga/software
http://pages.stern.nyu.edu/~marriaga/beyond-the-hfs.pdf
http://tonictapes.org/
http://underlinedpress.com/


In [32]:
# Now let us try a more complicated page: let us crawl Stern news

# first, retrieve page and create soup object

response = requests.get("https://www.stern.nyu.edu/experience-stern/news-events")

if response.status_code != 200:
    print("Trouble! The server returned error code %d" % response.status_code)
else: # OK!
    print("Success retrieving the page!")

Success retrieving the page!


In [33]:
html = response.text

# as above
soup = bs4.BeautifulSoup(html, "html.parser")

In [36]:
# through visual inspection of the page (using the "View page source" option by right-clicking
# on the background of the page in Firefox), we saw that each news item is a <li> element
# (<li> standing for "list item") and has class "item". 
#
# HTML elements very often have specific classes and that makes it very easy to locate what we
# want. So much so that BeautifulSoup's .find() and .find_all() accept a second optional
# argument called class_ (notice the underscore, it must be there!) that lets you specify
# you are only interested in elements with that class 

items = soup.find_all("li", class_="item")
print(items)

[<li class="item">
<div class="header">
<div class="kicker">School News</div>
<h2 class="title"><a class="news-events" href="/experience-stern/news-events/isser-gallogly-associate-dean-mba-admissions-and-program-innovation-highlights-stern-s-ms-accounting">Isser Gallogly, associate dean, MBA Admissions and Program Innovation, highlights Stern's MS in Accounting program in a feature story on specialized master's programs at New York City-based b-schools</a></h2>
<div class="date">— <span class="date-display-single">July 22, 2019</span> </div>
</div>
<div class="pic"><img alt="MiM Guide logo " height="144" src="https://www.stern.nyu.edu/sites/default/files/assets/images/mimguide192x144.jpg" title="MiM Guide logo " width="192"/></div>
<h2 class="title-bottom"><a class="news-events" href="/experience-stern/news-events/isser-gallogly-associate-dean-mba-admissions-and-program-innovation-highlights-stern-s-ms-accounting">Isser Gallogly, associate dean, MBA Admissions and Program Innovation, h

In [39]:
# great! let us look at the first news item on the page!
print(items[0])

<li class="item">
<div class="header">
<div class="kicker">School News</div>
<h2 class="title"><a class="news-events" href="/experience-stern/news-events/isser-gallogly-associate-dean-mba-admissions-and-program-innovation-highlights-stern-s-ms-accounting">Isser Gallogly, associate dean, MBA Admissions and Program Innovation, highlights Stern's MS in Accounting program in a feature story on specialized master's programs at New York City-based b-schools</a></h2>
<div class="date">— <span class="date-display-single">July 22, 2019</span> </div>
</div>
<div class="pic"><img alt="MiM Guide logo " height="144" src="https://www.stern.nyu.edu/sites/default/files/assets/images/mimguide192x144.jpg" title="MiM Guide logo " width="192"/></div>
<h2 class="title-bottom"><a class="news-events" href="/experience-stern/news-events/isser-gallogly-associate-dean-mba-admissions-and-program-innovation-highlights-stern-s-ms-accounting">Isser Gallogly, associate dean, MBA Admissions and Program Innovation, hi

In the cells that follow, we will extract specific information from this news item:

- title?
- category?
- date?
- abstract?
- URL to the complete news item?

In [48]:
# let us extract the title of this news item
title = items[0].find("h2" ).text
print( title )

Isser Gallogly, associate dean, MBA Admissions and Program Innovation, highlights Stern's MS in Accounting program in a feature story on specialized master's programs at New York City-based b-schools


In [43]:
# let us extract the category of this news item

category = items[0].find("div", class_="kicker").text
print( category )

School News


In [45]:
# let us extract the date of this news item

date = items[0].find("div", class_="date").text[2:]
print( date )

July 22, 2019 


In [46]:
# let us extract the abstract of this news item

abstract = items[0].find("div", class_="abstract").text
print( abstract)

Excerpt from MiM Guide -- "'New York City offers unique advantages. After graduating from our program, many students begin their careers working in one of "The Big Four" large public accounting firms — Deloitte, PricewaterhouseCoopers, Ernst & Young, and KPMG,' says Isser Gallogly, associate dean at NYU Stern, referring to the school’s one-year MS in Accounting course."


In [53]:
# let us extract the URL of this news item (and complete it so that we have a valid URL)
url = "http://www.stern.nyu.edu" + items[0].find("h2").find('a').get('href')
print( url )

http://www.stern.nyu.edu/experience-stern/news-events/isser-gallogly-associate-dean-mba-admissions-and-program-innovation-highlights-stern-s-ms-accounting


## Exercise to do at home:

Based on the code above, write a short program that creates a dictionary called 'news'.
This dictionary should be structured as follows:

key: title of a news article
value: URL of that news article

Your code should extract this information for *all* news articles on the Stern News and Events page and insert it into the dictionary.

Then, write a simple program that prints to the user all titles currently on the Stern News and Events page and asks them to indicate which article they want to see the URL for (so that they can click on the link and open it in a browser).

Extension/more advanced:

If you want a bigger challenge, notice that at the bottom of the page there are links for other pages with additional (older) news articles. They show up in the familiar format: 1 2 3 ... . Each of these numbers is a link to another page with another batch of news articles. Try to have your code extract some additional news articles by having it automatically visit those other pages and also extracting news articles from them.