# Python Open Lab

## Beautiful Soup

$*$ Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

$*$ Check out official documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [31]:
#example of use

import bs4
import urllib.request

#make a string to record where
loc='https://library.columbia.edu/index.html'

#create the page object using urllib
page=urllib.request.urlopen(loc)



#create the soup object
soup=bs4.BeautifulSoup(page,'html.parser')



In [2]:
# now lets check out some methods
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <!-- <meta http-equiv="keywords" content=""> -->
  <!-- <meta http-equiv="description" content=""> -->
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0" name="viewport"/>
  <title>
   Libraries Home | Columbia University Libraries
  </title>
  <script>
   var CUL = {};
var LDPD = {};
  </script>
  <link href="//cdn.cul.columbia.edu/ldpd-toolkit/build/css/bootstrap.min.css" rel="stylesheet"/>
  <link href="//cdn.cul.columbia.edu/ldpd-toolkit/build/css/bootstrap-responsive.min.css" rel="stylesheet"/>
  <link href="//cdn.cul.columbia.edu/ldpd-toolkit/build/css/ldpd-toolkit.min.css" rel="stylesheet"/>
  <link href="/etc/clientlibs/shared/shared-v2.css" rel="stylesheet" type="text/css"/>
  <link href="/etc/clientlibs/shared/cul-base-v1.css" rel="stylesheet" type="text/css"/>
  <link href="/etc/designs/libraryweb/librar

In [3]:
#find and find_all
print(soup.find('title').get_text())

Libraries Home | Columbia University Libraries


In [4]:
for l in soup.find_all('a'):
    print(l.get('href')) #getting links

https://library.columbia.edu/index.html
https://library.columbia.edu/research/askalibrarian.html
None
https://library.columbia.edu/index.html
https://library.columbia.edu/locations.html
https://hours.library.columbia.edu
https://library.columbia.edu/locations/map.html
https://library.columbia.edu/locations/avery.html
https://library.barnard.edu/
https://library.columbia.edu/locations/burke.html
https://library.columbia.edu/locations/business.html
https://library.columbia.edu/locations/butler.html
https://library.columbia.edu/locations/chrdr.html
https://ctl.columbia.edu/
https://library.columbia.edu/locations/ccoh.html
https://library.columbia.edu/locations/cuarchives.html
https://library.columbia.edu/locations/dhc.html
https://library.columbia.edu/locations/music/music-lab.html
https://library.columbia.edu/locations/dsc.html
https://library.columbia.edu/locations/dssc.html
https://library.columbia.edu/locations/eastasian.html
https://library.columbia.edu/locations/global.html
http://l

In [47]:

# same as before but a different url
loc2='http://www.bloomberg.com/quote/SPX:IND'
page2=urllib.request.urlopen(loc2)
soup2=bs4.BeautifulSoup(page2,'html.parser')

print(soup2.prettify())


<!DOCTYPE doctype html>
<html>
 <head>
  <title>
   Bloomberg - Are you a robot?
  </title>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="https://assets.bwbx.io/font-service/css/BWHaasGrotesk-55Roman-Web,BWHaasGrotesk-75Bold-Web,BW%20Haas%20Text%20Mono%20A-55%20Roman/font-face.css" rel="stylesheet" type="text/css"/>
  <style rel="stylesheet" type="text/css">
   html, body, div, span, applet, object, iframe,
        h1, h2, h3, h4, h5, h6, p, blockquote, pre,
        a, abbr, acronym, address, big, cite, code,
        del, dfn, em, img, ins, kbd, q, s, samp,
        small, strike, strong, sub, sup, tt, var,
        b, u, i, center,
        dl, dt, dd, ol, ul, li,
        fieldset, form, label, legend,
        table, caption, tbody, tfoot, thead, tr, th, td,
        article, aside, canvas, details, embed,
        figure, figcaption, footer, header, hgroup,
        menu, nav, output, ruby, section, summary,
        time, mark, audio, video {
       

## Positional Keywords or Parameters

In [43]:
def f(a,b):
    print('a is: ',a)
    print('b is: ',b)
    return

f(2,3)

a is:  2
b is:  3


As expected the first parameter was passed into the parameter a and the second parameter into b. Now observe what happens if we invoke the function in this way:

In [44]:
f(b=10,a=0)

a is:  0
b is:  10


What just happened? This time we specified the parameters by keyword, not by position. Of course we needed to know what the formal parameter names were but if we do, as long as we specifiy them by name, we don't have to worry about order. The only rule is that positional arguments come before keyword arguments in the function call. So for example this is okay:

In [45]:
f(6,b=12)

a is:  6
b is:  12


In [46]:
f(b=12,6)

SyntaxError: positional argument follows keyword argument (<ipython-input-46-a7f8a95efa30>, line 1)

What about f(6,a=12)? Try it! You'll see that this doesn't work because the argument 6 is passed into the parameter a and then we try to reassign it to 12.


## Variable-Length-Postional Parameter
In a function defintion these must come after the positional or keyword parameters.

In [None]:
def f(a,b=0,*args):
    print('a is: ',a)
    print('b is: ',b)
    print('args is: ',args)
    return

f(3,2,4,5,6)

The * in front of the parameter args makes it a variable-length-positional parameter. This means it absorbs any extra postional arguments and places them into a tuple called args. Specifying a parameter like this in a function definition gives the function caller the option of providing as many postional arguments as they want without breaking the function.

## Keyword-Only Parameters:
We can specifiy additional keyword parameters after the variable-length-positional parameter in a function definition. These parameters may only be specified by keyword, never positionally. For this reason we call these keyword-only parameters. Like this:

In [None]:
def g(a,b=0,*args,c=1,d=2):
    print('a is: ',a)
    print('b is: ',b)
    print('args is: ',args)
    print('c is: ',c)
    print('d is: ',d)
    return

g(1,2,3,4,5,d=100)