# BeautifulSoup
- great 'screen scraping' package
- tons of interesting data on webpages designed for people, not programs
- makes it easy to extract information from complex web pages and XML documents
- soup reads in the page of interest, then you can query it
- often can figure out what to do by playing interactively
- works in unicode
- new code should use BeautifulSoup version 4
- usually used on web pages, but can operate on any string
- [doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

# Example:
# Want to find all the headlines on the front page of the [New York Times](http://nyt.com)
- but - key point - i don't want to work very hard!!!
    - look at webpage source - html structure is quite complex - not interested in understanding it
    - would be very difficult to do using text tools we have seen so far, like string.find() and regular expressions

In [1]:
# 'lxml' is a XML parser(parses HTML too)
# must tell soup what unicode decoding to use

import urllib.request
import bs4
import lxml

In [2]:
nf2 = urllib.request.urlopen('http://nyt.com')
lines = nf2.readlines()
print(len(lines))
lines[:10]

320


[b'<!DOCTYPE html>\n',
 b'<html lang="en" xmlns:og="http://opengraphprotocol.org/schema/">\n',
 b'  <head>\n',
 b'    <title data-rh="true">The New York Times - Breaking News, World News & Multimedia</title>\n',
 b'    <meta data-rh="true" itemprop="inLanguage" content="en-US"/><meta data-rh="true" name="robots" content="noarchive,noodp,noydir"/><meta data-rh="true" name="application-name" content="The New York Times"/><meta data-rh="true" name="msapplication-starturl" content="https://www.nytimes.com"/><meta data-rh="true" name="msapplication-task" content="name=Search;action-uri=https://www.nytimes.com/search/?src=iepin;icon-uri=https://static01.nyt.com/images/icons/search.ico"/><meta data-rh="true" name="msapplication-task" content="name=Most Popular;action-uri=https://www.nytimes.com/gst/mostpopular.html?src=iepin;icon-uri=https://static01.nyt.com/images/icons/mostpopular.ico"/><meta data-rh="true" name="msapplication-task" content="name=Video;action-uri=https://video.nytimes.com/?

In [3]:
nf2 = urllib.request.urlopen('http://nyt.com')
soup = bs4.BeautifulSoup(nf2, 'lxml', from_encoding='utf-8')

In [4]:
# headlines seem to be contained in 'h2' elements

h2s = soup.findAll('h2')
h2s
      

[<h2 class="css-km70tz esl82me0"> Listen to ‘The Daily’</h2>,
 <h2 class="css-km70tz esl82me0">In the ‘In Her Words’ Newsletter</h2>,
 <h2 class="css-km70tz esl82me0">Sign Up: ‘Coronavirus Briefing’</h2>,
 <h2 class="css-1qwxefa esl82me0"><span>Stocks and Bond Yields Fall Sharply: Latest Updates</span></h2>,
 <h2 class="css-n2blzn esl82me0">Global Health Crisis 1, Economic Policymakers 0</h2>,
 <h2 class="css-n2blzn esl82me0">The Fed has no tools for an outbreak, but it acted anyway. Here’s why.</h2>,
 <h2 class="css-1qwxefa esl82me0"><span>Coronavirus Death Toll Rises to 9 in Washington State: Updates</span></h2>,
 <h2 class="css-n2blzn esl82me0">Trump Administration Sends Mixed Signals on Coronavirus Testing</h2>,
 <h2 class="css-n2blzn esl82me0">A second case in New York may force a hospital to quarantine its staff.</h2>,
 <h2 class="css-1qwxefa esl82me0"><span>Democrats Head to the Polls for Super Tuesday: Latest Updates</span></h2>,
 <h2 class="css-n2blzn esl82me0">Sanders Campaig

In [5]:
# pull out the contents of the h2 elements

contents = [h2.contents for h2 in h2s]
contents

[[' Listen to ‘The Daily’'],
 ['In the ‘In Her Words’ Newsletter'],
 ['Sign Up: ‘Coronavirus Briefing’'],
 [<span>Stocks and Bond Yields Fall Sharply: Latest Updates</span>],
 ['Global Health Crisis 1, Economic Policymakers 0'],
 ['The Fed has no tools for an outbreak, but it acted anyway. Here’s why.'],
 [<span>Coronavirus Death Toll Rises to 9 in Washington State: Updates</span>],
 ['Trump Administration Sends Mixed Signals on Coronavirus Testing'],
 ['A second case in New York may force a hospital to quarantine its staff.'],
 [<span>Democrats Head to the Polls for Super Tuesday: Latest Updates</span>],
 ['Sanders Campaign Was Caught Off Guard by Quick Massing of Opposition'],
 ['Some of the questions surrounding Super Tuesday already seem to have answers.'],
 [<span>Why She’s Prof. Warren From Harvard, Not Betsy From Oklahoma</span>],
 ['The Latino Vote: The ‘Sleeping Giant’ Awakens'],
 ['The lasting effects of Michael Bloomberg’s stop-and-frisk policy on New York City.'],
 [<span>2

In [6]:
# pull out the strings from lists and the <span> tag
# note use of 'ternary if'

[ content[0] if isinstance(content[0], str) else content[0].contents[0] \
 for content in contents]

[' Listen to ‘The Daily’',
 'In the ‘In Her Words’ Newsletter',
 'Sign Up: ‘Coronavirus Briefing’',
 'Stocks and Bond Yields Fall Sharply: Latest Updates',
 'Global Health Crisis 1, Economic Policymakers 0',
 'The Fed has no tools for an outbreak, but it acted anyway. Here’s why.',
 'Coronavirus Death Toll Rises to 9 in Washington State: Updates',
 'Trump Administration Sends Mixed Signals on Coronavirus Testing',
 'A second case in New York may force a hospital to quarantine its staff.',
 'Democrats Head to the Polls for Super Tuesday: Latest Updates',
 'Sanders Campaign Was Caught Off Guard by Quick Massing of Opposition',
 'Some of the questions surrounding Super Tuesday already seem to have answers.',
 'Why She’s Prof. Warren From Harvard, Not Betsy From Oklahoma',
 'The Latino Vote: The ‘Sleeping Giant’ Awakens',
 'The lasting effects of Michael Bloomberg’s stop-and-frisk policy on New York City.',
 '22 Dead After Tornadoes Lash Tennessee: Latest Updates',
 'Hideo Kojima’s Strange

# Example: [Citizen Kane page at Rotten Tomatoes](https://www.rottentomatoes.com/m/citizen_kane)


In [7]:
nf2 = urllib.request.urlopen('https://www.rottentomatoes.com/m/citizen_kane')
soup = bs4.BeautifulSoup(nf2, 'lxml', from_encoding='utf-8')

In [8]:
# synopsis

div = soup.find('div', id="movieSynopsis")
print(div.get_text().strip())

This is the labyrinthine study of the life of a newspaper tycoon.


In [9]:
# movie info 

for li in div.parent.find_all('li'):
    key, val=li.find_all('div')

    # get rid of trailing ': '
    print(key.get_text()[:-2], end=' - ')
    
    childs = val.contents
    if len(childs) == 1:
        print(childs[0])
    else:
        for child in val.children:
            if child.name in ['a', 'time']:
                print(child.get_text().strip(), end = ' ')
        print()

Rating - PG
Genre - Classics Drama Mystery & Suspense 
Directed By - Orson Welles 
Written By - Herman J. Mankiewicz Orson Welles 
In Theaters - May 1, 1941 
On Disc/Streaming - Sep 25, 2001 
Runtime - 119 minutes 
Studio - 
                        RKO Radio Pictures
                    


In [10]:
# actors

cast = soup.find('div', class_ = 'castSection')

fields=[s['title'] for s in cast.findAll('span', title=True)]
actor_role = [ [fields[j], fields[j+1]] 
              for j in range(0,len(fields),2)]

actor_role

[['Orson Welles', 'Charles Foster Kane'],
 ['Dorothy Comingore', 'Susan Alexander'],
 ['Joseph Cotten', 'Jedediah Leland'],
 ['Everett Sloane', 'Bernstein'],
 ['George Coulouris', 'Walter Parks Thatcher'],
 ['Agnes Moorehead', 'Mrs. Mary Kane'],
 ['Ruth Warrick', 'Emily Norton Kane'],
 ['Harry Shannon', 'Kane Sr.'],
 ['Ray Collins', 'Boss James W. Gettys'],
 ['Sonny Bupp', 'Kane III'],
 ['Erskine Sanford', 'Herbert Carter'],
 ['William Alland', 'Jerry Thompson'],
 ['Fortunio Bonanova', 'Matisti'],
 ['Paul Stewart', 'Raymond'],
 ['Gus Schilling', 'Head Waiter'],
 ['Buddy Swan', 'Young Charles Foster Kane'],
 ['Philip Van Zandt', 'Mr. Rawlston'],
 ['Georgia Backus', 'Miss Anderson'],
 ['Alan Ladd', 'Reporter'],
 ['Pedro de Cordoba', 'Kane senior'],
 ['Charles Bennett', 'Entertainer'],
 ["Arthur O'Connell", 'Reporter'],
 ['Joan Blair', 'Georgia'],
 ['Edmund Cobb', 'Enquirer Reporter'],
 ['Eddie Coke', 'Reporter'],
 ['Gino Corrado', 'Gino'],
 ['Herbert Corthell', 'City Editor'],
 ['Louise 