# Lesson 9 - Requests + BeautifulSoup

First, the requests library. The requests library gives you a convenient set of functions to send requests over the internet. What are requests? They are the things that let your web browser access the data from different websites. One of the most common types of requests are get requests. 

In [9]:
import requests

response = requests.get("https://pypi.org/project/beautifulsoup4/")
html_str = response.text
status = response.status_code
print(html_str)
print(status) # 200 status means successful request





<!DOCTYPE html>
<html lang="en" dir="ltr">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <meta name="defaultLanguage" content="en">
    <meta name="availableLanguages" content="en, es, fr, ja, pt_BR, uk, el, de, zh_Hans, zh_Hant, ru, he, eo">


    <title>beautifulsoup4 · PyPI</title>
    <meta name="description" content="Screen-scraping library">

    <link rel="stylesheet" href="/static/css/warehouse-ltr.99b3104d.css">
    <link rel="stylesheet" href="/static/css/fontawesome.b50b476c.css">
    <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Source+Sans+3:400,400italic,600,600italic,700,700italic%7CSource+Code+Pro:500">
    <noscript>
      <link rel="stylesheet" href="/static/css/noscript.0673c9ea.css">
    </noscript>


    <link rel="icon" href="/static/images/favicon.35549fe8.ico" type="image/x-icon">

    <link rel="alternate" type

In [12]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_str, "html.parser")
help(soup)

Help on BeautifulSoup in module bs4 object:

class BeautifulSoup(bs4.element.Tag)
 |  BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)
 |  
 |  A data structure representing a parsed HTML or XML document.
 |  
 |  Most of the methods you'll call on a BeautifulSoup object are inherited from
 |  PageElement or Tag.
 |  
 |  Internally, this class defines the basic interface called by the
 |  tree builders when converting an HTML/XML document into a data
 |  structure. The interface abstracts away the differences between
 |  parsers. To write a new tree builder, you'll need to understand
 |  these methods as a whole.
 |  
 |  These methods will be called by the BeautifulSoup constructor:
 |    * reset()
 |    * feed(markup)
 |  
 |  The tree builder may call these methods from its feed() implementation:
 |    * handle_starttag(name, attrs) # See note about return value
 |    * handle_endtag(n

In [15]:
print(soup.get_text())










beautifulsoup4 · PyPI










 

















Skip to main content

Switch to mobile version    








Some features may not work without JavaScript. Please try enabling it if you encounter problems.


 












Search PyPI



Search



 


Help
Sponsors
Log in
Register




Menu      




Help
Sponsors
Log in
Register



 




Search PyPI



Search








        beautifulsoup4 4.12.3
      


pip install beautifulsoup4


Copy PIP instructions






Latest version


Released: 
  Jan 17, 2024
 





 
Screen-scraping library
 







Navigation





Project description                




Release history                




Download files                





Project links



Download
      



Homepage
      




Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery 


Meta
License: MIT License (MIT License)
Author: Leonard Richardson


Tags

      HTML,    

      XML,    

      parse,    

      soup    

In [25]:
from pprint import pprint
my_div_tags = list(soup.find_all("div"))
my_div_tags[0]
# print(len(my_a_tags))

<div class="stick-to-top js-stick-to-top" id="sticky-notifications">
<!--[if IE]>
        <span class="notification-bar__icon">
          <i class="fa fa-exclamation-triangle" aria-hidden="true"></i>
        </span>
        <span class="notification-bar__message">You are using an unsupported browser, upgrade to a newer version.</span>
      </div>
      <![endif]-->
<noscript>
<span class="notification-bar__icon">
<i aria-hidden="true" class="fa fa-exclamation-triangle"></i>
</span>
<span class="notification-bar__message">Some features may not work without JavaScript. Please try enabling it if you encounter problems.</span>
</div>
</noscript>
<div data-html-include="/_includes/notification-banners/"></div> </div>

In [27]:
first_div = my_div_tags[0]
first_div.find_all("div")


 <span class="notification-bar__icon">
 <i aria-hidden="true" class="fa fa-exclamation-triangle"></i>
 </span>
 <span class="notification-bar__message">Some features may not work without JavaScript. Please try enabling it if you encounter problems.</span>
 </div>,
 <div data-html-include="/_includes/notification-banners/"></div>]

In [31]:
soup.div

<div class="stick-to-top js-stick-to-top" id="sticky-notifications">
<!--[if IE]>
        <span class="notification-bar__icon">
          <i class="fa fa-exclamation-triangle" aria-hidden="true"></i>
        </span>
        <span class="notification-bar__message">You are using an unsupported browser, upgrade to a newer version.</span>
      </div>
      <![endif]-->
<noscript>
<span class="notification-bar__icon">
<i aria-hidden="true" class="fa fa-exclamation-triangle"></i>
</span>
<span class="notification-bar__message">Some features may not work without JavaScript. Please try enabling it if you encounter problems.</span>
</div>
</noscript>
<div data-html-include="/_includes/notification-banners/"></div> </div>

In [32]:
soup.div.div


<span class="notification-bar__icon">
<i aria-hidden="true" class="fa fa-exclamation-triangle"></i>
</span>
<span class="notification-bar__message">Some features may not work without JavaScript. Please try enabling it if you encounter problems.</span>
</div>

In [46]:
pokemon_wiki_html = requests.get("https://en.wikipedia.org/wiki/Pok%C3%A9mon").text
pokemon_soup = BeautifulSoup(pokemon_wiki_html, "html.parser")
section_headers = pokemon_soup.find_all("h2")

print("Initial Search")
pprint(section_headers)
print()

print("Filtered Search")
section_headers = [tag.find_all(class_="mw-headline") for tag in section_headers]
pprint(section_headers)
print()

print("Restructuring the List")
section_headers = [tag_list[0] for tag_list in section_headers if len(tag_list) > 0]
pprint(section_headers)
print()

print("Extract the Section Names")
section_names = [tag.get_text() for tag in section_headers]
pprint(section_names)

Initial Search
[<h2 class="vector-pinnable-header-label">Contents</h2>,
 <h2><span class="mw-headline" id="Name">Name</span></h2>,
 <h2><span class="mw-headline" id="General_concept">General concept</span></h2>,
 <h2><span class="mw-headline" id="History">History</span></h2>,
 <h2><span class="mw-headline" id="Media">Media</span></h2>,
 <h2><span id="Reaction_to_Pok.C3.A9mania_.281999.E2.80.932000.29"></span><span class="mw-headline" id="Reaction_to_Pokémania_(1999–2000)">Reaction to Pokémania (1999–2000)</span></h2>,
 <h2><span class="mw-headline" id="Legacy_and_influences">Legacy and influences</span></h2>,
 <h2><span class="mw-headline" id="Footnotes">Footnotes</span></h2>,
 <h2><span class="mw-headline" id="References">References</span></h2>,
 <h2><span class="mw-headline" id="External_links">External links</span></h2>]

Filtered Search
[[],
 [<span class="mw-headline" id="Name">Name</span>],
 [<span class="mw-headline" id="General_concept">General concept</span>],
 [<span class="m

In [47]:
# Aside: prettyprint
my_dict = {"afjadkfasdlfajldskfjlasdkfj" : 1, "aisdfjasdjfaldfjsalkfb" : 2, "ajsdhfaskfjajdkfahjsdkfakjsdhfc" : 3}

# Normal print
print(my_dict)

# Pretty Print

pprint(my_dict)

{'afjadkfasdlfajldskfjlasdkfj': 1, 'aisdfjasdjfaldfjsalkfb': 2, 'ajsdhfaskfjajdkfahjsdkfakjsdhfc': 3}
{'afjadkfasdlfajldskfjlasdkfj': 1,
 'aisdfjasdjfaldfjsalkfb': 2,
 'ajsdhfaskfjajdkfahjsdkfakjsdhfc': 3}


In [48]:
# To avoid rate limit errors, make sure to pause the program a bit between requests
import time

for i in range(10):
    response = requests.get("https://pypi.org/project/beautifulsoup4/")
    html_str = response.text

    time.sleep(1) # Let the program pause for 1 second in between loop iterations / between requests

In [90]:
a = [2, 2]
[x for x in set(a)][0]

2

In [None]:
climate_change_wiki_html = requests.get("https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=20190629002413%7C903949211&limit=500").text

In [81]:
html_soup = BeautifulSoup(climate_change_wiki_html, "html.parser")
html_a_tag = html_soup.find_all("a")
# print("find all a")
# pprint(html_a_tag)
# print()

next_500 = html_soup.find_all(class_="mw-nextlink")
print("find all class nextlink")
pprint(list(set(next_500)))
print()

next_500_href = next_500[0].get('href')
print("get href")
print(next_500_href)
print()

find all class nextlink
[<a class="mw-nextlink" href="/w/index.php?title=Climate_change&amp;action=history&amp;offset=20190415130509%7C892572587&amp;limit=500" rel="next">older 500</a>]

get href
/w/index.php?title=Climate_change&action=history&offset=20190415130509%7C892572587&limit=500



# Function that extracts a link to the next 500


In [91]:
def next_500_link(cur_link):
    get_html = requests.get(cur_link).text
    html_soup = BeautifulSoup(get_html, "html.parser")
    next_500 = html_soup.find_all(class_="mw-nextlink")
    next_500_href = next_500[0].get('href')
    return "https://en.wikipedia.org" + next_500_href

a = next_500_link("https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=&limit=500")
print(a)

/w/index.php?title=Climate_change&action=history&offset=20230422143521%7C1151198409&limit=500


# Experiment: from the very first 500 history page to the next 500 page

In [94]:
import time

first_500 = "https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=&limit=500"
next_500_link_list = [first_500]
for i in range(5):
    next_500_link_list.append(next_500_link(next_500_link_list[i]))
    time.sleep(1)
print(next_500_link_list)

['https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=&limit=500', 'https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=20230422143521%7C1151198409&limit=500', 'https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=20220309105655%7C1076103725&limit=500', 'https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=20210719074058%7C1034331110&limit=500', 'https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=20210103201733%7C998097208&limit=500', 'https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=20200922074022%7C979696839&limit=500']


In [96]:
import time

first_500 = "https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=&limit=500"
next_500_link_list = [first_500]
current_link = first_500
for i in range(5):
    next_link = next_500_link(current_link)
    next_500_link_list.append(next_link)
    current_link = next_link
    time.sleep(1)
print(next_500_link_list)

['https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=&limit=500', 'https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=20230422143521%7C1151198409&limit=500', 'https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=20220309105655%7C1076103725&limit=500', 'https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=20210719074058%7C1034331110&limit=500', 'https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=20210103201733%7C998097208&limit=500', 'https://en.wikipedia.org/w/index.php?title=Climate_change&action=history&offset=20200922074022%7C979696839&limit=500']


In [68]:

climate_soup = BeautifulSoup(climate_change_wiki_html, "html.parser")
climate_a_tag = climate_soup.find_all("a")
# pprint(climate_a_tag)
climate_next_link = climate_soup.find_all(class_="mw-nextlink")
pprint(climate_next_link)
next_link_href = climate_next_link[0].get('href')
# print(next_link_href)


[<a class="mw-nextlink" href="/w/index.php?title=Climate_change&amp;action=history&amp;offset=20190415130509%7C892572587&amp;limit=500" rel="next">older 500</a>,
 <a class="mw-nextlink" href="/w/index.php?title=Climate_change&amp;action=history&amp;offset=20190415130509%7C892572587&amp;limit=500" rel="next">older 500</a>]


AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [None]:
climate_change_wiki_html = requests.get("https://en.wikipedia.org"+next_link_href)