## Starting Off

With a partner, answer the following question:

Is it legal to scrape data from websites?

# Advanced Webscraping: How to make sure you don't get blocked.

## Aims:

- Write scripts that can handle errors and minimize the likelihood of your IP address getting blocked.


## Agenda

- Talk about the legality of scraping
- Practice scraping
- Look at ways to programmatically avoid getting banned
- Set up the selenium webdriver
- Learn how to use Selenium

## 1. Check 200 status code
It is always good to check the HTTP status code earlier and proceed accordingly.

This is good:

~~~
if response.status_code == 200:
   #Proceed further
~~~

This is better:

~~~~
if response.status_code != 200:
  return False
~~~

In [None]:
for url in urls:
    page = requests.get(url)
    # include code to do status check
    if page.status_code != 200:
        print( page.status_code)
    
    # more code to process the results

## 2. Never Trust HTML

Especially if you can’t control it. Web scraping depends on HTML DOM, a simple change in element or class name could break your entire script. The best way to deal with it is to check if it returns `None`.

~~~
page_count = soup.select('.pager-pages > li > a')
if page_count:
 #do your stuff
else:
 # ALERT!! Send notification to Admin
~~~

Here I am checking whether the CSS selector returned something legitimate, if yes then proceed further.

In [22]:
for url in urls:
    page = requests.get(url)
    # include code to do status check
    if page.status_code != 200:
        print( page.status_code)
    
    # more code to process the results
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    if items:
        #continue processing the data
        pass
    else:
        print("Data is coming back blank")

InvalidSchema: No connection adapters were found for '<a href="https://www.amazon.com/Best-Sellers/zgbs/amazon-devices">Amazon Devices &amp; Accessories</a>'

## 3 .  Set headers

`requests` does not force you to use request headers while sending requests, but there are few smart websites that do not let you to get read anything important unless certain headers are not set in it. Once I faced the situation that the HTML I was seeing in browser was different than what I was getting via my script, kind of like magic huh. So, it is always good to make your requests as legitimate as you can. The least you should do is to set a User-Agent.

~~~
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

response = requests.get(url, headers=headers, timeout=5)

~~~

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    page = requests.get(url, headers = headers)
    # include code to do status check
    if page.status_code != 200:
        print(page.status_code)
    
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

## 4. Set timeout

One of the issues with `requests` is that, if you don’t mention **timeout**, it will keep trying until its last breath. This might be good for some certain conditions but not in majority cases. Therefore, it’s always good to set a timeout value for each request. Here I am setting timeout to 5 seconds.

~~~
response = requests.get(url, headers=headers, timeout=5)
~~~

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
    if page.status_code != 200:
        print(page.status_code)
    
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

## 5. Exception handling

It is always good to implement exception handling. It does not only help to avoid unexpected exit of script but can also help to log errors and info notification. When using Python requests I prefer to catch exceptions like this:

~~~
try:
    # your logic is here

except requests.ConnectionError as e:
    print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
    print(str(e))
except requests.Timeout as e:
    print("OOPS!! Timeout Error")
    print(str(e))
except requests.RequestException as e:
    print("OOPS!! General Error")
    print(str(e))
except KeyboardInterrupt:
    print("Someone closed the program") 
~~~

Check the very last one. This one tells the program that if someone wants to terminate program by using Ctrl+C then it wrap things up first and then exist. This situation is good if you are storing information in file and wants to dump all at the time of exit.

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    try:
        page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
        if page.status_code != 200:
            print(page.status_code)
    except requests.ConnectionError as e:
        print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
        print(str(e))
    except requests.Timeout as e:
        print("OOPS!! Timeout Error")
        print(str(e))
    except requests.RequestException as e:
        print("OOPS!! General Error")
        print(str(e))
    except KeyboardInterrupt:
        print("Someone closed the program") 
        
        
        
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

This code is starting to get long and hard to read. So let's start to modularize it.  

In [1]:
def get_page(url):
    try:
        page = requests.get(url, timeout=5)
    # include code to do status check
        if page.status_code != 200:
            print(page.status_code)
    except requests.ConnectionError as e:
        print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
        print(str(e))
    except requests.Timeout as e:
        print("OOPS!! Timeout Error")
        print(str(e))
    except requests.RequestException as e:
        print("OOPS!! General Error")
        print(str(e))
    except KeyboardInterrupt:
        print("Someone closed the program") 
        
        
    return page
    

We can replace a chunk of our code with this function

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    #use our new function to process each url
    page = get_page(url)
        
        
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

## 6. Regulate your request pace

Many websites have a limit on how many times you can ping a website within a minute/hour/day. YOu want to be aware of that and change your script in order to account for that.

One example is using the `sleep()` function that is a part of the time package.  This can pause your script for a set amount of time.

~~~
import time
 
 
## Star loop ##
for url in urls:

    # try to make resquest here.
    
 
    #### Delay for 1 seconds ####
    time.sleep(1)
        
~~~

In [None]:
import time
 
 
## Start loop ##
for url in urls:
    print("Current date & time " + time.strftime("%c"))

    #use our new function to process each url
    page = get_page(url)
             
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")
    
    time.sleep(1)

## 7 - Save as you go

You might run into an issue halfway through your scrape and your script breaks. So you want to make sure you are saving your data as you go.  

~~~ 
import csv
...
with open("~/Desktop/output.csv", "w") as f:
    writer = csv.writer(f)

    # collected_items = [
    #   ["Product #1", "10", "http://example.com/product-1"],
    #   ["Product #2", "25", "http://example.com/product-2"],
    #   ...
    # ]

    for item_property_list in collected_items:
        writer.writerow(item_property_list)
~~~
~~~
import csv
...
field_names = ["Product Name", "Price", "Detail URL"]
with open("~/Desktop/output.csv", "w") as f:
    writer = csv.DictWriter(f, field_names)

    # collected_items = [
    #   {
    #       "Product Name": "Product #1",
    #       "Price": "10",
    #       "Detail URL": "http://example.com/product-1"
    #   },
    #   ...
    # ]

    # Write a header row
    writer.writerow({x: x for x in field_names})

    for item_property_dict in collected_items:
        writer.writerow(item_property_dict)
~~~

In [20]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    print("Current date & time " + time.strftime("%c"))

    #use our new function to process each url
    page = get_page(url)
             
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")
    
    #Saving your data as you go
    
    # Option 1: write the line of data to a csv files
    with open("~/Desktop/output.csv", "w") as f:
        writer = csv.writer(f)

    for item in items:
        writer.writerow(item)
        
    # Option 2: Inseting the data into a DB
    # This code uses a theoretical module, SQL,
    # The functions below are examples and will not run. 
    import sql_helpers as sql
    
    sql.create_connection()
    for  item in items:
        item = data
        query = "INSERT INTO table_name VALUES (%s,%s,%s,%s)"
        sql.insert_data(db, query, data )
        
    #Taking a one second pause to help slow down your requests 
    time.sleep(1)

SyntaxError: 'return' outside function (<ipython-input-20-96ab78507719>, line 28)

## More Resources 
- [More advanced issues](https://blog.hartleybrody.com/web-scraping-cheat-sheet/)
- [Request Advanced Usage](http://docs.python-requests.org/en/master/user/advanced/#)

Web scraping with Python often requires no more than the use of the Beautiful Soup module to reach the goal. Beautiful Soup is a popular Python library that makes web scraping by traversing the DOM (document object model) easier to implement.

## Applied: Scraping Amazon's Best Sellers list:


Amazon keeps track of the best sellers for 41 different categories of products. We want to grab that data from Amazon so that we can keep track of which products are on that list and stock our mom and pop store with them.  


Deliverable: a file that contains all of the products on Amazon's best seller list. 

```[{'name': 'A top selling product',
'url': http://the_url_to_the_product.com},
{'name': 'A top selling product',
'url': http://the_url_to_the_product.com}]```

In [2]:
import requests
from bs4 import BeautifulSoup as BS
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

First we start by grabbing the page where all of the best sellers list are located.

In [3]:

url="https://www.amazon.com/Best-Sellers/zgbs"

#let's use the function we already created
page = get_page(url)
page

<Response [200]>

In [4]:
soup = BS(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr">
 <head>
  <link href="https://images-na.ssl-images-amazon.com/images/I/21doGy6C0kL._RC|01KD4yyr5LL.css_.css?AUIClients/ZeitgeistPageAssets-zeitgeistHome" rel="stylesheet"/>
  <link href="https://images-na.ssl-images-amazon.com/images/I/41gCbfiTdaL._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01tgK36lpGL.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01Xl9KigtzL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.css,01SHjPML6tL.css,111-D2qRjiL.css,01QrWuRrZ-L.css,310Imb6LqFL.css,11Z1a0FxSIL.css,01Alnvtt1zL.css,21mOLw+nYYL.css,01L8Y-JFEhL.css_.css?AUIClients/AmazonUI#us.not-trident.206347-T1" rel="stylesheet"/>
  <script>
   (function(g,h,R,z){function G(a){x&&x.tag&&x.tag(q(":","aui",a))}function v(a,b){x&&x.count&&x.count("aui:"+a,0===b?0:b||(x.count("aui:"+a)||0)+1)}function n(a){try{return a.test(navigator.userAgent)}catch(b){return

Now that we have this page, we want to find the urls of all the other pages to scrape those.  

In [5]:
#using the select statement to find the elements containing each url
urls = soup.select('ul#zg_browseRoot a')
# print(len(urls))
# urls
print(urls[0].text, '\n',urls[0]['href'])

Amazon Devices & Accessories 
 https://www.amazon.com/Best-Sellers/zgbs/amazon-devices


In [8]:
#list of all best seller urls
urls = [url['href'] for url in urls]

TypeError: string indices must be integers

Select a url/products that you want to investigate and lets build our script to parse one page.  then we can apply it to all of the pages. 

In [9]:
urls[3]

'https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps'

In [54]:
url=urls[13]

apps = get_page(url)
apps

<Response [200]>

In [56]:
app_soup = BS(apps.content, 'html.parser')
print(app_soup.prettify())

<!DOCTYPE doctype html>
<html class="a-no-js" data-19ax5a9jf="dingo">
 <head>
  <script>
   var aPageStart = (new Date()).getTime();
  </script>
  <meta charset="utf-8"/>
  <link href="https://images-na.ssl-images-amazon.com/images/I/21doGy6C0kL._RC|01WTbMujHuL.css_.css?AUIClients/ZeitgeistPageAssets-zeitgeistList" rel="stylesheet"/>
  <link href="https://images-na.ssl-images-amazon.com/images/I/41gCbfiTdaL._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01tgK36lpGL.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01Xl9KigtzL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.css,01SHjPML6tL.css,111-D2qRjiL.css,01QrWuRrZ-L.css,310Imb6LqFL.css,11Z1a0FxSIL.css,01Alnvtt1zL.css,21mOLw+nYYL.css,01L8Y-JFEhL.css_.css?AUIClients/AmazonUI#us.not-trident.206347-T1" rel="stylesheet"/>
  <script>
   (function(g,h,R,z){function G(a){x&&x.tag&&x.tag(q(":","aui",a))}function v(a,b){x&&x

In [57]:
app_soup.select('title')

[<title dir="ltr">Amazon Best Sellers: Best Clothing, Shoes &amp; Jewelry</title>]

In [19]:
def do_everything(url):
    page = get_page(url)
    soup = BS(page.content,'html.parser')
    print(soup.select('title')[0].text)
    ind = retrieve_prod_number(soup)
    names = retrieve_prod_names(soup)
    ratings = retrieve_prod_ratings(soup)
    num_ratings = retrieve_prod_num_ratings(soup)
    prices = retrieve_prod_price_range(soup)
    return make_dataframe(ind,names,ratings,num_ratings,prices)

In [22]:
do_everything(urls[7])

Amazon Best Sellers: Best Baby


Unnamed: 0,Rank,Product Name,Rating,Number of Ratings,Low Price,High Price
0,1,Pampers Sensitive Water-Based Baby Diaper Wipe...,4.1,2160,$14.59,$14.59
1,2,Nuby Ice Gel Teether Keys,4.4,3816,$3.88,$3.88
2,3,"Mommy's Helper Outlet Plugs, 36 Count",4.5,3820,$2.99,$2.99
3,4,"Haakaa Manual Breast Pump 4oz/100ml,2019 New S...",4.4,1268,$12.98,$12.98
4,5,The First Years Stack Up Cups,4.8,3715,$3.99,$3.99
5,6,"Dr. Brown's Bottle Brush, Blue",3.9,3163,$3.99,$3.99
6,7,Evenflo Position and Lock Wood Gate,3.3,5591,$9.98,$9.98
7,8,Pampers Sensitive Water-Based Baby Diaper Wipe...,4.1,2160,$12.93,$12.93
8,9,"Munchkin Miracle 360 Trainer Cup, Green/Blue, ...",4.4,4909,$10.22,$10.22
9,10,ChoiceRefill Compatible with Diaper Genie Pail...,4.4,1613,$14.99,$14.99


In [10]:
import pandas as pd

In [16]:
def make_dataframe(ind,names,ratings,num_ratings,prices):
    df = pd.DataFrame(columns=['Rank','Product Name','Rating','Number of Ratings','Low Price','High Price'])
    for i in range(len(ind)):
        df = df.append({'Rank':ind[i],
                        'Product Name':names[i],
                        'Rating':ratings[i],
                        'Number of Ratings':num_ratings[i],
                        'Low Price':prices[i][0],
                        'High Price':prices[i][1]},
                       ignore_index=True)
    return df

In [210]:
make_dataframe(ind,names,ratings,num_ratings,prices)

Unnamed: 0,Rank,Product Name,Rating,Number of Ratings,Low Price,High Price
0,1,VIFUUR Water Sports Shoes Barefoot Quick-Dry A...,4.0,3585,$9.88,$13.68
1,2,ASICS Men's GEL Venture 5 Running Shoe,4.4,10066,$54.50,$69.95
2,3,Womens and Mens Kids Water Shoes Barefoot Quic...,4.0,2648,$6.99,$13.58
3,4,iGENJUN Women's Summer Sleeveless Pleated Back...,4.3,1398,$9.99,$15.99
4,5,ASICS Women's GEL-Venture 5 Running Shoe,4.3,6534,$59.95,$129.99
5,6,DANVOUY Womens T Shirt Casual Cotton Short Sle...,4.1,387,$11.99,$11.99
6,7,GRECERELLE Women's Casual Loose Pocket Long Dr...,4.2,4159,$18.99,$23.99
7,8,Custer's Night High Waist Out Pocket Yoga Shor...,4.3,404,$13.59,$17.99
8,9,SIMARI Womens and Mens Water Shoes Quick-Dry A...,4.3,1605,$11.89,$13.98
9,10,TIJN Blue Light Blocking Glasses Square Nerd E...,4.0,2173,$16.99,$16.99


Inspect the actual webpage to determine the data you want and the corresponding elements you want to parse out. Then use that element tag or class to pull those elements out of the page. 

In [11]:
# your code here
def retrieve_prod_number(soup):
    return [prod.text[1:] for prod in soup.select('.zg-badge-text')]

In [201]:
ind = retrieve_prod_number(app_soup)

In [12]:
def retrieve_prod_names(soup):
    product_names = []
    all_text = [prod.text.strip() for prod in soup.select('.a-link-normal')]
    for i in range(len(all_text)-6):
        if i%4 == 0:
            product_names.append(all_text[i])
    return product_names

In [203]:
names = retrieve_prod_names(app_soup)
# names
# soup.select('.a-link-normal')

In [13]:
def retrieve_prod_ratings(soup):
    product_ratings = []
    all_text = [prod.text.strip() for prod in soup.select('.a-link-normal')]
    for i in range(len(all_text)-6):
        if i%4 == 1:
            product_ratings.append(float(all_text[i].split()[0]))
    return product_ratings

In [205]:
ratings = retrieve_prod_ratings(app_soup)
len(ratings)

50

In [14]:
def retrieve_prod_num_ratings(soup):
    product_num_ratings = []
    all_text = [prod.text.strip() for prod in soup.select('.a-link-normal')]
    for i in range(len(all_text)-6):
        if i%4 == 2:
            product_num_ratings.append(int(all_text[i].replace(',','')))
    return product_num_ratings

In [207]:
num_ratings = retrieve_prod_num_ratings(app_soup)
len(num_ratings)

50

In [15]:
def retrieve_prod_price_range(soup):
    product_price_range = []
    all_text = [prod.text.strip() for prod in soup.select('.a-link-normal')]
    for i in range(len(all_text)-6):
        if i%4 == 3:
            prices = all_text[i].split(' - ')
            if len(prices) == 1:
                prices.append(prices[0])
            product_price_range.append((prices[0],prices[1]))
    return product_price_range

In [209]:
prices = retrieve_prod_price_range(app_soup)
# prices

Now that you can access all the data you need, let's put this into a loop so that we can proccess all of the products and create one list with all of the data.   

In [81]:
# your code here

['VIFUUR Water Sports Shoes Barefoot Quick-Dry Aqua Yoga Socks Slip-on for Men Women Kids',
 '4.0 out of 5 stars',
 '3,585',
 '$9.88 - $13.68',
 "ASICS Men's GEL Venture 5 Running Shoe",
 '4.4 out of 5 stars',
 '10,066',
 '$54.50 - $69.95',
 'Womens and Mens Kids Water Shoes Barefoot Quick-Dry Aqua Socks for Beach Swim Surf Yoga Exercise',
 '4.0 out of 5 stars',
 '2,648',
 '$6.99 - $13.58',
 "iGENJUN Women's Summer Sleeveless Pleated Back Closure Casual Tank Tops",
 '4.3 out of 5 stars',
 '1,398',
 '$9.99 - $15.99',
 "ASICS Women's GEL-Venture 5 Running Shoe",
 '4.3 out of 5 stars',
 '6,534',
 '$59.95 - $129.99',
 'DANVOUY Womens T Shirt Casual Cotton Short Sleeve V-Neck Graphic T-Shirt Tops Tees',
 '4.1 out of 5 stars',
 '387',
 '$11.99',
 "GRECERELLE Women's Casual Loose Pocket Long Dress Short Sleeve Split Maxi Dresses",
 '4.2 out of 5 stars',
 '4,159',
 '$18.99 - $23.99',
 "Custer's Night High Waist Out Pocket Yoga Short Tummy Control Workout Running 4 Way Stretch Yoga Leggings",
 

Now that we have each individual part working, let's wrap this all up in a function that we can run for each product class?


In [None]:
def parse_bestseller_cat(___):
    #your code here
    
    return ___

In [None]:
Next step is now to add this function to the larger script we have from above.  

## Selenium

The Selenium package is used to automate web browser interaction from Python. With Selenium, programming a Python script to automate a web browser is possible.

In [None]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

In [None]:
driver = webdriver.Chrome()
driver.get("https://www.instagram.com/accounts/login/")


In [None]:
username = ''
pw = ''

In [None]:
#find the element where you input your email
email = driver.find_elements_by_css_selector('form input')[0]

#find the element where you input your password
password = driver.find_elements_by_css_selector('form input')[1]

#send your keys to those elements
email.send_keys(username)
password.send_keys(pw)

#find the button to login
login = driver.find_element_by_xpath('//*[@id="react-root"]/section/main/div/article/div/div[1]/div/form/div[3]/button')

#have the program 'click' on the login button
login.click()


#looking for an interstial page
try: 
    not_now = WebDriverWait(driver, 15).until(
        lambda d: d.find_element_by_xpath('//button[text()="Not Now"]')
    )
    not_now.click()
except: 
    pass

#now you are logged in, navigate to a new page
driver.get("https://www.instagram.com/foodandprobability")

### Transitioning to Beautiful Soup
Beautiful Soup remains the best way to traverse the DOM and scrape the data. After utilizing Selenium to handle the interactive parts, it is time to ask Beautiful Soup to grab the data that you need