<img src="images\logo_datai.png" width="400" img style="float: right;"> 

https://www.unav.edu/web/instituto-de-ciencia-de-los-datos-e-inteligencia-artificial<br>
Author: Pablo Urruchi Mohino

# Web scraping "firewalls"

__If you try to scrape Amazon, youÂ´ll get this error message:__<br>
"To discuss automated access to Amazon data please contact api-services-support@amazon.com."

In this notebook we will try to work around the goal.

In [1]:
import pandas as pd

In [2]:
import requests
r = requests.get('https://www.amazon.com/AmazonBasics-Performance-Alkaline-Batteries-Count/dp/B00LH3DMUO/ref=sxin_3_ac_d_rm?ac_md=0-0-YW1hem9uYmFzaWNz-ac_d_rm&cv_ct_cx=amazonbasics&dchild=1&keywords=amazonbasics&pd_rd_i=B00LH3DMUO&pd_rd_r=ef8a3adc-4ea7-4599-a8bc-ccece0e64df3&pd_rd_w=ixItm&pd_rd_wg=K5Fe8&pf_rd_p=9349ffb9-3aaa-476f-8532-6a4a5c3da3e7&pf_rd_r=KDBKF2Q1Y984249CR30N&qid=1607012986&sr=1-1-12d4272d-8adb-4121-8624-135149aa9081&th=1')

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

In [5]:
# You will actually be unable to connect
'''
To discuss automated access to Amazon data please contact 
api-services-support@amazon.com. For information about migrating to our 
APIs refer to our Marketplace APIs at 
https://developer.amazonservices.com/ref=rm_c_sv, or our Product 
Advertising API at 
https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac
for advertising use cases.
'''
# soup

'\nTo discuss automated access to Amazon data please contact \napi-services-support@amazon.com. For information about migrating to our \nAPIs refer to our Marketplace APIs at \nhttps://developer.amazonservices.com/ref=rm_c_sv, or our Product \nAdvertising API at \nhttps://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac\nfor advertising use cases.\n'

__By including proper headers (which are not always easy to find) you can actually bypass the restriction and scrape Amazon.__

Note: This header is for Amazon.com. To scrape "Amazon.es" you need to change header accordingly. This is also true for language.

In [6]:
import requests
from bs4 import BeautifulSoup

headers = {
    'Host': 'www.amazon.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'TE': 'Trailers'
}


In [7]:
r = requests.get('https://www.amazon.com/AmazonBasics-Performance-Alkaline-Batteries-Count/dp/B00LH3DMUO/ref=sxin_3_ac_d_rm?ac_md=0-0-YW1hem9uYmFzaWNz-ac_d_rm&cv_ct_cx=amazonbasics&dchild=1&keywords=amazonbasics&pd_rd_i=B00LH3DMUO&pd_rd_r=ef8a3adc-4ea7-4599-a8bc-ccece0e64df3&pd_rd_w=ixItm&pd_rd_wg=K5Fe8&pf_rd_p=9349ffb9-3aaa-476f-8532-6a4a5c3da3e7&pf_rd_r=KDBKF2Q1Y984249CR30N&qid=1607012986&sr=1-1-12d4272d-8adb-4121-8624-135149aa9081&th=1', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

__This is the "soup" elements in which you will be able to find the selectors to allocate each element in a pandas dataframe.__

In [15]:
# soup

### Goal

Get a dataframe with the following structure:
    
    column 1 = review stars
    column 2 = review title
    column 3 = review date
    column 4 = review text
    column 5 = verified purchase
    
<img src="images\result.png" width="550" img style="float: left;">

In [22]:
review_rating = soup.find_all('div',attrs = {"data-hook":"review"})


In [24]:
review_rating[0].find('span',attrs = {"data-hook":"review-date"}).text

'Reviewed in the United States on January 21, 2016'

In [26]:
review_rating[0].find('i',attrs = {"data-hook":"review-star-rating"}).text

'4.0 out of 5 stars'

In [28]:
review_rating[0].find('span',attrs = {"data-hook":"format-strip-linkless"}).text

'Size: 8 Count (Pack of 1)'

In [29]:
review_rating[0].find('span',attrs = {"data-hook":"avp-badge-linkless"}).text

'Verified Purchase'

In [7]:
review_rating = soup.find_all(class_="review-rating")
print(review_rating[0].prettify())

<i class="a-icon a-icon-star a-star-4 review-rating" data-hook="review-star-rating">
 <span class="a-icon-alt">
  4.0 out of 5 stars
 </span>
</i>


In [8]:
review_rating[0].get_text()

'4.0 out of 5 stars'

In [9]:
review_title = soup.find_all(class_="review-title")
print(review_title[0].prettify())

<a class="a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold" data-hook="review-title" href="/gp/customer-reviews/RC2VPIP8BBP8O/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&amp;ASIN=B00LH3DMUO">
 <span>
  7 brands, 1 test: results
 </span>
</a>


In [10]:
review_title_def=(review_title[0].get_text()).replace('\n','')
review_title_def

'7 brands, 1 test: results'

In [11]:
review_date = soup.find_all(class_="review-date")
print(review_date[0].prettify())

<span class="a-size-base a-color-secondary review-date" data-hook="review-date">
 Reviewed in the United States on January 21, 2016
</span>



In [12]:
review_date_def=(review_date[0].get_text()).replace('Reviewed in the United States on ','')
review_date_def

'January 21, 2016'

In [13]:
review_text = soup.find_all(class_="reviewText")
print(review_text[0].prettify())

<div aria-expanded="false" class="a-expander-content reviewText review-text-content a-expander-partial-collapse-content" data-hook="review-collapsed">
 <span>
  AmazonBasics batteries are quite good in terms of capacity -- pretty much tied for the top spot compared to the other 6 brands I've tested, but other brands have the edge in capacity per dollar. When I computed value by dividing capacity by the cost per battery of the cheapest package size, they took a respectable third place, and they have the advantage of being a trusted name compared to the value leader.  In the images I have attached a graph and a table summarizing my test results for the 7 types I've tested, but if you'd like to know more about how I test ... on to the in-dept test &amp; review.
  <br>
   <br/>
   I've been on a bit of a quest to test all of the top-selling aaa batteries on Amazon in a repeatable, precise way. This means the same equipment, same environmental conditions, and same slots in the equipment wil

In [14]:
review_text_def = ((review_text[0].get_text()).replace('\n\n  ','')).replace('\n\n','')
review_text_def

"\nAmazonBasics batteries are quite good in terms of capacity -- pretty much tied for the top spot compared to the other 6 brands I've tested, but other brands have the edge in capacity per dollar. When I computed value by dividing capacity by the cost per battery of the cheapest package size, they took a respectable third place, and they have the advantage of being a trusted name compared to the value leader.  In the images I have attached a graph and a table summarizing my test results for the 7 types I've tested, but if you'd like to know more about how I test ... on to the in-dept test & review.I've been on a bit of a quest to test all of the top-selling aaa batteries on Amazon in a repeatable, precise way. This means the same equipment, same environmental conditions, and same slots in the equipment will be used for each test. For each test, I fully discharge 3 batteries in my Opus BT-C2000 battery analyzer at rates of 100 ma, 200 ma, and 400 ma (discharge rate affects usable capac

In [15]:
review_rating = [rr.get_text() for rr in review_rating]
review_ratings=[]
for rrat in review_rating:    
    rrat = rrat[:(rrat.find(' out'))]
    review_ratings.append(rrat)
review_ratings

['4.0',
 '5.0',
 '5.0',
 '5.0',
 '5.0',
 '5.0',
 '5.0',
 '5.0',
 '5.0',
 '4.0',
 '5.0',
 '4.0',
 '5.0']

In [16]:
review_titles=[]
for rt in review_title:    
    rt = rt.get_text().replace('\n','')
    review_titles.append(rt)
review_titles

['7 brands, 1 test: results',
 'Good value for money and not too far behind other reputed brands in terms of longevity',
 'Lasting by actual count 3 times as long as Eveready or Duracell',
 'good value on basic batteries',
 'Great product and long lasting!',
 'Theyâ€™re batteriesâ€¦',
 'Battery please!',
 'Best choice for longevity, low price of all competitors',
 'Great Value for Money',
 'Reasonable value AAA batteries',
 'Muy duraderas',
 'pleased with items so far.',
 'Perfect']

In [17]:
review_date1=[]
for rdate1 in review_date:    
    rdate1 = rdate1.get_text()
    review_date1.append(rdate1)
review_date1

review_dates=[]
for rdate2 in review_date1:    
    rdate2 = rdate2[(rdate2.find('on ')+3):]
    review_dates.append(rdate2)
review_dates

['January 21, 2016',
 'November 4, 2022',
 'October 18, 2022',
 'October 24, 2022',
 'November 11, 2022',
 'November 13, 2022',
 'November 7, 2022',
 'October 29, 2022',
 'October 26, 2018',
 'June 5, 2019',
 'August 13, 2019',
 'April 17, 2019',
 'August 8, 2019']

In [18]:
review_texts=[]
for rtext in review_text:    
    rtext = rtext.get_text().replace('\n','').lstrip()
    review_texts.append(rtext)
review_texts

["AmazonBasics batteries are quite good in terms of capacity -- pretty much tied for the top spot compared to the other 6 brands I've tested, but other brands have the edge in capacity per dollar. When I computed value by dividing capacity by the cost per battery of the cheapest package size, they took a respectable third place, and they have the advantage of being a trusted name compared to the value leader.  In the images I have attached a graph and a table summarizing my test results for the 7 types I've tested, but if you'd like to know more about how I test ... on to the in-dept test & review.I've been on a bit of a quest to test all of the top-selling aaa batteries on Amazon in a repeatable, precise way. This means the same equipment, same environmental conditions, and same slots in the equipment will be used for each test. For each test, I fully discharge 3 batteries in my Opus BT-C2000 battery analyzer at rates of 100 ma, 200 ma, and 400 ma (discharge rate affects usable capaci

In [19]:
review_ver_purchase = soup.find_all(class_="a-size-mini a-color-state a-text-bold")
review_ver_purchases = [vp.get_text() for vp in review_ver_purchase]
review_ver_purchases

['Verified Purchase',
 'Verified Purchase',
 'Verified Purchase',
 'Verified Purchase',
 'Verified Purchase',
 'Verified Purchase',
 'Verified Purchase',
 'Verified Purchase',
 'Verified Purchase',
 'Verified Purchase',
 'Verified Purchase',
 'Verified Purchase',
 'Verified Purchase']

__Question:__ Describe an scenario in where this script is useful for NLP

In [20]:
review = pd.DataFrame({
        "review_stars": review_ratings, 
        "review_title": review_titles, 
        "review_date": review_dates, 
        "review_text": review_texts,
        "verified_purchase_id": review_ver_purchases    
    })
review

Unnamed: 0,review_stars,review_title,review_date,review_text,verified_purchase_id
0,4.0,"7 brands, 1 test: results","January 21, 2016",AmazonBasics batteries are quite good in terms...,Verified Purchase
1,5.0,Good value for money and not too far behind ot...,"November 4, 2022",Speaking from practical experience of using th...,Verified Purchase
2,5.0,Lasting by actual count 3 times as long as Eve...,"October 18, 2022",I was skeptical but figured they couldn't be a...,Verified Purchase
3,5.0,good value on basic batteries,"October 24, 2022",I don't do experiments with batteries to see w...,Verified Purchase
4,5.0,Great product and long lasting!,"November 11, 2022",Thank you to Amazon for getting these batterie...,Verified Purchase
5,5.0,Theyâ€™re batteriesâ€¦,"November 13, 2022",They do what theyâ€™re supposed to- they power t...,Verified Purchase
6,5.0,Battery please!,"November 7, 2022",Batteries are great value for the price. Good ...,Verified Purchase
7,5.0,"Best choice for longevity, low price of all co...","October 29, 2022",Several sites on Youtube did comparison of Ama...,Verified Purchase
8,5.0,Great Value for Money,"October 26, 2018","Now, you can spend quite a bit on batteries th...",Verified Purchase
9,4.0,Reasonable value AAA batteries,"June 5, 2019",These are good alkaline AAA batteries that ala...,Verified Purchase


## robots.txt
The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites.

When a site owner wishes to give instructions to web robots they place a text file called robots.txt in the root of the web site hierarchy (e.g. https://www.example.com/robots.txt). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the website. If this file doesn't exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

__Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the web robot__

https://developers.google.com/search/docs/advanced/robots/intro

example: <br>
https://www.amazon.es/robots.txt

__Exercise:__<br>
Find how to pass the following firewall.

In [42]:
   headers = {
        'authority': 'www.bloomberg.com' ,
  'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" ,
  'accept-language': "en-GB,en-US;q=0.9,en;q=0.8" ,
  'cache-control': "max-age=0",
  'cookie': """_schn = _151wtbh; _scid=f25088bf-2958-4b98-99c8-80851d59dc2a; __sppvid=de8e7992-179b-4981-b663-3ed029bbba42; seen_uk=1; exp_pref=EUR; pxcts=a0873f3d-6771-11ed-9bb2-6b7345777a43; _pxvid=a0873191-6771-11ed-9bb2-6b7345777a43; _pxff_rf=1; _pxff_fp=1; _sp_v1_uid=1:289:106ad2af-4882-4e69-af49-22c87ab69f19; _sp_v1_ss=1:H4sIAAAAAAAAAItWqo5RKimOUbLKK83J0YlRSkVil4AlqmtrlXSGk7JoIhl5IIZBbSwuI-khoRQLAP7hQqiiAQAA; sampledUser=false; agent_id=93ea9758-830b-4852-aed0-db0eb5031dcb; session_id=29e49ebb-c261-4d1a-acd7-2da4432e53a5; session_key=6d0b5bce3f6117a62c17f3ed32c7a4b892b92647; gatehouse_id=86ed1466-4467-4bc3-b09b-8fa2f68e62a9; geo_info=%7B%22countryCode%22%3A%22ES%22%2C%22country%22%3A%22ES%22%2C%22field_d%22%3A%22unav.es%22%2C%22field_n%22%3A%22hf%22%2C%22trackingRegion%22%3A%22Europe%22%2C%22cacheExpiredTime%22%3A1669402146089%2C%22region%22%3A%22Europe%22%2C%22fieldN%22%3A%22hf%22%2C%22fieldD%22%3A%22unav.es%22%7D%7C1669402146089; geo_info={%22country%22:%22ES%22%2C%22region%22:%22Europe%22%2C%22fieldD%22:%22unav.es%22%2C%22fieldN%22:%22hf%22}|1669402146127; _px3=442a2694b949cbb6dcb0ff378b0215cd3639191102ce8ebff241c88b50b4ca71:x29Ayo11ab1ixCvjFh4PifGxWKCupmip+RH+A9Zp+DR2NLzfzGS73ueFP3VyO/o9vNThiQelswb+kzLCeshoag==:1000:BFxVBsyU2bNksBcVGcNx//oqBqpChfudXnMew5NgyNe+p4Wyzr5ygC+iZLIUs6t+BwKDVhGR/EaSznDF5ez3geNRKrPxY7YKrlTSxWABHuOwUIkD0qRzFrNkPyWc7gWumX2OemUC46w7ipdntMnjbHP1SQPHDIXcIQBpHADA0xNXCF/RYFz8Xbwv6y8ns6YPNJE/tS7bOtd6jM/XWt0Fwg==; _px2=eyJ1IjoiYWQzNzk1ZTAtNjc3MS0xMWVkLWEzMTYtMDc0OTdiMTgwNDdiIiwidiI6ImEwODczMTkxLTY3NzEtMTFlZC05YmIyLTZiNzM0NTc3N2E0MyIsInQiOjE2Njg3OTc2NDYzNzEsImgiOiI0ZWUwYmRjODUzZjY5YWE4NGU0MWQ0NTBiMmM4YjI2ZWE3NGY2ZTk1NDczNjExYWFkODc1NDUzNDI4MDYzMTUyIn0=; _ga_GQ1PBLXZCT=GS1.1.1668797323.1.1.1668797346.0.0.0; ccpaUUID=edcb842a-f814-4618-99fb-f99121219574; dnsDisplayed=true; ccpaApplies=true; signedLspa=false; _tb_sess_r=https%3A//www.bloomberg.com/tosv2.html%3Fvid%3D%26uuid%3Da0009844-6771-11ed-9f67-4d4664645246%26url%3DL3F1b3RlL1NQWDpJTkQ%3D; _tb_t_ppg=https%3A//www.bloomberg.com/quote/SPX%3AIND; _ga=GA1.2.1890042947.1668797324; _gid=GA1.2.1981318809.1668797347; _reg-csrf-token=OpO685bn-qSu95W8P2JUDRh1SM-Ji-UNm7Rs; _reg-csrf=s%3AbOkSZkdWnCdcQZANh4rTyN-0.pcM%2F8xoIZG7mIr%2Fgqz6j1tZsV27VWPs8Rr%2B0J4zOFlc; _user-data=%7B%22status%22%3A%22anonymous%22%7D; _last-refresh=2022-11-18%2018%3A49; _sp_krux=true; euconsent-v2=CPio-4APio-4AAGABCENCrCgAP_AAGLAAAiQH9oB9CpGCTFDKGh4AIsAEAQXwBAEAOAAAAABAAAAAAgQAIwCAEASAACAAAACAAAAIAIAAAAAEAAAAEAAQAAAAAFAAAAEAAAAAAAAAAAAAAAAAAAAAAEAAAAAAUAAAFAAgEAAABIAQAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgYtAPAAcADkAH4AjgCBwEHAQgAiIBFgC6gGBANeAdQBZQC8wGLAHDIAoATABHAEcAXIAvMRABAXIEgKgALAAqABkADgAIAAZAA0gCIAIoATAAngB-AEIAI4AUoA7wB7AEcAJSAcQBcgDJAgAIAH4Cmw0AEBcgYACApsVAFACYAI4AjgC5AF5joDIACwAKgAZAA4ACAAFwAMgAaQBEAEUAJgATwAxAB-AEwAKMAUoAygB3gD2AI4ASkA4gB1AFyAMkHAAwALgCOAU2QgHAALAAyAC4AJgAYgBHAClAGUAO8AjgBKQDqALKAXISgGgALAAyABwAIgATAAxACOAFGAKUAZQA7wCOAHUJAAQALlICgACwAKgAZAA4ACAAGQANIAiACKAEwAJ4AYgA_ACjAFKAMoAd4BHACUgLkAZIUABgAXAIOApsAAA.YAAAAAAAAAAA; consentUUID=3faeda70-11cc-4396-8fde-0c6e9a4fbc3b_13; _sp_v1_data=2:517482:1668797230:0:1:-1:1:0:0:_:-1; _sp_v1_opt=1:login|true:last_id|11:; _sp_v1_csv=; _sp_v1_lt=1:; bbgconsentstring=req1fun1pad1; bdfpc=004.5449857892.1668797348805; _gcl_au=1.1.2020409233.1668797349; _uetsid=af682690677111ed9b8a2f79b2ded8db; _uetvid=af681a60677111ed9dfba506c147cf12; _rdt_uuid=1668797349028.d904da97-ba2a-4687-8603-f7ccf6e68e0a; _pxde=318d9f34d1bd94831ed7e054399fb5e569b063caebd88f603057c830578ed5fb:eyJ0aW1lc3RhbXAiOjE2Njg3OTczNDkyNzAsImZfa2IiOjAsImlwY19pZCI6W119; _fbp=fb.1.1668797349134.1268640029; ln_or=d; lotame_domain_check=bloomberg.com; _cc_id=13ffcee8cee7e406e96c8ff64e289cce; panoramaId_expiry=1669402149885; panoramaId=30affb3e90a38d900a609e5742dbe32246b0fcc52f78d971876f7aa67ed31189; afUserId=7349a45b-4e82-4a64-b04a-9574715742ea-p; AF_SYNC=1668797349880; __gads=ID=cdbb6794b3ebe32a:T=1668797349:S=ALNI_Ma1od9YmykEFq6ejAmuCO-UiNWpmw; __gpi=UID=00000b2266c78ea3:T=1668797349:RT=1668797349:S=ALNI_MZK5ogMuXiBbPdd1LlFVXiCDSt_kA""",
  'if-none-match':"""W/"4b1b1-ubZ0T8amBLQcESE87vMdouwVTnw"'""",
  'sec-ch-ua': """Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v=\"24\"""",
  'sec-ch-ua-mobile': '?0',
  'sec-ch-ua-platform': "macOS",
  'sec-fetch-dest': 'document',
  'sec-fetch-mode': 'navigate',
  'sec-fetch-site': 'none',
  'sec-fetch-user': '?1',
  'upgrade-insecure-requests': '1',
  'user-agent': "No Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36",
    }

In [43]:
import requests



r = requests.get('http://www.bloomberg.com/quote/SPX:IND', headers=headers)

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

soup

<!DOCTYPE html>

<html>
<head>
<title>Bloomberg - Are you a robot?</title>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="noindex" name="robots"/>
<link href="https://assets.bwbx.io/font-service/css/BWHaasGrotesk-55Roman-Web,BWHaasGrotesk-75Bold-Web,BW%20Haas%20Text%20Mono%20A-55%20Roman/font-face.css" rel="stylesheet" type="text/css"/>
<style rel="stylesheet" type="text/css">
        html, body, div, span, applet, object, iframe,
        h1, h2, h3, h4, h5, h6, p, blockquote, pre,
        a, abbr, acronym, address, big, cite, code,
        del, dfn, em, img, ins, kbd, q, s, samp,
        small, strike, strong, sub, sup, tt, var,
        b, u, i, center,
        dl, dt, dd, ol, ul, li,
        fieldset, form, label, legend,
        table, caption, tbody, tfoot, thead, tr, th, td,
        article, aside, canvas, details, embed,
        figure, figcaption, footer, header, hgroup,
        menu, nav, output, ruby, section, summary,
        time, mark, 