<a href="https://colab.research.google.com/github/ajmbarron/web_scraping_with_python-/blob/main/Chapter_14_Avoiding_Scraping_Traps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Looking Like a Human #
> Adjust Your Headers: HTTP Headers are lists of attributes, or preferences, sent by you every time you make a request to a web server.

In [None]:
import requests
from bs4 import BeautifulSoup

session = requests.Session()

# These seven fields are consistently used by most major browsers 
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)'\
           'AppleWebKit 537.36 (KHTML, like Gecko) Chrome',
           'Accept':'text/html,application/xhtml+xml,application/xml;'\
           'q=0.9,image/webp,*/*;q=0.8'}

# Note: this headers are easier to scrap: use it as an option

# User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X)
# AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257
# Safari/9537.53

url = 'https://www.whatismybrowser.com/'\
'developers/what-http-headers-is-my-browser-sending'
req = session.get(url, headers=headers)

bs = BeautifulSoup(req.text, 'html.parser')
print(bs.find('table',{'class':'table-striped'}).get_text)

<bound method Tag.get_text of <table class="table table-striped table-data">
<tr>
<th>ACCEPT</th>
<td>text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8</td>
</tr>
<tr>
<th>ACCEPT-ENCODING</th>
<td>gzip, deflate</td>
</tr>
<tr>
<th>CONNECTION</th>
<td>keep-alive</td>
</tr>
<tr>
<th>HOST</th>
<td>www.whatismybrowser.com</td>
</tr>
<tr>
<th>USER-AGENT</th>
<td>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)AppleWebKit 537.36 (KHTML, like Gecko) Chrome</td>
</tr>
</table>>


## Handling Cookies with JavaScript ##

> View Cookies: see what data is being collected by the website you visit.

In [None]:
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

!pip install selenium
!pip install webdriver-manager

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

import time


0% [Working]            Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Waiting f                                                                               Hit:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Waiting f0% [2 InRelease gpgv 3,626 B] [Waiting for headers] [Waiting for headers] [Wait                                                                               Hit:3 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
0% [2 InRelease gpgv 3,626 B] [Waiting for headers] [Waiting for headers] [Conn                                                                               Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
0% [2 InRelease gpgv 3,626 B] [Waiting for headers] [Waiting for headers] [Conn                             

In [None]:
options = webdriver.ChromeOptions()
options.add_argument('--headless')   # run webdriver in the background
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome('chromedriver', options=options)

driver.get('https://elfinanciero.com.mx/')
driver.implicitly_wait(1)
print(driver.get_cookies())

# To manipulate cookies you can call
# delete_cookies, add_cookie() and delete_all_cookies()


[{'domain': '.elfinanciero.com.mx', 'expiry': 1647906376, 'httpOnly': False, 'name': 'nvg54307', 'path': '/', 'secure': False, 'value': 'deb47e5304520cfa82ea3138b09|0_81'}, {'domain': '.elfinanciero.com.mx', 'expiry': 1624146376, 'httpOnly': False, 'name': '_fbp', 'path': '/', 'secure': False, 'value': 'fb.2.1616370376010.1775802263'}, {'domain': 'elfinanciero.com.mx', 'expiry': 1616372176, 'httpOnly': False, 'name': '_tb_sess_r', 'path': '/', 'secure': False, 'value': ''}, {'domain': '.elfinanciero.com.mx', 'expiry': 1616370435, 'httpOnly': False, 'name': '_gat_UA-425878-1', 'path': '/', 'secure': False, 'value': '1'}, {'domain': '.elfinanciero.com.mx', 'expiry': 1616370435, 'httpOnly': False, 'name': '_gat_UA-112838768-1', 'path': '/', 'secure': False, 'value': '1'}, {'domain': '.elfinanciero.com.mx', 'expiry': 1616456775, 'httpOnly': False, 'name': '_gid', 'path': '/', 'secure': False, 'value': 'GA1.3.1301999003.1616370376'}, {'domain': '.elfinanciero.com.mx', 'expiry': 1679442375, 

* Always try to space page load by a few seconds, this will prevent the site to block you.

* Sometimes it is useful to slow down to go fast. 


In [None]:
import time

time.sleep(3)

* Selenium renders the page it visits, because of this it is able to distinguish between elements that are visually present on the page ad those that aren't.

* The following code retrieves a page and looks for hidden links and form input fields.

In [None]:
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.chrome.options import Options


driver = webdriver.Chrome('chromedriver', options=options)

driver.get('https://elfinanciero.com.mx/')

links = driver.find_elements_by_tag_name('a')

for link in links:
    if not link.is_displayed():
      print('The link {} is a trap'.format(link.get_attribute('href')))

fields = driver.find_elements_by_tag_name('input')
for field in fields:
    if not field.is_displayed():
         print('Do not change value of {}'.format(field.get_attribute('name')))



The link None is a trap
The link https://www.facebook.com/ElFinancieroMx is a trap
The link https://twitter.com/ElFinanciero_Mx is a trap
The link https://www.linkedin.com/company/90169/ is a trap
The link https://www.instagram.com/elfinanciero_mx/ is a trap
The link https://elfinanciero.com.mx/economia is a trap
The link https://elfinanciero.com.mx/mercados is a trap
The link https://elfinanciero.com.mx/opinion is a trap
The link https://elfinanciero.com.mx/nacional is a trap
The link https://elfinanciero.com.mx/estados is a trap
The link https://elfinanciero.com.mx/tv is a trap
The link None is a trap
The link https://elfinanciero.com.mx/economia is a trap
The link https://elfinanciero.com.mx/empresas is a trap
The link https://elfinanciero.com.mx/mercados is a trap
The link https://elfinanciero.com.mx/nacional is a trap
The link https://elfinanciero.com.mx/estados is a trap
The link https://elfinanciero.com.mx/opinion is a trap
The link https://elfinanciero.com.mx/tv is a trap
The l