# Web Scraping - Pro Tips

### User Agents
"In HTTP, the ___User-Agent string___ is often used for content negotiation, where the origin server selects suitable content or operating parameters for the response. For example, the User-Agent string might be used by a web server to choose variants based on the known capabilities of a particular version of client software. The concept of content tailoring is built into the HTTP standard in RFC 1945 "for the sake of tailoring responses to avoid particular user agent limitations.”

The User-Agent string is _one of the criteria by which Web crawlers may be excluded from accessing certain parts of a Web site_ using the Robots Exclusion Standard (robots.txt file).

As with many other HTTP request headers, _the information in the "User-Agent" string contributes to the information that the client sends to the server_, since the string can vary considerably from user to user."

Source: https://en.wikipedia.org/wiki/User_agent

Useful website to retrieve user agents examples: http://www.useragentstring.com/pages/useragentstring.php

In practice..

In [163]:
# .. in requests 
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0',}
req = requests.get('http://www.northeastern.edu/', headers=header)
html = req.text

soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-us" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta charset="utf-8"/>
  <title>
   Northeastern University: a leader in global experiential learning in Boston, MA
  </title>
  <meta content="Northeastern is a global, experiential, research university built on a tradition of engagement with the world, creating a distinctive approach to education" name="description"/>
  <meta content="index,follow" name="robots"/>
  <meta content="width=device-width, minimum-scale=1.0, maximum-scale=1.0" name="viewport"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="7B1AF967D1B59429E1BDAA6A84702528" name="msvalidate.01"/>
  <!--[if lt IE 9]>
    <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
 </head>
 <body>
  <br/>
  <link href="https://www.northeastern.edu/css/html5styles.css" media="screen" rel="stylesheet" type="text/css"/>
  <link href="https://fast.fonts.com/cssapi/cac43e8c-6965-44df-b8ca-9784

In [161]:
# .. in Selenium
from selenium.webdriver.chrome.options import Options
opts = Options()
user_agent_string = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0"
opts.add_argument("user-agent=%s" % user_agent_string)
driver = webdriver.Chrome(executable_path='./chromedriver_mac', chrome_options=opts)


More info at: http://stackoverflow.com/questions/29916054/change-user-agent-for-selenium-driver

### Use Proxies

In [None]:
# .. in requests
proxies = {
  'http': 'http://10.10.1.10:3128',
  'https': 'http://10.10.1.10:1080',
}

requests.get('http://example.org', proxies=proxies)




In [None]:
# .. in selenium
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("--proxy-server=http://%s" % 'yourproxyserver')
driver = webdriver.Chrome(executable_path='./chromedriver_linux', chrome_options=opts)


### Headless Selenium
More info at:
http://scraping.pro/use-headless-firefox-scraping-linux/