# What is Web Scrapping?

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. 

## Let's get Started

# Request Library

Requests is an Apache2 Licensed HTTP library, written in Python. It is designed to be used by humans to interact with the language. This means you don’t have to manually add query strings to URLs, or form-encode your POST data. Don’t worry if that made no sense to you. It will in due time.

What can Requests do?

Requests will allow you to send HTTP/1.1 requests using Python. With it, you can add content like headers, form data, multipart files, and parameters via simple Python libraries. It also allows you to access the response data of Python in the same way.

In [1]:
import requests

In [2]:
tag='love'
pageno=1
link = "https://www.goodreads.com/quotes/tag/"+tag+"?page="+str(pageno)

result = requests.get(link)
print result.text

<!DOCTYPE html>
<html class="desktop pageskin
">


<head>
  <title>
Quotes About Love (63851 quotes)
</title>


    <script type="text/javascript"> var ue_t0=window.ue_t0||+new Date();
 </script>
  <script type="text/javascript">
    (function(e){var c=e;var a=c.ue||{};a.main_scope="mainscopecsm";a.q=[];a.t0=c.ue_t0||+new Date();a.d=g;function g(h){return +new Date()-(h?0:a.t0)}function d(h){return function(){a.q.push({n:h,a:arguments,t:a.d()})}}function b(m,l,h,j,i){var k={m:m,f:l,l:h,c:""+j,err:i,fromOnError:1,args:arguments};c.ueLogError(k);return false}b.skipTrace=1;e.onerror=b;function f(){c.uex("ld")}if(e.addEventListener){e.addEventListener("load",f,false)}else{if(e.attachEvent){e.attachEvent("onload",f)}}a.tag=d("tag");a.log=d("log");a.reset=d("rst");c.ue_csm=c;c.ue=a;c.ueLogError=d("err");c.ues=d("ues");c.uet=d("uet");c.uex=d("uex");c.uet("ue")})(window);(function(e,d){var a=e.ue||{};function c(g){if(!g){return}var f=d.head||d.getElementsByTagName("head")[0]||d.documentElement

#### Status Codes

HTTP status codes are standard response codes given by web site servers on the Internet. The codes help identify the cause of the problem when a web page or other resource does not load properly. ... HTTP status codes are sometimes called browser error codes or internet error codes.

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [3]:
print result.status_code

200


In [4]:
print result.headers['content-type']

text/html; charset=utf-8


In [5]:
print result.encoding

utf-8


# Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

Reference: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [6]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(result.text,'lxml')
print soup.text




Quotes About Love (63851 quotes)

 var ue_t0=window.ue_t0||+new Date();
 

    (function(e){var c=e;var a=c.ue||{};a.main_scope="mainscopecsm";a.q=[];a.t0=c.ue_t0||+new Date();a.d=g;function g(h){return +new Date()-(h?0:a.t0)}function d(h){return function(){a.q.push({n:h,a:arguments,t:a.d()})}}function b(m,l,h,j,i){var k={m:m,f:l,l:h,c:""+j,err:i,fromOnError:1,args:arguments};c.ueLogError(k);return false}b.skipTrace=1;e.onerror=b;function f(){c.uex("ld")}if(e.addEventListener){e.addEventListener("load",f,false)}else{if(e.attachEvent){e.attachEvent("onload",f)}}a.tag=d("tag");a.log=d("log");a.reset=d("rst");c.ue_csm=c;c.ue=a;c.ueLogError=d("err");c.ues=d("ues");c.uet=d("uet");c.uex=d("uex");c.uet("ue")})(window);(function(e,d){var a=e.ue||{};function c(g){if(!g){return}var f=d.head||d.getElementsByTagName("head")[0]||d.documentElement,h=d.createElement("script");h.async="async";h.src=g;f.insertBefore(h,f.firstChild)}function b(){var k=e.ue_cdn||"z-ecx.images-amazon.com",g=e.ue_cdns||

#### How do you get just what you want?

In [7]:
samples = soup.find_all("div", "quoteText")
print samples[1].text


      “You've gotta dance like there's nobody watching,Love like you'll never be hurt,Sing like there's nobody listening,And live like it's heaven on earth.”
    ―
    William W. Purkey



#### But...

In [8]:
for quote in samples:
    print quote.text


      “I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.”
    ―
    Marilyn Monroe


      “You've gotta dance like there's nobody watching,Love like you'll never be hurt,Sing like there's nobody listening,And live like it's heaven on earth.”
    ―
    William W. Purkey


      “You know you're in love when you can't fall asleep because reality is finally better than your dreams.”
    ―
    Dr. Seuss


      “A friend is someone who knows all about you and still loves you.”
    ―
    Elbert Hubbard


      “Darkness cannot drive out darkness: only light can do that. Hate cannot drive out hate: only love can do that.”
    ―
    Martin Luther King Jr.,
    
A Testament of Hope: The Essential Writings and Speeches


//<![CDATA[  

  function submitShelfLink(unique_id, book_id, shelf_id, shelf_name, submit_form, exclusive) {
    var check

#### It has the script element. What do we do?

In [9]:
# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()

In [10]:
samples = soup.find_all("div", "quoteText")
for quote in samples:
    print quote.text


      “I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.”
    ―
    Marilyn Monroe


      “You've gotta dance like there's nobody watching,Love like you'll never be hurt,Sing like there's nobody listening,And live like it's heaven on earth.”
    ―
    William W. Purkey


      “You know you're in love when you can't fall asleep because reality is finally better than your dreams.”
    ―
    Dr. Seuss


      “A friend is someone who knows all about you and still loves you.”
    ―
    Elbert Hubbard


      “Darkness cannot drive out darkness: only light can do that. Hate cannot drive out hate: only love can do that.”
    ―
    Martin Luther King Jr.,
    
A Testament of Hope: The Essential Writings and Speeches





      “We accept the love we think we deserve.”
    ―
    Stephen Chbosky,
    
The Perks of Being a Wallflower




    

#### Let's write it to file.

In [11]:
#Write the quotes to a file
with open('quotefile.txt', 'w') as file:   
    for quote in samples:
        file.write(quote.text.encode("utf-8"))

Seems pretty simple right? But this is not the case always. Sometimes they don't like being scrapped.

## Sites don't like being scrapped. Why?

1. Uncontrolled scraping in the form of an overwhelming number of requests at a time may lead to a denial of service (DoS) situation, where your server and all services hosted on it become unresponsive.
2. Scraping is bad for you as it can lead to a loss of competitive advantage and therefore, a loss of revenue.
3. Scraping may lead to your content being duplicated elsewhere and lead to a loss of credibility for the original source. 
4. Scraping may lead to excess pressure on your server, slowing it down and eventually inflating your bills too!


## How can websites detect web scraping?

1. Unusual traffic/high download rate especially from a single client/or IP address within a short time span.
2. Repetitive tasks performed on the website – based on an assumption that a human user won’t perform the same repetitive tasks all the time.
3. Detection through honeypots – these honeypots are usually links which aren’t visible to a normal user but only to a spider. When a scraper/spider tries to access the link, the alarms are tripped.

## Easiest way to find if a site doesn’t want data to be scraped

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned.

Check out https://in.pinterest.com/robots.txt

## How to prevent getting Blacklisted/ Blocked?

###  IP based blocking 

In [12]:
from torrequest import TorRequest
import requests

response= requests.get('http://ipecho.net/plain')
print "My Original IP Address:",response.text

#Let's reset those IP addresses

tr=TorRequest(password='pass')

tr.reset_identity()
response= tr.get('http://ipecho.net/plain')
print "New Ip Address",response.text

tr.reset_identity()
response= tr.get('http://ipecho.net/plain')
print "New Ip Address",response.text

My Original IP Address: 202.88.225.162
New Ip Address 51.15.81.183
New Ip Address 62.210.13.58


To know how Tor Really works,
https://hackernoon.com/how-does-tor-really-work-c3242844e11f?gi=d854a296323a


In [13]:
# Request through Tor

response=tr.get('https://www.goodreads.com/quotes/tag/bbc?page=99')
print response.text

<!DOCTYPE html>
<html class="desktop pageskin
">


<head>
  <title>
Quotes About Bbc (26 quotes)
</title>


    <script type="text/javascript"> var ue_t0=window.ue_t0||+new Date();
 </script>
  <script type="text/javascript">
    (function(e){var c=e;var a=c.ue||{};a.main_scope="mainscopecsm";a.q=[];a.t0=c.ue_t0||+new Date();a.d=g;function g(h){return +new Date()-(h?0:a.t0)}function d(h){return function(){a.q.push({n:h,a:arguments,t:a.d()})}}function b(m,l,h,j,i){var k={m:m,f:l,l:h,c:""+j,err:i,fromOnError:1,args:arguments};c.ueLogError(k);return false}b.skipTrace=1;e.onerror=b;function f(){c.uex("ld")}if(e.addEventListener){e.addEventListener("load",f,false)}else{if(e.attachEvent){e.attachEvent("onload",f)}}a.tag=d("tag");a.log=d("log");a.reset=d("rst");c.ue_csm=c;c.ue=a;c.ueLogError=d("err");c.ues=d("ues");c.uet=d("uet");c.uex=d("uex");c.uet("ue")})(window);(function(e,d){var a=e.ue||{};function c(g){if(!g){return}var f=d.head||d.getElementsByTagName("head")[0]||d.documentElement,h=d

In [14]:
# Request through Tor

response=tr.get('http://dogs.petbreeds.com/l/95')
print response.text

<!DOCTYPE html>
<html>
	<head>
		<title>Rate Limited</title>
		<meta name="robots" content="noindex, nofollow">
		<link href="/sites/default/files/distil/rate-limit.css" rel="stylesheet" type="text/css" />
	</head>
	<body>
		<div id="rate-limit-wrap">
			<h1><img class="ftb-logo" src="/sites/default/files/distil/logo-darkblue.svg"/></h1>
			<div id="rate-limit-captcha">
				<img class="rate-limit-icon" src="/sites/default/files/distil/rate-limit-icon.png"/>
				<div class="rate-limit-content">

					<div class="if-not-submitted active">
						<h2>Woah! You've been rate-limited.</h2>
						<p>Our servers have seen too many requests from you recently.</p>
						<p>If you feel this block is in error, please contact us using the form below.</p>
					</div>

					        <form id="dqxxxrvyafwqzqrvddac" method="POST" action="azsyrwwvwfzfbtxyvvbu.html" style="display:none"><label>Ignore: <input type="text" name="first_name" /></label><label>Ignore: <input type="text" name="last_name" /></labe

## Block by User Agent

A user agent is a “string” – that is, a line of text – identifying the browser and operating system to the web server.

When your browser connects to a website, it includes a User-Agent field in its HTTP header. The contents of the user agent field vary from browser to browser. Each browser has its own, distinctive user agent. Essentially, a user agent is a way for a browser to say “Hi, I’m Mozilla Firefox on Windows” or “Hi, I’m Safari on an iPhone” to a web server.

https://www.whatismybrowser.com/detect/what-is-my-user-agent

#### How to over come this?

Spoof the User Agent by creating a list of user agents and picking a random one for each request.
Websites do not want to block genuine users so you should try to look like one. Set your user-agent to a common web browser instead of using the default user-agent (such as wget/version or urllib/version). 

In [15]:
from fake_useragent import UserAgent
ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)

No handlers could be found for logger "root"


Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36
{'User-Agent': 'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1464.0 Safari/537.36'}
<Response [200]>


##### Randomise a few


In [16]:
import random

headers = [ua.ie, ua.msie, ua.opera, ua.google, ua.chrome, ua.firefox, ua.ff, ua.safari, ua['Internet Explorer'], ua['google chrome']]
header = random.sample(headers, 1)
print header[0]

Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1


##### Now pass the header to your request

In [17]:
header = {'User-Agent': header[0]}
resp = requests.get(link, timeout=5, headers=header)
print resp

<Response [200]>


## Beware of Honey Pot Traps

These honeypots usually are links that normal user can’t see but a spider can.
When following links always take care that the link has proper visibility with no nofollow tag. Some honeypot links to detect spiders will be have the CSS style display:none or will be color disguised to blend in with the page’s background color.

## Do not follow the same crawling pattern

## Make the crawling slower, do not slam the server 

## Is it Legal to Scrape?


 this process may not be illegal as an attacker is just extracting information that is available to him through a browser, unless the webmaster specifically forbids it in the terms and conditions of the website. This is a gray area, where ethics and morality come into play.
 
For instance, Medium’s terms of service contain the following line:

    Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited
    
 Where as distil's policy,
 
 https://resources.distilnetworks.com/travel/is-web-scraping-illegal-depends-on-what-the-meaning-of-the-word-is-is
    

## Resources

1. https://blog.jscrambler.com/protect-your-site-against-web-scraping/
2. https://www.scrapehero.com/scalable-do-it-yourself-scraping-how-to-build-and-run-scrapers-on-a-large-scale/

