# Website Spider

## Introduction

This notebook presents web scraping tool.

### Goals

1. To retrieve web pages that contain a given string.
    1. To convert web pages into plain ASCII text.

## Requirements

- [WebDriver](https://www.w3.org/TR/webdriver1/)
- [Crawler](https://en.wikipedia.org/wiki/Web_crawler)

### Imports

#### Standard

In [1]:
try:
    from urllib.parse import urlparse
except ImportError:
     from urlparse import urlparse
import encodings       
import random

#### Third-parties

- [scrapy](https://scrapy.org/)
- [html2text](http://alir3z4.github.io/html2text/)
- [selenium](https://www.seleniumhq.org/)

In [2]:
import scrapy 
import html2text
from selenium import webdriver

## Arguments

1. List of URL from the websites to crawl.

In [3]:
websites = 'websites.txt'
query = 'vanves'

## Parameters

In [4]:
start_urls = []
allowed_domains = []
counter = 0
charsets = set(encodings.aliases.aliases.values())
driver = webdriver.Chrome()
ignored_urls = []

## Classes

## Methods

## Program

### Allowed domains

Preventing the spider to follow URLs not belonging to the allowed domains.

In [5]:
with open(websites) as i:        
    for l in i:  
        if not l.isspace():
            parsed = urlparse(l.rstrip())
            start_urls.append(parsed.geturl())
            allowed_domains.append(parsed.netloc)    

### Web Spider

Class that Scrapy will uses to scrape information from a single or a group of website.

In [6]:
class SpiderScraper (scrapy.Spider):
    '''    
        The class that Scrapy will uses to scrape information from a single or a group of website.
        
        Parameters
        ----------
        name: string, optional (default = query)
            a string that defines the name for this spider.
        start_urls: array, optional (default = allowed_domains) 
            an array of strings containing URLs where the spider will begin to crawl from.
        allowed_domains: array, optional (default = allowed_domains)
            an array of strings containing domains that this spider is allowed to crawl. 
        query: string, optional (default = query)
            a string to be found in web pages.
        counter: int, optional (default = counter)
            an int that holds the number of retrieved web pages.
        charsets: set, optional (default = charsets)   
            a set of string containing the charsets that this spidel will use to decode web pages.
        driver: ?
            ?
        ignored_urls: ?
            ?
    '''
    name = query
    start_urls = start_urls
    allowed_domains = allowed_domains    
    query = query 
    counter = counter
    charsets = charsets
    driver = driver
    ignored_urls = ignored_urls
    
    def parse(self, response):
        '''
            The method that parses the response.
            
            Parameters
            ----------
            response: ?
                ?
        '''
        
        '''
            ?
        '''
        if response.request.url in SpiderScraper.ignored_urls:
            return
        
        driver.get(response.request.url)
        
        html = driver.page_source
          
        '''
        for c in SpiderScraper.charsets:
            try:                
                break
            except UnicodeDecodeError as u:
                continue
            except Exception as e: 
                print response           
                print(type(e))   
                print(e.args)
                print(e)          
        '''

        '''
            Converting the web pages into plain ASCII text.
        '''
        text = html2text.html2text(html)
        
        '''
            Persisting the web pages if they contains the string to be found (query).
        '''
        if SpiderScraper.query in text:
            with open(str(SpiderScraper.counter) + '.txt', 'w') as f:            
                f.write(text.encode('utf-8'))
                SpiderScraper.counter += 1
        else:
            SpiderScraper.ignored_urls.append(response.request.url)
        
        '''
            Searching for URLs using XPath.
        '''
        for link in scrapy.Selector(text = html, type = 'html').xpath('*//a/@href').extract():
            yield response.follow(link, self.parse)          

## References

- [BeautifulSoup Grab Visible Webpage Text](https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text)
- [How can I render JavaScript HTML to HTML in python?](https://stackoverflow.com/questions/29404856/how-can-i-render-javascript-html-to-html-in-python)
- [How to get html with javascript rendered sourcecode by using selenium](https://stackoverflow.com/questions/22739514/how-to-get-html-with-javascript-rendered-sourcecode-by-using-selenium)
- [Rendered HTML to plain text using Python ](https://stackoverflow.com/questions/13337528/rendered-html-to-plain-text-using-python)
- [Python - html2text write to file](https://stackoverflow.com/questions/28602868/python-html2text-write-to-file)
- [Scrapy](https://scrapy.org/)
- [Get protocol + host name from URL](https://stackoverflow.com/questions/9626535/get-protocol-host-name-from-url)
- [How to extract raw html from a Scrapy selector?](https://stackoverflow.com/questions/34887730/how-to-extract-raw-html-from-a-scrapy-selector)