# Intermediate Python Programming Workshop
## Scrapy and 'borrowing' content from the web

This notebook will introduce you to Scrapy, an open source and collaborative framework for extracting the data you need from websites (https://scrapy.org).

### 1. Install scrapy
This code cell is used to install the scrapy package using the python package management system called 'pip'

**Note** You only need to run this once. After that, the package will be installed and you will not need to run this again.

In [None]:
!pip install --user scrapy

### 2. A basic example
Here we will create a basic example using Scrapy. We need to define a few terms. 
 * A **spider** is a term used for the piece of software that 'crawls the web' meaning that this is the Python code that will grab the content we need from a website so that we can scrape it.
 * A **parse** function is the function that will run our scraping code.
 * A **Class** is a term used in object-oriented programming (OOP). Python is an object-oriented programming language. It represents a characterization of a thing like a 'dog' where you can give attributes such as a 'name' or functionality such as 'bark.' If you are interested in a deeper explanation you can visit the following page: [Python OOP](https://www.datacamp.com/community/tutorials/python-oop-tutorial).
 * A **URL** stands for Uniform Resource Locator, which is a web address.
 * The command **yield** will generate data that you are interested in scraping out of a website.
 * A **response** is the website (in the form of HTML code) that came back from our spider.
 
This first code cell simply defines our spider. It does not run it.

In [None]:
import scrapy


# This creates our own spider class called 'MySpider'
class MySpider(scrapy.Spider):
    # This names our spider. The name must be unique but doesn't matter much.
    name = 'myspider'
    
    def start_requests(self):
        
        # This is the comma separated list of URLs you plan to scrape
        start_urls = [
            'https://en.wikipedia.org/wiki/Social_science',
            'https://en.wikipedia.org/wiki/Geography'
        ]
        for start_url in start_urls:
            yield scrapy.Request(url=start_url, callback=self.parse)
        

    # This is the function where we will scrape each website
    def parse(self, response):
        # First open our file for writing
        # Get the page name (social_science or geography)
        page = response.url.split("/")[-1] 
        filename = f'my-scrapy-data-{page}.txt'
        with open(filename, 'w') as f:
            
            # Now figure out what we are going to write
            links = response.xpath("//a/@href").getall()
            for link in links:
                f.write(link)
            
            #f.write(response.body)
        self.log(f'Saved file {filename}')
        
        yield None # Just do nothing here


Now we will run our spider.

Here we are importing a new module called CrawlerProcess, which will run our Spider. For more information see the following [webpage](https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script). We will also be importing the multiprocessing module, which allows us to reuse the crawler. The technical details for this are not important - you can read more here if you are interested: [webpage](https://wiseodd.github.io/techblog/2016/06/10/scrapy-long-running/).

In [None]:
# Okay, hand waving time.
# This code does a few advanced things to make the Scrapy crawler work inside of a notebook.
# mp.Process launches a new computing process (that is separate from the notebook itself)
# then you call the _getscraping function that you created inside that new process
# That is where the crawler will call your parse function above

# You don't need to understand this code to use scrapy. You just need to change the parse function above.

from scrapy.crawler import CrawlerProcess
import multiprocessing as mp

def getscraping(spider):
    def _getscraping(spider):
        # Create a computer process that mimics a Mozilla Firefox browser.
        process = CrawlerProcess({
          'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
        })

        # Run our 'MySpider' crawler. Warning you will see a lot of red info.
        result = process.crawl(spider)
        process.start()
        return

    # Create a multiprocessing queue to hold our result from our crawling
    p = mp.Process(target=_getscraping, args=(spider,))
    p.start()
    p.join()
    return None

getscraping(MySpider)

print("Done")
    

In [None]:
# Useful tip: You can turn off warnings (ignore) or only see them 'once'
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='once')