<a href="https://colab.research.google.com/github/dylanwalker/MGSC496/blob/main/MGSC496_L05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is the in-class notebook for MGSC496 Lecture 5.

First let's go around the room and look at our solutions to the last exercise of R03 and the two scraping exercises of L04.

# Potential Topic: Faking User Agents in Scrapy


By default, scrapy tells every website that you scrape that you are using scrapy. It does this for each request that is sends, through the `User-Agent` string of the request header.

For example, here is the header of a request that I sent from a Scrapy shell session:
```python
{b'Accept': b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 b'Accept-Language': b'en',
 b'User-Agent': b'Scrapy/2.8.0 (+https://scrapy.org)',
 b'Accept-Encoding': b'gzip, deflate'}
``` 



The last line, where `USER_AGENT` is defined, is the fallback option if none of the over fake user agent services work. So you might want to plug in the user agent that is given for your favorite browser. For example, if you [go to this link](https://www.whatismybrowser.com/detect/what-is-my-user-agent/) you can see the user agent your browser has defined. Mine is:
>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36

You should note that, even though the websites you scrape will see these fake user agents, they still know what IP address you are accessing them from. Many sites have automated monitoring tools that look for scraper-like patterns in requests from the same IP address and may block your IP from accessing their site. To get around this, you can use proxies, which route your requests through many other proxy machines. There are [middleware libraries](https://github.com/rejoiceinhope/scrapy-proxy-pool) that you can install (with pip) and configure (in scrapy's `settings.py`) and there are both [free and paid proxy services](https://youtu.be/qHahcxoGfpc) that you can use with this middleware.


# Potential Topic: Items in Scrapy

In the reading, we exported data from our scraper by `yield`ing structured data items as a dictionary:

```python
class ToySpider(scrapy.Spider):
    name = 'toyspider'
    allowed_domains = ['cool.toys.com']
    start_urls = ['http://cool.toys.com/new']
    
    def parse(self,response):
      sel_list = response.xpath('/xpath/to/toyitem')
      for toy_sel in sel_list:
        toyname = toy_sel.xpath('continuedpath/to/toyname/text()').get()
        toyprice = toy_sel.xpath('continuedpath/to/toyprice/text()').get()
        yield {'toyname': toyname, 'toyprice': toyprice}
```

When we yield the dictionary, we are telling scrapy we want that data to go out through the data pipeline. In addition to yielding data as a dictionary, we can also [formally define our own data item objects](https://docs.scrapy.org/en/latest/topics/items.html) and yield them when we scrape. We can define one or more custom item classes in `items.py`. For example:

```python
# Inside of the file "/content/toyscraper/toyscraper/items.py"
import scrapy


class ToyscraperItem(scrapy.Item):
    toyname = scrapy.Field()
    toyprice = scrapy.Field()
```

If we do this, instead of yielding a dictionary, we could instead yield an item of this type:

```python
from toyscraper.items import ToyscraperItem

class ToySpider(scrapy.Spider):
    name = 'toyspider'
    allowed_domains = ['cool.toys.com']
    start_urls = ['http://cool.toys.com/new']
    
    def parse(self,response):
      sel_list = response.xpath('/xpath/to/toyitem')
      for toy_sel in sel_list:
        toyname = toy_sel.xpath('continuedpath/to/toyname/text()').get()
        toyprice = toy_sel.xpath('continuedpath/to/toyprice/text()').get()
        yield ToyscraperItem(toyname=toyname,toyprice=toyprice)
```

This approach allows us to:
* specify the structure of the data we are going to extract (if we yield a dictionary, we can put any keys that we want)
* do more advanced things with scrapy item feeds, such as:
 * validate our data (ensure that we are extracting what we think we are extracting)
 * ensure that we are not duplicating data
 * store our data
 * export our data
and also allows more flexibility in how we extract and preprocess data  (because, for example, we can write methods of our item class to clean and sanitize the data).


# Exercise: Scrape country info

Let's scrape Country Data from [this site](https://www.scrapethissite.com/pages/simple/). You will notice that all of the data is on a single page, so our webscraper will not need to follow any links.

First run the two code cells below:


In [None]:
%%capture
!pip install parsel

In [None]:
from parsel import Selector
import requests
res = requests.get('https://www.scrapethissite.com/pages/simple/') # grab the page using requests lib
doc = res.text # store the html of the page in the variable doc
selector = Selector(doc) # make a selector from doc

There is data from 250 different countries on the page about each country's:
* name
* capital
* population
* area

Use your browser inspector to inspect the html of the page and play with xpaths using `selector.xpath(...)` in the code area below to find xpaths that extract all the data described above:

In [None]:
# WRITE YOUR CODE TO PLAY WITH XPATHS HERE

Once you have figured this out, make a new colab notebook called `CountryScraper.ipynb` and write all of the necessary code to:
* install scrapy on colab
* create a Scrapy project
* write spider code to a file (using `%%writefile`) [ this will make use of xpaths you discovered above]
* run the scrapy project

# Exercise: Scrape NHL data 

Let's scape NHL data from [this site](https://www.scrapethissite.com/pages/forms/). The data is spread out across multiple pages, so your scraper will have to: 1) get all of the data from the current page; and 2) follow a link to the next page. You will not need more than one `parse` function to do this.

First run the two code cells below:

In [None]:
%%capture
!pip install parsel

In [None]:
from parsel import Selector
import requests
res = requests.get('https://www.scrapethissite.com/pages/forms/') # grab the page using requests lib
doc = res.text # store the html of the page in the variable doc
selector = Selector(doc) # make a selector from doc

There is data on NHL teams performance for a given year, including:
* team name
* year
* wins 
* losses
* overtimes losses
* win percentage
* goals for
* goals against
* difference (goals for - goals against)

Use your browser inspector to inspect the html of the page and play with xpaths using `selector.xpath(...)` in the code area below to find xpaths that extract all the data described above:

In [None]:
# WRITE YOUR CODE TO PLAY WITH XPATHS HERE

Once you have figured this out, make a new colab notebook called `NHLScraper.ipynb` and write all of the necessary code to:
* install scrapy on colab
* create a Scrapy project
* write spider code to a file (using `%%writefile`) [ this will make use of xpaths you discovered above]
* run the scrapy project

# R03 Exercise: Scrape quote and author from a single page of `quotes.toscrape.com`

You should now know enough to write your own very simple scraper. Treat the boxes below as if you are writing in a completely blank colab notebook. What do you need to do to write and run your own scraper? We will be focusing on `quotes.toscrape.com`. As you can see from browsing the site, each quote has content, the author, and tags; each author also has an about page. For now, your job is write a scraper with a spider that just scrapes the content (the body of the quote itself) and the author. 


1. Make sure scrapy is installed in colab:

In [None]:
# ENTER CODE HERE

2. Create a new scrapy project:

In [None]:
# ENTER CODE HERE

3. Write your spider code to a file in the appropriate directory:

In [None]:
# ENTER CODE HERE

4. Tell scrapy to start crawling:

In [None]:
# ENTER CODE HERE

You can test out your scraper by making a new colab notebook, copy/pasting your code in the cells above and running it.

<hr/>

# Exercise: Extend your Quote Scraper to scrape multiple pages




Starting with the Quote Scraper that you wrote in the last exercise of the reading R03:
* extend it to follow multiple pages.

In [None]:
# You will probably want to write this in a separate notebook

# Exercise: Extend your Quote Scraper to scrape Author about pages




* now add in the ability for your scrape to gather data from each author's about page. There are many fields in the about page. Use your browser inspector and play in a scrapy shell in xterm to get the right xpaths for each field
* Try incorporating the fake-user-agent
* Use logging to write the user-agent, which can be found in `response.request.headers`to the log. 



In [None]:
# You will probably want to write this in a separate notebook