# Assignment 2
## Part I - Web Scraping for Dynamic Webpages
## Part II - Applying your Webscraping skills

Here we will learn how to do web scraping in Python using Scrapy and Selenium for dynamic webpages then applying those skills to collect data for your potential project in waste management.

The topics covered include scraping dynamic webpages, link following, and time-based retrieval problems. It is highly encouraged to use the [Scrapy](https://docs.scrapy.org) and [Selenium](https://selenium-python.readthedocs.io/index.html) documentation and other online resources while working on the problems. We hope the resources in this notebook will help you with data collection and analysis in the future.

**Before getting started:**
1.   Make sure to complete and understand the in-class webscraping activity (week 2)
2.   Please make a copy of this notebook to your own Google Drive by going to "File" and "Save a copy in Drive".

# Setup

Install the pip dependencies and download the chrome binary and driver (needed for dynamic web scraping).

In [None]:
!pip install scrapy selenium beautifulsoup4
!wget https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/116.0.5845.96/linux64/chrome-linux64.zip
!wget https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/116.0.5845.96/linux64/chromedriver-linux64.zip
!unzip chrome-linux64.zip
!unzip chromedriver-linux64.zip
!rm *.zip

In [None]:
import re
import os
import sys
import json
import time
import scrapy
import warnings
from datetime import datetime
from urllib.parse import urljoin
warnings.filterwarnings("ignore")

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from multiprocessing import Process, Queue
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

options = webdriver.ChromeOptions()
options.binary_location = './chrome-linux64/chrome'
options.add_argument('--no-sandbox')
options.add_argument('--headless')
service = Service(executable_path='./chromedriver-linux64/chromedriver')


# this function gets the data from the spider's callback function and
# saves it to the out_filename (in the specified format out_filetype)
# (no changes necessary to this function)
def run_spider(spider, out_filename, out_filetype):
    if os.path.exists(out_filename):
        os.remove(out_filename)

    def f(q):
        try:
            runner = CrawlerRunner(settings={
                "FEEDS": {
                    out_filename: {
                        "format": out_filetype},
                },
            })
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()
    if result is not None:
        raise result


# Part I - Scraping dynamic websites

Many modern sites are dynamic in nature, meaning they are generated on the fly. Generally, for a dynamic website, your browser will send a JavaScript query to the server, which returns a DOM that is then loaded in the browser. Because of this intermediate step, we can not use native Scrapy to scrape dynamic pages as the spider will be unable to find any elements.

Instead, we will use a helper package Selenium to scrape dynamic websites. Selenium works by emulating the behavior of a browser, thereby allowing the spider to send a JavaScript query and retrieve the response. To do this, we will need to install a browser binary and driver. This has been done for you already and stored in the directories `./chrome-linux64` and `./chromedriver-linux64` respectively. We will use Selenium to scrape https://quotes.toscrape.com/js/, a dynamic version of the quotes site, as well as https://reddit.com.

### 1.1 Analyzing the structure of dynamic websites

For this section, copy your code from problem 1.2 (in-class activity) and change the initial request to scrape the dynamic webpage https://quotes.toscrape.com/js/. Does your spider correctly extract all the quotes from the site? If not, what is the behavior of your spider? What does the HTTP request return to the spider?

In [None]:
class MyAllQuotesJSSpider(scrapy.Spider):
    # Simply copy the completed code from in-class exercise 1.2 activity here
    # MAKE SURE to scrape data/quotes from https://quotes.toscrape.com/js/
    name = "myspider-1.2" # name your spider

    # list the url(s) where to start reading
    def start_requests(self):
        url = "https://quotes.toscrape.com/js/"
        yield scrapy.Request(url=url, callback=self.parse)


    # the callback function (called when the webpage is fetched)
    def parse(self, response):
        # print(response.css("html").get())   # prints the contents of the webpage on the console

        # write the code for extracting all quotes, author and tags from the webpage (hint use a loop!)
        # see the documentation here: https://docs.scrapy.org/en/latest/topics/selectors.html

        for i, quote in enumerate(response.css("div.quote")):
            text = quote.css("span.text::text").get()[1:-1]
            author = quote.css("small.author::text").get()
            tags = quote.css("div.tags a.tag::text").getall()

            yield {
                "text": text,
                "author": author,
                "tags": tags,
            }

        # once all the quotes are extracted from this page, extract the next page
        next_page = response.css("li.next a::attr(href)").get()    # extract the link to the next webpage
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

run_spider(MyAllQuotesJSSpider, "all_quotes_js.json", "json")


<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    
<script src="/static/jquery.js"></script>
<script>
    var data = [
    {
        "tags": [
            "change",
            "deep-thoughts",
            "thinking",
            "world"
        ],
        "author": {
            "name": "Albert Einstein",
            "goodreads_link": "/author/show/9810.Albert_Einstein",
            "slug": "Alber

Observations:

- **Describe your observations here (double click here to edit and answer)**
- What does the http request returned?

### 1.2 Scraping dynamic websites

In this section, we will scrape a dynamic webpage. There are many viable methods, but we will do this by creating custom class called `SeleniumRequest` which extends the `scrapy.Request` class. Each `SeleniumRequest` object should define its own separate driver (this is important since spiders work asynchronously) and submit a GET request to a specified url. They should also be able to handle callback functions. We have already implemented the `SeleniumRequest` class below.

For this problem, create a spider to scrape all quotes from all pages on https://quotes.toscrape.com/js/

In [None]:
# this creates a new request type, that can help emulate a browser
class SeleniumRequest(scrapy.Request):
    def __init__(self, url, **kwargs):
        super().__init__(url, **kwargs)
        self.driver = webdriver.Chrome(service=service, options=options)
        self.driver.get(url)



class MyDynamicPageSpider(scrapy.Spider):

    name = "MyDynamicPageSpider"

    def start_requests(self):
        url = "https://quotes.toscrape.com/js/page/1/"
        yield SeleniumRequest(url=url, callback=self.parse)   # note: we are making a specialized SeleniamRequest with "parse" callback

    def parse(self, response):
        driver = response.request.driver  # we first extract the driver which will act like a browser

        # we can find various html elements using "find_elements" method,
        # here is a detailed documentation https://selenium-python.readthedocs.io/locating-elements.html

        quotes = driver.find_elements(By.CSS_SELECTOR, "div.quote")   # this extracts the quote element from the webpage

        for i, quote in enumerate(quotes):
            text = quote.find_elements(By.CSS_SELECTOR,"span.text")[0].text
            author = quote.find_elements(By.CSS_SELECTOR,"small.author")[0].text
            tags = []
            for t in quote.find_elements(By.CSS_SELECTOR,"div.tags a.tag"):
              tags.append(t.text)
            item = {
                "text": text,
                "author": author,
                "tags": tags,
            }
            if len(item["tags"]) >= 2:
                yield item   # this yields quotes with more than two tags

        next_page = None     # write code to extract the link to the next page
        if next_page is not None:
            yield SeleniumRequest(url=next_page, callback=self.parse)

run_spider(MyDynamicPageSpider, "selected_quotes_dynamic.json", "json")

### 1.3 Scraping more complex dynamic websites

For your next problem, you will choose a subreddit to scrape from the popular social media website [Reddit](https://www.reddit.com/). Your subreddit must have at least 10 posts within the last 12 hours. For this problem, create a spider to retrieve the top 10 highest voted posts within the past 24 hours on your subreddit (you can do this in your browser by first sorting by "Top" posts and then choosing "Today"). For each quote, you should save the title, author, number of up votes and comments, the best (first) comment in the thread, date and time posted in UTC +0 time zone (why?), as well as any other information you may deem important. Note that reddit will load more posts as you scroll down, so if your spider gets fewer than 10 posts, you will have to mimic the browser scrolling down in your Selenium driver.

In [None]:
class MyRedditSpider(scrapy.Spider):

    name = "MyRedditSpider"

    def start_requests(self):
        url = "https://www.reddit.com/r/books/top/?t=day"   # include your favorite subreddit here
        yield SeleniumRequest(url=url, callback=self.parse)

    def parse(self, response):
        driver = response.request.driver
        posts = driver.find_elements(By.CSS_SELECTOR, "shreddit-post")
        while len(posts) < 10:

            driver.execute_script("window.scrollTo(0, window.scrollY + 100000)")   # mimics the scrolling behavior on a browser
            time.sleep(1)
            posts = driver.find_elements(By.CSS_SELECTOR, "shreddit-post")

        # posts contains all the loaded posts from the subreddit
        print(posts)
        posts = posts[:10]
        for i, post in enumerate(posts):
            title = post.get_attribute('post-title')
            author = post.get_attribute('author')
            score = int(post.get_attribute('score'))
            comment_count = int(post.get_attribute('comment-count'))
            timestamp = post.get_attribute('created-timestamp')
            date = timestamp[:10]
            url = response.urljoin(post.get_attribute('permalink'))
            item = {
                "title": title,
                "author": author,
                "score": score,
                "comments": comment_count,
                "date": date,
                "url": url,
            }
            yield SeleniumRequest(url, callback=self.parse_comments, cb_kwargs={'item': item})

    def parse_comments(self, response, item):
        driver = response.request.driver
        try:
          best_comment = driver.find_elements(By.CSS_SELECTOR, "shreddit-comment")[0]
          best_comment_text = best_comment.find_element(By.CSS_SELECTOR, "div#-post-rtjson-content").text    # write code to extract the text of the top comment
        except:
          best_comment_text=""
        item["best_comment"] = best_comment_text
        yield item

run_spider(MyRedditSpider, "most_voted_posts.json", "json")

[<selenium.webdriver.remote.webelement.WebElement (session="df12b9c16a87aa8d8a0a5bb0edad5abe", element="C4C962277A76A4E0A3D03215A4B2FBA5_element_15")>, <selenium.webdriver.remote.webelement.WebElement (session="df12b9c16a87aa8d8a0a5bb0edad5abe", element="C4C962277A76A4E0A3D03215A4B2FBA5_element_23")>, <selenium.webdriver.remote.webelement.WebElement (session="df12b9c16a87aa8d8a0a5bb0edad5abe", element="C4C962277A76A4E0A3D03215A4B2FBA5_element_25")>, <selenium.webdriver.remote.webelement.WebElement (session="df12b9c16a87aa8d8a0a5bb0edad5abe", element="C4C962277A76A4E0A3D03215A4B2FBA5_element_32")>, <selenium.webdriver.remote.webelement.WebElement (session="df12b9c16a87aa8d8a0a5bb0edad5abe", element="C4C962277A76A4E0A3D03215A4B2FBA5_element_34")>, <selenium.webdriver.remote.webelement.WebElement (session="df12b9c16a87aa8d8a0a5bb0edad5abe", element="C4C962277A76A4E0A3D03215A4B2FBA5_element_36")>, <selenium.webdriver.remote.webelement.WebElement (session="df12b9c16a87aa8d8a0a5bb0edad5abe",

### 1.4 Time-based retrieval (optional, extra-credit!)

Reddit is able to sort posts by number of up votes, but what if we wanted to get the posts with the highest number of comments? In this section, you should create a spider to get the top 10 posts with the most number of comments today. You do not have to save the best (first) comment in each thread, but you should still save all other information in the previous problem. Occasionally, Google Colab may hang while running this spider. If your code is taking longer than five minutes to run, try restarting the kernel and running again.

In [None]:
class MyMostCommentsRedditSpider(scrapy.Spider):

    name = "MyMostCommentsRedditSpider"

    def start_requests(self):
        url = "https://www.reddit.com/r/books/new/"      # include your favorite subreddit here
        yield SeleniumRequest(url=url, callback=self.parse)

    def parse(self, response):
        driver = response.request.driver
        posts = driver.find_elements(By.CSS_SELECTOR, "shreddit-post")
        # Adjust threshold datetime as necessary
        threshold_datetime = datetime(2023, 9, 19)

        print('Collecting posts...')
        while True:
            oldest_post = posts[-1]
            oldest_timestamp = oldest_post.get_attribute('created-timestamp')
            oldest_date = oldest_timestamp[:oldest_timestamp.index('T')]
            oldest_datetime = datetime.strptime(oldest_date, '%Y-%m-%d')
            if threshold_datetime <= oldest_datetime:
                driver.execute_script("window.scrollTo(0, window.scrollY + 100000)")
                time.sleep(1)
                posts = driver.find_elements(By.CSS_SELECTOR, "shreddit-post")
            else:
                break
        print(f'Collected {len(posts)} posts')

        # posts contains all the loaded posts from the subreddit

        ordering = []
        for i in range(len(posts)):
          comment_count = int(posts[i].get_attribute('comment-count'))
          ordering.append((comment_count,i))

        ordering.sort(reverse=True)

        print('Getting data from top 10 most commented posts')
        yielded_posts = 0
        for _, i in ordering:
            if yielded_posts == 10:
                break
            post = posts[i]
            title = post.get_attribute('post-title')
            author = post.get_attribute('author')
            score = int(post.get_attribute('score'))
            comment_count = int(post.get_attribute('comment-count'))
            timestamp = post.get_attribute('created-timestamp')
            date = timestamp[:10]

            post_datetime = datetime.strptime(date, '%Y-%m-%d')
            if post_datetime <= threshold_datetime:
                yielded_posts += 1
                url = response.urljoin(post.get_attribute('permalink'))          # write code to extract the post url
                item = {
                    "title": title,
                    "author": author,
                    "score": score,
                    "comments": comment_count,
                    "date": date,
                    "url": url,
                }
                yield SeleniumRequest(url, callback=self.parse_comments, cb_kwargs={'item': item})

    def parse_comments(self, response, item):
        driver = response.request.driver
        try:
          best_comment = driver.find_elements(By.CSS_SELECTOR, "shreddit-comment")[0]
          best_comment_text = best_comment.find_element(By.CSS_SELECTOR, "div#-post-rtjson-content").text    # write code to extract the text of the top comment
        except:
          best_comment_text=""
        item["best_comment"] = best_comment_text
        yield item

run_spider(MyMostCommentsRedditSpider, "most_commented_posts.json", "json")

Collecting posts...
Collected 53 posts
Getting data from top 10 most commented posts


# Part II - Applying your Webscraping skills in Waste Management

Now that you understand the basics of automated webscraping, let's apply it to collect relevant data for your particular *target environment* and potential project in waste management.  

## 2.1 Identify data sources

For your selected target environment in waste management and recycling, identify at least one website that may have relevant data for each of the following key research questions:


* **What are the existing barriers to waste management/recycling?**
  * website1 url: (double click to edit)
  * description of relavent data:
  * justification:
 ---
  * website2 url:
  * description of relavent data:
  * justification:

* **What may be untapped opportunities?**
  * website1 url: (double click to edit)
  * description of relavent data:
  * justification:
 ---
  * website2 url:
  * description of relavent data:
  * justification:

* **What are the best practices from other spaces?**
  * website1 url: (double click to edit)
  * description of relavent data:
  * justification:
 ---
  * website2 url:
  * description of relavent data:
  * justification:

* **Other relevant websites:**
  * website1 url: (double click to edit)
  * description of relavent data:
  * justification:
 ---
  * website2 url:
  * description of relavent data:
  * justification:






## 2.2 Develop your WebScaper
For at least two of the websites you identified relevant to your project above in 2.1, write your own data scraping script to extract data and save it to a file in json format.

(feel free to attend TA office hours for help!)



In [None]:
class MyDataSpider_Website1(scrapy.Spider):
  # write your code here
  pass

run_spider(MyDataSpider_Website1, "data_from_website1.json", "json")


In [None]:
class MyDataSpider_Website2(scrapy.Spider):
  # write your code here
  pass

run_spider(MyDataSpider_Website2, "data_from_website2.json", "json")
