# Web Scraping in Python

***

#### Course Description

> The ability to build tools capable of retrieving and parsing information stored across the internet has been and continues to be valuable in many veins of data science. In this course, you will learn to navigate and parse html code, and build tools to crawl websites automatically. Although our scraping will be conducted using the versatile Python library scrapy, many of the techniques you learn in this course can be applied to other popular Python libraries as well, including BeautifulSoup and Selenium. Upon the completion of this course, you will have a strong mental model of html structure, will be able to build tools to parse html code and access desired information, and create a simple scrapy spiders to crawl the web at scale.

***

## Introduction to HTML

> Learn the structure of HTML. We begin by explaining why web scraping can be a valuable addition to your data science toolbox and then delving into some basics of HTML. We end the chapter by giving a brief introduction on XPath notation, which is used to navigate the elements within HTML code.

### HyperText Markup Language

### Attributes

### Crash Course in XPath

***

## XPaths and Selectors

### Xpathology

### Off the Beaten XPath

### Introduction to the scrapy Selector

In [1]:
from scrapy import Selector

In [2]:
html = '''
<html>
    <body>
        <div class="hello datacamp">
            <p>Hello World!</p>
        </div>
        <p>Enjoy DataCamp!</p>
    </body>
</html>
'''

In [3]:
# Create a scrapy Selector object using a string with the html code
sel = Selector(text=html)

In [4]:
# Select all p
sel.xpath("//p")

[<Selector xpath='//p' data='<p>Hello World!</p>'>,
 <Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]

In [5]:
sel.xpath("//p").extract()

['<p>Hello World!</p>', '<p>Enjoy DataCamp!</p>']

In [6]:
sel.xpath("//p").extract_first()

'<p>Hello World!</p>'

In [8]:
ps = sel.xpath('//p')
second_p = ps[1]
second_p.extract()

'<p>Enjoy DataCamp!</p>'

### The Source of the Source

In [11]:
from scrapy import Selector

In [14]:
import requests
url = "https://en.wikipedia.org/wiki/Web_scraping"
html = requests.get(url).content

In [15]:
sel = Selector(text=html)

In [16]:
print(sel)

<Selector xpath=None data='<html class="client-nojs vector-featu...'>


***

## CSS Locators, Chaining, and Responses

### From XPath to CSS

In [1]:
html = '''
<html>
    <body>
        <div class="hello datacamp">
            <p>Hello World!</p>
        </div>
        <p>Enjoy DataCamp!</p>
    </body>
</html>
'''

In [2]:
from scrapy import Selector

sel = Selector(text=html)

In [3]:
sel.css("div > p")

[<Selector xpath='descendant-or-self::div/p' data='<p>Hello World!</p>'>]

In [4]:
sel.css("div > p").extract()

['<p>Hello World!</p>']

### CSS Attributes and Text Selection

### Respond Please

### Survey

In [None]:
url = 'https://www.datacamp.com/courses/all'
course_divs = response.css('div.course-block')

***

## Spiders

### Your First Spider

In [6]:
import scrapy
from scrapy.crawler import CrawlerProcess

class SpiderClassName(scrapy.Spider):
    name = "spider_name"
    # the code of your spider
    ...

# Initiate a CrawlerProcess
process = CrawlerProcess()

# Tell the process which spider to use
process.crawl(SpiderClassName)

# Start the crawling process
process.start()

2023-09-04 10:18:32 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: scrapybot)
2023-09-04 10:18:32 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1q  5 Jul 2022), cryptography 37.0.1, Platform Windows-10-10.0.22621-SP0
2023-09-04 10:18:32 [scrapy.crawler] INFO: Overridden settings:
{}
2023-09-04 10:18:32 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-09-04 10:18:32 [scrapy.extensions.telnet] INFO: Telnet Password: 1fa75098d0dc7446
2023-09-04 10:18:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2023-09-04 10:18:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scr

In [None]:
class DCspider(scrapy.Spider):
    
    name = 'dc_spider'
    
    def start_requests(self):
        urls = ['https://www.datacamp.com/courses/all']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        # simple example: write out the html
        html_file = "DC_courses.html"
        with open(html_file, 'wb') as fout:
            fout.write(response.body)

In [None]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider(scrapy.Spider):
  name = "your_spider"
  # start_requests method
  def start_requests(self):
    pass
  # parse method
  def parse(self, response):
    pass
  
# Inspect Your Class
inspect_class(YourSpider)

In [None]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    urls = ["https://www.datacamp.com", "https://scrapy.org"]
    for url in urls:
      yield url
  # parse method
  def parse( self, response ):
    pass
  
# Inspect Your Class
inspect_class( YourSpider )

### Start Requests

In [None]:
def start_requests(self):
        urls = ['https://www.datacamp.com/courses/all']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse) 

In [None]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    self.print_msg( "Hello World!" )
  # parse method
  def parse( self, response ):
    pass
  # print_msg method
  def print_msg( self, msg ):
    print( "Calling start_requests in YourSpider prints out:", msg )
  
# Inspect Your Class
inspect_class( YourSpider )

In [None]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url="https://www.datacamp.com", callback=self.parse )
  # parse method
  def parse( self, response ):
    pass
  
# Inspect Your Class
inspect_class( YourSpider )

### Parse and Crawl

In [None]:
def parse(self, response):
    # input parsing code with response that you already know!
    # output to a file, or...
    # crawl the web

In [None]:
def parse(self, response):
    links = response.css('div.course-block > a::attr(href)').extract()
    filepath = 'DC_links.csv'
    with open(filepath, 'w') as f:
        f.writelines([f"{link}/n" for link in links])

In [None]:
# Create the spider class
class DCspider( scrapy.Spider ):
  name = "dcspider"
  # start_requests method
  def start_requests( self ):
    urls = ['https://www.datacamp.com/courses/all']
    for url in urls:
        yield scrapy.Request( url=url, callback=self.parse )
  # parse method
  def parse( self, response ):
    links = response.css('div.course-block > a::attr(href)').extract()
    for link in links:
        yield response.follow(url=link, callback=self.parse2)
        
  def parse2(self, response):
        # parse the course sites here

In [None]:
# Import the scrapy library
import scrapy

# Create the Spider class
class DCspider( scrapy.Spider ):
  name = 'dcspider'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )
  # parse method
  def parse( self, response ):
    # Create an extracted list of course author names
    author_names = response.css('p.course-block__author-name::text').extract()
    # Here we will just return the list of Authors
    return author_names
  
# Inspect the spider
inspect_spider( DCspider )

In [None]:
# Import the scrapy library
import scrapy

# Create the Spider class
class DCdescr( scrapy.Spider ):
  name = 'dcdescr'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )
  
  # First parse method
  def parse( self, response ):
    links = response.css( 'div.course-block > a::attr(href)' ).extract()
    # Follow each of the extracted links
    for link in links:
      yield response.follow(url=link, callback=self.parse_descr)
      
  # Second parsing method
  def parse_descr( self, response ):
    # Extract course description
    course_descr = response.css( 'p.course__description::text' ).extract_first()
    # For now, just yield the course description
    yield course_descr


# Inspect the spider
inspect_spider( DCdescr )

### Capstone

In [None]:
def parse_front(self, response):
    # Narrow in on the course blocks
    course_blocks = response.css('div.course-block')
    # Direct to the course links
    course_links = course_blocks.xpath('./a/@href')
    # Extract the links (as a list of strings)
    links_to_follow = course_links.extract()
    # Foloow the links to the next parser
    for url in links_to_follow:
        yield response.follow(url=url, callback=self.parse_pages)

In [None]:
def parse_pages(self, response):
    # Direct to the course title text
    crs_title = response.xpath('//h1[contains(@class, "title")]/text()')
    # Extract and clean the course title text
    crs_title_ext = crs_title.extract_first().strip()
    # Direct to the chapter titles text
    ch_titles = response.css('h4.chapter__title::text')
    # Extract and clean the chapter titles text
    ch_titles_ext = [t.strip() for t in ch_titles.extract()]
    dc_dict[crs_title_ext] = ch_titles_ext
    

In [None]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Chapter_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    crs_title_ext = crs_title.extract_first().strip()
    ch_titles = response.css('h4.chapter__title::text')
    ch_titles_ext = [t.strip() for t in ch_titles.extract()]
    dc_dict[ crs_title_ext ] = ch_titles_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Chapter_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

In [None]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Description_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    # Create a SelectorList of the course titles text
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    # Extract the text and strip it clean
    crs_title_ext = crs_title.extract_first().strip()
    # Create a SelectorList of course descriptions text
    crs_descr = response.css( 'p.course__description::text' )
    # Extract the text and strip it clean
    crs_descr_ext = crs_descr.extract_first().strip()
    # Fill in the dictionary
    dc_dict[crs_title_ext] = crs_descr_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Description_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)