# Web Scraping using Python

## Simple parsing with HTMLParser

In this notebook you will practice one of the workflows for using `HTMLParser` effectively. As you already know, `HTMLParser` is a streaming parser, where data comes in with chunks. Each chunk of data has delimeters like tags. 

It might feel a bit complicated to have special methods to look at tags, and others to process data - this is one of the caveats of using a streaming parser.

For this exercise, you will use predefined HTML variables with raw content that can be parsed. Instead of requesting the data from the web, the content is already defined and available to be processed. The process is the same to scrape the html.

In [1]:
content = """
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>1992 World Junior Championships in Athletics – Men's high jump - Wikipedia</title>
"""

Now that the data is available, import the html modules so that you can write the class next. The class has to have the `__init__()` method and set some class attributes.

In [2]:
from html.parser import HTMLParser

class Parser(HTMLParser):

    def __init__(self):
        super().__init__()
        self.recording = False

    def handle_starttag(self, tag, attrs):
        if tag == "title":
            self.recording = True
        else:
            self.recording = False
            
    def handle_data(self, data):
        if self.recording:
            print(f"Found data for tag: {data}")
            

In [3]:
p = Parser()
p.feed(content)

Found data for tag: 1992 World Junior Championships in Athletics – Men's high jump - Wikipedia
Found data for tag: 



Why is `handled_data()` printing twice? The second line appears to have an _empty_ data. Here is one way to find out: update the `handle_data()` method so that it displays the string with the `repr()` built-in function:

```python
    def handle_data(self, data):
        if self.recording:
            print(f"Found data for tag: {repr(data)}")
```

Run the cell where the class lives and re-run the Parser cell again to see if you spot the problem

In [4]:
# repr() helps when there are hidden characters that `print()` wouldn't show. 
empty = ""
print(f"A string with an empty string var wouldn't show the variable: {empty}")
print(f"A string with an empty string var wouldn't show the variable: {repr(empty)}")

A string with an empty string var wouldn't show the variable: 
A string with an empty string var wouldn't show the variable: ''


Think about what changes could you make to prevent two lines showing in the output. There are several approaches you could take to improve the quality of the data gathering, and the previous cells showed one. But what if you are also dealing with newline characters? Or other non-visible characters? An alternative you could try is to append the data found to a list instead of printing, and when the parsing is completed, joining the data found.

Here is how that would look with an example data.

In [5]:
captured_data = ["1992 World Junior Championships in Athletics – Men's high jump", "\n", "\n", "Wikipedia"]
print(''.join(captured_data))

1992 World Junior Championships in Athletics – Men's high jump

Wikipedia


## Using Scrapy and XPath

In [None]:
# create a virtual environment
python3 -m venv venv

# activate the venv
source venv/bin/activate

# install scrapy
pip install scrapy

# start a new project
scrapy startproject vulnerabilities

# enter the new project directory
cd vulnerabilities

# genspider needs two arguments: name of the spider and the domain
scrapy genspider cve cve.mitre.org

### Parsing Data with XPath and Scrapy Shell

In [None]:
scrapy shell http://cve.mitre.org/data/refs/refmap/source-EXPLOIT-DB.html

response.url

response.css

response.xpath('//table')

len(response.xpath('//table'))

response.css('table')

len(response.css('table'))

len(response.xpath('//table').xpath('tr'))

response.xpath('//table')[0]

len(response.xpath('//table')[0].xpath('tr'))

len(response.xpath('//table')[3].xpath('tr'))

table = response.xpath('//table')[3]

len(table.xpath('tr'))

In [None]:
table.xpath('//tr') # finds every single row

child = table.xpath('//tr')[10]

child.xpath('td//text()')

In [None]:
child.xpath('td//text()')[0].extract()

for row in table.xpath('//tr'):
    try:
        print(row.xpath('td//text()')[0].extract())
    except IndexError:
        pass

### Using Scrapy Spider for Web Scraping

In [None]:
vim cve.py

In [None]:
import scrapy


class CveSpider(scrapy.Spider):
    name = 'cve'
    allowed_domains = ['cve.mitre.org']
    start_urls = ['http://cve.mitre.org/data/refs/refmap/source-EXPLOIT-DB.html']

    def parse(self, response):
        for child in response.xpath('//table'):
            if len(child.xpath('tr')) > 100:
                table = child
                break
        for row in table.xpath('//tr'):
            try:
                print(row.xpath('td//text()')[0].extract())
            except IndexError:
                pass


In [None]:
scrapy crawl cve

## Persistence and Efficiency with Scraping

### Scraping locally

In [None]:
import scrapy


class CveSpider(scrapy.Spider):
    name = 'cve'
    allowed_domains = ['cve.mitre.org']
    # download the html
    start_urls = ['http://cve.mitre.org/data/refs/refmap/source-EXPLOIT-DB.html']

    def parse(self, response):
        for child in response.xpath('//table'):
            if len(child.xpath('tr')) > 100:
                table = child
                break
        for row in table.xpath('//tr'):
            try:
                print(row.xpath('td//text()')[0].extract())
            except IndexError:
                pass

### Persisting data in CSV and JSON formats

In [None]:
import scrapy
import os
from os.path import dirname
import csv
import json

current_dir = os.path.dirname(__file__)
url = os.path.join(current_dir, 'source-EXPLOIT-DB.html')


class CveSpider(scrapy.Spider):
    name = 'cve4'
    allowed_domains = ['cve.mitre.org']
    # start_urls = ['http://cve.mitre.org/data/refs/refmap/source-EXPLOIT-DB.html']
    start_urls = [f"file://{url}"]
    def parse(self, response):
        for child in response.xpath('//table'):
            if len(child.xpath('tr')) > 100:
                table = child
                break
        count = 0
        data = {}

        json_file = open('vulnerabilities.json', 'w')
                
        for row in table.xpath('//tr'):
            if count > 100:
                break
            try:
                exploit_id = row.xpath('td//text()')[0].extract()
                cve_id = row.xpath('td//text()')[2].extract()
                data[exploit_id] = cve_id
                count += 1
            except IndexError:
                pass
        json.dump(data, json_file)
        json_file.close()


### Persisting data to a SQLite database

In [None]:
import scrapy
import os
from os.path import dirname
import csv
import sqlite3

current_dir = os.path.dirname(__file__)
url = os.path.join(current_dir, 'source-EXPLOIT-DB.html')


class CveSpider(scrapy.Spider):
    name = 'cve5'
    allowed_domains = ['cve.mitre.org']
    # start_urls = ['http://cve.mitre.org/data/refs/refmap/source-EXPLOIT-DB.html']
    start_urls = [f"file://{url}"]
    def parse(self, response):
        connection = sqlite3.connect('vuln.db')
        table = 'CREATE TABLE vulns (exploit TEXT, cve TEXT)'
        cursor = connection.cursor()
        cursor.execute(table)
        connection.commit()

        for child in response.xpath('//table'):
            if len(child.xpath('tr')) > 100:
                table = child
                break
        count = 0
        data = {}

        json_file = open('vulnerabilities.json', 'w')
                
        for row in table.xpath('//tr'):
            if count > 100:
                break
            try:
                exploit_id = row.xpath('td//text()')[0].extract()
                cve_id = row.xpath('td//text()')[2].extract()
                cursor.execute('INSERT INTO vulns (exploit, cve) VALUES(?, ?)'), (exploit_id, cve_id)
                connection.commit()
                count += 1
            except IndexError:
                pass
        json.dump(data, json_file)
        json_file.close()
