<hr>
<div style="background-color: lightgray; padding: 20px; color: black;">
<div>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/97/Coursera-Logo_600x600.svg/1024px-Coursera-Logo_600x600.svg.png" style="float: right; margin-right: 30px;" width="120"/> 
<font size="6.5" color="#0056D2"><b>Web Scraping usind Python </b></font> <br>
<font size="5.0" color="#0056D2"><b>Scripting with Python and SQL for Data Engineering </b></font> 
</div>
<div style="text-align: left">  <br>
Edison David Serrano Cárdenas. <br>
MSc in Applied Mathematics <br>
CIMAT - Sede Guanajuato <br>
</div>

</div>
<hr>

#  <font color="#0056D2" >**Objetives**</font> 
In this module, you will learn how to efficiently extract data from the web. You will learn how to use a scraping library to read data from websites and identify and extract specific values from it.

*Load Libraries:*

In [1]:
import numpy as np
from html.parser import HTMLParser
from typing import List, Tuple, Dict, Any, Optional

# <font color="#0056D2" >**Introduction to Web Scraping using Python**</font> 

<font color="#0056D2" >**Keywords:**</font> Web Scraping, HTML parsing (BeatifulSoup), Unstructured data, JSON, XML, CSV.


<font color="#0056D2" >**Parsing Techniques with HTMLParser in Python**</font> 


For this exercise, you will use predefined HTML variables with raw content that can be parsed. Instead of requesting the data from the web, the content is already defined and available to be processed.

In [6]:
content = """
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>1992 World Junior Championships in Athletics – Men's high jump - Wikipedia</title>
"""

For this exercise, you will use predefined HTML variables with raw content that can be parsed. Instead of requesting the data from the web, the content is already defined and available to be processed.

In [13]:
class Parser(HTMLParser):

    def __init__(self):
        super().__init__()
        self.recording = False

    def handle_starttag(self, tag, attrs):
        if tag == "title":
            self.recording = True
        else:
            self.recording = False
            
    def handle_data(self, data):
        if self.recording:
            print(f"Found data for tag: {data}")
            self.recording = False

In [None]:
p = Parser()
p.feed(content)

Found data for tag: 1992 World Junior Championships in Athletics – Men's high jump - Wikipedia


#  <font color="#0056D2" >**Using Scrapy and XPath in Python**</font> 

<font color="#0056D2" >**Creating a Web Scraping Project with Scrapy in Python**</font> 

Create a virtual enviroment

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install scrapy
```

Start project

```bash
scrapy startproject vulnerabilities
cd vulnerabilities
scrapy genspider cve cve.mitre.org
```


 If I go into vulnerabilities and spiders and I list the contents, you will see a cve.py. That cve.py will have a very basic class with a single method called parse.

<font color="#0056D2" >**Parsing Data with XPath and Scrapy Shell**</font> 



What we can do here is that there's a reference map, that means that every single CVE in this website has an exploit DB ID associated with it. The business requirement what we want to try to do here today is to try to parse this data out, extract this data out. 

**Link:** https://cve.mitre.org/data/refs/refmap/source-EXPLOIT-DB.html

- By using //table, it means give me every single table that you find in this document. We see the response is that it's a list with several different items.

```bash
cd vul
scrappy shell https://cve.mitre.org/data/refs/refmap/source-EXPLOIT-DB.html
>   response
>   response.xpath('//table')
>   response.xpath('//table')[0]    #First Table
>   len(response.xpath('//table')[0].xpath('tr')) #How many items have the first table 
>   table = response.xpath('//table')[3]
>   len(table.xpath('tr'))          #Returns 10719
>   child = table.xpath('//tr')[10]
>   child.xpath('td//text()')[0].extract()  #Returns 'EXPLOIT-DB:10180'
>   child.xpath('td//text()')[3].extract()  #Returns 'CVE-2009-4091'
>   for row in table.xpath('//tr'):
>...    try:
>...        print(row.xpath('td//text()')[0].extract())
>...    except IndexError:
            pass
>   exit()

scrapy crawl cve
```

Python Script to run Scrapy

```python
import scrapy

class CveSpider(scrapy.Spider):
    name = "cve"
    allowed_domains = ["cve.mitre.org"]
    start_urls = ["https://cve.mitre.org/data/refs/refmap/source-EXPLOIT-DB.html"]

    def parse(self, response):
        for child in response.xpath('//table'):
            if len(child.xpath('tr'))>100:
                table = child
                break
        for row in table.xpath('//tr'):
            try:
                print(row.xpath('td//text()')[0].extract())
            except IndexError:
                pass
```

# <font color="#0056D2" >**Introduction to Scrapy and XPath in Python**</font> 




<font color="#0056D2" >**Keywords:**</font>
- BeautifulSoup: A Python library for parsing HTML and extracting data.
- CSS selectors: Patterns used to select HTML elements for scraping.
- Scrapy: A Python web scraping framework.
- XPath: A query language for selecting elements in XML documents.

<font color="#0056D2" >**Creating a Web Scraping Project with Scrapy in Python**</font> 
