# Covid: Using scrapy regex xpath

## References
1. https://www.jitsejan.com/using-scrapy-in-jupyter-notebook.
1. https://docs.python.org/3/library/re.html

## Regex

### Method/Attribute

```match()```
Determine if the RE matches at the beginning of the string.

```search()```
Scan through a string, looking for any location where this RE matches.

```findall()```
Find all substrings where the RE matches, and returns them as a list.

```finditer()```
Find all substrings where the RE matches, and returns them as an iterator.

and more e.g. sub

```
prog = re.compile(pattern)
result = prog.match(string)
```
is equivalent to

```
result = re.match(pattern, string)
```
but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

Note

In [None]:
import re
s1 = '<p class="">12:05 PM<br>'


In [None]:
s1
type(s1)

In [None]:
p = re.compile('(([01]?[0-9]):([0-5][0-9]) ([AaPp][Mm]))')

In [None]:
# pattern doesnt match with start of string, so ```match``` method returns empty 
p.match(s1)

In [None]:
p.search(s1)

In [None]:
p.findall(s1)

In [None]:
# pattern match with start of string, so ```search``` method returns Match object with all matching empty 
# using a pattern that matches
m1 = re.match('(([01]?[0-9]):([0-5][0-9]) ([AaPp][Mm]))','12:05 PM' )
print(m1)
print(m1.span())
print(m1.group())
print(m1.string)
print(type(m1))
print(isinstance(m1,re.Match))
for i in range(5):
    print(i,m1.group(i))
 

In [None]:
# pattern match with start of string, so ```search``` method returns Match object with all matching empty 
m2 = re.search('(([01]?[0-9]):([0-5][0-9]) ([AaPp][Mm]))','<p class="">12:05 PM<br>' )
print(m2)
print(m2.span())
print(m2.string)
print(m2.group())
print(type(m2))
print(isinstance(m2,re.Match))
for i in range(5):
    print(i,m2.group(i))
 

In [None]:
# pattern match with start of string, so ```findall``` method returns Match object with all matching empty 
m3 = re.findall('(([01]?[0-9]):([0-5][0-9]) ([AaPp][Mm]))','<p class="">12:05 PM<br>' )
print(m3)
# print(m3.span())    # 'list' object has no attribute 'span'
# print(m3.string)    # 'list' object has no attribute 'string'
# print(m3.match())     #'list' object has no attribute 'match'
print(type(m3))
print(isinstance(m3,list))
print(i,m3[0][0])

for i in range(4):
    print(i,m3[0][i])


In [None]:
re.match('(([01]?[0-9]):([0-5][0-9]) ([AaPp][Mm]))','12:05 PM' ).groups()

In [None]:
re.match('(([01]?[0-9]):([0-5][0-9]) ([AaPp][Mm]))','12:05 PM' ).group() # or group(0), group(1)

In [None]:
re.search('(([01]?[0-9]):([0-5][0-9]) ([AaPp][Mm]))','<p class="">12:05 PM<br>' ).groups()

In [None]:
re.search('(([01]?[0-9]):([0-5][0-9]) ([AaPp][Mm]))','<p class="">12:05 PM<br>' ).group()  # or group(0), group(1)

### Compiled pattern

In [None]:
p.match(s1)

In [None]:
p.search(s1)

In [None]:
p.findall(s1)

In [None]:
re.search('(([01]?[0-9]):([0-5][0-9]) ([AaPp][Mm]))','<p class="">12:05 PM<br>' ).groups()

In [None]:
re.search('(([01]?[0-9]):([0-5][0-9]) ([AaPp][Mm]))','<p class="">12:05 PM<br>' ).group(2)

In [None]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Show Python version
import platform
platform.python_version()


In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess


## Xpath  
The schema of the web pages for public and schools exposures differ in HTML structure, and neither is internally consistent. So the xpath selectors differ.

### Running scrapy in notebook
Under the Files tab open a new terminal: New > Terminal
Then simply run you spider: scrapy crawl [options] <spider>

2. Create a new notebook and use CrawlerProcess or CrawlerRunner classes to run in a cell:



In [None]:
import scrapy
# scrape public exposures
!scrapy crawl covid01


In [None]:
# scrape school exposures
!scrapy crawl covid02


## School exposures output to JSON
This class creates a simple pipeline that writes all found items to a JSON file, where each line contains one JSON element.

In [None]:
# Settings for notebook
# restart kernel
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Show Python version
import platform
platform.python_version()

In [None]:
# try:
import scrapy
# except:
#     !pip install scrapy
#     import scrapy
from scrapy.crawler import CrawlerProcess

In [None]:
import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('schoolsresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item


In [None]:
"""
from covid03_spider.py
parses local copy of VCH convid 19 exposures
school exposures
outputs to json file
"""
import scrapy
import logging
import re

def remove_html_tags(text):
    """Remove html tags from a string"""
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)
    
# class QuotesSpider(scrapy.Spider):
class covid_schools(scrapy.Spider):
    name = "covid03"
    start_urls = [
        'http://localhost/schools_exposures.html',
#         'http://www.vch.ca/covid-19/school-exposures',
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        # Used for pipeline 1
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1},
        'FEED_FORMAT': 'json',                                 # Used for pipeline 2
        'FEED_URI': 'schoolsresult.json'                        # Used for pipeline 2
    }

    def parse(self, response):
        """
        use xpath response
        """
        # scrapy regex outputs all match groups as strings
        for myMatch in response.xpath('//*[@id="809"]/div/div//span/text()').getall():
        # for myMatch in response.css('div.table-responsive > table > tbody > tr > td:nth-child(1) > p::text').re(r'(([01]?[0-9]):([0-5][0-9]) ([AaPp][Mm]))'):
                print(myMatch)
                ######
                yield {
                     'school': myMatch,
                }


In [None]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

# process.crawl(QuotesSpider)
process.crawl(covid_schools)
process.start()

In [None]:
# Check the output files

In [None]:
ll *.jl

In [None]:
ll *.json

In [None]:
ll schoolsresult*.*

In [None]:
!tail -n 2 schoolsresult.jl


In [None]:
!tail -n 2 schoolsresult.json

In [None]:
!more schoolsresult.jl

In [None]:
!tail -n 2 schoolsresult.json

In [None]:
!more schoolsresult.json

## Public exposures output to JSON
This class creates a simple pipeline that writes all found items to a JSON file, where each line contains one JSON element.

In [1]:
# Settings for notebook
# restart kernel
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Show Python version
import platform
platform.python_version()

'3.7.8'

In [2]:
# try:
import scrapy
# except:
#     !pip install scrapy
#     import scrapy
from scrapy.crawler import CrawlerProcess

In [3]:
import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('publicresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item


In [4]:
"""
from covid04_spider.py
parses local copy of VCH convid 19 exposures
school exposures
outputs to json file
"""
import scrapy
import logging
import re

def remove_html_tags(text):
    """Remove html tags from a string"""
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)
    
# class QuotesSpider(scrapy.Spider):
class covid_public(scrapy.Spider):
    name = "covid04"
    start_urls = [
            'http://localhost/public_exposures.html',
#             'http://www.vch.ca/covid-19/public-exposures',
     ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        # Used for pipeline 1
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1},
        'FEED_FORMAT': 'json',                                 # Used for pipeline 2
        'FEED_URI': 'publicresult.json'                        # Used for pipeline 2
    }

    def parse(self, response):
        """
        use xpath response
        """
        # scrapy regex outputs all match groups as strings
        for myMatch in response.xpath('//*[@id="9184"]/div').re(r'<span style="font-size:14px;">(.*?)<\/span>'):
        # for myMatch in response.xpath('//*[@id="809"]/div/div//span/text()').getall():
        # for myMatch in response.css('div.table-responsive > table > tbody > tr > td:nth-child(1) > p::text').re(r'(([01]?[0-9]):([0-5][0-9]) ([AaPp][Mm]))'):
            myMatch1 = remove_html_tags(myMatch)
            print(myMatch1)
            ######
            yield {
                 'public': myMatch1,
            }


'\nfrom covid04_spider.py\nparses local copy of VCH convid 19 exposures\nschool exposures\noutputs to json file\n'

In [5]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

# process.crawl(QuotesSpider)
process.crawl(covid_public)
process.start()

2020-10-02 21:35:41 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-10-02 21:35:41 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.8 (default, Aug 17 2020, 15:18:11) - [GCC 5.4.0 20160609], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.1, Platform Linux-4.15.0-118-generic-x86_64-with-debian-stretch-sid
2020-10-02 21:35:41 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-02 21:35:41 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
  exporter = cls(crawler)



<Deferred at 0x7f9d31889a90>

Park Drive
Address: 1815 Commercial Drive, Vancouver, BC
Potential exposure date(s): 
September 26
Potential exposure time: Between 6:00 p.m. and 10:00 p.m.

Abruzzo Cappuccino Bar
Address: 1321 Commercial Drive, Vancouver, BC
Potential exposure date(s): 
September 23 to 26
Potential exposure time: Between 1:00 p.m. and 3:00 p.m.

Wreck Beach
Address: Southwest Marine Drive, Vancouver, BC
Potential exposure date(s): 
September 7
Potential exposure time: Between 1:00 p.m. and 8:30 p.m.


The King’s Head Public House
Address: 1618 Yew Street, Vancouver, BC
Potential exposure date(s): 
September 4 to September 7
Potential exposure 


Athens Cultural Club
Address: 114 West Broadway, Vancouver, BC
Potential exposure date(s): 
August 26 to September 8
Potential exposure 


The West Pub
Address: 488 Carrall Street, Vancouver
Potential exposure date(s): 
August 20 to September 8
Potential exposure 


Flying Beaver Bar and Grill
Address: 4760 Inglis Drive, Richmond
Potential exposure date(s): 


In [6]:
# Check the output files

In [7]:
ll *.jl

-rw-rw-r-- 1 alastair 1900 Oct  2 21:35 publicresult.jl
-rw-rw-r-- 1 alastair 2581 Oct  2 21:34 schoolsresult.jl


In [8]:
ll *.json

-rw-rw-r-- 1 alastair 1949 Oct  2 21:35 publicresult.json
-rw-rw-r-- 1 alastair 2632 Oct  2 21:34 schoolsresult.json


In [9]:
ll publicresult*.*

-rw-rw-r-- 1 alastair 1900 Oct  2 21:35 publicresult.jl
-rw-rw-r-- 1 alastair 1949 Oct  2 21:35 publicresult.json


In [10]:
!tail -n 2 publicresult.jl


{"public": "\u200e"}
{"public": "*Locations will be removed from the list one month after the last exposure date, and then archived."}


In [11]:
!tail -n 2 publicresult.json

{"public": "*Locations will be removed from the list one month after the last exposure date, and then archived."}
]

In [None]:
!more publicresult.jl

{"public": "Park Drive"}
{"public": "Address: 1815\u00a0Commercial Drive, Vancouver, BC"}
{"public": "Potential exposure date(s):\u00a0"}
{"public": "September 26"}
{"public": "Potential exposure time: Between 6:00 p.m. and 10:00 p.m."}
{"public": ""}
{"public": "Abruzzo Cappuccino Bar"}
{"public": "Address: 1321 Commercial Drive, Vancouver, BC"}
{"public": "Potential exposure date(s):\u00a0"}
{"public": "September 23 to 26"}
{"public": "Potential exposure time: Between 1:00 p.m. and 3:00 p.m."}
{"public": ""}
{"public": "Wreck Beach"}
{"public": "Address: Southwest Marine Drive, Vancouver, BC"}
{"public": "Potential exposure date(s):\u00a0"}
{"public": "September 7"}
{"public": "Potential exposure time: Between 1:00 p.m. and 8:30 p.m."}
{"public": ""}
{"public": ""}
{"public": "The King\u2019s Head Public House"}
{"public": "Address:\u00a01618 Yew Street, Vancouver, BC"}
{"public": "Potential exposure date(s):\u00a0"}
{"public": "September 4 to September 7"}
[7

In [None]:
!tail -n 2 publicresult.json

In [None]:
!more publicresult.json