# Scrapy

A short example of a Crawler with the Scrapy library to scrape data from a website.

![Scrapy Crawler](http://gabrielelanaro.github.io/public/post_resources/part1_scraping/spider.png)

### Import Scrapy

In [1]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Show Python version
import platform
platform.python_version()

'3.9.11'

In [2]:
try:
    import scrapy
except:
    !pip install scrapy
    import scrapy
from scrapy.crawler import CrawlerProcess

Collecting scrapy
  Downloading Scrapy-2.7.1-py2.py3-none-any.whl (271 kB)
     -------------------------------------- 271.5/271.5 kB 8.2 MB/s eta 0:00:00
Collecting lxml>=4.3.0
  Downloading lxml-4.9.1-cp39-cp39-win_amd64.whl (3.6 MB)
     ---------------------------------------- 3.6/3.6 MB 21.0 MB/s eta 0:00:00
Collecting service-identity>=18.1.0
  Downloading service_identity-21.1.0-py2.py3-none-any.whl (12 kB)
Collecting protego>=0.1.15
  Downloading Protego-0.2.1-py2.py3-none-any.whl (8.2 kB)
Collecting pyOpenSSL>=21.0.0
  Downloading pyOpenSSL-22.1.0-py3-none-any.whl (57 kB)
     ---------------------------------------- 57.0/57.0 kB 2.9 MB/s eta 0:00:00
Collecting tldextract
  Downloading tldextract-3.4.0-py3-none-any.whl (93 kB)
     ---------------------------------------- 93.9/93.9 kB ? eta 0:00:00
Collecting zope.interface>=5.1.0
  Downloading zope.interface-5.5.2-cp39-cp39-win_amd64.whl (211 kB)
     ---------------------------------------- 211.8/211.8 kB ? eta 0:00:00
Colle

  DEPRECATION: PyDispatcher is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559


### Setup a pipeline

In [3]:
import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('quoteresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

### Define the spider

In [4]:
import logging

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/'
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'quoteresult.json'                        # Used for pipeline 2
    }
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

### Start the crawler

In [5]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start()

2022-12-12 17:32:50 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: scrapybot)
2022-12-12 17:32:50 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.9.11 (tags/v3.9.11:2de452f, Mar 16 2022, 14:33:45) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Windows-10-10.0.19044-SP0
2022-12-12 17:32:50 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

  exporter = cls(crawler)



<Deferred at 0x1cebef033d0>

### Check the files

In [6]:
!dir quoteresult.*

 Volume in drive C is Windows
 Volume Serial Number is F6E8-F02F

 Directory of C:\Users\didimitrov\U_N_I\IR_REPO\information_retrieval_fmi

12/12/2022  05:32 PM             5,571 quoteresult.jl
12/12/2022  05:32 PM             5,573 quoteresult.json
               2 File(s)         11,144 bytes
               0 Dir(s)  384,233,652,224 bytes free


### Create dataframes

In [7]:
import pandas as pd
dfjson = pd.read_json('quoteresult.json')
dfjson

Unnamed: 0,author,tags,text
0,Albert Einstein,"[change, deep-thoughts, thinking, world]",“The world as we have created it is a process ...
1,J.K. Rowling,"[abilities, choices]","“It is our choices, Harry, that show what we t..."
2,Albert Einstein,"[inspirational, life, live, miracle, miracles]",“There are only two ways to live your life. On...
3,Jane Austen,"[aliteracy, books, classic, humor]","“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"[be-yourself, inspirational]","“Imperfection is beauty, madness is genius and..."
5,Albert Einstein,"[adulthood, success, value]",“Try not to become a man of success. Rather be...
6,André Gide,"[life, love]",“It is better to be hated for what you are tha...
7,Marilyn Monroe,"[friends, heartbreak, inspirational, life, lov...",“This life is what you make it. No matter what...
8,J.K. Rowling,"[courage, friends]",“It takes a great deal of bravery to stand up ...
9,Albert Einstein,"[simplicity, understand]","“If you can't explain it to a six year old, yo..."


In [8]:
dfjl = pd.read_json('quoteresult.jl', lines=True)
dfjl

Unnamed: 0,author,tags,text
0,Albert Einstein,"[change, deep-thoughts, thinking, world]",“The world as we have created it is a process ...
1,J.K. Rowling,"[abilities, choices]","“It is our choices, Harry, that show what we t..."
2,Albert Einstein,"[inspirational, life, live, miracle, miracles]",“There are only two ways to live your life. On...
3,Jane Austen,"[aliteracy, books, classic, humor]","“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"[be-yourself, inspirational]","“Imperfection is beauty, madness is genius and..."
5,Albert Einstein,"[adulthood, success, value]",“Try not to become a man of success. Rather be...
6,André Gide,"[life, love]",“It is better to be hated for what you are tha...
7,Marilyn Monroe,"[friends, heartbreak, inspirational, life, lov...",“This life is what you make it. No matter what...
8,J.K. Rowling,"[courage, friends]",“It takes a great deal of bravery to stand up ...
9,Albert Einstein,"[simplicity, understand]","“If you can't explain it to a six year old, yo..."


In [9]:
dfjson.to_pickle('quotejson.pickle')
dfjl.to_pickle('quotejl.pickle')

In [10]:
!dir *pickle

 Volume in drive C has no label.
 Volume Serial Number is 08C1-44E6

 Directory of C:\Users\Boris Velichkov\Desktop\IR\Crawlers

10.12.2018 г.  19:43             5 472 quotejl.pickle
10.12.2018 г.  19:43             5 472 quotejson.pickle
               2 File(s)         10 944 bytes
               0 Dir(s)  18 069 590 016 bytes free
