<h2 align='center'><b> Simple web scraping (scrapy package) </b></h2> 

---
---

Using JJ's example from scrapy in jupyter notebook [example here](https://www.jitsejan.com/using-scrapy-in-jupyter-notebook) and scrapy documentation to [getting started](https://docs.scrapy.org/en/latest/intro/tutorial.html), this notebook creates a simple scraper (using scrapy package) that scrape data from [Quotes To Scrape](https://quotes.toscrape.com/) and save them as .json file and .jl (json line) file to be retrieved later and be analyzed.


**obs(!):** Each time the notebook is run to scrape the data, it is necessary to restart the notebook kernel and run all the cells again

In [1]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Show Python version
import platform
platform.python_version()
# Reactor restart
#from crochet import setup, wait_for
#setup()

'3.9.7'

In [2]:
# Import Scrapy
try:
    import scrapy
except:
    !conda install scrapy
    import scrapy
from scrapy.crawler import CrawlerProcess

* **Setup a pipeline**: converts all found data to JSON

In [3]:
import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('quoteresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

* **Define the spider**

In [4]:
import logging

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'quoteresult.json'                        # Used for pipeline 2
    }
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }
            
        '''
            Looping through indexed pages. Selecting <a> tag and selecting its attribute, 
            then if its values is different from none keeping going to next page
        '''
        next_page = response.css('li.next a::attr(href)').get()
        
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

In [5]:
#@wait_for(10)
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start()

2022-03-21 20:20:58 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-03-21 20:20:58 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.19044-SP0
2022-03-21 20:20:58 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
  exporter = cls(crawler)



<Deferred at 0x1397ff2a910>

* **Check the files**

In [6]:
import pandas as pd

# Using .json file
dfjson = pd.read_json('quoteresult.json', lines = False)
dfjson

Unnamed: 0,text,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"[change, deep-thoughts, thinking, world]"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"[abilities, choices]"
2,“There are only two ways to live your life. On...,Albert Einstein,"[inspirational, life, live, miracle, miracles]"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"[aliteracy, books, classic, humor]"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"[be-yourself, inspirational]"
...,...,...,...
95,“You never really understand a person until yo...,Harper Lee,[better-life-empathy]
96,“You have to write the book that wants to be w...,Madeleine L'Engle,"[books, children, difficult, grown-ups, write,..."
97,“Never tell the truth to people who are not wo...,Mark Twain,[truth]
98,"“A person's a person, no matter how small.”",Dr. Seuss,[inspirational]


In [8]:
# using .jl (json line) file
dfjl = pd.read_json('quoteresult.jl', lines=True)
dfjl

Unnamed: 0,text,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"[change, deep-thoughts, thinking, world]"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"[abilities, choices]"
2,“There are only two ways to live your life. On...,Albert Einstein,"[inspirational, life, live, miracle, miracles]"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"[aliteracy, books, classic, humor]"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"[be-yourself, inspirational]"
...,...,...,...
95,“You never really understand a person until yo...,Harper Lee,[better-life-empathy]
96,“You have to write the book that wants to be w...,Madeleine L'Engle,"[books, children, difficult, grown-ups, write,..."
97,“Never tell the truth to people who are not wo...,Mark Twain,[truth]
98,"“A person's a person, no matter how small.”",Dr. Seuss,[inspirational]


In [20]:
from collections import Counter

tags = dfjl.loc[:,'tags']

tags

0              [change, deep-thoughts, thinking, world]
1                                  [abilities, choices]
2        [inspirational, life, live, miracle, miracles]
3                    [aliteracy, books, classic, humor]
4                          [be-yourself, inspirational]
                            ...                        
95                                [better-life-empathy]
96    [books, children, difficult, grown-ups, write,...
97                                              [truth]
98                                      [inspirational]
99                                        [books, mind]
Name: tags, Length: 100, dtype: object

In [34]:
freq = {}

for row_tag in tags:
    if len(row_tag) != 0:
        for tag in row_tag:
            if tag in freq:
                freq[tag] += 1
            else:
                freq[tag] = 1
                
freq

{'change': 1,
 'deep-thoughts': 1,
 'thinking': 2,
 'world': 1,
 'abilities': 1,
 'choices': 1,
 'inspirational': 13,
 'life': 13,
 'live': 1,
 'miracle': 1,
 'miracles': 1,
 'aliteracy': 1,
 'books': 11,
 'classic': 2,
 'humor': 12,
 'be-yourself': 1,
 'adulthood': 1,
 'success': 1,
 'value': 1,
 'love': 14,
 'edison': 1,
 'failure': 1,
 'paraphrased': 2,
 'misattributed-eleanor-roosevelt': 1,
 'obvious': 1,
 'simile': 3,
 'friends': 4,
 'heartbreak': 1,
 'sisters': 1,
 'courage': 2,
 'simplicity': 1,
 'understand': 1,
 'fantasy': 1,
 'navigation': 1,
 'activism': 1,
 'apathy': 1,
 'hate': 1,
 'indifference': 1,
 'opposite': 1,
 'philosophy': 2,
 'friendship': 5,
 'lack-of-friendship': 1,
 'lack-of-love': 1,
 'marriage': 1,
 'unhappy-marriage': 1,
 'contentment': 1,
 'fate': 1,
 'misattributed-john-lennon': 1,
 'planning': 1,
 'plans': 1,
 'poetry': 1,
 'happiness': 1,
 'attributed-no-source': 3,
 'religion': 2,
 'comedy': 1,
 'yourself': 2,
 'children': 2,
 'fairy-tales': 1,
 'imagin

In [38]:
'''
    display_table takes in any dataset - list of lists and any 
    topShow - integer and display, in a sorted way, the frequency
    table created by freq_table function. topShow limits how many rolls the function
    should show. If no argument is passed, the standard is 10 rows to show.
'''
def display_table(freq_table, topShow = 10):
    table = freq_table
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    
    rows_to_show = topShow
    # Avoiding negative or 0 entry argument and, consequently, breaking code
    if(rows_to_show < 1):
        rows_to_show = 1
    
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        rows_to_show -= 1
        if(rows_to_show == 0):
            break

'\n    display_table takes in any dataset - list of lists and any \n    topShow - integer and display, in a sorted way, the frequency\n    table created by freq_table function. topShow limits how many rolls the function\n    should show. If no argument is passed, the standard is 10 rows to show.\n'

In [42]:
display_table(freq)

love : 14
life : 13
inspirational : 13
humor : 12
books : 11
reading : 7
friendship : 5
truth : 4
friends : 4
writing : 3
