# Scrapy Library using a Python Script

The Scrapy command line tool is used for controlling Scrapy, which is often referred to as 'Scrapy tool'. It includes the commands for various objects with a group of arguments and options.

## Folder Structure

```
scrapy.cfg                - Deploy the configuration file
project_name/             - Name of the project
   _init_.py
   items.py               - It is project's items file
   pipelines.py           - It is project's pipelines file
   settings.py            - It is project's settings file
   spiders                - It is the spiders directory
      _init_.py
      spider_name.py
      . . .

```

* *Spiders*: Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).
* *Items*:
* *Middlewares*:
* *Pipelines*:
* *Settings*: Delays, Cookies, Caching. 

*Note difference between Settings and scrapy.cfg file

The scrapy.cfg file is a project root directory, which includes the project name with the project settings. For instance −

```
[settings] 
default = [name of the project].settings  

[deploy] 
#url = http://localhost:6800/ 
project = [name of the project] 
```

Scrapy will find configuration settings in the scrapy.cfg file. The last project's configuration location -

```/Users/kaveri/Documents/Research/Computer Scripting II/Scripting2-Spring24-25-IIIT-H/Lecture 2/2_Scrapy/Quotes/scrapy.cfg```


## Usage

Scrapy tool provides some usage and available commands as follows −

```Scrapy X.Y  - no active project 
Usage: 
   scrapy  [options] [arguments] 
Available commands: 
   crawl      It puts spider (handle the URL) to work for crawling data 
   fetch      It fetches the response from the given URL
```

# Creating a Project

You can use the following command to create the project in Scrapy −

```scrapy startproject project_name```

This will create the project called project_name directory. Next, go to the newly created project, using the following command −

```cd  project_name```


# Controlling Projects

You can control the project and manage them using the Scrapy tool and also create the new spider, using the following command −

```scrapy genspider mydomain mydomain.com```

The commands such as crawl, etc. must be used inside the Scrapy project. You will come to know which commands must run inside the Scrapy project in the coming section.The commands such as crawl, etc. must be used inside the Scrapy project. You will come to know which commands must run inside the Scrapy project in the coming section.

Try:

```scrapy -h```


## What are spiders? 

Spiders are the classes where you define the custom behaviour for crawling and parsing pages 
for a particular site (or, in some cases, a group of sites).

<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*rsdM_Iy7Wga3aBuozhgKqw.png" height=400px>

For spiders, the scraping cycle goes through something like this:

1.    You start by generating the initial ```Requests``` to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

2.    The first requests to perform are obtained by calling the ```start_requests()``` method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

3.    In the callback function, you parse the response (web page) and return item objects, ```Request objects```, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

4.    In callback functions, you parse the page contents, typically using ```Selectors``` (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.

Note: It has a built-in mechanism called Selectors, for extracting the data from websites.

5.    Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using ```Feed exports```.


Scrapy generates feed exports in formats such as JSON, CSV, and XML.

# Running Scrapy from a Jupyter notebook

In [24]:
#imports

# scrape webpage
import scrapy
from scrapy.crawler import CrawlerProcess
# text cleaning
import re

## 💬 Quotes from WikiQuotes 


Before we start let's have a look at the webpage and it's element div.mw-parser-output > ul > li
Use inspect element tool in firefox.

A div for each header contains the quotes. 
Quaotes are in an unordered list.
Each element of the list is a quote. 

So we get the div using the class name mw-parser-output then get it's child element using the > symbol.
Then get ul's child li (single element) using > child operator.

In [25]:
#define a spider for scraping the website
class QuotesToCsv(scrapy.Spider):
    """scrape first line of  quotes from `wikiquote` by 
    Maynard James Keenan and save to json file"""
    name = "MJKQuotesToCsv"
    start_urls = [
        'https://en.wikiquote.org/wiki/Mahatma_Gandhi', #1. url to scrape
    ]
    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.ExtractFirstLine': 1 #2. what to run to extract data after response object is sucessfully returned from the website
        },
        'FEEDS': {
            'kaveri_QUOTES.csv': { #3. where to save the extracted data
                'format': 'csv',   #3. format of data. other formats like json and xml are also supported
                'overwrite': True
            }
        }
    }

    def parse(self, response):
        """parse data from urls"""
        for quote in response.css('div.mw-parser-output > ul > li'):
            yield {'quote': quote.extract()}

#define extraction logic
class ExtractFirstLine(object):
    def process_item(self, item, spider):
        """text processing"""
        lines = dict(item)["quote"].splitlines()
        first_line = self.__remove_html_tags__(lines[0])

        return {'quote': first_line}

    def __remove_html_tags__(self, text):
        """remove html tags from string"""
        html_tags = re.compile('<.*?>')
        return re.sub(html_tags, '', text)

In [26]:
#execute the crawler

process = CrawlerProcess() #define the crawler
process.crawl(QuotesToCsv) #attach the spider to the crawler
process.start()

2025-01-08 21:18:26 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2025-01-08 21:18:26 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.10.4 (main, May 25 2024, 00:47:07) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform macOS-15.2-arm64-arm-64bit
2025-01-08 21:18:26 [scrapy.addons] INFO: Enabled addons:
[]
2025-01-08 21:18:26 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2025-01-08 21:18:26 [scrapy.extensions.telnet] INFO: Telnet Password: 842116b14b51d53a
2025-01-08 21:18:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2025-01-08 21:18:26 [scrapy.crawler] INFO: Overridden 

ReactorNotRestartable: 

## 🎂 Cakes From Karachi bakery

Before we start let's have a look at the webpage and it's element.
The span part is present in all elements for the zoom effect.

But scroll upwards and note each cake is a <a class="fancybox">. 
Use inspect element tool in firefox.

There is a title and img source for each element. Let's make a two column csv for title and img source

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess
import re

#define a spider for scraping the website
class CakesToCsv(scrapy.Spider):
    """scrape first line of  quotes from `wikiquote` by 
    Maynard James Keenan and save to json file"""
    name = "CakesToCsv"
    start_urls = [
        'https://www.karachibakery.com/birthday-cakes1?pg=1', #1. url to scrape
    ]
    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.ExtractFirstLine': 1 #2. what to run to extract data after response object is sucessfully returned from the website
        },
        'FEEDS': {
            'kaveri_CAKES.csv': { #3. where to save the extracted data
                'format': 'csv',   #3. format of data. other formats like json and xml are also supported
                'overwrite': True
            }
        }
    }

    def parse(self, response):
        """parse data from urls"""
        for cake in response.css('a.fancybox'):
            something = cake.extract() #return the items extracted from the html
            yield {'cake_title': something}

#define extraction logic
class ExtractFirstLine(object):
    def process_item(self, item, spider): #create columns for csv file
        """text processing"""
        lines = dict(item)["cake_title"].splitlines()
        title = self.__get_cake_title__(lines[0])
        img = self.__get_cake_img_link__(lines[0])

        return {'cake_title': title, 'cake_img': img}

    def __get_cake_title__(self, text):
        """get title of the anchor tag"""
        title = re.search("title=\"(.*?)\"", text)[1]
        return title

    def __get_cake_img_link__(self, text):
        """get the image link of the anchor tag"""
        img = re.search("img src=\"(.*?)\"", text)[1]
        return img

In [None]:
#execute the crawler

process = CrawlerProcess() #define the crawler
process.crawl(CakesToCsv) #attach the spider to the crawler
process.start()

2025-01-07 13:18:03 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2025-01-07 13:18:03 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.10.4 (main, May 25 2024, 00:47:07) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform macOS-15.2-arm64-arm-64bit
2025-01-07 13:18:03 [scrapy.addons] INFO: Enabled addons:
[]
2025-01-07 13:18:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2025-01-07 13:18:03 [scrapy.extensions.telnet] INFO: Telnet Password: 9a7408c6beb60171
2025-01-07 13:18:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2025-01-07 13:18:03 [scrapy.crawler] INFO: Overridden 

## Getting multiple pages

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess
import re

#define a spider for scraping the website
class QuotesToCsv(scrapy.Spider):
    """scrape first line of  quotes from `wikiquote` by 
    Maynard James Keenan and save to json file"""
    name = "MJKQuotesToCsv"
    start_urls = [
        'https://en.wikiquote.org/wiki/Indira_Gandhi', #1. url to scrape
        'https://en.wikiquote.org/wiki/Sachin_Tendulkar', #1. url to scrape
        'https://en.wikiquote.org/wiki/Bhagat_Singh', #1. url to scrape
        'https://en.wikiquote.org/wiki/Subhas_Chandra_Bose', #1. url to scrape
    ]
    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.ExtractFirstLine': 1 #2. what to run to extract data after response object is sucessfully returned from the website
        },
        'FEEDS': {
            'kaveri_QUOTES.csv': { #3. where to save the extracted data
                'format': 'csv',   #3. format of data. other formats like json and xml are also supported
                'overwrite': True
            }
        }
    }

    def parse(self, response):
        """parse data from urls"""
        for quote in response.css('div.mw-parser-output > ul > li'):
            yield {'quote': quote.extract()}

#define extraction logic
class ExtractFirstLine(object):
    def process_item(self, item, spider):
        """text processing"""
        lines = dict(item)["quote"].splitlines()
        first_line = self.__remove_html_tags__(lines[0])

        return {'quote': first_line}

    def __remove_html_tags__(self, text):
        """remove html tags from string"""
        html_tags = re.compile('<.*?>')
        return re.sub(html_tags, '', text)

#execute the crawler

process = CrawlerProcess() #define the crawler
process.crawl(QuotesToCsv) #attach the spider to the crawler
process.start()

2025-01-07 16:14:44 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2025-01-07 16:14:44 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.10.4 (main, May 25 2024, 00:47:07) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform macOS-15.2-arm64-arm-64bit
2025-01-07 16:14:44 [scrapy.addons] INFO: Enabled addons:
[]
2025-01-07 16:14:44 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2025-01-07 16:14:44 [scrapy.extensions.telnet] INFO: Telnet Password: 12caba2ca4cc5278
2025-01-07 16:14:44 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2025-01-07 16:14:44 [scrapy.crawler] INFO: Overridden 

## 📚 Try scraping multiple pages of the karachi bakery


### Try writing a for loop to create a list of urls first

In [None]:
start_urls_custom = []

for i in range(1,4):
    url = "https://www.karachibakery.com/birthday-cakes"+str(i)+"?pg="+str(i)
    start_urls_custom.append(url)
print(start_urls_custom)

['https://www.karachibakery.com/birthday-cakes1?pg=1', 'https://www.karachibakery.com/birthday-cakes2?pg=2', 'https://www.karachibakery.com/birthday-cakes3?pg=3']


Modify start urls to get the structure

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess
import re

#define a spider for scraping the website
class CakesToCsv(scrapy.Spider):
    """scrape first line of  quotes from `wikiquote` by 
    Maynard James Keenan and save to json file"""
    name = "CakesToCsv"
    start_urls = [
        'https://www.karachibakery.com/birthday-cakes1?pg=1', #1. url to scrape
    ]
    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.ExtractFirstLine': 1 #2. what to run to extract data after response object is sucessfully returned from the website
        },
        'FEEDS': {
            'kaveri_CAKES.csv': { #3. where to save the extracted data
                'format': 'csv',   #3. format of data. other formats like json and xml are also supported
                'overwrite': True
            }
        }
    }

    def parse(self, response):
        """parse data from urls"""
        for cake in response.css('a.fancybox'):
            something = cake.extract() #return the items extracted from the html
            yield {'cake_title': something}

#define extraction logic
class ExtractFirstLine(object):
    def process_item(self, item, spider): #create columns for csv file
        """text processing"""
        lines = dict(item)["cake_title"].splitlines()
        title = self.__get_cake_title__(lines[0])
        img = self.__get_cake_img_link__(lines[0])

        return {'cake_title': title, 'cake_img': img}

    def __get_cake_title__(self, text):
        """get title of the anchor tag"""
        title = re.search("title=\"(.*?)\"", text)[1]
        return title

    def __get_cake_img_link__(self, text):
        """get the image link of the anchor tag"""
        img = re.search("img src=\"(.*?)\"", text)[1]
        return img

#execute the crawler

process = CrawlerProcess() #define the crawler
process.crawl(CakesToCsv) #attach the spider to the crawler
process.start()

2025-01-07 16:07:32 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2025-01-07 16:07:32 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.10.4 (main, May 25 2024, 00:47:07) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform macOS-15.2-arm64-arm-64bit
2025-01-07 16:07:32 [scrapy.addons] INFO: Enabled addons:
[]
2025-01-07 16:07:32 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2025-01-07 16:07:32 [scrapy.extensions.telnet] INFO: Telnet Password: 5ca322b08bf71435
2025-01-07 16:07:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2025-01-07 16:07:32 [scrapy.crawler] INFO: Overridden 

## 🎂 Scraping the entire bakery using BeautifulSoup

In [1]:
import requests
from bs4 import BeautifulSoup
import re

crawled_websites = set()   # Set to keep track of already crawled URLs
website_queue = []         # Queue for URLs to crawl

def is_url(link):
    # Checks using regex if 'link' is a valid url
    link = str(link)
    url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*/\\,() ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
    x = (" ".join(url) == link)
    return x

def read_anchors(site):
    # Downloads the page and extracts all anchor tags
    url = requests.get(site)
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    anchors = soup.find_all('a')
    return anchors

def link_from_anchor(anchor, base_url):
    # Extracts href from anchor and converts relative links to absolute
    try:
        link = re.search('href="(.*?)"', str(anchor))[1]
        if link.startswith('/'):
            # Convert relative link to absolute using base_url
            from urllib.parse import urljoin
            link = urljoin(base_url, link)
    except:
        link = None
    return link

def get_links(site, filename):
    # Main crawling function
    # Adds the initial site to the queue if not already crawled
    if (site not in crawled_websites) and (site not in crawled_websites):
        website_queue.append(site)
    with open(filename, 'a') as f:
        while website_queue:
            if len(crawled_websites) > 100:
                # Limit crawl to 100 pages for safety
                break
            current_site = website_queue.pop(0)
            if current_site in crawled_websites:
                continue
            try:
                anchors = read_anchors(current_site)  # Get all links from the page
            except Exception as e:
                continue
            for anchor in anchors:
                link = link_from_anchor(anchor, current_site)  # Extract and normalize link
                # Only add valid, uncrawled, in-domain links to the queue and file
                if link and is_url(link) and ('karachibakery.com' in link) and (link not in crawled_websites):
                    website_queue.append(link)
                    f.write(link + "\n")
            crawled_websites.add(current_site)  # Mark as crawled

get_links("https://www.karachibakery.com/", "kaveri_ALLCAKES.txt")

In [2]:
crawled_websites #print some sample crawled websites

{'http://www.karachibakery.com/virtualtour/karachi-virtualtour.html',
 'https://order.karachibakery.com',
 'https://order.karachibakery.com/',
 'https://order.karachibakery.com/pages/contact',
 'https://order.karachibakery.com/pages/terms',
 'https://order.karachibakery.com/shop',
 'https://order.karachibakery.com/shop/account',
 'https://order.karachibakery.com/shop/account/favourites',
 'https://order.karachibakery.com/shop/c',
 'https://order.karachibakery.com/shop/c/200g-packs_6055',
 'https://order.karachibakery.com/shop/c/assorted-biscuits-pack_5344',
 'https://order.karachibakery.com/shop/c/biscotti_5998',
 'https://order.karachibakery.com/shop/c/buy-1-get-1-free_4942',
 'https://order.karachibakery.com/shop/c/chocolate-biscuits_6000',
 'https://order.karachibakery.com/shop/c/christmas-special-cakes_5262',
 'https://order.karachibakery.com/shop/c/christmas-specials_8441',
 'https://order.karachibakery.com/shop/c/cocoatini-exc-chocolate-collection_4943',
 'https://order.karachiba