# Week 4: Day 1 PM // EDA

## Basic Web Scraping

Scrapy can be installed either through anaconda or pip.

`conda install -c conda-forge scrapy`

or

`pip install Scrapy`

### Creating new Scrapy Project

Open a terminal (mac/linux) or command line (windows).

`scrapy startproject h8scrapy`

Once the project is created, there will be a folder and a configuration file created.

This folder is created to collate different components of the crawler that will be created later.

Once you enter the project folder, you can see the project structure and supporting files.

Let’s have a look at them in detail:

| File |	Description |
|---|---|
spiders	| This directory contains all the spiders in the form of a python class. Whenever Scrapy is requested to run, it will be searched in this folder.
items.py	| Includes container that will be loaded along with scraped data.
middleware.py | It contains Spider’s processing mechanism to handle requests and responses.
pipeline.py	| It contains a set of Python classes to process scraped data further.
settings.py	| Any customized settings can be added to this file.

### Creating Spider

Spider is a class that contains the methodology to scrape and extract the data form the site defined. In other words, it determines how to perform the crawl. 
 
In order to create a Spider, we can use the command below.

`scrapy genspider <spidername> <your-link-here>`

For spidername, you can give any name for your spider and for the link, you can give the URL of the site or domain that you are going to scrape data from. In this session, we will extract manga info from manganato.com.

We will call our spider as reviewspider.

`scrapy genspider reviewspider https://manganato.com/genre-all`



### Scrapy Shell

Scrapy shell is an interactive shell similar to a python shell in which you can try and debug your code for data scraping. Using this shell, you can test out your XPath and CSS expressions and verify the data that they extract without even having to run your spider. Therefore, it is a faster and a valuable tool for developing and debugging.
Scrapy shell can be launched using the below command

`scrapy shell <url>`

### Identifying the HTML structure

Before we start coding our spider we need to analyze how the web page is structured and identify its patterns. In order to view the HTML structure of the page, we can right-click and go to Inspect, or we can view it through the Browser’s Developer Tools.

When we further expand each manga division, we can see there are separate blocks for each component of the manga. Our focus will be on the manga_name, rating and the link.

```html
<a rel="nofollow" class="genres-item-name text-nowrap a-h" href="https://readmanganato.com/manga-il986046" title="Pumpkin Time">Pumpkin Time</a>
```
 
Here, manga ratings are identified by the class `genres-item-rate`.

```html
<em class="genres-item-rate">4.6</em>
```

The review text is identified by the class `genres-item-description`. Next, we will use this information to define our Spider.

### Defining Scrapy Parser

```py
import scrapy


class ReviewspiderSpider(scrapy.Spider):
    name = 'reviewspider'
    allowed_domains = ['https://manganato.com/']
    start_urls = ['https://manganato.com/genre-all']

    def parse(self, response):
        pass

```

This is the basic template of the Spider and `allowed_domains` and `start_urls` are created based on the link we provided when we created the spider.



This is the basic template of the Spider and `allowed_domains` and `start_urls` are created based on the link we provided when we created the spider.

The logic for extracting our data will be written in the parse function, which will be fired when landing on the page defined by `start_urls`

Scrapy allows crawling multiple URLs simultaneously. For this, identify the Base URL and then identify the part of the other URLs that need to join the base URL and append them using urljoin(). However, in this example, we will use only the base URL.

Below is the code which is written in the Scrapy Parser to scrape review data.

```py
def parse(self, response):
        manga_name=response.xpath('//a[@class="genres-item-name text-nowrap a-h"]/text()').extract()
        viewed=response.xpath('//span[@class="genres-item-view"]/text()').extract()
        rating=response.xpath('//em[@class="genres-item-rate"]/text()').extract()
        link=response.xpath('//a[@class="genres-item-name text-nowrap a-h"]/@href').extract()

        for item in zip(manga_name, viewed, rating, link):

            scraped_data = {

                'MangaName': item[0],
                'Viewed': item[1],
                'Rating': item[2],
                'Link': item[3]

            }

            # yield or give the scraped info to scrapy

            yield scraped_data
```

Scrapy comes with its own mechanism called Selectors to extract data. These Selectors use XPath and CSS expressions to select different elements in the HTML documents. In code above uses XPath as the Selector.

```py
viewed=response.xpath('//span[@class="genres-item-view"]/text()').extract()
```

In the above code line, Scrapy uses XPath to reach a node in the response and extract its data in the form of a text.

```py
 for item in zip(manga_name, viewed, rating, link):

            scraped_data = {

                'MangaName': item[0],
                'Viewed': item[1],
                'Rating': item[2],
                'Link': item[3]

            }
```

In the above code, we are adding each item to the Python dictionary.

`yield scraped_data`
 
The yield statement returns the scraped data for Scrapy to process and store.

### Running Spider

`scrapy runspider h8scrapy/spiders/reviewspider.py -o scraped_data.csv`

The runspider command takes the reviewspider.py as the input file and produces the CVS file scraped_data.cvs, which has the collected results.

### Scrapy Feed Exports

Scrapy provides the Feed Export option to store the extracted data in different formats or serialization methods. It supports formats such as CVS, XML, and JSON.

For example, if you want your output in CVS format, got to settings.py file and type in the below lines.

```py
FEED_FORMAT="csv"

FEED_URI="scraped_data.csv"
```

Save this file and rerun the spider. Then, you can see the CVS file formed under your project directory.

If you want a timestamp or name of the spider along with your file name you can use %(time)s or %(name)s to you FEED_URI, For example:

`'FEED_URI': "scraped_data_%(time)s.json"`

### Handling Multiple Pages

This part is about getting additional elements to put in the `start_urls` list. We are finding out how to go to the next page so we can get additional urls to put in `start_urls`.

The second start url is: `https://manganato.com/genre-all/2`

The code below will be used in the code for the spider, all it does is make a list of start_urls.

```py
for i in range(2, 1000):
  start_urls.append('https://manganato.com/genre-all/'+str(i)+'')
```

### Handling Individual Page

The best way to learn how to extract data with Scrapy is using the Scrapy shell. We will use XPaths which can be used to select elements from HTML documents.

The first thing we will try and get the xpaths for are the individual manga links. First we do inspect to see roughly where the manga are in the HTML.

`<a class="genres-item-name text-nowrap a-h" href="https://manganato.com/manga-kc961759" title="Sensou Gekijou">Sensou Gekijou</a>`



`scrapy shell "https://manganato.com/genre-all"`

`response.xpath('//a[@class="genres-item-name text-nowrap a-h"]/@href').extract()`

```py
  for href in response.xpath('//a[@class="genres-item-name text-nowrap a-h"]/@href'):
    url  = href.extract()
```

### Inspect Individual Manga

Next we go to individual manga page `https://manganato.com/manga-kc961759`. Using the same process we extract everything we want.

### Scrapy Items

The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. Scrapy spiders can return the extracted data as Python dicts. While convenient and familiar, Python dicts lack structure: it is easy to make a typo in a field name or return inconsistent data, especially in a larger project with many spiders.

```py
class H8ScrapyItem(scrapy.Item):
    mangaName = scrapy.Field()
    author = scrapy.Field()
    genre = scrapy.Field()
    view = scrapy.Field()
    rating = scrapy.Field()
    desc = scrapy.Field()

```

### Final Spider

```py
import scrapy
from h8scrapy.items import H8ScrapyItem


class ReviewspiderSpider(scrapy.Spider):
    name = 'reviewspiderMultiple'
    start_urls = ['https://manganato.com/genre-all']

    for i in range(2, 100):
        start_urls.append('https://manganato.com/genre-all/'+str(i)+'')
    
    def parse(self, response):
        for href in response.xpath('//a[@class="genres-item-name text-nowrap a-h"]/@href'):
            url  = href.extract() 
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        item = H8ScrapyItem()

        item['mangaName'] = response.xpath('//div[@class="story-info-right"]/h1/text()').extract()
        item['author'] = response.xpath('//td[@class="table-value"]/a/text()').extract()[0]
        item['genre'] = response.xpath('//td[@class="table-value"]/a/text()').extract()[1:]
        item['view'] = response.xpath('//span[@class="stre-value"]/text()').extract()[1]
        item['rating'] = response.xpath('//em[@property="v:average"]/text()').extract()
        item['desc'] = response.xpath('//div[@class="panel-story-info-description"]/text()').extract()[1].strip()

        yield item
```

Run with `scrapy crawl reviewspiderMultiple`
