# Run Spider


In this section, we will write a spider,

and run it in the terminal.

### Generate the spider

In [1]:
pwd

'/Users/floyd/Projects/scrapy/pubcrawler/tutorial/6 Run Spider'

**Change the directory to the spider folder**

In [2]:
cd /Users/floyd/Projects/scrapy/pubcrawler/tutorial/6\ Run\ Spider/fintime50/fintime50/spiders

/Users/floyd/Projects/scrapy/pubcrawler/tutorial/6 Run Spider/fintime50/fintime50/spiders


Using `scrapy` command to generate the spider.

In spiders directory.

```bash
scrapy genspider dss www.journals.elsevier.com
```




---

`scrapy`: the command

`genspider`: option

`dss`: the spider's name

`www.journals.elsevier.com`: the domain where you want to scrape.

---


![](genspider.png)




## Spider Contents

You will find a `dss.py` file is generated.

Open it and you will find:

---

```python
# -*- coding: utf-8 -*-
import scrapy


class DssSpider(scrapy.Spider):
    name = "dss"
    allowed_domains = ["www.journals.elsevier.com"]
    start_urls = (
        'http://www.www.journals.elsevier.com/',
    )

    def parse(self, response):
        pass
```
---



This file only define a class `DssSpider`, which inherit from `scrapy.Spider`.

**Its fields and methods are:**

`name`: string, `"dss"` is from the command line.

`allowed_domians`: list, the domains this spider will scape in.

`start_urls`: tuple, when you run the spider, scrapy will request via these URLs, and get `responses`. The `responses` will be conveyed to its method parse, by default.

`parse()`: here, this method take `response` as its parameter. 

**The `response` comes from the `start_urls`. This is very import thing you need to know.**

## Our first spider `DssSpider` 

#### In spider, these methods are used.

`cleanhtml()`

`TakeFirst()`

`ItemLoader`


In the section `ItemLoader`, these items are illustrated in detail.

In [6]:
from scrapy import Item, Field
class SourceItem(Item):
    publication_title = Field()
    chief_editor = Field()
    issn = Field()
    description = Field()
    home_url = Field()
    coverimage = Field()
    title = Field()

In [7]:
from scrapy.loader import ItemLoader


# we need headers to disguise our bot as a browser

headers = {
    "Connection": "keep-alive",
    "Cache-Control": "max-age=0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36",
    "Accept-Encoding": "gzip,deflate,sdch",
    "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4,zh-TW;q=0.2",
}


import requests
from scrapy.http import TextResponse

r = requests.get('http://www.journals.elsevier.com/decision-support-systems/', 
                 headers = headers)

response = TextResponse(r.url, body = r.text, encoding = 'utf-8')

# there is a response we need to handle
response

<200 https://www.journals.elsevier.com/decision-support-systems/>

In [8]:
# Initialization with Item and response
# Item, (here is SourceItem()), it the container ItemLoader uses
# response, is the raw material ItemLoader to exploit
l = ItemLoader(item = SourceItem(), response = response)
type(l)

scrapy.loader.ItemLoader

In [3]:
# here are the xpaths for the items
issn_xpath = '//*[@class="issn keyword"]/span/text()'
chief_editor_xpath = '//*[@id="Title"]//span[@class="nowrap"]/text()'
title_xpath = '//*[@id="Title"]//h1[@itemprop="name"]/text()'
description_xpath = '//*[@class="publication-description"]//p'
coverimage_xpath = '//*[@id="Title"]//img[@class="cover-img"]/@src'

In [13]:
# based on this logics, we can get a clean item here.
import re
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join, TakeFirst


def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext
    
    
l = ItemLoader(item = SourceItem(), response = response)
l.default_output_processor = TakeFirst()

# issn
l.add_xpath('issn', issn_xpath)

# chief_editor
l.add_xpath('chief_editor', chief_editor_xpath)

# coverimage
l.add_xpath('coverimage', coverimage_xpath)

# description
# notice Join() and cleanhtml here
# refer to the units above.
l.add_xpath('description', description_xpath, Join(), cleanhtml)

# publication_title
publication_title = l.get_xpath(title_xpath)
l.add_value('publication_title', publication_title)

# home_url
l.add_value('home_url', response.url)
l.load_item()


# relatively cleaner now!

{'chief_editor': 'James R. Marsden',
 'coverimage': 'https://www.elsevier.com/__data/cover_img/505540.gif',
 'description': 'The common thread of articles published in Decision Support '
                'Systems is their relevance to theoretical and technical '
                'issues in the support of enhanced decision making. The areas '
                'addressed may include foundations, functionality, interfaces, '
                'implementation, impacts, and evaluation of decision support '
                'systems... The common thread of articles published in '
                'Decision Support Systems is their relevance to theoretical '
                'and technical issues in the support of enhanced decision '
                'making. The areas addressed may include foundations, '
                'functionality, interfaces, implementation, impacts, and '
                'evaluation of decision support systems (DSSs). Manuscripts '
                'may draw from diverse methods

### `dss.py` python file content


---

```python
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import re
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join, TakeFirst
from scrapy import Spider
from dateutil.parser import parse
from fintime50.items import SourceItem


class DSSSpider(Spider):
    name = 'dss'
    start_urls = (
            "http://www.journals.elsevier.com/decision-support-systems/",
            )
    base_url = "http://www.sciencedirect.com/science/journal/01679236/"


    def cleanhtml(self, raw_html):
      cleanr = re.compile('<.*?>')
      cleantext = re.sub(cleanr, '', raw_html)
      return cleantext


    def parse(self, response):
        issn_xpath = '//*[@class="issn keyword"]/span/text()'
        chief_editor_xpath = '//*[@id="Title"]//span[@class="nowrap"]/text()'
        title_xpath = '//*[@id="Title"]//h1[@itemprop="name"]/text()'
        description_xpath = '//*[@class="publication-description"]//p'
        coverimage_xpath = '//*[@id="Title"]//img[@class="cover-img"]/@src'

        l = ItemLoader(item = SourceItem(), response = response)
        l.default_output_processor = TakeFirst()
        l.add_xpath("issn",issn_xpath)
        l.add_xpath('chief_editor', chief_editor_xpath)
        l.add_xpath('coverimage', coverimage_xpath)
        l.add_xpath('description', description_xpath, Join(), self.cleanhtml)
        l.add_value('home_url', response.url)
        publication_title = l.get_xpath( title_xpath)
        l.add_value('publication_title', publication_title)

        yield l.load_item()
```

---

### Where to run the project.

In [14]:
# to see the present working directory
!pwd

/Users/floyd/Projects/scrapy/pubcrawler/tutorial/6 Run Spider/fintime50/fintime50/spiders


In [15]:
cd /Users/floyd/Projects/scrapy/pubcrawler/tutorial/6\ Run\ Spider/fintime50/fintime50

/Users/floyd/Projects/scrapy/pubcrawler/tutorial/6 Run Spider/fintime50/fintime50


In [16]:
!tree

.
├── __init__.py
├── __pycache__
│   ├── __init__.cpython-35.pyc
│   └── settings.cpython-35.pyc
├── items.py
├── pipelines.py
├── settings.py
└── spiders
    ├── __init__.py
    ├── __pycache__
    │   ├── __init__.cpython-35.pyc
    │   └── dss.cpython-35.pyc
    └── dss.py

3 directories, 10 files


In [17]:
# only in this directory, you can run the spider.
!pwd

/Users/floyd/Projects/scrapy/pubcrawler/tutorial/6 Run Spider/fintime50/fintime50


![](rightdirectory.png)

In [22]:
# run it to see the results.
!scrapy crawl dss

2017-01-11 20:39:16 [scrapy] INFO: Scrapy 1.1.1 started (bot: fintime50)
2017-01-11 20:39:16 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'fintime50.spiders', 'BOT_NAME': 'fintime50', 'SPIDER_MODULES': ['fintime50.spiders']}
2017-01-11 20:39:16 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats']
2017-01-11 20:39:16 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMi

### On the terminal, It looks like this.

---

![](console1.png)

![](console2.png)
