# Web Scraping with Scrapy

## Step 1: Start with a command shell

Install Scrapy

```
pip install scrapy
```

Use another command shell and type 

```
scrapy shell
```

to enter the scrapy shell

Fetch a website

```
fetch("https://www.spiegel.de")
```

View the response

```
print(response.text)
```

Access specific elements of the DOM tree using xpath

```
response.xpath("/html").extract()

response.xpath("//article/@aria-label").extract()

response.xpath("//div[@data-area='article_teaser'").extract()

response.xpath("//div[@data-area='article_teaser>news-l-compayt']").extract()

response.xpath("//div[@data-area='article_teaser>news-l-compayt']//span[@class='align-middle']/text()").extract()

```

## Step 2: Write a spider for web crawling

You can run the following script in Jupyter Notebook but it is better to run it on the command shell because jupyter keeps the process running and you have to restart the notebook.

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    name = "MySpider"
    start_urls = [
        'https://www.spiegel.de',
    ]
    
    def parse(self, response):
        for t in response.xpath("//article/@aria-label").extract():
            yield {'text': t}

process = CrawlerProcess()
process.crawl(MySpider)
process.start()



## Step 3: Crawl the webpage recursively

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

text =[]
authors =[]

class MySpider(scrapy.Spider):
    name = "MySpider"
    start_urls = [
        'https://www.spiegel.de',
    ]
    #Dangerous
    def parse(self, response):
        
        
        for t in response.xpath("//article/@aria-label").extract():
            text.append({'text': t})
            yield {'text': t}
        
        author=response.xpath("//div[@class='font-sansUI lg:text-base md:text-base sm:text-s text-shade-dark mb-4']//a/text()").extract()
        if author!=None:
            print("Author: ".join(author))
            authors.append({'author': author})
            yield {'author': author}
            
        next_page=response.xpath("//article//h2/a/@href").extract()
        for n in next_page:
            yield scrapy.Request(response.urljoin(n),callback=self.parse)
            
            
process = CrawlerProcess()
process.crawl(MySpider)
process.start()

print(text)
print(authors)