## Scrapy

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.


Debug application from scrapy shell:


```python
   from scrapy.shell import inspect_response
            inspect_response(response, self)

```


## Dictionary:
1. [Webscraping](https://en.wikipedia.org/wiki/Web_scraping)
2. Xpath - query language for selecting nodes.
3. CSS - cascading style sheet.
4. HTML - hypertext markdown language.



---
## XPath

XPath (XML Path Language) is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C).



[Tutorial](http://zvon.org/comp/r/tut-XPath_1.html)

---
## Quick guide:

**Example: **

```XML
<?xml version="1.0" encoding="utf-8"?>
<Wikimedia>
  <projects>
    <project name="Wikipedia" launch="2001-01-05">
      <editions>
        <edition language="English">en.wikipedia.org</edition>
        <edition language="German">de.wikipedia.org</edition>
        <edition language="French">fr.wikipedia.org</edition>
        <edition language="Polish">pl.wikipedia.org</edition>
        <edition language="Spanish">es.wikipedia.org</edition>
      </editions>
    </project>
    <project name="Wiktionary" launch="2002-12-12">
      <editions>
        <edition language="English">en.wiktionary.org</edition>
        <edition language="French">fr.wiktionary.org</edition>
        <edition language="Vietnamese">vi.wiktionary.org</edition>
        <edition language="Turkish">tr.wiktionary.org</edition>
        <edition language="Spanish">es.wiktionary.org</edition>
      </editions>
    </project>
  </projects>
</Wikimedia>
```

1. The XPath expression - /Wikimedia/projects/project/@name
2. selects name attributes for all projects - /Wikimedia//editions
3. selects all editions of all projects, - /Wikimedia/projects/project/editions/edition[@language='English']/text()
4. selects addresses of all English Wikimedia projects - /Wikimedia/projects/project[@name='Wikipedia']/editions/edition/text()


**Xpath Automation, get Xpath:**

1. Open devtools in your browser
2. Select element using Selecting tool
3. Click right mouse button on html element inside Elements tab
4. Choose copy from menu then Copy Xpath


**Xpath Automation, check Xpath:**

1. Go to devtools in your browser
2. Ctr + F/Cmd + F inside Elements tab
3. Paste your Xpath path
4. Check if founded element is correct.

**Cons:**
If HTML DOM or XML will change, XPath will stop working.


---
## CSS

CSS - Cascagind style sheet. CSS a language used to describe the style of document presentations in web development. 

You can get any Website elements (HTML - Hypertext Markdown Language) using CSS selectors ([API Selectors](https://www.w3.org/TR/selectors-api/))

Example:
```CSS
// go to wykop.pl, and find this element (logo anchor)
#nav > div > ul.clearfix.mainnav > li.active > a

```

---
## Quick quide


**HTML**

```HTML
<!DOCTYPE html>
<html> <!-- root od DOM - document object model -->
    <head>
        <!-- head -> meta information for browsers, robots and 3rd parties apps -->
    <head>
    <body> <!-- body -> content of website/application -->

        <h1>My First Heading</h1> <!-- some tags (h1 - headline level 1) -->
        <p id="intro">My first paragraph.</p> <!-- p - paragraph -->
        
        <div class="wrapper-box default-element">
            <div class="inside-box">
                <a href="#" class="link" target="_blank">Link to better world</a>
            </div>
        </div>
    
    </body>
</html>

```

HTML Rules:
1. One root element HTML
2. Almost all tags have open tag <tag> and clode tag </tag>
3. Tags can be nested
4. Tag can't be open in one tag and close is another
5. HTML looks like tree (each tag are node, tag with other tags inside is nodelist)
6. HTML has 2 children (head, body), body has one parent (html) and 3 children (h1, p, div) and one sibling (head)


CSS Selectors simple:
1. div - get elements by tag name
2. .wrapper-box - get elements by class name
3. #intro - get element by identyficator name (each identificator can be used to only for one tag on webpage)
4. a[target="_blank"] - get elements by attribute 
5. much more

CSS Selectors complex:
1. .wrapper-box .link - get all elements with class link and parents on some level has class wrapper-box
2. .wrapper-box.default-element - get all elements with 2 classes (wrapper-box and default-element)
3. body a - get all elements a inside element body
4. .inside-box > a - get all a elements which are first child of element with inside-box class
5. body div - get all div elements inside body
6. body > div - get all div elements which are first child of body element
7. div + div - get all div elements which have previous sibling div element

CSS Automation, get CSS Path:
1. Open devtools in your browser
2. Select element using Selecting tool
3. Click right mouse button on html element inside Elements tab
4. Choose copy from menu then Copy Selector

CSS Automation, check CSS Path:
1. Go to devtools in your browser
2. Ctr + F/Cmd + F inside Elements tab
3. Paste your CSS path
4. Check if founded element is correct.


## Pros:
1. Easy to learn
2. Easy to use

## Cons:
1. If HTML DOM will change, CSS Path will stop working.
2. Testing is really slow.