# Scrapy

Resources: 

https://docs.scrapy.org/en/latest/

https://www.youtube.com/watch?v=mBoX_JCKZTE

https://thepythonscrapyplaybook.com/freecodecamp-beginner-course/

In [None]:
pip install scrapy

### Part 1: Initialize your project 

`scrapy startproject *projectname*` Import the framework (cd project directory)

You will see the following files:

`*project name*` folder<br>
<ul>
<li>`spiders` folder</li>
<ul>
<li>`__init__.py`</li>
</ul>
<li>`settings.py`</li>
<li>`pipelines.py`</li>
<li>`middleware.py`</li>
<li>`items.py`</li>
<li>`__init__.py`</li>
</ul>
`scrapy.cfg`
<br>
`scrapy genspider *name of spider* *link of website*` Create spider object (run the command inside the `*project name*` folder)

## Part 2: Building spider

#### **IPython:**

- a python shell where you can text your selectors

`pip install ipython`  A shell that is just easier to read (add shell = ipython in settings in scrapy.cfg)

`scrapy shell` Open ipython in terminal


`fetch('*url*')` return an object stored under 'response'

`response.css('article.product_pod')` return a list of selectors xpath of \<article\> tag with class="product_pod"

`book.css('h3 a').attrib['href']` return the link inside the \<a> tag under h3 tag of book.css

#### **Execution**

`scrapy crawl *nameofspider*` execute the spider (execute it in the directory where you can see the spider folder)

## Part 3: Item Pipeline

Similar to pipeline in machine learning, it is mostly used as an organized and systematic way for data cleaning

Remember to **uncomment** `ITEM_PIPELINES = {
   'bookscraper.pipelines.BookscraperPipeline': 300,
}`  inside settings.py if you are using pipelines.py

## Part 4: Saving Data to Files & Databases

#### **Saving files with terminal**
`scrapy crawl *spidername* -O *filename*` Save the file in the same directory in the file type you specified (write mode, overwrite existing data)

`scrapy crawl *spidername* -o *filename*` Save the file in the same directory in the file type you specified (append mode)

#### **Saving file in settings.py**

Alternatively you can define it in settings.py instead of specifying -O *filename*
<br>
`FEEDS = {'booksdata.json': {'format': 'json'}}`
<br>
If you want to specify a specific filename/type for a particular spider, you can define  <br>
`custom_settings = {FEEDS = {'booksdata.json': {'format': 'json'}}}` <br>
on top of your bookspider.py

## **mysql**

Ok hear me out boy, if you've never use terminal before this thing gonna drive you nuts, fortunately here are some few easy steps you need to follow to get mysql installed in your macos.

*Full tutorial: https://www.youtube.com/watch?v=37nyT3U6hFI*

1. Install mysql from https://dev.mysql.com/downloads/mysql/

2. Open terminal, if you type mysql, you notice an error particularly **zsh command not found mysql** this is because mysql is not on the echo $PATH yet

3. In terminal type `vi .zshrc` to navigate into vim editor

4. Press 'i' to enter insert mode, go to a newline and type `export PATH=${PATH}:/usr/local/mysql/bin`, press 'esc' and press `:wq` which means write and quit the vim editor

5. Now type `source .zshrc` in the terminal, if there are no errors you should be good to go

6. To use mysql you can type `mysql -u root -p` and type your password


To connect mysql to python you need to do **pip install mysql mysql-connector-python**

Here are some useful mysql queries to initialize the database:

`show databases;` (show all the databases) <br>
`create database *database name*`<br>
`use *database name*` (active the particular database) 

-------------------------------------------
Ran into WHOLE BUNCH OF ERRORS when try to do **pip install mysql mysql-connector-python**, here are some solutions that might be useful:

**If you download mysql from their sketchy official web and encounter some errors such as sql.h not found, do `brew install mysql` on your terminal**

If you ran into some weird errors like pkg not found or something like that, try the follow:
`export MYSQLCLIENT_CFLAGS="-I/usr/local/include/mysql"` <br>
`export MYSQLCLIENT_LDFLAGS="-L/usr/local/lib -lmysqlclient"`
--------

#### Using mysql

To use mysql, you will need to create a new class in **pipeline.py** *(see pipeline.py class SaveToMySQLPipeline for more)

Also remember everytime you add a class to pipeline, add it to **settings.py** also (make sure it has a higher number than previous pipelines)


## Part 5: Bypassing Blocks

**General rule of thumb**: If the data is publicly available and you do not need to login then it is most likely ok to scrape the website


**User-Agent** (Inspect > Network > doc > click in a doc > header > user-agent)

Visit https://useragentstring.com/ for detailed information of your user-agent

Website mainly track who you are through **IP, cookies, flag/counter, useragent** and in general the **request header** (everything you send to the server when you make a request, user-agent is part of the request header).

The idea of bypassing blocks is to change your request headers every time (eg. user-agent) to trick the antiscraping bot that you are a 'different' user. (Noted that you can set the user-agent in settings, ie USER_AGENT = '')