# Introduction to Web Scraping with Python
## Event: Campus Party Natal 2019
## Speaker: Breno Santana Santos
***

## 1. Getting data with Web Scraping

The steps needed to get and extract data from websites are:
1. Determine your extraction's goal;
2. Check whether your extraction is legal;
3. Determine the data source or the target website;
4. Get the HTML source of target page;
5. Choose the HTML elements that will be extracted;
6. Extract the data with your preffered tool of Web Scraping;
7. Store the data, if it is necessary.
***

## 2. Required Background

The required background to perform the activities of Web Scraping are:
* Understand the structure (hierarchy) and elements of HTML (HyperText Markup Language);
* Understand the syntax of XPath and/or CSS Selector;
    * For select/extract the data contained in HTML elements.
* Know how to use the needed tools (in this case, Python tools for Web Scraping).
***

## 3. HTML hierarchy and elements

<img src="img/tags_html.jpg" />

An example of HTML hierarchy:
<img src="img/structure_html.png" />

In addition, there are two attributes mostly important: class and id.
 * class: determine the CSS class of HTML element.
 * id: unique identificator of HTML element.

If you don't have the deep understanding of HTML, I suggest you take the [W3Schools' HTML tutorial](https://www.w3schools.com/html/).
***

## 4. Quick Review of XPath and CSS Selector

Both resources are used to navigate through elements and attributes in HTML pages. In particular, the CSS Locator enables to select the elements based their CSS styles.

### 4.1. XPath

The basic syntax to select the nodes/elements in HTML document is:

| Expression | Description |
|------------|-------------|
| nodename | selects all nodes with the name "nodename". |
| / | selects from the root node. |
| // | selects all nodes no matter where they are in the document. |
| . | selects the current node. |
| .. | selects the parent of the current node. |
| @ | selects attributes. |
| parent/nodename[n] | selects the n-th "nodename" element that is the child of the "parent" element. The value of "n" starts from one. |
| //nodename[@attr_name] | selects all the "nodename" elements that have an attribute named "attr_name". |
| //nodename[@attr_name='attr_value'] | selects all the "nodename" elements that have a "attr_name" attribute with a value of "attr_value". |
| * | matches any element node. |
| @* | matches any attribute node.|
| nodename/@attribute | selects the value of the "attr_name" attribute for "nodename" element. |
| nodename/text() | extracts the textual information of the "nodename" element. |

For more information about XPath, I suggest you take [W3Schools' XPath tutorial](https://www.w3schools.com/xml/xpath_intro.asp).

### 4.2. CSS Selector

The basic syntax to select the nodes/elements in HTML document is:

| Selector | Description |
|----------|-------------|
| .class_node | selects all elements with class="class_node". |
|.class1.class2 | selects all elements with both "class1" and "class2" set within its class attribute. |
| #id_node | selects the element with id="id_node". |
| * | selects all elements. |
| tagname | selects all "tagname" elements. |
| tagname1,tagname2 | selects all "tagname1" elements and all "tagname2" elements. |
| tagname1 tagname2 | selects all "tagname2" elements inside "tagname1" elements. |
| tagname1 > tagname2 | selects all "tagname2" elements where the parent is a "tagname1" element. |
| [attr_name] | selects all elements with a "attr_name" attribute. |
| [attr_name=attr_value] | selects all elements with attr_name="attr_value". |
| [attr_name~=attr_value] | selects all elements with a "attr_name" attribute containing the "attr_value" value. |
| [attr_name*=attr_value] | selects every element whose "attr_name" attribute value contains the substring "attr_value". |
| nodename:nth-of-type(n) | selects every "nodename" element that is the n-th element of its parent. The value of "n" starts from one. |
| nodename::attr(attr_name) | selects the value of the "attr_name" attribute for "nodename" element. |
| nodename::text | extracts the textual information of the "nodename" element. |

For more information about CSS Selector, see [W3Schools' CSS Selector Reference](https://www.w3schools.com/cssref/css_selectors.asp).

### 4.3. Examples

| XPath | CSS Selector | Explanation |
|-------|--------------|-------------|
| /html/body/div | html > body > div | selects all "div" elements that are children of "body" tag. |
| //table | table | selects all "table" elements of HTML page. |
| /html/body/div[2]//table | html > body > div:nth-of-type(2) table | selects all "table" elements contained in the second "div" element of "body" tag. |
| /html/body/* | html > body > * | selects all children of "body" tag. |
| //p[@class="class-1"] | p.class-1 | selects all "p" elements whose "class" attribute is equal to the "class-1" class. |
| //*[@id="uid"] | *#uid | selects all elements, including their children, whose "id" attribute is equal to the "uid" value. |

***

## 5. Visualizing/Inspecting the HTML pages

An important resource for support the activities of Web Scraping is the inspection of HTML pages' elements. This resource allows you to visualize the elements and their attributes, thus, facilitating the process of extraction data.

<img src="img/inspect.png" />

***

## 6. Getting the HTML pages' source

One way of getting the source of HTML pages is by the use of Python __requests__ module. The __get__ method performs an HTTP request and the __content__ attribute represents the HTML page's content. In addition, you should decode the content for the UTF-8 encoding.

Other form of getting the source-code is by the Python __urllib__ library's __request__ module. The __urlopen__ returns the HTTPResponse object, which represents a response of an HTTP request. For obtaining the received content by the request, you should use the __read__ method that represents the set of bytes. Similar to previous way, you should decode the content for the UTF-8 encoding.

In [None]:
# determining the URL of target page
url = "https://campuse.ro/events/campus-party-natal-2019"

In [None]:
# importing the requests module and the request module of urllib library
import requests
from urllib import request

In [None]:
# creating the function that gets source of HTML page
def getPage(url, is_urlopen=False):
    """This function gets the HTML page's source in the UTF-8 encoding"""
    html = request.urlopen(url).read() if is_urlopen else requests.get(url).content  # getting the HTML source
    html = html.decode("utf-8")  # setting the encode of the HTML page
    return html

In [None]:
# getting the page using the requests module
html = getPage(url)

In [None]:
# printing the 200 first characters
html[:200]

In [None]:
# getting the page using the urlopen method
html = getPage(url, True)

In [None]:
# printing the 200 first characters
html[:200]

***

## 7. Extracting data with scrapy Selector

The __Selector__ object is a __scrapy__ library's object that __selects and extracts data__ using XPath or CSS Selector notation. In its constructor method, it is necessary to pass the HTML source to the __text__ attribute. The Selector object has two methods to select HTML elements: __xpath__ and __css__. Both methods return __Selector__ or __SelectorList__ objects. We can use these objects to create new Selector ones of specic pieces of the HTML code. The __extract__ method allows to extract a list of data. While the __extract_first__ extracts the first item of returned list by Selector object.

In order to get the absolute URLs from the relative ones, you can use the __urllib__ library's __parse__ module. The __urljoin__ method performs this task, i.e., it gets the absolute URLs by the parent and relative URLs.

In [None]:
# For installing the scrapy library, uncomment the below line.
#%pip install scrapy

In [None]:
# importing the scrapy Selector and the urljoin method
from scrapy import Selector
from urllib.parse import urljoin

In [None]:
# creating the function that gets the links contained in a page
def getLinks(url, locator_links, is_css=True, locator_pagination=None):
    """This method gets the page's links based on a XPath/CSS locator."""

    # getting the absolute URLs contained in an HTML page, using XPath or CSS Selector
    html = getPage(url)
    sel = Selector(text=html)
    links = sel.css(locator_links).extract() if is_css else sel.xpath(locator_links).extract()  # extracting the relative links
    links = [urljoin(url, item) for item in links]  # generating the absolute URLs

    # getting the pagination of absolute URLs
    if locator_pagination:
        for link in list(links):
            links_pagination = getLinks(link, locator_pagination)
            if links_pagination and not set(links_pagination).issubset(set(links)):
                links.remove(link)
                links += links_pagination

    return links

### 7.1. Getting the absolute URLs of CPNatal 2019's event categories

In [None]:
# getting the absolute URLs of the CPNatal 2019's event categories (Workshops and Conferences)
xpath = "//div[@class='small-12 medium-6 large-3 columns pink-bg']/a/@href"
links = getLinks(url, xpath, False)
print(links)

css = "div[class='small-12 medium-6 large-3 columns pink-bg'] > a::attr(href)"
links = getLinks(url, css)
print(links)

### 7.2. Getting the absolute URLs with pagination for CPNatal 2019's event categories

In [None]:
# getting the absolute URLs with pagination for CPNatal 2019's event categories (Workshops and Conferences)
css_pagination = "span.step-links > span.list > a::attr(href)"
links = getLinks(url, css, locator_pagination=css_pagination)

In [None]:
# printing the result
print(links)

### 7.3. Getting the subevents' URL for each event category of CPNatal 2019

In [None]:
# getting the absolute URLs of CPNatal 2019's Workshops and Conferences
css_categories = "div.header > a::attr(href)"
links_subevents = [link_subevent for link_cat in links for link_subevent in getLinks(link_cat, css_categories)]

In [None]:
# printing the total of URLs
print("Total of links:", len(links_subevents))

### 7.4. Extracting data of the CPNatal 2019's subevents

In [None]:
# creating the function that extracts the data based on a list of CSS Selector
def getData(url, css_list):
    """This function gets the content and returns it in the list of data. Also, it performs the text cleaning."""
    html = getPage(url)
    sel = Selector(text=html)
    return [sel.css(item_css).extract_first().replace("\n", "").replace(";", ",") if sel.css(item_css).extract_first() != None else "" for item_css in css_list]

In [None]:
# defining the list of CSS Selectors to extract data
css = ["div.left > ul.filter-tab-menu > li:nth-of-type(3) > a::text", # type
       "div.blue-box > span.title::text", # title
       "div.metadata > span:nth-of-type(1)::text", # start date
       "div.metadata > span:nth-of-type(2)::text", # end date
       "div.metadata > span:nth-of-type(3)::text", # location
       "div.ident-25 > p::text" # description
]

In [None]:
# getting and extracting the data
data = [["Type","Title","Start Date","End Date","Location","Description"]]  # header
data.extend([getData(link, css) for link in links_subevents])

In [None]:
# printing the five first records and the number of extracted subevents
print(data[:5])
print("Number of extracted subevents:", len(data))

### 7.5. Storing the data in a CSV file

In order to store the data in a CSV file, you can use the __csv__ module. In every file manipulation, it is necessary to open the file to read or write some content. For this operation, you can invoke the __open__ built-in method to open the file for manipulations. This method returns the __TextIOWrapper__ object that is used to __manipulate__ the __selected file__.

The __writer__ method of __csv__ module creates a __writer__ object that stores a content within the file. The __writerow__ method inserts a content line within the selected file. While the __writerows__ method inserts a set of content (rows) within the selected file. Both writing methods receive a list as parameter.

In [None]:
# importing the csv module
import csv

In [None]:
# creating the writer object
file_writer = csv.writer(open("data_cp_natal_2019.csv", "w"), delimiter="|")

In [None]:
# inserting the data within the file
file_writer.writerows(data)

***

## 8. Crawling with scrapy Response objects

The **Response** objects have the **same** functionalities of **Selector** ones. They can use the **xpath**, **css**, **extract** and **extract_first** methods. In addition, the Response object **keeps track of the URL** where the HTML code was loaded from, i.e., it keeps track of the URL within the response **url** variable/attribute. It **helps us move from one site to another**, so that we can "crawl" the web while scraping. From the **follow** method, the Response lets us "follow" to a new link.

The **Spider** class determines how to perform the **crawl** (follow links) and how to **extract data** from the pages (scraping items). For creating your Spider, you should create a subclass of **scrapy.Spider**. For the Spider class, it is necessary to create the **start_requests** and **parse** methods. The **former** defines the **start point to run** the Spider object. While the **latter** are responsible for **handling the HTML** code. It is needed to create **at least one parse function**. The **scrapy.Request** statement creates a response variable for us. The **url** argument tells us which site to scrape. The **callback** argument tells us where to send the response variable for processing. The **cb_kwargs** argument tells us what parameters are required for the **callback** argument and it receives a dictionary object that represents the parameters and their values for **callback** function. The **follow** method accepts the same arguments of **scrapy.Request**.

Finally, the **CrawlerProcess** objects of **scrapy.crawler** module are responsible for defining and executing the Spider object. The **crawl** method receives the **Spider's subclass** that will be used to extract data. The **start** method starts the execution of the defined Spider's subclass.

In [None]:
# Required imports
import scrapy
from scrapy.crawler import CrawlerProcess
import csv

In [None]:
# determining the URL of target page
url = "https://campuse.ro/events/campus-party-natal-2019"

In [None]:
# Definition of Spider class
class SpiderCPNatal(scrapy.Spider):
    name = "cp_natal_2019"
    file_writer = None

    # Start point to run the spider
    def start_requests(self):
        # creating the file and saving the header within the file
        self.file_writer = csv.writer(open("data_cp_natal_2019_2.csv", "w"), delimiter="|")
        self.file_writer.writerow(["Type","Title","Start Date","End Date","Location","Description"])

        # getting the relative URLs of the CPNatal 2019's event categories (Workshops and Conferences)
        args = dict(css = "div[class='small-12 medium-6 large-3 columns pink-bg'] > a::attr(href)")
        yield scrapy.Request(url = url, callback=self.parse_links, cb_kwargs=args)

    def parse_links(self, response, css):
        # extracting the relative URLs
        links = response.css(css).extract()

        # getting the relative URLs with pagination for CPNatal 2019's event categories (Workshops and Conferences)
        args = dict(css_pag = "span.step-links > span.list > a::attr(href)")
        for link in links:
            yield response.follow(url = link, callback=self.parse_pagination, cb_kwargs=args)

    def parse_pagination(self, response, css_pag):
        # extracting the relative URLs with pagination
        links = response.css(css_pag).extract()

        # getting the relative URLs of CPNatal 2019's Workshops and Conferences
        for link in links:
            args = dict(css_cat = "div.header > a::attr(href)")
            yield response.follow(url = link, callback=self.parse_links_subevents, cb_kwargs=args)

    def parse_links_subevents(self, response, css_cat):
        # extracting the relative URLs of CPNatal 2019's Workshops and Conferences
        links = response.css(css_cat).extract()

        # getting the data
        for link in links:
            args = {"type": "div.left > ul.filter-tab-menu > li:nth-of-type(3) > a::text", # type
                    "title": "div.blue-box > span.title::text", # title
                    "start_date": "div.metadata > span:nth-of-type(1)::text", # start date
                    "end_date": "div.metadata > span:nth-of-type(2)::text", # end date
                    "location": "div.metadata > span:nth-of-type(3)::text", # location
                    "description": "div.ident-25 > p::text" # description
            }
            yield response.follow(url = link, callback=self.parse_data, cb_kwargs=args)

    def parse_data(self, response, **kwargs):
        # extracting the data and saving them in the CSV file
        row = [response.css(kwargs[item_css]).extract_first().replace("\n", "").replace(";", ",") if response.css(kwargs[item_css]).extract_first() != None else "" for item_css in kwargs]
        self.file_writer.writerow(row)


In [None]:
# Execution Process to run the spider
process = CrawlerProcess()  # initiate a CrawlerProcess
process.crawl(SpiderCPNatal)  # tell the process which spider to use
process.start()  # start the crawling process

## 9. Visualizing the extracted data with Pandas

The **pandas** library is one of main libraries used to Data Science with Python. It allows you to work with the data in the **tabular format**. For this, there is the **DataFrame** object. The **read_csv** function allows a **DataFrame** object from the **CSV file**. The **info** method permits you to visualize some informations with respect to the data. The **head** method shows the five first records of your DataFrame.

In [None]:
# importing the pandas library
import pandas as pd

In [None]:
# creating the DataFrame object from the extracted data (CSV file)
data_df = pd.read_csv("data_cp_natal_2019_2.csv", delimiter="|", header=0, index_col=None)

In [None]:
# visualizing some informations with respect to the DataFrame object
data_df.info()

In [None]:
# visualizing the five first records
data_df.head()