# Introduction to Web Scraping with Python
## Speaker: Breno Santana Santos
***

## 1. Getting data with Web Scraping

The steps needed to get and extract data from websites are:
1. Determine your extraction's goal;
2. Check whether your extraction is legal;
3. Determine the data source or the target website;
4. Get the HTML source of target page;
5. Choose the HTML elements that will be extracted;
6. Extract the data with your preffered tool of Web Scraping;
7. Store the data, if it is necessary.
***

## 2. Required Background

The required background to perform the activities of Web Scraping are:
* Understand the structure (hierarchy) and elements of HTML (HyperText Markup Language);
* Understand the syntax of XPath and/or CSS Selector;
    * For select/extract the data contained in HTML elements.
* Know how to use the needed tools (in this case, Python tools for Web Scraping).
***

## 3. HTML hierarchy and elements

<img src="img/tags_html.jpg" />

An example of HTML hierarchy:
<img src="img/structure_html.png" />

In addition, there are two attributes mostly important: class and id.
 * class: determine the CSS class of HTML element.
 * id: unique identificator of HTML element.

If you don't have the deep understanding of HTML, I suggest you take the [W3Schools' HTML tutorial](https://www.w3schools.com/html/).
***

## 4. Quick Review of XPath and CSS Selector

Both resources are used to navigate through elements and attributes in HTML pages. In particular, the CSS Locator enables to select the elements based their CSS styles.

### 4.1. XPath

The basic syntax to select the nodes/elements in HTML document is:

| Expression                                     | Description                                                                                                             |
|------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|
| nodename                                       | selects all nodes with the name "nodename".                                                                             |
| /                                              | selects from the root node.                                                                                             |
| //                                             | selects all nodes no matter where they are in the document.                                                             |
| .                                              | selects the current node.                                                                                               |
| ..                                             | selects the parent of the current node.                                                                                 |
| @                                              | selects attributes.                                                                                                     |
| parent/nodename[n]                             | selects the n-th "nodename" element that is the child of the "parent" element. The value of "n" starts from one.        |
| //nodename[@attr_name]                         | selects all the "nodename" elements that have an attribute named "attr_name".                                           |
| //nodename[@attr_name='attr_value']            | selects all the "nodename" elements that have a "attr_name" attribute with a value of "attr_value".                     |
| //nodename[contains(@attr_name, 'attr_value')] | selects all the "nodename" elements whose the "attr_name" attribute contains the substring "attr_value". |
| *                                              | matches any element node. |
| @*                                             | matches any attribute node.|
| nodename/@attribute                            | selects the value of the "attr_name" attribute for "nodename" element. |
| nodename/text()                                | extracts the textual information of the "nodename" element. |

For more information about XPath, I suggest you take [W3Schools' XPath tutorial](https://www.w3schools.com/xml/xpath_intro.asp).

### 4.2. CSS Selector

The basic syntax to select the nodes/elements in HTML document is:

| Selector | Description |
|----------|-------------|
| .class_node | selects all elements with class="class_node". |
|.class1.class2 | selects all elements with both "class1" and "class2" set within its class attribute. |
| #id_node | selects the element with id="id_node". |
| * | selects all elements. |
| tagname | selects all "tagname" elements. |
| tagname1,tagname2 | selects all "tagname1" elements and all "tagname2" elements. |
| tagname1 tagname2 | selects all "tagname2" elements inside "tagname1" elements. |
| tagname1 > tagname2 | selects all "tagname2" elements where the parent is a "tagname1" element. |
| [attr_name] | selects all elements with a "attr_name" attribute. |
| [attr_name=attr_value] | selects all elements with attr_name="attr_value". |
| [attr_name~=attr_value] | selects all elements with a "attr_name" attribute containing the "attr_value" value. |
| [attr_name*=attr_value] | selects every element whose "attr_name" attribute value contains the substring "attr_value". |
| nodename:nth-of-type(n) | selects every "nodename" element that is the n-th element of its parent. The value of "n" starts from one. |
| nodename::attr(attr_name) | selects the value of the "attr_name" attribute for "nodename" element. |
| nodename::text | extracts the textual information of the "nodename" element. |

For more information about CSS Selector, see [W3Schools' CSS Selector Reference](https://www.w3schools.com/cssref/css_selectors.asp).

### 4.3. Examples

| XPath | CSS Selector | Explanation |
|-------|--------------|-------------|
| /html/body/div | html > body > div | selects all "div" elements that are children of "body" tag. |
| //table | table | selects all "table" elements of HTML page. |
| /html/body/div[2]//table | html > body > div:nth-of-type(2) table | selects all "table" elements contained in the second "div" element of "body" tag. |
| /html/body/* | html > body > * | selects all children of "body" tag. |
| //p[@class="class-1"] | p.class-1 | selects all "p" elements whose "class" attribute is equal to the "class-1" class. |
| //*[@id="uid"] | *#uid | selects all elements, including their children, whose "id" attribute is equal to the "uid" value. |

***

## 5. Visualizing/Inspecting the HTML pages

An important resource for support the activities of Web Scraping is the inspection of HTML pages' elements. This resource allows you to visualize the elements and their attributes, thus, facilitating the process of extraction data.

<img src="img/inspect.png" />

***

## 6. Getting the HTML pages' source

One way of getting the source of HTML pages is by the use of Python __requests__ module. The __get__ method performs an HTTP request and the __content__ attribute represents the HTML page's content. In addition, you should decode the content for the UTF-8 encoding.

Other form of getting the source-code is by the Python __urllib__ library's __request__ module. The __urlopen__ returns the HTTPResponse object, which represents a response of an HTTP request. For obtaining the received content by the request, you should use the __read__ method that represents the set of bytes. Similar to previous way, you should decode the content for the UTF-8 encoding.

In [None]:
# Determining the URL of target page.
url = "https://www.imdb.com/chart/top/"

In [None]:
# Importing the requests module and the request module of urllib library.
import requests, urllib.request
from urllib.request import Request

In [None]:
# Creating the function that gets source of HTML page.
def get_page(url, is_urlopen=False):
    """ This function gets the HTML page's source in the UTF-8 encoding. """
    # Getting the HTML source.
    header = {"User-Agent": "Root Scraping"}
    req = Request(url=url, headers=header)
    html = urllib.request.urlopen(req).read() if is_urlopen else requests.get(url, headers=header).content
    # Setting the encoding of the HTML page.
    html = html.decode("utf-8")
    return html

In [None]:
# Getting the page using the requests module.
html = get_page(url)

In [None]:
# Printing the 200 first characters.
html[:200]

In [None]:
# Getting the page using the urlopen method.
html = get_page(url, True)

In [None]:
# Printing the 200 first characters.
html[:200]

***

## 7. Extracting data with scrapy Selector

The __Selector__ object is a __scrapy__ library's object that __selects and extracts data__ using XPath or CSS Selector notation. In its constructor method, it is necessary to pass the HTML source to the __text__ attribute. The Selector object has two methods to select HTML elements: __xpath__ and __css__. Both methods return __Selector__ or __SelectorList__ objects. We can use these objects to create new Selector ones of specic pieces of the HTML code. The __extract__ method allows to extract a list of data. While the __extract_first__ extracts the first item of returned list by Selector object.

In order to get the absolute URLs from the relative ones, you can use the __urllib__ library's __parse__ module. The __urljoin__ method performs this task, i.e., it gets the absolute URLs by the parent and relative URLs.

In [None]:
# For installing the scrapy library, uncomment the below line.
# %pip install scrapy

In [None]:
# Importing the scrapy Selector and the urljoin method.
from scrapy import Selector
from urllib.parse import urljoin

In [None]:
# Creating the function that gets the links contained in a page.
def get_links(url, locator_links, is_css=True, locator_pagination=None):
    """ This method gets the page's links based on a XPath/CSS locator. """

    # Getting the absolute URLs contained in an HTML page, using XPath or CSS Selector.
    html = get_page(url)
    sel = Selector(text=html)

    # Extracting the relative links.
    links = sel.css(locator_links).extract() if is_css else sel.xpath(locator_links).extract()

    # Generating the absolute URLs.
    links = [urljoin(url, item) for item in links]

    # Getting the pagination of absolute URLs.
    if locator_pagination:
        for link in list(links):
            links_pagination = get_links(link, locator_pagination)
            if links_pagination and not set(links_pagination).issubset(set(links)):
                links.remove(link)
                links += links_pagination

    return links

### 7.1. Getting the absolute URLs of the IMDb Top 250 Movies

In [None]:
# Getting the absolute URLs of the IMDb Top 250 Movies.
xpath = "//div[@data-testid='chart-layout-main-column']/ul/li/div[2]/div/div/div[contains(@class, 'ipc-title')]"
xpath += "/a[@class='ipc-title-link-wrapper']/@href"
links = get_links(url, xpath, False)
print(len(links))

css = "div[data-testid='chart-layout-main-column'] > ul > li > div:nth-child(2) > div > div > "
css += "div[class*='ipc-title'] > a[class='ipc-title-link-wrapper']::attr(href)"
links = get_links(url, css)
print(len(links))

### 7.2. Extracting data of the IMDb Top 250 Movies

In [None]:
# Creating the function that extracts the data based on a list of CSS Selector.
def get_data(url, css_list):
    """ This function gets the content and returns it in the list of data. Also, it performs the text cleaning. """
    html = get_page(url)
    sel = Selector(text=html)
    return [tuple(sel.css(item_css).getall()) if sel.css(item_css).getall() else ""
            for item_css in css_list]

In [None]:
# Defining the list of CSS Selectors to extract data.
css = [
    # Title
    "section.ipc-page-section > div:nth-child(2) > div:nth-child(1) > h1[data-testid='hero__pageTitle'] > span::text",
    # Year
    "section.ipc-page-section > div:nth-child(2) > div:nth-child(1) > ul[class*='ipc-inline-list'] > li:nth-child(1) > a::text",
    # Duration
    "section.ipc-page-section > div:nth-child(2) > div:nth-child(1) > ul[class*='ipc-inline-list'] > li:nth-child(3)::text",
    # Grade
    "section.ipc-page-section > div:nth-child(2) > div:nth-child(2) > div > div:nth-child(1) > a > span > div > div:nth-child(2) > div:nth-child(1) > ::text",
    # Popularity
    "section.ipc-page-section > div:nth-child(2) > div:nth-child(2) > div > div:nth-child(3) > a > span > div > div:nth-child(2) > div:nth-child(1)::text",
    # Category
    "section.ipc-page-section > div:nth-child(3) > div:nth-child(2) > div:nth-child(1) > section > div:nth-child(1) > div:nth-child(2) > a > span::text",
    # Description
    "section.ipc-page-section > div:nth-child(3) > div:nth-child(2) > div:nth-child(1) > section > p > span:nth-child(1)::text",
    # Direction
    "section.ipc-page-section > div:nth-child(3) > div:nth-child(2) > div:nth-child(1) > section > div:nth-child(3) > div > ul > li:nth-child(1) > div > ul > li > a::text",
    # Screenwriters
    "section.ipc-page-section > div:nth-child(3) > div:nth-child(2) > div:nth-child(1) > section > div:nth-child(3) > div > ul > li:nth-child(2) > div > ul > li > a::text",
    # Main Actors
    "section.ipc-page-section > div:nth-child(3) > div:nth-child(2) > div:nth-child(1) > section > div:nth-child(3) > div > ul > li:nth-child(3) > div > ul > li > a::text",
    # Main Actors Cast
    "section[data-testid='title-cast'] > div:nth-child(2) > div:nth-child(2) > div[data-testid='title-cast-item'] > div:nth-child(2) > a::text"
]

In [None]:
# Getting and extracting the data.
data = [["Title","Year","Duration","Grade","Popularity","Categories","Description","Directors","Screenwriters","Main_Actors","Main_Cast"]]
data.extend([get_data(link, css) for link in links])

In [None]:
# Normalizing the data.
for idx, movie in enumerate(data[1:], start=1):
    movie = [None if len(movie[i]) == 0 else movie[i][0] if i not in [3, 5, 7, 8, 9, 10] else movie[i]
             for i in range(len(movie))]
    movie[3] = "".join(movie[3])
    data[idx] = movie

In [None]:
# Printing the five first records and the number of extracted movies.
print(data[:5])
print("Number of extracted movies:", len(data))

### 7.3. Storing the data in a CSV file

In order to store the data in a CSV file, you can use the __csv__ module. In every file manipulation, it is necessary to open the file to read or write some content. For this operation, you can invoke the __open__ built-in method to open the file for manipulations. This method returns the __TextIOWrapper__ object that is used to __manipulate__ the __selected file__.

The __writer__ method of __csv__ module creates a __writer__ object that stores a content within the file. The __writerow__ method inserts a content line within the selected file. While the __writerows__ method inserts a set of content (rows) within the selected file. Both writing methods receive a list as parameter.

Other approach is using the **pandas** library. It is one of main libraries used to Data Science with Python. It allows you to work with the data in the **tabular format**. For this, there is the **DataFrame** object. The **to_csv** function allows saving the data of a **DataFrame** object into a **CSV file**.

In [None]:
# Importing the required libraries.
import pandas as pd, csv

In [None]:
# Saving the data collected.
pd.DataFrame(data).to_csv("data_top_250_movies_imdb.csv", sep="|", header=0, index=False, quoting=csv.QUOTE_ALL)

***

## 8. Crawling with scrapy Response objects

The **Response** objects have the **same** functionalities of **Selector** ones. They can use the **xpath**, **css**, **extract** and **extract_first** methods. In addition, the Response object **keeps track of the URL** where the HTML code was loaded from, i.e., it keeps track of the URL within the response **url** variable/attribute. It **helps us move from one site to another**, so that we can "crawl" the web while scraping. From the **follow** method, the Response lets us "follow" to a new link.

The **Spider** class determines how to perform the **crawl** (follow links) and how to **extract data** from the pages (scraping items). For creating your Spider, you should create a subclass of **scrapy.Spider**. For the Spider class, it is necessary to create the **start_requests** and **parse** methods. The **former** defines the **start point to run** the Spider object. While the **latter** are responsible for **handling the HTML** code. It is needed to create **at least one parse function**. The **scrapy.Request** statement creates a response variable for us. The **url** argument tells us which site to scrape. The **callback** argument tells us where to send the response variable for processing. The **cb_kwargs** argument tells us what parameters are required for the **callback** argument and it receives a dictionary object that represents the parameters and their values for **callback** function. The **follow** method accepts the same arguments of **scrapy.Request**.

Finally, the **CrawlerProcess** objects of **scrapy.crawler** module are responsible for defining and executing the Spider object. The **crawl** method receives the **Spider's subclass** that will be used to extract data. The **start** method starts the execution of the defined Spider's subclass.

In [None]:
# Importing the required libraries.
import scrapy, pandas as pd, csv
from scrapy.crawler import CrawlerProcess

In [None]:
# Determining the URL of target page.
url = "https://www.imdb.com/chart/top/"

In [None]:
# Definition of Spider class
class SpiderIMDbMovies(scrapy.Spider):
    name = "imdb_top_250_movies"
    user_agent = name

    # Start point to run the spider.
    def start_requests(self):
        # Getting the relative URLs of the IMDb Top 250 Movies.
        css = "div[data-testid='chart-layout-main-column'] > ul > li > div:nth-child(2) > div > div > " \
              "div[class*='ipc-title'] > a[class='ipc-title-link-wrapper']::attr(href)"
        args = dict(css=css)
        yield scrapy.Request(url=url, callback=self.parse_links, cb_kwargs=args)

    def parse_links(self, response, css):
        # Extracting the relative URLs.
        links = response.css(css).getall()

        # Defining the CSS Selectors.
        css = {
            "title": "section.ipc-page-section > div:nth-child(2) > div:nth-child(1) > h1[data-testid='hero__pageTitle'] > span::text",
            "year": "section.ipc-page-section > div:nth-child(2) > div:nth-child(1) > ul[class*='ipc-inline-list'] > li:nth-child(1) > a::text",
            "duration": "section.ipc-page-section > div:nth-child(2) > div:nth-child(1) > ul[class*='ipc-inline-list'] > li:nth-child(3)::text",
            "grade": "section.ipc-page-section > div:nth-child(2) > div:nth-child(2) > div > div:nth-child(1) > a > span > div > div:nth-child(2) > div:nth-child(1) > ::text",
            "popularity": "section.ipc-page-section > div:nth-child(2) > div:nth-child(2) > div > div:nth-child(3) > a > span > div > div:nth-child(2) > div:nth-child(1)::text",
            "category": "section.ipc-page-section > div:nth-child(3) > div:nth-child(2) > div:nth-child(1) > section > div:nth-child(1) > div:nth-child(2) > a > span::text",
            "description": "section.ipc-page-section > div:nth-child(3) > div:nth-child(2) > div:nth-child(1) > section > p > span:nth-child(1)::text",
            "direction": "section.ipc-page-section > div:nth-child(3) > div:nth-child(2) > div:nth-child(1) > section > div:nth-child(3) > "
                         "div > ul > li:nth-child(1) > div > ul > li > a::text",
            "screenwriters": "section.ipc-page-section > div:nth-child(3) > div:nth-child(2) > div:nth-child(1) > section > div:nth-child(3) > "
                             "div > ul > li:nth-child(2) > div > ul > li > a::text",
            "main_actors": "section.ipc-page-section > div:nth-child(3) > div:nth-child(2) > div:nth-child(1) > section > div:nth-child(3) > "
                           "div > ul > li:nth-child(3) > div > ul > li > a::text",
            "main_actors_cast": "section[data-testid='title-cast'] > div:nth-child(2) > div:nth-child(2) > div[data-testid='title-cast-item'] > "
                                "div:nth-child(2) > a::text"
        }

        # Getting the data.
        args = dict(css=css)
        for link in links:
            yield response.follow(url=link, callback=self.parse_data, cb_kwargs=args)

    def parse_data(self, response, css):
        # Extracting the information and saving them into the list "data".
        row = [tuple(response.css(css[item]).getall()) if response.css(css[item]).getall() else ""
               for item in css.keys()]

        # Normalizing the information and inserting them into the list "data".
        row = [None if len(row[i]) == 0 else row[i][0] if i not in [3, 5, 7, 8, 9, 10] else row[i]
               for i in range(len(row))]
        row[3] = "".join(row[3])
        data.append(row)

In [None]:
# Creating the list "data" and inserting the header within it.
data = [["Title","Year","Duration","Grade","Popularity","Categories","Description","Directors","Screenwriters","Main_Actors","Main_Cast"]]

In [None]:
# Execution Process to run the spider.
process = CrawlerProcess()  # initiating a CrawlerProcess.
process.crawl(SpiderIMDbMovies)  # telling the Process which Spider to use.
process.start()  # starting the Crawling Process.

In [None]:
# Saving the data into a CSV file.
pd.DataFrame(data).to_csv("data_top_250_movies_imdb_2.csv", sep="|", header=0, index=False, quoting=csv.QUOTE_ALL)

## 9. Visualizing the extracted data with Pandas

The **pandas** library is one of main libraries used to Data Science with Python. It allows you to work with the data in the **tabular format**. For this, there is the **DataFrame** object. The **read_csv** function allows a **DataFrame** object from the **CSV file**. The **info** method permits you to visualize some informations with respect to the data. The **head** method shows the five first records of your DataFrame.

In [None]:
# creating the DataFrame object from the extracted data (CSV file)
data_df = pd.read_csv("data_top_250_movies_imdb_2.csv", delimiter="|", header=0, index_col=None)

In [None]:
# visualizing some information with respect to the DataFrame object
data_df.info()

In [None]:
# visualizing the five first records
data_df.head()

## 10. Creating a Word Cloud with NLP

In [None]:
# Installation of NLTK and WordCloud libraries.
# %pip install nltk
# %pip install wordcloud

In [None]:
# Getting the list "Description".
descriptions = data_df["Description"].copy()

In [None]:
# Importing the required libraries.
import nltk
import string
import re
from wordcloud import WordCloud
from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download("stopwords")

In [None]:
# Getting the english stopwords.
stop_words = stopwords.words("english")
stop_words.extend(["one", "two", "three", "four", "five", "six", "seven",
                   "eight", "nine", "ten", "ii"])

In [None]:
# executing some tasks of text cleaning
words = [re.sub(r"[“”`'…" + string.punctuation + "]+", "", word.lower().strip())
         for desc in descriptions for word in word_tokenize(desc)]
words = [word for word in words if word != "" and word not in stop_words]

In [None]:
# Setting the word cloud.
wordcloud = WordCloud(max_font_size=300, max_words=100, width=1500,
                      height=900, stopwords=stop_words, background_color="white").generate(" ".join(words))

# Saving the generated image.
wordcloud.to_file("tag_cloud.png")