# Web Scraping in Python

## Course Description
The ability to build tools capable of retrieving and parsing information stored across the internet has been and continues to be valuable in many veins of data science. In this course, you will learn to navigate and parse html code, and build tools to crawl websites automatically. Although our scraping will be conducted using the versatile Python library scrapy, many of the techniques you learn in this course can be applied to other popular Python libraries as well, including BeautifulSoup and Selenium. Upon the completion of this course, you will have a strong mental model of html structure, will be able to build tools to parse html code and access desired information, and create a simple scrapy spiders to crawl the web at scale.

In [1]:
import numpy as np
import scrapy
from scrapy import Selector

# 1. Introduction to HTML

Learn the structure of HTML. We begin by explaining why web scraping can be a valuable addition to your data science toolbox and then delving into some basics of HTML. We end the chapter by giving a brief introduction on XPath notation, which is used to navigate the elements within HTML code.

### From Tree to HTML
Here you are given the chance to create your own bit of HTML code (as a python string). More specifically, below is an HTML tree image and you will finish the missing code within the string html which produces this HTML tree.

![](html_tree_exercise_resize.png)

To note:

- We have started the string html for you, to help nudge you in the correct direction.
- The spacing we use in the portion of the string we provide (e.g., indenting \<head> two spaces more than \<html>) isn't necessary, but we did so just to make it easier to read.

#### Instructions
- Fill in the blank within the string variable html so that the HTML code matches its tree representation (including the text within the two paragraph children).

In [2]:
html = '''
<html>
  <head>
    <title>Intro HTML</title>
  </head>
  <body>
    <p>Hello World!</p>
    <p>Enjoy DataCamp!</p>
  </body>
</html>
'''

### Keep it Classy
In this two-part exercise, you will have a chance to show off what you've learned about attributes; in this case, we focus on the class attribute.

### Instructions
- Fill in the blank in the HTML code string html to assign a class attribute to the second div element which has the value "you-are-classy".


In [3]:
def whats_my_class( html ):
  sel = Selector( text = html )
  try:
        print( "The class you assigned to the second div element is:", sel.xpath( '//div' )[1].xpath('./@class' )[0].extract() )
  except:
    print("No second div element class found!")

In [4]:
# HTML code string
html = '''
<html>
  <body>
    <div class="class1" id="div1">
      <p class="class2">Visit DataCamp!</p>
    </div>
    <div class = "you-are-classy">
      <p class="class2">Keep up the good work!</p>
    </div>
  </body>
</html>
'''
# Print out the class of the second div element
whats_my_class( html )

The class you assigned to the second div element is: you-are-classy


### Where am I?
In this exercise, you will navigate to a specific element using your new knowledge of XPath notation.

Consider the HTML code:

```JavaScript
<html>
  <body>
    <div>
      <p>Good Luck!</p>
      <p>Not here...</p>
    </div>
    <div>
      <p>Where am I?</p>
    </div>
  </body>
</html>
```

Your job will be to create an XPath string using single forward-slashes and brackets which navigates to the paragraph p element which contains the text "Where am I?".

#### Instructions

- Using only single forward-slashes to move between generations, and brackets to select the correct element, assign a string to the variable xpath that directs to the paragraph element containing "Where am I?".
- Do not include any blank spaces in the string you assign to xpath.

In [5]:
# Fill in the blank
xpath = '/html/body/div[2]/p'

### It's Time to P
In the lecture, we learned how to use double forward-slashes to navigate to all future generations. In this exercise, you will select all paragraph p elements within the HTML. Because we want you to navigate to all paragraph elements, it is not important that you know what the HTML code is, since the task can be accomplished with a simple XPath string using the double forward-slash notation you have learned.

#### Instructions

- Using double forward-slash notation, assign to the variable xpath a simple XPath string navigating to all paragraph p elements within any HTML code.

In [6]:
# Fill in the blank
xpath = '//p'

### A classy span
Although we haven't yet gone deep into XPath, one thing we can do is select elements by their attributes using an XPath. For example, if we want to direct to the div element within the HTML document whose id attribute is "uid", then we could write the XPath string '//div[@id="uid"]'. The first part of this string, //div, first looks at all div elements in the HTML document. Then, using the brackets, we specify that we want only the div element with a specific id attribute (in this case uid). To note, the phrase @id="uid" in the brackets would be read as "attribute id equals uid".

In this exercise, you will select all span elements whose class attribute equals "span-class". (Note: span is just another possible tag-name).

#### Instructions

- Assign to the variable xpath an XPath string which will select all span elements whose class attribute equals "span-class". You do not need to see the actual HTML code to do this!

In [7]:
# Fill in the blank
xpath = '//span[@class="span-class"]'

# 2. XPaths and Selectors

Leverage XPath syntax to explore scrapy selectors. Both of these concepts will move you towards being able to scrape an HTML document.

### Body Appendages
We have loaded the HTML from a secret website and have used it to create a function how_many_elements(). The way this function works is that you pass it an XPath string and it will print out the number of elements the XPath you wrote has selected. For example, by running the code how_many_elements('//*') in the console will print out the total number of elements the HTML document has (try it!).

Your job in this exercise is to create an XPath string which can be used to direct to all child elements the body (regardless of tag type). To note, you can first test your solution with how_many_elements() to find the total number of children in the body element if you wish.

Note that the exercises in this chapter may take some time to load.

#### Instructions

- Assign to the variable xpath an XPath string which directs to all child elements of the body element. There is only one body element in this HTML document and it is a child of the root html element.


In [8]:
html = open('data/Body_appendages.html','r').read()

In [9]:
def how_many_elements( xpath ):
    sel = Selector( text = html )
    print( len(sel.xpath( xpath )) )

In [10]:
# Create an XPath string to direct to children of body element
xpath = '//body/*'

# Print out the number of elements selected
how_many_elements( xpath )


22


### Choose DataCamp!
In this exercise, we want to give you the opportunity to create your own XPath string to achieve a certain task; the task is to select the paragraph element containing the text "Choose DataCamp!".

Consider the following HTML:

```Html
<html>
  <body>
    <div>
      <p>Hello World!</p>
      <div>
        <p>Choose DataCamp!</p>
      </div>
    </div>
    <div>
      <p>Thanks for Watching!</p>
    </div>
  </body>
</html>
```


We have created the function print_element_text() for you, which will print the text contained in your element (if it contains any). Feel free to use this function to check if your solution is correct!

#### Instructions

- Assign to the variable xpath an XPath string to direct to the paragraph element containing the phrase: "Choose DataCamp!".

In [11]:
html = '''
<html>
  <body>
    <div>
      <p>Hello World!</p>
      <div>
        <p>Choose DataCamp!</p>
      </div>
    </div>
    <div>
      <p>Thanks for Watching!</p>
    </div>
  </body>
</html>
'''

In [12]:
def print_element_text( xpath ):
    sel = Selector( text = html )
    text = ' '.join( sel.xpath( xpath ).xpath( './text()' ).extract() )
    print( text )

In [13]:
# Create an XPath string to the desired paragraph element
xpath = '//body/div[1]/div/p'

# Print out the element text
print_element_text( xpath )

Choose DataCamp!


### Where it's @
In this exercise, you'll begin to write an XPath string using attributes to achieve a certain task; that task is to select the paragraph element containing the text "Thanks for Watching!". We've already created most of the XPath string for you.

Consider the following HTML:

```Html
<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose DataCamp!</p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>
```

We have created the function print_element_text() for you, which will print any text contained in your element.

#### Instructions

- Fill in the blanks in the XPath string to select the paragraph element containing the phrase: "Thanks for Watching!".

In [14]:
# Create an Xpath string to select desired p element
xpath = '//*[@id="div3"]/p'

# Print out selection text
print_element_text( xpath )




### Check your Class
This exercise is to emphasize that when you use an XPath to select an element by its class attribute without using the contains() function, you match the class exactly. Your job is to fill in the blank below and finish the variable xpath directing to the specified element.

Consider the following HTML:

```Html
<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose DataCamp!</p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>
```

#### Instructions

- Fill in the blanks in the xpath below to select the paragraph element containing the phrase: "Hello World!".

In [15]:
# Create an XPath string to select p element by class
xpath = '//p[@class="class-1 class-2"]'

# Print out select text
print_element_text( xpath )




### Hyper(link) Active
One of the most important attributes to extract for "web-crawling" is the hyperlink url (href attribute) within an a tag. Here, you will extract such a hyperlink! We have created the function print_attribute to print out the data extracted from your XPath, so you can test your XPath strings in the console, if you like.

#### Instructions

- Fill in the blanks to complete the variable xpath below to select the href attribute value from the DataCamp hyperlink.

In [16]:
html = '''
<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose 
            <a href="http://datacamp.com">DataCamp!</a>!
        </p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>
'''

In [17]:
def print_attribute( xpath ):
    sel = Selector(text = html)
    print( "You have selected:" )
    for i,el in enumerate(sel.xpath( xpath ).extract()):
        print( "%d) %s" % (i+1, el) )

In [18]:
# Create an xpath to the href attribute
xpath = '//p[@id="p2"]/a/@href'

# Print out the selection(s); there should be only one
print_attribute( xpath )

You have selected:
1) http://datacamp.com


### Secret Links
We have loaded the HTML from a secret website and have used it to create the functions how_many_elements() and preview(). The function how_many_elements() allows you to pass in an XPath string and it will print out the number of elements the XPath you wrote has selected. The function preview() allows you to pass in an XPath string and it will print out the first few elements you've selected.

Your job in this exercise is to create an XPath which directs to all href attribute values of the hyperlink a elements whose class attributes contain the string "course-block". If you do it correctly, you should find that you have selected 40 elements with your XPath string and that it previews links (with some repetition).

#### Instructions

- Fill in the blanks below to assign an XPath string to the variable xpath which directs to all href attribute values of the hyperlink a elements whose class attributes contain the string "course-block". Remember that we use the contains call within the XPath string to check if an attribute value contains a particular string.

In [19]:
html = open('data/Body_appendages.html','r').read()

In [20]:
def preview( xpath ):
    sel = Selector(text = html)
    els = sel.xpath( xpath ).extract()
    n = len(els)
    for i,el in enumerate( els[:min(4,n)]):
        print( "Element %d: %s" % (i+1,el) )

In [21]:
# Create an xpath to the href attributes
xpath = '//a[contains(@class,"course-block")]/@href'

# Print out how many elements are selected
how_many_elements( xpath )
# Preview the selected elements
preview( xpath )

40
Element 1: /courses/intro-to-python-for-data-science
Element 2: /courses/intro-to-python-for-data-science
Element 3: /courses/free-introduction-to-r
Element 4: /courses/free-introduction-to-r


### XPath Chaining
Selector and SelectorList objects allow for chaining when using the xpath method. What this means is that you can apply the xpath method over once you've already applied it. For example, if sel is the name of our Selector, then

`sel.xpath('/html/body/div[2]')`

is the same as

`sel.xpath('/html').xpath('./body/div[2]')`

or is the same as

`sel.xpath('/html').xpath('./body').xpath('./div[2]')`

The only catch is that you need to "glue together" the XPath pieces by using a period at the start of each subsequent XPath string (notice the periods we added to the XPath strings in our examples).

#### Instructions

- Fill in the blank below to chain together two xpath calls which result in the same selection as
`sel.xpath('//div/span/p[3]')`

In [22]:
html = '\n<html>\n<body>\n<div>HELLO</div>\n<div><p>GOODBYE</p></div>\n<div><span><p>NOPE</p><p>ALMOST</p><p>YOU GOT IT!</p></span></div>\n</body>\n</html>\n'

In [23]:
sel = Selector(text = html)
# Chain together xpath methods to select desired p element
sel.xpath( '//div' ).xpath( './span/p[3]' )

[<Selector xpath='./span/p[3]' data='<p>YOU GOT IT!</p>'>]

### Divvy Up This Exercise
We have pre-loaded an HTML into the string variable html. In this two part problem you will use this html variable as the HTML document to set up a Selector object with, and create a SelectorList which selects all div elements; then, you will check your understanding of what happens within the SelectorList.

### Instructions 

- Set up the Selector object sel with the html variable passed as the text argument.
- Assign to the variable divs a SelectorList of all div elements within the HTML document.

In [24]:
html =  '\n<html>\n<body>\n<div>Div 1: <p>paragraph 1</p></div>\n<div>Div 2: <p>paragraph 2</p> <p>paragraph 3</p> </div>\n<div>Div 3: <p>paragraph 4</p> <p>paragraph 5</p> <p>paragraph 6</p></div>\n<div>Div 4: <p>paragraph 7</p></div>\n<div>Div 5: <p>paragraph 8</p></div>\n</body>\n</html>\n'

In [25]:
# Create a Selector selecting html as the HTML document
sel = Selector( text = html )

# Create a SelectorList of all div elements in the HTML document
divs = sel.xpath( '//div' )
divs

[<Selector xpath='//div' data='<div>Div 1: <p>paragraph 1</p></div>'>,
 <Selector xpath='//div' data='<div>Div 2: <p>paragraph 2</p> <p>par...'>,
 <Selector xpath='//div' data='<div>Div 3: <p>paragraph 4</p> <p>par...'>,
 <Selector xpath='//div' data='<div>Div 4: <p>paragraph 7</p></div>'>,
 <Selector xpath='//div' data='<div>Div 5: <p>paragraph 8</p></div>'>]

### Requesting a Selector
We have pre-loaded the URL for a particular website in the string variable url and use the requests library to put the content from the website into the string variable html. Your task is to create a Selector object sel using the HTML source code stored in html.

#### Instructions

- Fill in the two blanks below to assign to create the Selector object sel which uses the string html as the text it inputs.

In [26]:
url = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

In [27]:
# Import requests
import requests

# Create the string html containing the HTML source
html = requests.get( url ).content

# Create the Selector object sel from html
sel = Selector( text = html )

# Print out the number of elements in the HTML document
print( "There are 1020 elements in the HTML document.")
print( "You have found: ", len( sel.xpath('//*') ) )

There are 1020 elements in the HTML document.
You have found:  1020


# 3. CSS Locators, Chaining, and Responses

Learn CSS Locator syntax and begin playing with the idea of chaining together CSS Locators with XPath. We also introduce Response objects, which behave like Selectors but give us extra tools to mobilize our scraping efforts across multiple websites.

### The (X)Path to CSS Locators
Many people prefer using CSS Locator notation to XPath notation. As we will see later, it often makes attribute selection very easy. To help get you more comfortable going back and forth between XPath and CSS Locator strings, we give you a chance in this exercise to do some direct "translation" between the two.

Note that the exercises in this chapter may take some time to load.

#### Instructions 

- Assign to the variable css_locator a CSS Locator string which is equivalent to the XPath string given.
 - Assign to the variable xpath a XPath string which is equivalent to the CSS Locator string given.

In [28]:
# Create the XPath string equivalent to the CSS Locator 
xpath = '/html/body/span[1]//a'

# Create the CSS Locator string equivalent to the XPath
css_locator = 'html > body > span:nth-of-type(1) a '

# Create the XPath string equivalent to the CSS Locator 
xpath = '//div[@id="uid"]/span//h4'

# Create the CSS Locator string equivalent to the XPath
css_locator = 'div#uid > span h4'

### Get an "a" in this Course
We have loaded the HTML from a secret website which you will use to set up a Selector object and the function how_many_elements(). When passing this function a CSS Locator string, it will print out the number of elements that the CSS Locator you wrote has selected.

In the second part of this problem, we want you to create a CSS Locator string which will select a certain collection of elements as described here: Select the hyperlink (a element) children of all div elements belonging to the class "course-block" (that is, any div element with a class attribute such that "course-block" is one of the classes assigned). The number of such elements is 11, so you can check your solution with how_many_elements if you choose.

#### Instructions 
- Fill in the blank below to create the Selector object sel using the string html as the text input.
- Assign the variable css_locator a CSS Locator string which directs to the hyperlink (a element) children of all div elements belonging to the class "course-block".

In [29]:
def open_html_raw(filename):
    return open(filename,'r').read()

def how_many_elements_css( css ):
  sel = Selector( text = html )
  print( len(sel.css( css )) )

In [30]:
html ='b'+ open_html_raw('data/getAnA.html')
html1 =open_html_raw('data/getAnA.html')

In [31]:
# Create a selector from the html (of a secret website)
sel = Selector( text = html )

# Fill in the blank
css_locator = ' div.course-block'

# Print the number of selected elements.
how_many_elements_css( css_locator )

11


### The CSS Wildcard
You can use the wildcard * in CSS Locators too! In fact, we can use it in a similar way, when we want to ignore the tag type. For example:

- The CSS Locator string '*' selects all elements in the HTML document.
- The CSS Locator string '*.class-1' selects all elements which belong to class-1, but this is unnecessary since the string '.class-1' will also do the same job.
- The CSS Locator string '*#uid' selects the element with id attribute equal to uid, but this is unnecessary since the string '#uid' will also do the same job.
In this exercise, we want you to work by analogy with the wildcard character you know from XPath notation to discover how to select all the children of a certain element in CSS Locator notation.

Instructions

- Assign to the variable css_locator a CSS Locator string which will select all children (regardless of tag-type) of the unique element in the HTML document that has its id attribute equal to uid.


In [32]:
# Create the CSS Locator to all children of the element whose id is uid
css_locator = '#uid > *'

### You've been 'href'ed
In a previous exercise, you created a CSS Locator string to select the hyperlink (a element) children of all div elements belonging to the class "course-block". Here we have created a SelectorList called course_as having selected those hyperlink children.

Now, we want you to fill in the blank below to extract the href attribute values from these elements. This is another example of chaining, as we've seen in a previous exercise.

The point here is that we can chain together calls to the methods css and xpath, and combine them! We help nudge you in the correct direction by giving you the solution if we chain with another call to the css method.

#### Instructions

- Set up the Selector object sel using the string html as the text input.
- Assign to the variable hrefs_from_xpath the href attribute values from the elements in course_as. Your solution should match hrefs_from_css!

In [33]:
# Create a selector object from a secret website
sel = Selector( text = html )

# Select all hyperlinks of div elements belonging to class "course-block"
course_as = sel.css( 'div.course-block > a' )

# Selecting all href attributes chaining with css
hrefs_from_css = course_as.css( '::attr(href)' )

# Selecting all href attributes chaining with xpath
hrefs_from_xpath = course_as.xpath( './@href' )

### Top Level Text
This exercise will have you write an XPath and CSS Locator string to direct to the text of a specific paragraph p element. The p element in the HTML is uniquely defined by its id attribute, which is "p3". With this small piece of information, you should be able to create the desired strings; however, we have preloaded the variable html with a string containing the HTML in which this link belongs, if you want to peruse it.

In this exercise, you will only be selecting the text within the element, which does not include the text in future generations of the element. We have created a function print_results for you to compare which elements your strings direct to.

#### Instructions

- Assign to the variable xpath an XPath string directing to the text within the paragraph p element with id equal to p3, which does not include the text of future generations of this p element.
- Assign to the variable css_locator a CSS Locator string directing to this same text.

In [34]:
from bs4 import BeautifulSoup
# soup = BeautifulSoup(html, 'html.parser')

In [35]:
html = '<html><body><div id="this-div"><p id="p1" class="class-1">This is not the element you are looking for</p><p id="p2" class="class-12"><a href="https://www.google.com">Google</a> is linked to here, but this isn\'t the link you are looking for. </p><p id="p3" class="class-1 class-12">Here is the <a href="https://www.datacamp.com" id="a-exercise">DataCamp</a> link you want!</p></div></body></html>'
# print(soup.prettify())
def our_xpath( xpath ):
  xextr = sel.xpath( xpath ).extract()
  return xextr

def our_css( css ):
  cextr = sel.css( css ).extract()
  return cextr

def print_results( xpath, css_locator ):
  print( "Your XPath extracts to following:")
  print( our_xpath(xpath) )
  print("_________________\n")
  print( "Your CSS Locator extracts the following:")
  print( our_css(css_locator) )
  return None

In [36]:
sel = Selector(text = html)
# Create an XPath string to the desired text.
xpath = '//p[@id = "p3"]/text()  '

# Create a CSS Locator string to the desired text.
css_locator = 'p#p3::text'

# Print the text from our selections
print_results( xpath, css_locator )

Your XPath extracts to following:
['Here is the ', ' link you want!']
_________________

Your CSS Locator extracts the following:
['Here is the ', ' link you want!']


### All Level Text
This exercise is similar to the previous, but differs in that you will be selecting text from multiple generations of a given element.

You will write an XPath and CSS Locator strings to direct to the text of a specific paragraph p element. The p element in the HTML is uniquely defined by its id attribute, which is "p3". With this small piece of information, you should be able to create the desired strings; however, we have preloaded the variable html with a string containing the HTML in which this link belongs, if you want to peruse it.

In this exercise, you will only be selecting the text within the element which includes all text within the future generations. We have created a function print_results for you to compare which elements your strings direct to.

#### Instructions

- Assign to the variable xpath an XPath string directing to the text within the paragraph p element with id equal to p3, which includes the text of future generations of this p element.
- Assign to the variable css_locator a CSS Locator string directing to this same text.

In [37]:
# Create an XPath string to the desired text.
xpath = '//p[@id = "p3"]//text() '

# Create a CSS Locator string to the desired text.
css_locator = 'p#p3 ::text'

# Print the text from our selections
print_results( xpath, css_locator )

Your XPath extracts to following:
['Here is the ', 'DataCamp', ' link you want!']
_________________

Your CSS Locator extracts the following:
['Here is the ', 'DataCamp', ' link you want!']


### Reveal By Response
We have pre-loaded a Response object, named response, with the content from a secret website. Your job is to figure out the URL and the title of the website using the response variable. You learned how to find the URL in the last lesson. To find the website title, what you need to know is:

- The title is the text from the title element
- The title element is a child of the head element, which is a child of the html root element.
To note: the html root element only has one child head element, and the head element only has one child title element.

#### Instructions

- Assign to the variable this_url the URL used to load the response variable.
- Assign to the variable this_title the title of the website used to load the response variable. Since we only want the text from the single element we will select, we use the extract_first() method to extract the text.
- Regardless of whether you use xpath or css, make sure that you are selecting the text within the title element, and not just the title itself.


In [38]:
from scrapy.http import TextResponse
def print_url_title( url, title ):
  print( "Here is what you found:" )
  print( "\t-URL: %s" % url )
  print( "\t-Title: %s" % title )

In [39]:
response = TextResponse('https://www.datacamp.com/courses/all')

In [40]:
# Get the URL to the website loaded in response
this_url = response.url

# Get the title of the website loaded in response
this_title = response.xpath('//head/title/text()').extract_first()

# Print out our findings
print_url_title( this_url, this_title )

Here is what you found:
	-URL: https://www.datacamp.com/courses/all
	-Title: None


### Responding with Selectors
Something that we should emphasize at this point about the relationship between a Selector and Response objects is that both objects return a SelectorList when using the xpath or css methods to direct to elements. In this exercise, we'll prove it to you, by having you find all hyperlink elements belonging to the class course-block__link (notice the double underscore!) and looking at the object that is produced when doing so.

Recall that to find an element by class, you can use a period (.). For example, div.class-2 selects all div elements belonging to class-2.

We have pre-loaded both a Response object named response and a Selector object named sel with the content from the same "secret" website. Once you complete the task of creating a CSS Locator, you will compare both the output from response.css and selector.css to see that they are effectively the same!

#### Instructions

- Assign to the variable css_locator a CSS Locator string which directs to all hyperlink a elements belonging to the class course-block__link.
- Assign to the variable response_as the output of passing the css_locator variable to the css method in response.
- Assign to the variable sel_as the output of passing the css_locator variable to the css method in sel.

In [41]:
sel = Selector(text = html1)

# Create a CSS Locator string to the desired hyperlink elements
css_locator = 'a.course-block__link'

# Select the hyperlink elements from response and sel
response_as = response.css( css_locator )
sel_as = sel.css( css_locator )

# Examine similarity
nr = len( response_as )
ns = len( sel_as )
for i in range( min(nr, ns, 2) ):
  print( "Element %d from response: %s" % (i+1, response_as[i]) )
  print( "Element %d from sel: %s" % (i+1, sel_as[i]) )
  print( "" )

### Selecting from a Selection
In this exercise, you will find the text from an h4 element within a particular div element. It will occur in steps where the first step is selecting a family of div elements, and the second step is narrowing in on the first one, from which we will grab the h4 element text. This process of progressively narrowing in on elements (e.g., first to the div elements, then to the h4 element) is another example of "chaining", even if it doesn't look exactly the same as we've seen it before.

Along the way in this exercise, there is a variable first_div set up for you to use. Think carefully about what type of object first_div is!

#### Instructions 

- Assign to the variable divs a SelectorList which selects all div elements belonging to the class course-block.
- Assign to the variable h4_text the text from the only h4 element within the content selected in first_div. Since we only want the text from the single element we will select, we use the extract_first() method to extract the text.

### Titular
Similar to the work given in the previous lesson, we will have you use a pre-loaded Response object, named response to scrape the course titles from the (shortened version of the) DataCamp course directory https://www.datacamp.com/courses/all. To successfully do so, you only need to know the following

The course titles are the text from all the h4 elements within the HTML document.
We ask you to extract these course titles here.

#### Instructions

- Using response, assign to the variable crs_title_els a SelectorList of the selected course titles.
- Assign to the variable crs_titles a list created by extracting the course titles from crs_title_els.


In [42]:
# Create a SelectorList of the course titles
crs_title_els = '//h4/text()'

# Extract the course titles 
crs_titles = response.xpath(crs_title_els).extract()

# Print out the course titles 
for el in crs_titles:
  print( ">>", el )

### Scraping with Children
We did a cute trick in the lesson to calculate how many children there were of one of the div elements belonging to the class course-block. Here we ask you to find the number of children of a mystery element (already stored within a Selector object, so you can use the xpath or css method).

To be explicit, we have created the Selector object mystery in the following way:

We first loaded a Response variable using a secret website as the input.
Then we used a call to the xpath method to create a SelectorList of elements (but we won't say which ones)
Finally, we let mystery be the first Selector object of this SelectorList.

#### Instructions

- Fill in the blank below to chain on a call to xpath so that we can calculate the number of children of the mystery element; we assign this number to the variable how_many_kids.

- Remember, if you use xpath, this really is an instance of chaining, so don't forget to use a period (.) as glue.

In [43]:
sel = Selector(text = html)
mystery = sel.xpath('//body')

In [44]:
# Calculate the number of children of the mystery element
how_many_kids = len( mystery.xpath( './*' ) )

# Print out the number
print( "The number of elements you selected was:", how_many_kids )

The number of elements you selected was: 1


# 4. Spiders

Learn to create web crawlers with scrapy. These scrapy spiders will crawl the web through multiple pages, following links to scrape each of those pages automatically according to the procedures we've learned in the previous chapters.

In [45]:
from scrapy.crawler import CrawlerProcess

### Inheriting the Spider
When learning about scrapy spiders, we saw that the main portion of the code for us to adjust is the class for the spider. To help build some familiarity of the class, you will complete a short piece of code to complete a toy-model of the spider class code. We've omitted the code that would actually run the spider, only including the pieces necessary to create the class.

As mentioned in the lesson, a class is roughly a collection of related variables and functions housed together. Sometimes one class likes to use methods from another class, and so we will inherit methods from a different class. That's what we do in the spider class.

We wrote the function inspect_class to look at the your class once you're done, if you'd like to test your solution!

#### Instructions

- Pass scrapy.Spider as an argument to the class YourSpider; this will make it so that YourSpider inherits the methods from scrapy.Spider.

In [46]:
def inspect_class(c):
  newc = c()
  meths = dir(newc)
  if 'name' in meths:
    print("Your spider class name is:", newc.name)
  if 'from_crawler' in meths:
    print("It seems you have inherited methods from scrapy.Spider -- NICE!")
  else:
    print("Oh no! It doesn't seem that you are inheriting the methods from scrapy.Spider!!")

In [47]:

# Create the spider class
class YourSpider(scrapy.Spider):
  name = "your_spider"
  # start_requests method
  def start_requests(self):
    pass
  # parse method
  def parse(self, response):
    pass
  
# Inspect Your Class
inspect_class(YourSpider)

Your spider class name is: your_spider
It seems you have inherited methods from scrapy.Spider -- NICE!


### Hurl the URLs
In the next lesson we will talk about the start_requests method within the spider class. In this quick exercise, we ask you to change around a variable within the start_requests method which foreshadows some of what we will be learning in the next lesson. Basically, we want you to start becoming comfortable turning some of the wheels within a spider class; in this case, making a list of urls within the start_requests method.

We've written a function inspect_class which will print out the list of elements you have in the urls variable within the start_requests method.

Note: in the next several exercises, you will write code to complete your spider class, but the code does not yet include the pieces to actually run the spider; that will come at the end.

#### Instructions

- Fill in the blank within the start_requests method to assign the variable urls a list with the two strings: "https://www.datacamp.com" and"https://scrapy.org".

In [48]:
# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    urls = ["https://www.datacamp.com","https://scrapy.org"]
    for url in urls:
      yield url
  # parse method
  def parse( self, response ):
    pass
  
# Inspect Your Class
inspect_class( YourSpider )

Your spider class name is: your_spider
It seems you have inherited methods from scrapy.Spider -- NICE!


### Self Referencing is Classy
You probably have noticed that within the spider class, we always input the argument self in the start_requests and parse methods (just look in the sample code in this exercise!). This allows us to reference between methods within the class. That is, if we want to refer to the method parse within the start_requests method, we would need to write self.parse rather than just parse; what writing self does is tell the code: "Look in the same class as start_requests for a method called parse to use."

In this exercise you will get a chance to play with this "self referencing".

### Instructions

- Fill in the required scrapy object into the class YourSpider needed to create the scrapy spider.
- Pass the string argument "Hello World!" to fill in the blank in the start_requests method to use the print_msg method.

In [49]:
def square(self):
    return self**2

In [50]:
# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    self.print_msg( "Hello World!" )
  # parse method
  def parse( self, response ):
    pass
  # print_msg method
  def print_msg( self, msg ):
    print( "Calling start_requests in YourSpider prints out:", msg )
  
# Inspect Your Class
inspect_class( YourSpider )

Your spider class name is: your_spider
It seems you have inherited methods from scrapy.Spider -- NICE!


### Starting with Start Requests
In the last lesson we learned about setting up the start_requests method within a scrapy spider. Here we have another toy-model spider which doesn't actually scrape anything, but gives you a chance to play with the start_requests method. What we want is for you to start becomming familiar with the arguments you pass into the scrapy.Request call within start_requests.

As before, we have created the function inspect_class to examine what you are yielding in start_requests.

#### Instructions

- Fill in the required scrapy object into the class YourSpider needed to create the scrapy spider.
- Fill in the blank in the yielded scrapy.Request call within the start_requests method so that the URL this spider would start scraping is "https://www.datacamp.com" and would use the parse method (within the YourSpider class) as the method to parse the website.

In [51]:
# Create the spider class
class YourSpider( scrapy.Spider ):
  name = "your_spider"
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = "https://www.datacamp.com", callback = self.parse )
  # parse method
  def parse( self, response ):
    paddss
  
# Inspect Your Class
inspect_class( YourSpider )

Your spider class name is: your_spider
It seems you have inherited methods from scrapy.Spider -- NICE!


### Pen Names
In this exercise, we have set up a spider class which, when finished, will retrieve the author names from a shortened version of the DataCamp course directory. The URL for the shortened version is stored in the variable url_short. Your job will be to create the list of extracted author names in the parse method of the spider.

Two things you should know:

You will be using the response object and the css method here.
The course author names are defined by the text within the paragraph p elements belonging to the class course-block__author-name
You can inspect the spider using the function inspect_spider() that we built for you -- it will print out the author names you find!

Note that this and the remaining exercises in this chapter may take some time to load.

#### Instructions

- Fill in the required arguments to the parse method so that it will work as required when called in the start_requests method.
- Within the parse method, create a variable author_names, which is a list of strings created by extracting the text from the paragraph elements belonging to the class course-block__author-name.

In [52]:
def inspect_spider( s ):
  news = s()
  try:
    req = list( news.start_requests() )[0]
    url = req.url
    html = requests.get( url ).content
    response = TextResponse( url = url, body = html, encoding = 'utf-8' )
    author_names = req.callback( response )
    print( 'You have collected the author names:')
    for a in author_names:
      print('\t-', a )
  except:
    print( 'Oh no! Something went wrong with the code. Keep trying!')

In [53]:
url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'
# Create the Spider class
class DCspider( scrapy.Spider ):
  name = 'dcspider'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )
  # parse method
  def parse( self, response ):
    # Create an extracted list of course author names
    author_names = response.css( 'p.course-block__author-name::text' ).extract()
    # Here we will just return the list of Authors
    return author_names
  
# Inspect the spider
inspect_spider( DCspider )

You have collected the author names:
	- Jonathan Cornelissen
	- Matt Dowle
	- Garrett Grolemund
	- Garrett Grolemund
	- Garrett Grolemund
	- Filip Schouwenaars
	- Gilles Inghelbrecht
	- Nick Carchedi
	- Filip Schouwenaars
	- Filip Schouwenaars
	- Mark Peterson


### Crawler Time
This will be your first chance to play with a spider which will crawl between sites (by first collecting links from one site, and following those links to parse new sites). This spider starts at the shortened DataCamp course directory, then extracts the links of the courses in the parse method; from there, it will follow those links to extract the course descriptions from each course page in the parse_descr method, and put these descriptions into the list course_descrs. Your job is to complete the code so that the spider runs as desired!

We have created a function inspect_spider which will print out one of the course descriptions you scrape (if done correctly)!

#### Instructions

- Fill in the two blanks below (one in each of the parsing methods) with the appropriate entries so that the spider can move from the first parsing method to the second correctly.

In [54]:
def inspect_spider( s ):
  news = s()
  try:
    req1 = list( news.start_requests() )[0]
    html1 = requests.get( req1.url ).content
    response1 = TextResponse( url = req1.url, body = html1, encoding = 'utf-8' )
    req2 = list( news.parse( response1 ) )[0]
    html2 = requests.get( req2.url ).content
    response2 = TextResponse( url = req2.url, body = html2, encoding = 'utf-8' )
    for d in news.parse_descr( response2 ):
      print("One course description you found is:", d )
      break
  except:
    print("Oh no! Something is wrong with the code. Keep trying!")

In [55]:
# Create the Spider class
class DCdescr( scrapy.Spider ):
  name = 'dcdescr'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request( url = url_short, callback = self.parse )
  
  # First parse method
  def parse( self, response ):
    links = response.css( 'div.course-block > a::attr(href)' ).extract()
    # Follow each of the extracted links
    for link in links:
      yield response.follow(url = link, callback = self.parse_descr)
      
  # Second parsing method
  def parse_descr( self, response ):
    # Extract course description
    course_descr = response.css( 'p.course__description::text' ).extract_first()
    # For now, just yield the course description
    yield course_descr


# Inspect the spider
inspect_spider( DCdescr )

One course description you found is: In this introduction to R, you will master the basics of this beautiful open source language, including factors, lists and data frames. With the knowledge gained in this course, you will be ready to undertake your first very own data analysis. With over 2 million users worldwide R is rapidly becoming the leading programming language in statistics and data science. Every year, the number of R users grows by 40% and an increasing number of organizations are using it in their day-to-day activities. Leverage the power of R by completing this free R online course today!


### Time to Run
In the last lesson, we went through creating an entire web-crawler to access course information from each course in the DataCamp course directory. However, the lesson seemed to stop without a climax, because we didn't play with the code after finishing the parsing methods.

The point of this exercise is to remedy that!

The code we give you to look at in this and the next exercise is long, because its the entire spider that took us the lesson to create! However, don't be intimidated! The point of these two exercises is to give you a very easy task to complete, with the hope that you will look at and run the code for this spider. That way, even though it is long, you will have a grasp of it!

#### Instructions

- Fill in the one blank at the end of the parse_pages methods to assign the chapter titles to the dictionary whose key is the corresponding course title.
- NOTE: If you hit Run Code, you must Reset to Sample Code to successfully use Run Code again!!



In [56]:
import warnings
warnings.filterwarnings('ignore')

def previewCourses( dc_dict, n = 3 ):
  crs_titles = list( dc_dict.keys() )
  print( "A preview of DataCamp Courses:")
  print("---------------------------------------\n")
  for t in crs_titles[:n]:
    print( "TITLE: %s" % t)
    for i,ct in enumerate(dc_dict[t]):
      print("\tChapter %d: %s" % (i+1,ct) )
    print("")
url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

In [57]:
import scrapy
# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Chapter_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    crs_title_ext = crs_title.extract_first().strip()
    ch_titles = response.css('h4.chapter__title::text')
    ch_titles_ext = [t.strip() for t in ch_titles.extract()]
    dc_dict[ crs_title_ext ] = ch_titles_ext

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Chapter_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

2020-05-03 14:26:39 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot)
2020-05-03 14:26:39 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1  11 Sep 2018), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2020-05-03 14:26:39 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-05-03 14:26:39 [scrapy.crawler] INFO: Overridden settings:
{}
2020-05-03 14:26:39 [scrapy.extensions.telnet] INFO: Telnet Password: ccd59ad3f28cd7ca
2020-05-03 14:26:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-05-03 14:26:41 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrap

A preview of DataCamp Courses:
---------------------------------------

TITLE: Introduction to R
	Chapter 1: Intro to basics
	Chapter 2: Vectors
	Chapter 3: Matrices
	Chapter 4: Factors
	Chapter 5: Data frames
	Chapter 6: Lists

TITLE: Reporting with R Markdown
	Chapter 1: Authoring R Markdown Reports
	Chapter 2: Embedding Code
	Chapter 3: Compiling Reports
	Chapter 4: Configuring R Markdown (optional)

TITLE: Intermediate R - Practice
	Chapter 1: Conditionals and Control Flow
	Chapter 2: Loops
	Chapter 3: Functions
	Chapter 4: The apply family
	Chapter 5: Utilities



### DataCamp Descriptions
Like the previous exercise, the code here is long since you are working with an entire web-crawling spider! But again, don't let the amount of code intimidate you, you have a handle on how spiders work now, and you are perfectly capable to complete the easy task for you here!

As in the previous exercise, we have created a function previewCourses which lets you preview the output of the spider, but you can always just explore the dictionary dc_dict too after you run the code.

In this exercise, you are asked to create a CSS Locator string direct to the text of the course description. All you need to know is that from the course page, the course description text is within a paragraph p element which belongs to the class course__description (two underlines).

#### Instructions

- Fill in the one blank below in the parse_pages method with a CSS Locator string which directs to the text within the paragraph p element which belongs to the class course__description.


### Capstone Crawler
This exercise gives you a chance to show off what you've learned! In this exercise, you will write the parse function for a spider and then fill in a few blanks to finish off the spider. On the course directory page of DataCamp, each listed course has a title and a short course description. This spider will be used to scrape the course directory to extract the course titles and short course descriptions. You will not need to follow any links this time. Everything you need to know is:

The course titles are defined by the text within an h4 element whose class contains the string block__title (double underline).
The short course descriptions are defined by the text within a paragraph p element whose class contains the string block__description (double underline).

#### Instructions

- Fill in the four blanks below with the necessary entries to complete your spider.


