# Scraping Craigslist

### Introduction

In this lesson, we'll learn how to work with Selenium to scrape craigslist. As we'll see, Selenium can allow us to perform many (if not all) of the operations that we can perform when navigating the web by hand.  Let's get started.

> For the code below to work, we should have Firefox installed on our computer.  It may also help to reference [this website](https://stackoverflow.com/questions/40208051/selenium-using-python-geckodriver-executable-needs-to-be-in-path).

### Selecting Elements

Now if we visit the craigslist page, we can see that there is a lot of good information contained inside of the `result-info` box.

> <img src="./result-info-pic.png" width="60%">

That is the box with text, and we can see that it contains information about the description, the price, number of bedrooms.

As promising as the information on the webpage looks, if we inspect the HTML, it actually contains more information.

> <img src="./info-html.png" width="80%">

Looking above, we can see the number of bedrooms, square feet, the neighborhood, the price.  And each is given a separate tag.  

So let's work to select the our result-info boxes from the page.

### Loading up Selenium

In [6]:
from selenium import webdriver

driver = webdriver.Firefox()
craigslist_url = "https://newyork.craigslist.org/search/apa"
driver.get(craigslist_url)

> The main tool that we're using above is the webdriver.  We initialize a Firefox webdriver object, and then make the request to craigslist.  This will open up the firefox brower to that page, if Firefox is installed.

After navigating to the webpage we can select elements.  We can do so with the `find_elements` method.  And then from there, we can use our `CSS_SELECTOR` to select elements using our knowledge of CSS.

Taking another look at our result info box, we can see that we want to select each instance of HTML on the page that has a `result-info` class.

> <img src="./result-info-pic.png" width="60%">

And we can select each of these elements with the following.

In [14]:
from selenium.webdriver.common.by import By
driver.implicitly_wait(1)

infos = driver.find_elements(By.CSS_SELECTOR, ".result-info")

So above, we first make sure that we wait one second, and then find the `result-info` boxes.

> We can see that infos now contains a list of WebElements.

In [16]:
infos[:2]

[<selenium.webdriver.remote.webelement.WebElement (session="868bb8c0-6599-414b-9078-df79cd094bec", element="35d15aa1-9ffe-494e-bc9a-7e7db1d644c0")>,
 <selenium.webdriver.remote.webelement.WebElement (session="868bb8c0-6599-414b-9078-df79cd094bec", element="39417e0c-b769-4b03-b2e1-af89c87eef31")>]

And each element contains some key information about the apartment listing.

> So let's just select the first WebElement.

In [18]:
first_info = infos[0]

We can see that this represents the first of our info boxes, and contains information about the text, or entire HTML of that box.

> We can get the text with the following:

In [24]:
first_info.text

'Jan 4 *BRAND NEW 1BR APT*ASTORIA*3 BLK N!QUEEN LR!QUEEN BR!KIT!1/1/22 $1,800 1br - 650ft2 - (ASTORIA queens )'

> Or if we would like to see all of the HTML of just that element, we can do so with the following, by uncommenting the following.

> It's a lot of text so, comment it back when your done.

In [23]:
# first_info.get_attribute('innerHTML')

### Digging Deeper

Ok, so now that we have found the html elements that contain valuable information, we can move through to extract the specific information that we would like.

For example, let's take another look at the HTML of each info-box.

<img src="./info-html.png" width="80%">

Let's try to just extract the information about the price.  Here's how we can do so.

In [28]:
price_info = first_info.find_element(By.CSS_SELECTOR, '.result-price')
price_info.text

'$1,800'

So just inside of our `first_info` box, we find the `result-price`, and then from there we call the text method, to get the text.

### Looping Through Elements

Once we know how to select information from one listing, we can loop through and select the information from multiple listings.

Here is all of the code.

In [31]:
from selenium import webdriver

driver = webdriver.Firefox()
craigslist_url = "https://newyork.craigslist.org/search/apa"
driver.get(craigslist_url)

driver.implicitly_wait(1)

infos = driver.find_elements(By.CSS_SELECTOR, ".result-info")
listings = []
for info in infos:
    
    date_el = info.find_element(By.CSS_SELECTOR, '.result-date')
    title_el = info.find_element(By.CSS_SELECTOR, '.result-title')
    hood_text = ''
    housing_text = ''
    hood_els = info.find_elements(By.CSS_SELECTOR, '.result-hood')
    housing_els = info.find_elements(By.CSS_SELECTOR, '.housing')
    if len(hood_els) > 0:
        hood_text = hood_els[0].text
    if len(housing_els) > 0:
        housing_text = housing_els[0].text
    listing_ob = {'date': date_el.text, 'title': title_el.text, 
                  'hood': hood_text, 'href': title_el.get_property('href'), 
                  'housing': housing_text}
    listings.append(listing_ob)
driver.close()

From there, we can see the information that we collected.

In [32]:
listings[:2]

[{'date': 'Jan 4',
  'title': 'SPACIOUS LAYOUT--FREE CABLE--LUXURY LIVING--',
  'hood': '(Financial District)',
  'href': 'https://newyork.craigslist.org/mnh/apa/d/new-york-spacious-layout-free-cable/7428564709.html',
  'housing': '1br - 650ft2 -'},
 {'date': 'Jan 4',
  'title': 'RENT STABILIZED--SWIMMING POOL --LUXURY LIVING --NO BROKER FEE',
  'hood': '(Financial District)',
  'href': 'https://newyork.craigslist.org/mnh/apa/d/new-york-rent-stabilized-swimming-pool/7428564207.html',
  'housing': '3br -'}]

### Bonus Points 

Here's some other things that we can do with selenium.  We can click on a selected element with the `.click()` method.

We can even fill out a form on a webpage by selecting elements and then using the `send_keys` method.

In [190]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
craigslist_url = "https://newyork.craigslist.org/search/apa"
driver.get(craigslist_url)


query_box = driver.find_elements_by_class_name('querybox')[0]
search_box = query_box.find_elements_by_tag_name('input')[0]
search_box.send_keys('3 br')

query_box.find_elements_by_class_name('searchbtn')[0].click()

### Summary

In this lesson, we saw how to use Selenium to scrape information from webpages.  We started by navigating to a webpage with the following: 
```python
driver = webdriver.Firefox
driver.get("https://newyork.craigslist.org/search/apa")
```

Then we selected elements with a call to:
    
```python
query_box = driver.find_elements_by_class_name('querybox')[0]
```

From there, we saw that we could select child elements by calling `selected_element.find_elements_by...`.

Finally, we saw that we can fill out forms using selenium with calls to `send_keys('text')` and `selected_element.click()`.

[Geckodriver](https://stackoverflow.com/questions/40208051/selenium-using-python-geckodriver-executable-needs-to-be-in-path)