## Challenge

Do a little scraping or API-calling of your own.  Pick a new website and see what you can get out of it.  Expect that you'll run into bugs and blind alleys, and rely on your mentor to help you get through.  

Formally, your goal is to write a scraper that will:

1) Return specific pieces of information (rather than just downloading a whole page)  
2) Iterate over multiple pages/queries  
3) Save the data to your computer  

Once you have your data, compute some statistical summaries and/or visualizations that give you some new insights into your scraping topic of interest.  Write up a report from scraping code to summary and share it with your mentor.

In [1]:
# I will be scraping Portland's Craiglist for housing prices

# Importing in each cell because of the kernel restarts.
import scrapy
from scrapy.crawler import CrawlerProcess


class CLSpider(scrapy.Spider):
    # Naming the spider is important if you are running more than one spider of
    # this class simultaneously.
    name = "CL"
    
    # URL(s) to start with.
    allowed_domains = ["portland.craigslist.org"]
    start_urls = [
        'https://portland.craigslist.org/search/hhh?lang=en&cc=us&query=palermo&availabilityMode=0&sale_date=all+dates',
    ]

    # Use XPath to parse the response we get.
    def parse(self, response):
        
        # Iterate over every <article> element on the page.
        for posting in response.xpath('//p'):
            
            # Yield a dictionary with the values we want.
            yield {
                # This is the code to choose what we want to extract
                # You can modify this with other Xpath expressions to extract other information from the site
                'title': posting.xpath('a[@class="result-title hdrlnk"]/text()').extract_first(),
                'date': posting.xpath('time[@class="result-date"]/text()').extract_first(),
                'price': posting.xpath('span/span[@class="result-price"]/text()').extract_first()
            }
        
        # scrape all pages
        next_page_relative_url = response.xpath('//a[@class="button next"]/@href').extract_first()
        next_page_absolute_url = response.urljoin(next_page_relative_url)
        

# Tell the script how to run the crawler by passing in settings.
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'clpdx.json',  # Name our storage file.
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': 'ThinkfulDataScienceBootcamp (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,           # Turn off logging for now.
    # We use CLOSESPIDER_PAGECOUNT to limit our scraper to the first 100 links.    
    'CLOSESPIDER_PAGECOUNT' : 10
})

# Start the crawler with our spider.
process.crawl(CLSpider)
process.start()
print('Success!')

Success!


Now we will create a dataframe to analyze the data:

In [2]:
import pandas as pd

# Turning JSON into Data Frame
portland = pd.read_json('clpdx.json')
print(portland.shape)
portland.head()

(7, 3)


Unnamed: 0,date,price,title
0,Apr 12,$1403,Carve Out a Great Life At Palermo This Spring!
1,Apr 17,$1451,"Beautiful One Bedroom with Storage, Full Size ..."
2,Apr 17,$1619,"Reserved Covered Parking, Master Suite, 24-hou..."
3,Apr 16,$1413,Fire Up The Grill On The Patio Of Your New Home!
4,Apr 16,$1602,Live The Lakeland Hills Lifestyle


For some reason I can't get it to scrape more than 7 entries

In [4]:
import re

portland.price = portland.price.map(lambda x: None if x == None else int(re.sub('\$', '', str(x))))
portland.head()

Unnamed: 0,date,price,title
0,Apr 12,1403,Carve Out a Great Life At Palermo This Spring!
1,Apr 17,1451,"Beautiful One Bedroom with Storage, Full Size ..."
2,Apr 17,1619,"Reserved Covered Parking, Master Suite, 24-hou..."
3,Apr 16,1413,Fire Up The Grill On The Patio Of Your New Home!
4,Apr 16,1602,Live The Lakeland Hills Lifestyle
