## Challenge

Do a little scraping or API-calling of your own.  Pick a new website and see what you can get out of it.  Expect that you'll run into bugs and blind alleys, and rely on your mentor to help you get through.  

Formally, your goal is to write a scraper that will:

1) Return specific pieces of information (rather than just downloading a whole page)  
2) Iterate over multiple pages/queries  
3) Save the data to your computer  

Once you have your data, compute some statistical summaries and/or visualizations that give you some new insights into your scraping topic of interest.  Write up a report from scraping code to summary and share it with your mentor.

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import pandas as pd
pd.set_option('display.max_colwidth', 1000)

import scrapy
import re
from scrapy.crawler import CrawlerProcess

In [2]:
class VRSpider(scrapy.Spider):
    # Naming the spider is important if you are running more than one spider of
    # this class simultaneously.
    name = "VRS"
    
    # URL(s) to start with.
    start_urls = [
        'https://www.vivareal.com.br/venda/sp/sao-paulo/centro/santa-cecilia/apartamento_residencial/#area-desde=100&onde=BR-Sao_Paulo-NULL-Sao_Paulo-Centro-Santa_Cecilia&preco-ate=500000&preco-desde=300000&tipos=apartamento_residencial',
    ]

    # Use XPath to parse the response we get.
    def parse(self, response):
        
        # Iterate over every <property> element on the page.
        for card in response.xpath('//article'):
            
            # Yield a dictionary with the values we want.
            yield {
                'title': card.xpath('h2/a[@class="property-card__title js-cardLink js-card-title"]/text()').extract_first(),
                'cost': card.xpath('section[@class="property-card__values"]/div/text()').extract_first(),
                'condo': card.xpath('section/footer/div/strong[@class="js-condo-price"]/text()').extract_first(),
                'area': card.xpath('ul/li[@class="property-card__detail-item property-card__detail-area"]/span/text()').extract_first(),
                'rooms': card.xpath('ul/li[@class="property-card__detail-item property-card__detail-room js-property-detail-rooms"]/span/text()').extract_first(),
                'suites': card.xpath('ul/li[@class="property-card__detail-item property-card__detail-item-extra js-property-detail-suites"]/span/text()').extract_first(),
                'bathrooms': card.xpath('ul/li[@class="property-card__detail-item property-card__detail-bathroom js-property-detail-bathroom"]/span/text()').extract_first(),
                'garages': card.xpath('ul/li[@class="property-card__detail-item property-card__detail-garage js-property-detail-garages"]/span/text()').extract_first(),
                'link': card.xpath('h2/a/@href').extract_first(),
            }
        # Get the URL of the next page.
        next_page = response.xpath('//li/a[@title="Próxima página"]/@href').extract_first()
        
        # Recursively call the spider to run on the next page, if it exists.
        next_page = response.urljoin(next_page)
        
        # Request the next page and recursively parse it the same way we did above
        yield scrapy.Request(next_page, callback=self.parse)

# Tell the script how to run the crawler by passing in settings.
# The new settings have to do with scraping etiquette.          
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'data.json',       # Name our storage file.
    'LOG_ENABLED': False,          # Turn off logging for now.
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': 'bada',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': False
})

# Start the crawler with our spider.
process.crawl(VRSpider)
process.start()
print('Success!')

Success!


In [3]:
# Checking whether we got data from all pages
VRSdf=pd.read_json('data.json', orient='records')

# Remove unnecessary newlines and whitespace
VRSdf.replace(r'\n', '', regex=True, inplace=True)
VRSdf.iloc[:,:-1].replace(r'\s', '', regex = True, inplace=True)

print(VRSdf.shape)
VRSdf.head()

(72, 9)


Unnamed: 0,area,bathrooms,condo,cost,garages,link,rooms,suites,title
0,35,1,R$450,R$299.000,1,/imovel/apartamento-1-quartos-santa-cecilia-centro-sao-paulo-com-garagem-35m2-venda-RS299000-id-93569152/,1,--,"Apartamento com Quarto à Venda, 35m²"
1,374-605,4-6,,,4,/imoveis-lancamento/itacolomi-445-4795/,4,,Itacolomi 445
2,35,1,R$523,R$299.000,1,/imovel/apartamento-1-quartos-santa-cecilia-centro-sao-paulo-com-garagem-35m2-venda-RS299000-id-93668578/,1,1,"Apartamento com Quarto à Venda, 35m²"
3,50,1,R$585,R$379.000,1,/imovel/apartamento-1-quartos-santa-cecilia-centro-sao-paulo-com-garagem-50m2-venda-RS379000-id-1038599412/,1,1,"Apartamento com Quarto à Venda, 50m²"
4,96,1,R$787,R$795.000,--,/imovel/apartamento-2-quartos-santa-cecilia-centro-sao-paulo-96m2-venda-RS795000-id-95047049/,2,1,"Apartamento com 2 Quartos à Venda, 96m²"
