# Big idea: scraping Redfin for house price data
## Inputs from search page:
- Neighborhood - str, categorical
- **Price (Target)** - int
- Beds - float-ish
- Baths -float-ish
- SqFt - int
- How long on Redfin - int
## Inputs from individual listing pages:
- Style e.g. townhouse duplex condo etc. - str, categorical
- Lot size - int
- Year built - int
- Buyer's Brokerage Compensation - float percent
## Output:
Given the totally fixed (neighborhood, style, lot size, year built) and mostly fixed (beds, baths, sqft), what price should you ask and what brokerage comp should you set?
There are really two short-term useage cases (what should you set for the fastest sale and what should you set for highest price?) and longer-term useage cases (what's your ROI on various improvements given the unchangeables?)

# Step one: scraping the search page.

In [2]:
from bs4 import BeautifulSoup
import requests
import random
import re
from fake_useragent import UserAgent
import time, os
import sys
import pickle
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
chromedriver = '/Applications/chromedriver'
os.environ["webdriver.chrome.driver"] = chromedriver

In [None]:
searchpage_url = 'https://www.redfin.com/city/16163/WA/Seattle'
response = requests.get(searchpage_url)
response.status_code

We get a 403 (Forbidden) response code. Since I don't know what part of my process is ticking off Redfin, I'm going to Google the problem ('scraping Redfin') and see what's out there - basically a lot of ads for getting somebody else to do the scraping and a few basic blog posts on specifying fake useragents.

So we're going to need to at a minimum set up useragents and try again.

In [None]:
user_agent = {'User-agent': 'Mozilla/5.0'}
response  = requests.get(searchpage_url, headers = user_agent)

In [None]:
response.status_code

OK, this WORKED! Which means that at least in theory I can use headers and pauses to keep Redfin from locking me out.

Next step: can I scrape from the searchpage the things I want to scrape? Will need to add the urls for individual listings as well; address should work as unique identifier?

In [None]:
searchpage_soup = BeautifulSoup(response.content, 'html.parser')

Was consistently getting "TypeError: object of type 'Response' has no len()" errors; changing 'response' to "response.content, 'html.parser'" fixed it for unknown reasons.
According to https://stackoverflow.com/a/50324754/2880512, the Response object is neither a string nor a filename, so it's necessary to specify what gets passed. .text, .content, and .raw may all potentially work.

Better answer may be that you need to pass BeautifulSoup the content, not the request, e.g. soup = bs4.BeautifulSoup(html.text); this obviously does not work in the current configuration because of the need for useragents.

In [None]:
print(searchpage_soup.prettify())

OK, what was ladled up was not what I was seeing in Chrome. Seems like this is one of those dynamically generated pages that I need Selenium to handle.

In [8]:
# Setup part the second (BS set up in first cell)
import time, os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
chromedriver = '/Applications/chromedriver'
os.environ["webdriver.chrome.driver"] = chromedriver

In [None]:
searchpage_url = 'https://www.redfin.com/city/16163/WA/Seattle'
driver = webdriver.Chrome(chromedriver)
driver.get(searchpage_url)
searchpage_soup = BeautifulSoup(driver.page_source, 'html.parser')

# Parsing the searchpage

In [None]:
print(searchpage_soup.prettify())

In [None]:
# This is getting us the list of RELATIVE links to listings on the page, 20 at a go
for link in searchpage_soup.find_all('a', class_ = 'slider-item hidden'):
    print(link.get('href'))

In [None]:
# Gets home price from searchpage as int by stripping out leading $ and internal comma(s)
int(searchpage_soup.find('span', class_ = 'homecardV2Price').text[1:].replace(',', ''))

In [None]:
sp_bed_bath_sqft = searchpage_soup.find('div', class_ = 'HomeStatsV2 font-size-small').text.split()

In [None]:
# gets beds from HomeStatsV2
float(sp_bed_bath_sqft[0])
# gets baths
float(sp_bed_bath_sqft[1][4:])
# gets sqft
int(sp_bed_bath_sqft[2][5:].replace(',', ''))

In [None]:
searchpage_soup.find('div', class_ = 'homeAddressV2').text

# Main searchpage isn't cutting it; is it easier to scrape the individual pages?

Putting a pin in it: I'm going to have to figure out scrolling via selenium so I can get a full list of URLs; grabbing these at semi-random intervals overnight may be a use case for fake_useragent, but I'm going to shoot for having it ready to run by Thursday night without so I can try to figure out install if needed.

Looking for:
- Address (can strip directly from URL or 
``` <span data-rf-test-id="abp-streetLine" class="street-address" title="1119 N 85th St Unit A">1119 N 85th St Unit A </span>)```
- Neighborhood
```<div class="keyDetail font-weight-roman font-size-base"><span class="header font-color-gray-light inline-block">Community</span><span class="content text-right">Green Lake</span></div>```
- Price
```<div class="info-block price" data-rf-test-id="abp-price"><div class="statsValue"><div><span>$</span><span>445,000</span></div></div><span class="statsLabel">Price</span></div>)```
- Beds
```<div class="info-block" data-rf-test-id="abp-beds"><div class="statsValue">2</div><span class="statsLabel">Beds</span></div>```
- Baths
```<div class="info-block" data-rf-test-id="abp-baths"><div class="statsValue">1.75</div><span class="statsLabel">Baths</span></div>```
- Sq ft
```<div class="info-block sqft" data-rf-test-id="abp-sqFt"><span><span class="statsValue">1,000</span> <span class="sqft-label">Sq. Ft.</span><div class="statsLabel" data-rf-test-id="abp-priceperft">$445 / Sq. Ft.</div></span></div>```
- How long on Redfin
```<span><span class="label">On Redfin: </span><span class="value">6 days</span></span>```
- Type
```<div class="keyDetail font-weight-roman font-size-base"><span class="header font-color-gray-light inline-block">Style</span><span class="content text-right">Townhouse</span></div>```
- Lot size
```<div class="keyDetail font-weight-roman font-size-base"><span class="header font-color-gray-light inline-block">Lot Size</span><span class="content text-right">1,000 Sq. Ft.</span><div class="keyDetail font-weight-roman font-size-base"><span class="header font-color-gray-light inline-block">Style</span><span class="content text-right">Townhouse</span></div></div>```
- Year built
```<div class="keyDetail font-weight-roman font-size-base"><span class="header font-color-gray-light inline-block">Year Built</span><span class="content text-right">2009</span></div>```
- Buyer's brokerage compensation
```<div class="keyDetail font-weight-roman font-size-base"><span class="header font-color-gray-light inline-block">Buyer's Brokerage Compensation<div class="isDesktop inline-block"><div class="DefinitionFlyout definition-flyout-container react inline-block"><span class="DefinitionFlyoutLink inline-block"><div class="definition-icon label-info" tabindex="0"><svg class="SvgIcon label-info"><svg viewBox="0 0 24 24"><path fill-rule="evenodd" clip-rule="evenodd" d="M12 0c6.617 0 12 5.383 12 12s-5.383 12-12 12S0 18.617 0 12 5.383 0 12 0zm1 16v-5.75a.25.25 0 0 0-.25-.25h-2.5a.25.25 0 0 0-.25.25V12h1v4h-1v1.75c0 .138.112.25.25.25h3.5a.25.25 0 0 0 .25-.25V16h-1zm-.25-8h-1.5a.25.25 0 0 1-.25-.25v-1.5a.25.25 0 0 1 .25-.25h1.5a.25.25 0 0 1 .25.25v1.5a.25.25 0 0 1-.25.25z"></path></svg></svg></div></span></div></div></span><span class="content text-right">3.0%</span></div>```
    
Address, price, beds, baths, and sqft are all uniquely addressable by ID. Remainder are going to need to involve a text search/next combination.

In [None]:
listingpage_url = 'https://www.redfin.com/WA/Seattle/1119-N-85th-St-98103/unit-A/home/21883460'
user_agent = {'User-agent': 'Mozilla/5.0'}
response = requests.get(listingpage_url, headers = user_agent)
response.status_code

In [None]:
listingpage_soup = BeautifulSoup(response.text)

In [None]:
listingpage_soup.prettify()

In [None]:
listingpage_soup.find('span', class_ = 'street-address').text

In [None]:
# Getting from '$###,###Price' to integer price
int(listingpage_soup.find('div', class_ = 'info-block price').text[1:-5].replace(',', ''))

In [None]:
listingpage_soup.find_all('div', class_ = 'info-block')

In [None]:
# Getting square footage from there
int(listingpage_soup.find('div', class_ = 'info-block sqft').find('span', class_ = 'statsValue').text.replace(',', ''))

In [None]:
for child in listingpage_soup.find('div', class_ = 'HomeMainStats home-info inline-block float-right').children:
    print(child)

In [None]:
# Getting this infoblock into an addressable and subscriptable format
infoblock_list = list(listingpage_soup.find('div', class_ = 'HomeMainStats home-info inline-block float-right').children)

In [None]:
# Getting beds
float(infoblock_list[1].find('div', class_ = 'statsValue').text)

In [None]:
# Getting beds
float(infoblock_list[2].find('div', class_ = 'statsValue').text)

In [None]:
# Even better way to get baths and beds
float(listingpage_soup.find(attrs = {'data-rf-test-id': "abp-baths"}).find('div', class_ = 'statsValue').text)
float(listingpage_soup.find(attrs = {'data-rf-test-id': "abp-beds"}).find('div', class_ = 'statsValue').text)

In [None]:
# Getting buyer's brokerage compensation
float(listingpage_soup.find(string = "Buyer's Brokerage Compensation").find_next('span', class_ = 'content text-right').text[:-1])

In [None]:
# Getting neighborhood; seems relatively safe that this will be the first occurence of 'community' but...
listingpage_soup.find(string = 'Community').find_next('span', class_ = 'content text-right').text

In [None]:
# Number of days on Redfin
int(listingpage_soup.find(string = 'On Redfin: ').find_next().text.split()[0])

In [None]:
# Type of dwelling - somewhat worrisome because there were two 'Style' string results
listingpage_soup.find(string = 'Style').find_next().text

In [None]:
list(listingpage_soup.find_all(string = 'Style'))[1].find_next()
# When exploring, both pointed to townhouse, on in a span, the other in a div.

In [None]:
# As with type of dwelling, two occurences both of which have what I'm looking for as next value
int(listingpage_soup.find(string = 'Lot Size').find_next().text.split()[0].replace(',', ''))

In [None]:
int(listingpage_soup.find(string = 'Year Built').find_next().text)

In [None]:
# Getting zip code
listingpage_soup.find('span', class_ = 'postal-code').text

Chat with Tara Ziegler 1 Oct 1600:

1. May be best to use sold price
2. Depending on how active the market is, listings that are "sold" (Zillow shows within last 3 years) may be for sale at the moment; need to capture status as a feature for filtering.

```<span class="status-container" data-rf-test-id="abp-status"><span><span class="label">Status: </span><span class="value"><div class="DefinitionFlyout definition-flyout-container react inline-block"><span class="DefinitionFlyoutLink inline-block underline clickable" tabindex="0">Active</span></div></span></span></span>```

3. Grabbing webpages, soupifying them, and parsing them are separate tasks! Don't try to do everything at once (per Brian).

In [None]:
listingpage_soup.find(attrs = {'data-rf-test-id': 'abp-status'}).find('span', class_ = 'value').text

We're going to want to make a dict for the various parts we're scraping.

|ID|What|Type|Code|
|---|---|---|---|
|address|<|str|```listingpage_soup.find('span', class_ = 'street-address').text```|
|ZIP|<|str|```listingpage_soup.find('span', class_ = 'postal-code').text```|
|comm|neighborhood|str|```listingpage_soup.find(string = 'Community').find_next('span', class_ = 'content text-right').text```|
|price|<|int|```int(listingpage_soup.find('div', class_ = 'info-block price').text[1:-5].replace(',', ''))```|
|beds|<|discrete|```float(listingpage_soup.find(attrs = {'data-rf-test-id': "abp-beds"}).find('div', class_ = 'statsValue').text)```|
|baths|<|discrete|```float(listingpage_soup.find(attrs = {'data-rf-test-id': "abp-baths"}).find('div', class_ = 'statsValue').text)```|
|size|house sqft|int|```int(listingpage_soup.find('div', class_ = 'info-block sqft').find('span', class_ = 'statsValue').text.replace(',', ''))```|
|dur|how long on RF|int|```int(listingpage_soup.find(string = 'On Redfin: ').find_next().text.split()[0])```|
|style|dwelling type|str|```listingpage_soup.find(string = 'Style').find_next().text```|
|lot|lot sqft|int|```int(listingpage_soup.find(string = 'Lot Size').find_next().text.split()[0].replace(',', ''))```|
|age|yr built|int|```int(listingpage_soup.find(string = 'Year Built').find_next().text)```|
|brok|buy. brok. comp.|float|```float(listingpage_soup.find(string = "Buyer's Brokerage Compensation").find_next('span', class_ = 'content text-right').text[:-1])```|
|status|<|str|```listingpage_soup.find(attrs = {'data-rf-test-id': 'abp-status'}).find('span', class_ = 'value').text```|

# Getting URLs from the "sold" search page

Link at end of first searchpage is in 
```<button class="clickable buttonControl button-text" data-rf-test-id="react-data-paginate-next"><svg class="SvgIcon slide-next"><svg viewBox="0 0 24 24"><g fill-rule="evenodd"><path d="M7.134 23.134l-1.06-1.06a.25.25 0 0 1 0-.355L16.19 11.603 6.074 1.488a.25.25 0 0 1 0-.355l1.06-1.06a.25.25 0 0 1 .354 0L18.84 11.427a.25.25 0 0 1 0 .353L7.488 23.134a.25.25 0 0 1-.354 0"></path></g></svg></svg></button>```

However, I'm really struggling to find the next link using this, so it probably makes sense to just work with the fact that the page naming is consistent: https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-3yr/page-2
(sold-3yr/page-17 is the max).

In [None]:
soldpage_url = 'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-3yr'
soldpages_scraped = []
ua = UserAgent()
for i in range(17):
    if i > 1:
        target_url = (soldpage_url + '/page-' + str(i))
    else:
        target_url = soldpage_url
    user_agent = {'User-agent': ua.random}
    response  = requests.get(target_url, headers = user_agent)
    time.sleep(random.random()*33)
    soup = BeautifulSoup(response.content, 'html.parser')
    soldpages_scraped.append(soup)
    
print(len(soldpages_scraped))

In [None]:
soldpages_scraped[3]

In [None]:
link_list = []
for scrape in soldpages_scraped:
    for link in scrape.find_all('a', class_ = 'slider-item hidden'):
        link_list.append(link.get('href'))

In [None]:
link_list

In [None]:
base_url = 'https://www.redfin.com'
listpages_scraped = []
listpages_missed = []
ua = UserAgent()
for i in link_list:
    if i%37 == 0:
        time.sleep(random.random()*500)
    try:
        target_url = (base_url + i)
        user_agent = {'User-agent': ua.random}
        response  = requests.get(target_url, headers = user_agent)
        time.sleep(random.random()*11)
        soup = BeautifulSoup(response.content, 'html.parser')
        listpages_scraped.append(soup)
    except:
        listpages_missed.append(i)
pickle.dump(listpages_scraped, last3months_sold_listings)
print(len(listpages_scraped))


In [None]:
len(listpages_scraped), len(link_list)

Original code just ran "for i in link_list: target. . . " but errored out on 91st entry of 326. Added line for random longer sleep every 37 tries and put scraping inside a try/except.

In [None]:
soldpage_url = 'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-3yr'
soldpages_scraped = []
soldpages_missed = []
error_list = []
ua = UserAgent()
# Redfin only gives 17 possible pages BUT there are 33k possible houses
for i in range(1800):
    # Occasional long sleeps to throw off bot detection
    if 1%41 == 0:
        time.sleep(random.random()*300)
    if i > 1:
        target_url = (soldpage_url + '/page-' + str(i))
    else:
        target_url = soldpage_url
    user_agent = {'User-agent': ua.random}
    try:
        response  = requests.get(target_url, headers = user_agent)
        time.sleep(2)
        soup = BeautifulSoup(response.content, 'html.parser')
        soldpages_scraped.append(soup)
    except (ConnectionError, MaxRetryError, NewConnectionError, gaierror):
        soldpages_missed.append(i)
    # Are we seeing other errors that need to be dealt with?
    except:
        error_list.append(sys.exc_info()[0])
    time.sleep(random.random()*11)
    
print(len(soldpages_scraped))

Turns out that any page number after 17 gets redirected to 17, so this strategy won't work like I want it to. Need to subset by area (target on map) and then count up to 17 for each.

Also, want to pickle.dump for each area.

Chunks according to Redfin Seattle map:
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.74409:47.69068:-122.30638:-122.40011 (Northgate, Bitter Lake)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.73611:47.68268:-122.2574:-122.35113 (Lake City)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.70017:47.67345:-122.31544:-122.3623 (Greenlake, Greenwood)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.7098:47.65635:-122.34018:-122.43391 (Ballard)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.6748:47.62132:-122.30136:-122.39509 (Fremont)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.69562:47.64215:-122.24063:-122.33435 (Ravenna)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.63781:47.61105:-122.31995:-122.36682 (Downtown)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.67059:47.56356:-122.21313:-122.40058 (Capitol Hill PLUS - looks likely to pick up lots of duplicates)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.59167:47.56489:-122.37725:-122.42411 (Alki)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.5792:47.52562:-122.32803:-122.42176 (W. Seattle)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.54585:47.49223:-122.31099:-122.40472 (Highland Park)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.59386:47.5403:-122.26363:-122.35736 (Rainier)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.5791:47.52552:-122.22798:-122.32171 (Columbia City)
- https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.53334:47.47971:-122.20693:-122.30066 (Rainier Beach)

Flow:
1. Set these links as a list
2. Step through using scraping as previous with addition of list loop surrounding for i in range(17) loop.
3. Pickle the resulting list for future use
4. Remove duplicates (using set()?)
5. Search for any that don't include '/Seattle/' in the link text and remove if present.
6. Send cleaned list to individual page scraping code.

In [None]:
sold_by_area_links = [
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.74409:47.69068:-122.30638:-122.40011',
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.73611:47.68268:-122.2574:-122.35113',
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.70017:47.67345:-122.31544:-122.3623', 
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.7098:47.65635:-122.34018:-122.43391', 
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.6748:47.62132:-122.30136:-122.39509', 
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.69562:47.64215:-122.24063:-122.33435', 
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.63781:47.61105:-122.31995:-122.36682', 
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.67059:47.56356:-122.21313:-122.40058', 
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.59167:47.56489:-122.37725:-122.42411', 
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.5792:47.52562:-122.32803:-122.42176', 
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.54585:47.49223:-122.31099:-122.40472',
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.59386:47.5403:-122.26363:-122.35736', 
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.5791:47.52552:-122.22798:-122.32171',
    'https://www.redfin.com/city/16163/WA/Seattle/filter/include=sold-5yr,viewport=47.53334:47.47971:-122.20693:-122.30066']

soldpages_scraped = []
soldpages_missed = []
soldpages_completed = []
error_list = []
ua = UserAgent()
for link in sold_by_area_links:
    for i in range(17):
        if i > 1:
            target_url = (link + '/page-' + str(i))
        else:
            target_url = link
        user_agent = {'User-agent': ua.random}
        try:
            response  = requests.get(target_url, headers = user_agent)
            time.sleep(2)
            soup = BeautifulSoup(response.content, 'html.parser')
            soldpages_scraped.append(soup)
#        except (ConnectionError, MaxRetryError, NewConnectionError, gaierror):
#            soldpages_missed.append(i)
        # Are we seeing other errors that need to be dealt with?
        except:
            error_list.append(sys.exc_info()[0])
            error_list.append(response.status_code)
        time.sleep(random.random()*11)
    soldpages_completed.append(link)
pickle.dump(soldpages_scraped, 'soldpages_scraped_2020_10_02_1500')
print(len(soldpages_scraped))

In [None]:
# Can't seem to pickle the list of scraped pages because of recursion depth, so. . .
link_list = []
for scrape in soldpages_scraped:
    for link in scrape.find_all('a', class_ = 'slider-item hidden'):
        link_list.append(link.get('href'))

In [None]:
link_set = list(set(link_list))
len(link_list), len(link_set)

In [None]:
error_list

In [3]:
#with open("link_set.txt", "wb") as fp:
#    pickle.dump(link_set, fp)
with open("link_set.txt", "rb") as fp:
    link_set = pickle.load(fp)
link_set
# Confirmed I can pickle and unpickle, then!

['/WA/Seattle/1723-S-Forest-St-98144/home/172186946',
 '/WA/Seattle/3610-1st-Ave-NW-98107/unit-B/home/69353943',
 '/WA/Seattle/3052-23rd-Ave-W-98199/home/2064490',
 '/WA/Seattle/924-N-87th-St-98103/home/99838',
 '/WA/Seattle/8706-Phinney-Ave-N-98103/home/145709542',
 '/WA/Seattle/762-Hayes-St-98109/unit-31/home/103951529',
 '/WA/Seattle/137-S-107th-St-98168/home/18660720',
 '/WA/Seattle/4042-Martin-Luther-King-Jr-Way-S-98108/home/109969582',
 '/WA/Seattle/1709-18th-Ave-98122/unit-202/home/16085',
 '/WA/Seattle/2557-11th-Ave-W-98119/home/130160',
 '/WA/Seattle/3141-34th-Ave-S-98144/home/171688',
 '/WA/Seattle/211-23rd-Ave-98122/unit-B/home/143714',
 '/WA/Seattle/81-Clay-St-98121/unit-526/home/12537355',
 '/WA/Seattle/1530-NW-Market-St-98107/unit-811/home/17381614',
 '/WA/Seattle/647-NW-51st-St-98107/home/302455',
 '/WA/Seattle/3912-S-Orcas-St-98118/home/172785',
 '/WA/Seattle/2757-SW-Sylvan-Heights-Dr-98106/home/12090512',
 '/WA/Seattle/7310-7th-Ave-SW-98106/home/475885',
 '/WA/Seattle/

This did pick up a lot of duplicates, so I'm really glad I deduplicated it here! Went from 4645 links to 3111 links.

error_list is empty and I've confirmed that I can pickle things, so awesome!

Pull down the code to scrape individual pages, add lines to pickle it and reset the list every so often.

In [None]:
base_url = 'https://www.redfin.com'
# Should yield a list of BeautifulSoup objects
listpages_scraped = []
# Should yield a list of links
listpages_missed = []
# Should yield a list of errors and response codes for debugging
error_list = []
# Making loading up and using the dumps easier on myself
dump_count = 1
# For randomly generated headers in the request
ua = UserAgent()
for i in range(0, len(link_list)):
    # Occasional long sleep to throw bot detection off the scent
    if i%37 == 0:
        time.sleep(random.randrange(10, 120))
    try:
        target_url = (base_url + link_list[i])
        user_agent = {'User-agent': ua.random}
        response  = requests.get(target_url, headers = user_agent)
        time.sleep(0.5 + random.random() * 19.5)
        soup = BeautifulSoup(response.content, 'html.parser')
        listpages_scraped.append(soup)
    except:
        listpages_missed.append(link_list[i])
        error_list.append(sys.exc_info()[0])
        error_list.append(response.status_code)
        time.sleep(0.5 * random.random() * 19.5)
    #Making sure to dump results regularly (so they're pickleable and so they're not )
    if len(listpages_scraped) > 24:
        dump_name = 'list_scrapes' + str(dump_count) + '.txt'
        dump_count += 1
        with open(dump_name, "wb") as fp:
            pickle.dump(listpages_scraped, fp)
        listpages_scraped = []
print(dump_count)

# Parsing the content
Attempts to pickle the content keep running into RuntimeError maximum recursion depth reached.

So let's just parse this content as I go.

Questions to answer: 
1. Does the parsing that worked on for sale houses also work on the sold houses?
2. How do I parse sale date? ```<div class="sold-row row PropertyHistoryEventRow" id="propertyHistory-0"><div class="col-4"><p>Mar 30, 2020</p><p class="subtext">Date</p></div><div class="description-col col-4"><div>Sold (Public Records)</div><div></div><p class="subtext">Public Records</p></div><div class="col-4"><div class="price-col number">$760,000<span class="number positive empty"> (9.8%/yr)</span></div><p class="subtext">Price</p></div></div>``` (from manual inspection of page)

```<div class="Pill Pill--red padding-vert-smallest padding-horiz-smaller font-size-smaller font-weight-bold font-color-white HomeSash margin-top-smallest margin-right-smaller">``` (from search for sale date on scrape)

```\"lastSaleDate\":\"MAR 30, 2020``` (from search for sale date on scrape)

And I'm going to need to use selenium - the page is dynamically generated and the only place I seem to be able to scrape the sold date from is several scrolls down.

In [8]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chromedriver = "/Applications/chromedriver" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

In [23]:
listingpage_url = 'https://www.redfin.com/WA/Seattle/10024-63rd-Ave-S-98178/home/177013'
driver = webdriver.Chrome(chromedriver)
driver.get(listingpage_url)
for i in range(12):
    #Scroll
    driver.execute_script("window.scrollBy({top: 700,left: 0,behavior: 'smooth'});")
    time.sleep(1.5)
listingpage_soup = BeautifulSoup(response.content, 'html.parser')
driver.close()

In [32]:
# Getting sale date from little red pill on first photo (pills after first on related listings)
listingpage_soup.find('div', class_ = "Pill Pill--red padding-vert-smallest padding-horiz-smaller font-size-smaller font-weight-bold font-color-white HomeSash margin-top-smallest margin-right-smaller").text.replace('SOLD BY REDFIN ', '')

'MAR 30, 2020'

In [50]:
# <h3 class="title font-color-gray-dark font-weight-bold propertyDetailsHeader">Parking Information</h3>
listingpage_soup.find(string = 'Exterior')

|ID|What|Type|Code|
|---|---|---|---|
|address|<|str|```listingpage_soup.find('span', class_ = 'street-address').text```|
|ZIP|<|str|```listingpage_soup.find('span', class_ = 'postal-code').text```|
|comm|neighborhood|str|```listingpage_soup.find(string = 'Community').find_next('span', class_ = 'content text-right').text```|
|price|<|int|```int(listingpage_soup.find('div', class_ = 'info-block price').text[1:-15].replace(',', ''))```|
|beds|<|discrete|```float(listingpage_soup.find(attrs = {'data-rf-test-id': "abp-beds"}).find('div', class_ = 'statsValue').text)```|
|baths|<|discrete|```float(listingpage_soup.find(attrs = {'data-rf-test-id': "abp-baths"}).find('div', class_ = 'statsValue').text)```|
|size|house sqft|int|```int(listingpage_soup.find('div', class_ = 'info-block sqft').find('span', class_ = 'statsValue').text.replace(',', ''))```|
|style|dwelling type|str|```listingpage_soup.find(string = 'Style').find_next().text```|
|lot|lot sqft|int|```int(listingpage_soup.find(string = 'Lot Size').find_next().text.split()[0].replace(',', ''))```|
|age|yr built|int|```int(listingpage_soup.find(string = 'Year Built').find_next().text)```|
|status|<|str|```listingpage_soup.find(attrs = {'data-rf-test-id': 'abp-status'}).find('span', class_ = 'value').text```|
|sold|last sale date|str|```listingpage_soup.find('div', class_ = "Pill Pill--red padding-vert-smallest padding-horiz-smaller font-size-smaller font-weight-bold font-color-white HomeSash margin-top-smallest margin-right-smaller").text.replace('SOLD BY REDFIN ', '')```|
|park|parking|str|```listingpage_soup.find(string = 'Parking Information').find_next().text```|
|brok|buy. brok. comp.|float|```float(listingpage_soup.find(string = "Buyer's Brokerage Compensation").find_next('span', class_ = 'content text-right').text[:-1])```|


In [71]:
def parseSoldPage(soup, url):
    """
    Grabs desired information from Redfin page for a single sold property using BeatifulSoup.
    Returns a dictionary of the desired information.
    """
    property_dict = {}
    property_dict['address'] = listingpage_soup.find('span', class_ = 'street-address').text or False
    property_dict['ZIP'] = listingpage_soup.find('span', class_ = 'postal-code').text or False
    property_dict['comm'] = listingpage_soup.find(string = 'Community').find_next('span', class_ = 'content text-right').text or False
    property_dict['price'] = int(listingpage_soup.find('div', class_ = 'info-block price').text[1:-15].replace(',', '')) or False
    property_dict['beds'] = float(listingpage_soup.find(attrs = {'data-rf-test-id': "abp-beds"}).find('div', class_ = 'statsValue').text) or False
    property_dict['baths'] = float(listingpage_soup.find(attrs = {'data-rf-test-id': "abp-baths"}).find('div', class_ = 'statsValue').text) or False
    property_dict['size'] = int(listingpage_soup.find('div', class_ = 'info-block sqft').find('span', class_ = 'statsValue').text.replace(',', '')) or False
    property_dict['style'] = listingpage_soup.find(string = 'Style').find_next().text or False
    property_dict['lot'] = int(listingpage_soup.find(string = 'Lot Size').find_next().text.split()[0].replace(',', '')) or False
    property_dict['age'] = int(listingpage_soup.find(string = 'Year Built').find_next().text) or False
    property_dict['status'] = listingpage_soup.find(attrs = {'data-rf-test-id': 'abp-status'}).find('span', class_ = 'value').text or False
    property_dict['sold'] = listingpage_soup.find('div', class_ = "Pill Pill--red padding-vert-smallest padding-horiz-smaller font-size-smaller font-weight-bold font-color-white HomeSash margin-top-smallest margin-right-smaller").text.replace('SOLD BY REDFIN ', '') or False
    property_dict['park'] = listingpage_soup.find(string = 'Parking Information').find_next().text or False
    property_dict['brok'] = float(listingpage_soup.find(string = "Buyer's Brokerage Compensation").find_next('span', class_ = 'content text-right').text[:-1]) or False
    property_dict['url'] = url
    return property_dict

In [52]:
parseSoldPage(listingpage_soup)

{'address': '10024 63rd Ave S ',
 'ZIP': '98178',
 'comm': 'Upper Rainier Beach',
 'price': 760000,
 'beds': 4.0,
 'baths': 2.75,
 'size': 2670,
 'style': '2 Stories with Basement, Tudor',
 'lot': 6232,
 'age': 1921,
 'status': 'Sold',
 'sold': 'MAR 30, 2020',
 'park': 'Off-Street Parking',
 'brok': 3.0}

In [1]:
with open("link_set.txt", "rb") as fp:
    link_set = pickle.load(fp)
link_set

NameError: name 'pickle' is not defined

In [54]:
link_set_test = link_set[0:10]

In [55]:
link_set_remainder = link_set[10:]

In [64]:
base_url = 'https://www.redfin.com'
listingpage_scrapes = []
missed_urls = []
driver = webdriver.Chrome(chromedriver)
for link in link_set_test:
    listing_url = base_url + link
    driver.get(listing_url)
    for i in range(12):
        #Scroll
        driver.execute_script("window.scrollBy({top: 700,left: 0,behavior: 'smooth'});")
        time.sleep(0.5 + random.random())
    listingpage_soup = BeautifulSoup(response.content, 'html.parser')
    listingpage_scrapes.append(parseSoldPage(listingpage_soup))
listingpage_scrapes

[{'address': '10024 63rd Ave S ',
  'ZIP': '98178',
  'comm': 'Upper Rainier Beach',
  'price': 760000,
  'beds': 4.0,
  'baths': 2.75,
  'size': 2670,
  'style': '2 Stories with Basement, Tudor',
  'lot': 6232,
  'age': 1921,
  'status': 'Sold',
  'sold': 'MAR 30, 2020',
  'park': 'Off-Street Parking',
  'brok': 3.0},
 {'address': '10024 63rd Ave S ',
  'ZIP': '98178',
  'comm': 'Upper Rainier Beach',
  'price': 760000,
  'beds': 4.0,
  'baths': 2.75,
  'size': 2670,
  'style': '2 Stories with Basement, Tudor',
  'lot': 6232,
  'age': 1921,
  'status': 'Sold',
  'sold': 'MAR 30, 2020',
  'park': 'Off-Street Parking',
  'brok': 3.0},
 {'address': '10024 63rd Ave S ',
  'ZIP': '98178',
  'comm': 'Upper Rainier Beach',
  'price': 760000,
  'beds': 4.0,
  'baths': 2.75,
  'size': 2670,
  'style': '2 Stories with Basement, Tudor',
  'lot': 6232,
  'age': 1921,
  'status': 'Sold',
  'sold': 'MAR 30, 2020',
  'park': 'Off-Street Parking',
  'brok': 3.0},
 {'address': '10024 63rd Ave S ',
  '

In [85]:
listingpage_scrapes = []
for link in link_set:
    listing_url = base_url + link
    driver.get(listing_url)
    for i in range(14):
        #Scroll
        driver.execute_script("window.scrollBy({top: 700,left: 0,behavior: 'smooth'});")
        time.sleep(1.5)
    listingpage_soup = BeautifulSoup(response.content, 'html.parser')
    listingpage_scrapes.append(parseSoldPage(listingpage_soup, listing_url))
with open('listingpage_scrapes.txt', "wb") as fp:
    pickle.dump(listingpage_scrapes, fp)

In [86]:
len(listingpage_scrapes)

3111

# Learnings
1. Haste makes waste (someday this will sink in).
2. If you aren't clear on what form the data outputs are and what forms the inputs need to take, you will run into nigh-endless errors (someday this will sink in).
3. If you need to scrape fast, you'll need some ways to parallelize and spoof in order to avoid hitting CATCHPAs.
4. You can't pickle a BeautifulSoup directly unless you're on a more powerful machine than my 2014 MacBook Air; casting to string and pickling that works.
5. Just pickle the scraped pages as you go in small batches; the ability to re-scrape pages later could be a lifesaver for some projects.

# Future work
Cleaning, feature engineering, and model work, obvs.
1. Rewrite scrape codes to be better (include regular pickle dumps of listingpage_scrapes)
2. Spin up and run scraping of for sale pages to get model cross-comparison set (essentially, how well is my model predicting the LIST praces using the SOLD prices; are LIST prices that are substantially in excess of what the model would predict overreprsented in the older for sale listings?
3. Functionalize scripts used; store human-generated lists for future use.

In [88]:
import pandas as pd
listingpages = pd.DataFrame(listingpage_scrapes)
listingpages.head()

Unnamed: 0,address,ZIP,comm,price,beds,baths,size,style,lot,age,status,sold,park,brok,url
0,10024 63rd Ave S,98178,Upper Rainier Beach,760000,4.0,2.75,2670,"2 Stories with Basement, Tudor",6232,1921,Sold,"MAR 30, 2020",Off-Street Parking,3.0,https://www.redfin.com/WA/Seattle/1723-S-Fores...
1,10024 63rd Ave S,98178,Upper Rainier Beach,760000,4.0,2.75,2670,"2 Stories with Basement, Tudor",6232,1921,Sold,"MAR 30, 2020",Off-Street Parking,3.0,https://www.redfin.com/WA/Seattle/3610-1st-Ave...
2,10024 63rd Ave S,98178,Upper Rainier Beach,760000,4.0,2.75,2670,"2 Stories with Basement, Tudor",6232,1921,Sold,"MAR 30, 2020",Off-Street Parking,3.0,https://www.redfin.com/WA/Seattle/3052-23rd-Av...
3,10024 63rd Ave S,98178,Upper Rainier Beach,760000,4.0,2.75,2670,"2 Stories with Basement, Tudor",6232,1921,Sold,"MAR 30, 2020",Off-Street Parking,3.0,https://www.redfin.com/WA/Seattle/924-N-87th-S...
4,10024 63rd Ave S,98178,Upper Rainier Beach,760000,4.0,2.75,2670,"2 Stories with Basement, Tudor",6232,1921,Sold,"MAR 30, 2020",Off-Street Parking,3.0,https://www.redfin.com/WA/Seattle/8706-Phinney...


In [89]:
with open("link_set.txt", "rb") as fp:   # Unpickling
    link_set = pickle.load(fp)

Variable "response" was never actually set in scraping code, so parsed same thing over and over and over. Bloody hell.

In [93]:
link_set_test10 = link_set[0:10]
link_set_test100 = link_set[10:110]

In [113]:
base_url = 'https://www.redfin.com'
listingpage_scrapes = []
driver = webdriver.Chrome(chromedriver)
counter = 1
# Iterating through list of links
for link in link_set_test10[5:]:
    listing_url = base_url + link
    driver.get(listing_url)
    for i in range(14):
        #Scroll
        driver.execute_script("window.scrollBy({top: 700,left: 0,behavior: 'smooth'});")
        time.sleep(random.random()*2)    
    # Getting the page source and soupifying it
    time.sleep(2)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    listingpage_scrapes.append(soup)
    listingpage_scrapes.append(link)
    counter += 1
    # Intermittent pickling of raw html scrapes
    if counter%5 == 0:
        dump_list = [str(x) for x in listingpage_scrapes]
        with open(('listingpage_scrapes_dump' + str(counter) + '.txt'), 'wb') as fp:
            pickle.dump(dump_list, fp)
        listingpage_scrapes = []

TypeError: dump() missing required argument 'file' (pos 2)

In [114]:
# Iterating through list of links
for link in link_set[5:]:
    listing_url = base_url + link
    driver.get(listing_url)
    for i in range(14):
        #Scroll
        driver.execute_script("window.scrollBy({top: 700,left: 0,behavior: 'smooth'});")
        time.sleep(random.random()*2)    
    # Pause for loading, getting the page source, and soupifying it
    time.sleep(2)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    listingpage_scrapes.append(soup)
    listingpage_scrapes.append(link)
    counter += 1
    # Intermittent saving of results and freeing up a little memory
    if counter%100 == 0:
        # Casting soup as string so it can be pickled
        dump_list = [str(x) for x in listingpage_scrapes]
        with open(('listingpage_scrapes_dump' + str(counter) + '.txt'), 'wb') as fp:
            pickle.dump(dump_list, fp)
        listingpage_scrapes = []

WebDriverException: Message: disconnected: not connected to DevTools
  (Session info: chrome=85.0.4183.121)


/WA/Seattle/583-Battery-St-98121/unit-3003N/home/113201980

This is the last page scraped and pickled before the cats chewed my charging cord to pieces. Index = 1199

In [5]:
link_set.index('/WA/Seattle/762-Hayes-St-98109/unit-31/home/103951529')

5

In [6]:
unscraped_links = link_set[0:5] + link_set[1200:]

In [9]:
base_url = 'https://www.redfin.com'
listingpage_scrapes = []
driver = webdriver.Chrome(chromedriver)
counter = 1201
# Iterating through list of links
for link in unscraped_links:
    listing_url = base_url + link
    driver.get(listing_url)
    for i in range(14):
        #Scroll
        driver.execute_script("window.scrollBy({top: 700,left: 0,behavior: 'smooth'});")
        time.sleep(random.random()*2)    
    # Pause for loading, getting the page source, and soupifying it
    time.sleep(2)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    listingpage_scrapes.append(soup)
    listingpage_scrapes.append(link)
    counter += 1
    # Intermittent saving of results and freeing up a little memory
    if counter%100 == 0:
        # Casting soup as string so it can be pickled
        dump_list = [str(x) for x in listingpage_scrapes]
        with open(('listingpage_scrapes_dump' + str(counter) + '.txt'), 'wb') as fp:
            pickle.dump(dump_list, fp)
        listingpage_scrapes = []

TimeoutException: Message: timeout: Timed out receiving message from renderer: 58.353
  (Session info: chrome=85.0.4183.121)


In [4]:
# Timed out because of an infinite loop; one last run at this!
base_url = 'https://www.redfin.com'
listingpage_scrapes = []
driver = webdriver.Chrome(chromedriver)
counter = 2901
# Iterating through list of links
for link in link_set[2900:]:
    listing_url = base_url + link
    driver.get(listing_url)
    for i in range(14):
        #Scroll
        driver.execute_script("window.scrollBy({top: 700,left: 0,behavior: 'smooth'});")
        time.sleep(0.5 + random.random()*1.5)    
    # Pause for loading, getting the page source, and soupifying it
    time.sleep(2)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    listingpage_scrapes.append(soup)
    listingpage_scrapes.append(link)
    counter += 1
# Casting soup as string so it can be pickled
dump_list = [str(x) for x in listingpage_scrapes]
with open(('listingpage_scrapes_dump' + str(counter) + '.txt'), 'wb') as fp:
    pickle.dump(dump_list, fp)
