# How to do Web Scraping for Real Estate Data
Extracting data from the web to use in decision-making

### New Jersey Map by County

<img src="https://www.njfuture.org/wp-content/uploads/2012/05/new-jersey-county-map.png" style="height:600px;">

### What comes to your mind when you think of New Jersey?

md
<img src="https://townsquare.media/site/394/files/2016/03/six-flags.jpg?w=980&q=75" style="width:500px">
<img src="https://tvseriesfinale.com/wp-content/uploads/2020/01/jersey-shore-family-vacation-e1580134194375.jpg" style="width:500px">
<img src="https://advancelocal-adapter-image-uploads.s3.amazonaws.com/image.lehighvalleylive.com/home/lvlive-media/width2048/img/breaking-news_impact/photo/garden-state-parkway-49b4ef47daea1aad.jpg" style="width:500px">

### Did you know New Jersey has incredible real estate too?

<img src="https://images.beachhouse.com/files/hal_62581838_0.jpg" style="height:600px;">

## Overview
We are <b>long distance</b> real estate investors looking to invest in the state of New Jersey.

### Problem
We have limited resources since we do not live in New Jersey. We cannot rely on word-of-mouth or local conversation to find great deals.

### Goal
We want to use <b>BIG DATA</b> technologies to get local town data. Data in an easy format (csv file). That way we can make better investment decisions.

### Solution
Use <b>WEB SCRAPING</b> to build a dataset of New Jersey towns. Use this data to identify trends and areas for opportunity.

## What is web scraping?
A tool to automatically collect new or updated data.

<img src="https://drive.google.com/uc?id=1xTMvXBH8cnegL3VbBuA1pf9XcWiByxep" style="width:1000px;">

### Benefits
1. Collect Data for market research
2. Extract Contact Info
3. Track prices for mulitple markets

## Widgets

In [11]:
dbutils.widgets.removeAll()

In [12]:
# create widgets
dbutils.widgets.text("websiteUrl", "https://www.njstatelib.org/research_library/new_jersey_resources/highlights/municipalities_by_county/", "01) Website URL")
dbutils.widgets.text("newsUrl", "https://patch.com/new-jersey/eastbrunswick/real-estate", "02) News URL")

In [13]:
websiteUrl = dbutils.widgets.get("websiteUrl")
newsUrl = dbutils.widgets.get("newsUrl")

## Imports

In [15]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Functions

In [17]:
def get_url_contents(url):
  # get page contents
  page = requests.get(url)
  # prase the html page
  soup = BeautifulSoup(page.text, 'html5lib')
  print('Extracted website contents!')
  return soup

## Data Exploration

### 1) City & County Data

In [20]:
# parse text for nj state lib contents
soup = get_url_contents(websiteUrl)

In [21]:
# find the html table
table = soup.find('table', attrs={"class": "omsc-custom-table omsc-style-1"})
table_body = table.find('tbody')
print(table_body.prettify())

In [22]:
table_dict = {}
# list all counties from header tags
counties = [county.text for county in table_body.findAll('h2')]

city_list = []
# get a list of all cities within unordered list tag
for city_elem in table_body.findAll('ul'):
    cities = [city_name.text for city_name in city_elem.findAll('li')]
    city_list.append(cities)
# create a dictionary that maps county -> city
for i in range(len(counties)):
    table_dict[counties[i]] = city_list[i]
table_dict

In [23]:
full_list = []
# create a list that maps county -> city
for county in list(table_dict.keys()):
    full_list.append([[county, city] for city in table_dict[county]])
flatten = [item for sublist in full_list for item in sublist]
flatten[:5]

In [24]:
# view table
df = pd.DataFrame(flatten, columns=['County', 'City'])
display(df.head())

County,City
Atlantic,Absecon
Atlantic,Atlantic City
Atlantic,Brigantine
Atlantic,Buena
Atlantic,Buena Vista


In [25]:
# view target area
display(df.loc[df['City'] == 'East Brunswick'])

County,City
Middlesex,East Brunswick


### 2) News Data

In [27]:
# parse nj news content
soup = get_url_contents(newsUrl)

In [28]:
# view top 10 most recent blog titles
blog_titles = soup.findAll('a', attrs={"class": "near-black-link"})
blog_titles_list = [title.text for title in blog_titles]
for i in range(len(blog_titles_list[:10])):
  print('Blog title {}: {}'.format(i + 1, blog_titles_list[i]))

In [29]:
# view top 10 most recent blog descriptions
blog_descs = soup.findAll('p', attrs={"class": "d-none d-sm-block lineheight-1-5 m-0 py-2 text-secondary text-serif text-xs"})
blog_descs_list = [desc.text.replace("\n  ", "") for desc in blog_descs]
for i in range(len(blog_descs_list[:10])):
  print('Blog descriptions {}: {}'.format(i + 1, blog_descs_list[i]))

In [30]:
# readable format
df_news = pd.DataFrame([[*range(1, 11)], blog_titles_list[:10], blog_descs_list[:10]]).T.rename(
  columns={0: 'Blog Num', 1: 'Blog Title', 2: 'Blog Desc'})
df_news['City'] = 'East Brunswick'
display(df_news.head())

Blog Num,Blog Title,Blog Desc,City
1,5 New Open Houses In The East Brunswick Area,Here are the most recent properties to hit the local open-house circuit.,East Brunswick
2,First Look At Dramatic New Rt. 18 Development In East Brunswick,"On Tuesday, East Brunswick released never-before-seen renderings of the dramatic —​ and somewhat controversial —​ Rt. 18 redevelopment plan.",East Brunswick
3,East Brunswick Plans Massive Development Project Along Rt. 18,"A new bus stop, public parking, a hotel, residential and commercial units, a ""Tech Center"" and more are planned for this stretch of Rt. 18.",East Brunswick
4,Sayreville: Massive Luxury Apt. Complex 'Riverton' May Be Coming,"If built, this project promises to dramatically re-shape Sayreville and the Raritan riverfront in this part of Central New Jersey.",East Brunswick
5,The Most Expensive Home In Middlesex County ....,"... is this one, located in ....",East Brunswick


In [31]:
# join both datasets
df_join = df.merge(df_news, how='left', on=['City'])
display(df_join.loc[df_join['City'] == 'East Brunswick'])

County,City,Blog Num,Blog Title,Blog Desc
Middlesex,East Brunswick,1,5 New Open Houses In The East Brunswick Area,Here are the most recent properties to hit the local open-house circuit.
Middlesex,East Brunswick,2,First Look At Dramatic New Rt. 18 Development In East Brunswick,"On Tuesday, East Brunswick released never-before-seen renderings of the dramatic —​ and somewhat controversial —​ Rt. 18 redevelopment plan."
Middlesex,East Brunswick,3,East Brunswick Plans Massive Development Project Along Rt. 18,"A new bus stop, public parking, a hotel, residential and commercial units, a ""Tech Center"" and more are planned for this stretch of Rt. 18."
Middlesex,East Brunswick,4,Sayreville: Massive Luxury Apt. Complex 'Riverton' May Be Coming,"If built, this project promises to dramatically re-shape Sayreville and the Raritan riverfront in this part of Central New Jersey."
Middlesex,East Brunswick,5,The Most Expensive Home In Middlesex County ....,"... is this one, located in ...."
Middlesex,East Brunswick,6,The Cheapest Home In Middlesex County Is ...,"But there are some very good reasons it has a $4,900 price tag."
Middlesex,East Brunswick,7,2018 Best Cities For Middle-Class Buyers — And The Worst,Sky-high prices got you down? Here are the top middle-class housing meccas.
Middlesex,East Brunswick,8,Top Home In Highly-Rated East Brunswick School District,WOW House: Enjoy access to East Brunswick's blue-ribbon schools in this stunning villa with a pool and tennis courts near a peach orchard.
Middlesex,East Brunswick,9,Most Expensive Home On The Market In East Brunswick,WOW House: Explore this five-bedroom with a pool on 3.5 acres.
Middlesex,East Brunswick,10,Explore This Lakefront Home In East Brunswick,"WOW House: Asking $2.1 million in East Brunswick Twp., this property overlooks Farrington Lake."


### What do we have so far?
Two datasets merged into one file. This can be used to find trends per city.

Data set <b>features</b> (columns): County, City, Blog Num, Blog Title, Blog Desc

### How do we make this a process?
Label each row as <i>important</i> (1) or <i>not important</i> (0). 

### How do we automate this process?
1. Hire someone to manually label each new blog entry
  - <b>Problem</b>: not scalable, cost of training someone, error of mislabeling 
  - "Staying busy and being productive are NOT the same thing"
2. Develop a machine learning model that auto tags the important & non important label per blog entry. Set email alerts only when data labeled "important" is processed.
  - <b>Benefit</b>: scalable, low-cost, low error (if trained with correct data - labels)
  
<img src="https://avo-translations.co.uk/wp-content/uploads/2019/05/1.png" style="height:300px;">

## How do we train a model to hand label what's important and what's not?

1. NLP to extract text
2. Clustering to group similar titles
3. Label cluster
4. Train model

### NLP example
<img src="https://www.researchgate.net/profile/Pyry_Kettunen/publication/277248905/figure/fig4/AS:614337920106508@1523480982617/A-schematic-example-of-the-natural-language-processing-analysis-of-the-think-aloud.png" style="height:600px;">

### Train Model

<img src="https://drive.google.com/uc?id=1cVv3brTMrvrsoPYaulogpRDaP5uthRUo" style="height:600px;">

## Is there an easier way to get data?
### APIs(Application Programming Interface)
<img src="https://img.deusm.com/darkreading/MarilynCohodas/TraditionalVModern.jpg" style="height:400px;">

### Web tools with UI (User Interface)

<img src="https://www.import.io/wp-content/uploads/2017/04/import-io-selezionare-attributo-3-1.png" style="height:400px;">
<img src="https://www.import.io/wp-content/uploads/2017/04/import-io-pagina-tabellata-automaticamente-2-e1493125570920-1.png" style="height:400px;">

<a href="https://www.import.io/post/importio-extract-pricing-data-web-page/" target="_blank">Import IO Pricing Data Example</a>

## Future
- Other areas to extract data:
  - municipality websites (i.e. permits)
  - local forums (i.e. Facebook local community)
  
  <img src="https://drive.google.com/uc?id=1pGdcp0zeNxpBgqdT9ToxdEWqrtq5Vd99" style="height:600px;">

# End Notebook