# Web Scrapping Real Estate Data


## Goal of the notebook:
To get data of houses being listed for sale on Realtor.com

Future aim is to make predictions from the data

## TO DO:

1) Make proper markdowns in the notebooks

2) Improve variable names and code structure

3) Check whether data is accurate using regex or do it manually or visually   

4) TRY stuff from the above links (see all)

5) Follow a ML pipeline for prediction -- make a model

6) Make a verdict 

7) Write a report

8) Deploy the model (maybe using flask or django or some other way)



### Sources Used and Unused:

###### The links lead to tuorials of web scrapping (different methods) or searching using regex

https://www.realtor.com/realestateandhomes-search/Palo-Alto_CA

https://www.youtube.com/watch?v=RvCBzhhydNk

https://medium.com/@raiyanquaium/how-to-web-scrape-using-beautiful-soup-in-python-without-running-into-http-error-403-554875e5abed#:~:text=This%20will%20result%20in%20a,security%20features%20to%20prevent%20bots.

https://youtu.be/iESyyogOkY0
https://www.geeksforgeeks.org/python-extract-words-from-given-string/
https://medium.com/quantrium-tech/extracting-words-from-a-string-in-python-using-regex-dac4b385c1b8
https://www.guru99.com/python-regular-expressions-complete-tutorial.html

https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/

https://youtu.be/Z7LEv7nHqqk

https://youtu.be/a3Cuq2csLWk

https://youtu.be/pzptMqULnyE

https://youtu.be/dRcvJRmqFHQ

https://youtu.be/FJLpUsRFT00

https://www.youtube.com/watch?v=CwMei_VhHHQ


### Web Scrapping Code-

In [1]:
from bs4 import BeautifulSoup # web scrapping package
from urllib.request import Request, urlopen # since we will send a request to a web page and we will get a response back (some html)
import requests
from csv import writer 

In [2]:
url = "https://www.realtor.com/realestateandhomes-search/Palo-Alto_CA"


In [3]:
requests = Request(url, headers={'User-Agent':'Mozilla/5.0'})
#print(requests)

In [4]:
webpage = urlopen(requests).read()
#print(webpage)

In [5]:
# creates a beautiful soup object
soup = BeautifulSoup(webpage, "html.parser")
#print(soup)

In [6]:
# we find content in the soup on the basis of the html tags and the class
# the second argument is "class_" instead of just "class" beacuse we are dealing with css
find_price_content = soup.find_all("span", class_="rui__x3geed-0 kitA-dS") 
find_address_content = soup.find_all("div", class_="jsx-1489967104 address ellipsis srp-page-address srp-address-redesign")
find_space_content = soup.find_all("ul", class_="jsx-946479843 property-meta list-unstyled property-meta-srpPage")#joint no space


In [7]:
price = []
for i in find_price_content:
    price.append(i.text)

address = []
for i in find_address_content:
    address.append(i.text)

space = []
for i in find_space_content:
    space.append(i.text)

In [8]:
# problem in space - the no. of beds, bathroom and area data is joint together
print(space)

['2bed1bath660sqft2,325sqft lot', '2bed1bath865sqft875sqft lot', '2bed2.5bath1,230sqft630sqft lot', '2bed2bath1,440sqft', '3bed1bath1,004sqft7,748sqft lot', '2bed3bath1,490sqft961sqft lot', '2bed1bath943sqft', '5bed5bath4,003sqft9,523sqft lot', '4bed2bath1,554sqft6,292sqft lot', '2bed2.5bath1,295sqft630sqft lot', '4bed2bath2,066sqft5,849sqft lot', '1bed1bath885sqft', '3bed2bath1,710sqft9,400sqft lot', '2bed2.5bath1,468sqft', '2bed1bath1,091sqft', '4bed3.5+bath4,540sqft1.02acre lot', '5bed4.5+bath5,042sqft8,755sqft lot', '3bed2bath1,583sqft6,500sqft lot', '1bed1bath876sqft', '2bed1bath946sqft', '2bed2bath773sqft2,000sqft lot', '6bed5bath3,456sqft7,670sqft lot', '5bed4bath2,871sqft5,000sqft lot', '5bed4bath3,072sqft7,605sqft lot', '6bed5.5bath3,853sqft6,382sqft lot', '4bed2bath1,639sqft7,084sqft lot', '4bed2bath1,664sqft6,504sqft lot', '3bed2.5bath1,494sqft', '4bed2.5bath2,697sqft8,508sqft lot', '4bed3bath2,410sqft6,380sqft lot', '2bed2bath1,015sqft', '2bed2.5bath1,968sqft0.24acre lot', 

In [9]:
def regex_check_for_space(l, regex_string, original_string):
    """
    Goal of the function was to mainly remove redundant code
    """
    if regex_string is None:
        l.append(original_string)
    else:
        l.append(regex_string.group())
    return l

In [10]:
# cleaning the space data
# we seperate the data

# Regular Expression package
import re

beds = []
bath = []
area = []
area_lot = []
for i in space:
    search_bed = re.search(pattern="[0-9]+bed", string=i)
    regex_check_for_space(beds, search_bed, i)
        
    search_bath = re.search(pattern="[0-9]+bath", string=i)
    regex_check_for_space(bath, search_bath, i)
    
    # area lot is checked before because both area and area lot have the sqft word in commmon
    search_area_lot = re.search(pattern="[0-9]+sqft\slot|[0-9]+,+[0-9]+sqft\slot", string=i) 
    regex_check_for_space(area_lot, search_area_lot, i)
    
    search_area = re.search(pattern="[0-9]+sqft|[0-9]+,+[0-9]+sqft", string=i)
    regex_check_for_space(area, search_area, i)

In [11]:
# printing them together
for i in zip(price, address, space, beds, bath, area, area_lot):
    print(i)

('$1,599,000', '736 Homer Ave, Palo Alto, CA 94301', '2bed1bath660sqft2,325sqft lot', '2bed', '1bath', '660sqft', '2,325sqft lot')
('$1,098,000', '280 Waverley St, Palo Alto, CA 94301', '2bed1bath865sqft875sqft lot', '2bed', '1bath', '865sqft', '875sqft lot')
('$1,380,000', '2585 Park Blvd Apt Z206, Palo Alto, CA 94306', '2bed2.5bath1,230sqft630sqft lot', '2bed', '5bath', '1,230sqft', '630sqft lot')
('$1,790,000', '101 Alma St Apt 805, Palo Alto, CA 94301', '2bed2bath1,440sqft', '2bed', '2bath', '1,440sqft', '2bed2bath1,440sqft')
('$1,795,000', '3109 Maddux Dr, Palo Alto, CA 94303', '3bed1bath1,004sqft7,748sqft lot', '3bed', '1bath', '1,004sqft', '7,748sqft lot')
('$1,850,000', '685 High St Apt 5F, Palo Alto, CA 94301', '2bed3bath1,490sqft961sqft lot', '2bed', '3bath', '1,490sqft', '961sqft lot')
('$949,000', '4250 El Camino Real Apt A307, Palo Alto, CA 94306', '2bed1bath943sqft', '2bed', '1bath', '943sqft', '2bed1bath943sqft')
('$9,500,000', '2111 Barbara Dr, Palo Alto, CA 94303', '5b

In [12]:
# importing in pandas data frame
import pandas as pd 

df = pd.DataFrame(list(zip(price, address, space, beds, bath, area, area_lot)))
print(df.head())

            0                                             1  \
0  $1,599,000            736 Homer Ave, Palo Alto, CA 94301   
1  $1,098,000          280 Waverley St, Palo Alto, CA 94301   
2  $1,380,000  2585 Park Blvd Apt Z206, Palo Alto, CA 94306   
3  $1,790,000      101 Alma St Apt 805, Palo Alto, CA 94301   
4  $1,795,000           3109 Maddux Dr, Palo Alto, CA 94303   

                                 2     3      4          5                   6  
0    2bed1bath660sqft2,325sqft lot  2bed  1bath    660sqft       2,325sqft lot  
1      2bed1bath865sqft875sqft lot  2bed  1bath    865sqft         875sqft lot  
2  2bed2.5bath1,230sqft630sqft lot  2bed  5bath  1,230sqft         630sqft lot  
3               2bed2bath1,440sqft  2bed  2bath  1,440sqft  2bed2bath1,440sqft  
4  3bed1bath1,004sqft7,748sqft lot  3bed  1bath  1,004sqft       7,748sqft lot  


# Import to CSV

In [13]:
# import to csv
df.to_csv('Palo Alto houses form realtor.csv', sep='\t')