# The Python Mega Course: Build 10 Real World Applications
---

This notebook is a summary of [The Python Mega Course: Build 10 Real World Applciations](https://www.udemy.com/the-python-mega-course), a comprehensive online Python course taught by Ardit Sulce. Each lecture name is clickable and takes you to the video lecture in the course.

# Section 19: Application 7: Scrape Real Estate Property Data from the Web
***

**Lecture:** [Program Demonstration](https://www.udemy.com/the-python-mega-course/learn/v4/t/lecture/9439078?start=0)
---

This video lecture shows the finished version of the website running on a browser.

**Lecture:** [Loading the Webpage in Python](https://www.udemy.com/the-python-mega-course/learn/v4/t/lecture/9439078?start=0)
---

This code loads the webpage source code into Python ready for extracting information from it.

In [4]:
import requests, re
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
r = requests.get("https://www.pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/", headers=headers)
c = r.content

soup=BeautifulSoup(c, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<!-- saved from url=(0110)http://web.archive.org/web/20160127020422/http://www.century21.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS -->
<html lang="en" style="margin: 0px;overflow:hidden">
 <script async="" src="./LCWYROCKSPRINGS1_files/beacon.js">
 </script>
 <script src="chrome-extension://pkljnnogdmlajgaoodihioopfdkpgjgg/Kernel.js?0.3685073930846756">
 </script>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <script src="./LCWYROCKSPRINGS1_files/analytics.js" type="text/javascript">
  </script>
  <script type="text/javascript">
   archive_analytics.values.server_name="wwwb-app17.us.archive.org";archive_analytics.values.server_ms=227;
  </script>
  <link href="./LCWYROCKSPRINGS1_files/banner-styles.css" rel="stylesheet" type="text/css"/>
  <title>
   Rock Springs Real Estate | Find Houses &amp; Homes for Sale in Rock Springs, WY
  </title>
  <meta content="Rock Springs Real Estate | Find Houses &amp; Homes for Sale in Rock Spring

**Lecture:** [Extracting "div" Tags](https://www.udemy.com/the-python-mega-course/learn/v4/t/lecture/9439078?start=0)
---

We start extracting HTML tags starting from `div` tags.

In [5]:
import requests, re
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
r = requests.get("https://www.pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/", headers=headers)
c = r.content

soup = BeautifulSoup(c,"html.parser")

all = soup.find_all("div", {"class":"propertyRow"})
all[0].find("h4", {"class":"propPrice"}).text.replace("\n", "").replace(" ", "")

'$725,000'

**Lecture:** [Extracting Addresses and Property Details](https://www.udemy.com/the-python-mega-course/learn/v4/t/lecture/9439078?start=0)
---

Most of the data are stored inside `span` tags so we extract those data in this code.

In [6]:
import requests, re
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
r = requests.get("https://www.pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/", headers=headers)
c = r.content

soup = BeautifulSoup(c,"html.parser")

all = soup.find_all("div", {"class":"propertyRow"})
all[0].find("h4", {"class":"propPrice"}).text.replace("\n", "").replace(" ", "")

'$725,000'

In [7]:
for item in all:
    print(item.find("h4", {"class", "propPrice"}).text.replace("\n","").replace(" ", ""))
    print(item.find_all("span", {"class","propAddressCollapse"})[0].text)
    print(item.find_all("span", {"class","propAddressCollapse"})[1].text)

    try:
        print(item.find("span", {"class", "infoBed"}).find("b").text)
    except:
        print(None)

    try:
        print(item.find("span", {"class", "infoSqFt"}).find("b").text)
    except:
        print(None)

    try:
        print(item.find("span", {"class", "infoValueFullBath"}).find("b").text)
    except:
        print(None)

    try:
        print(item.find("span", {"class", "infoValueHalfBath"}).find("b").text)
    except:
        print(None)
        
    print(" ")

$725,000
0 Gateway
Rock Springs, WY 82901
None
None
None
None
 
$452,900
1003 Winchester Blvd.
Rock Springs, WY 82901
4
None
4
None
 
$396,900
600 Talladega
Rock Springs, WY 82901
5
3,154
3
None
 
$389,900
3239 Spearhead Way
Rock Springs, WY 82901
4
3,076
3
1
 
$254,000
522 Emerald Street
Rock Springs, WY 82901
3
1,172
3
None
 
$252,900
1302 Veteran's Drive
Rock Springs, WY 82901
4
1,932
2
None
 
$210,000
1021 Cypress Cir
Rock Springs, WY 82901
4
1,676
3
None
 
$209,000
913 Madison Dr
Rock Springs, WY 82901
3
1,344
2
None
 
$199,900
1344 Teton Street
Rock Springs, WY 82901
3
1,920
2
None
 
$196,900
4 Minnies Lane
Rock Springs, WY 82901
3
1,664
2
None
 


**Lecture:** [Extracting Elements without Unique Identifiers](https://www.udemy.com/the-python-mega-course/learn/v4/t/lecture/9439078?start=0)
---

Here we extract some more elements.

In [17]:
for item in all:
    print(item.find("h4", {"class", "propPrice"}).text.replace("\n","").replace(" ", ""))
    print(item.find_all("span", {"class","propAddressCollapse"})[0].text)
    print(item.find_all("span", {"class","propAddressCollapse"})[1].text)

    try:
        print(item.find("span", {"class", "infoBed"}).find("b").text)
    except:
        print(None)

    try:
        print(item.find("span", {"class", "infoSqFt"}).find("b").text)
    except:
        print(None)

    try:
        print(item.find("span", {"class", "infoValueFullBath"}).find("b").text)
    except:
        print(None)

    try:
        print(item.find("span", {"class", "infoValueHalfBath"}).find("b").text)
    except:
        print(None)
        
    for column_group in item.find_all("div", {"class":"columnGroup"}):
        for feature_group, feature_name in zip(column_group.find_all("span", {"class":"featureGroup"}), column_group.find_all("span", {"class":"featureName"})):
            if "Lot Size" in feature_group.text:
                print(feature_name.text)

    print(" ")

**Lecture:** [Saving the Extracted Data in CSV Files](https://www.udemy.com/the-python-mega-course/learn/v4/t/lecture/9439078?start=0)
---

Finally, we save the extracted data into a CSV file.

In [18]:
import requests, re
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
r = requests.get("https://www.pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/", headers=headers)
c = r.content


soup = BeautifulSoup(c,"html.parser")

all = soup.find_all("div",{"class":"propertyRow"})

all[0].find("h4", {"class":"propPrice"}).text.replace("\n", "").replace(" ", "")

'$725,000'

In [19]:
l = []
for item in all:
    d = {}
    df["Address"] = item.find_all("span", {"class", "propAddressCollapse"})[0].text
    df["Locality"] = item.find_all("span", {"class", "propAddressCollapse"})[1].text
    df["Price"] = item.find("h4", {"class", "propPrice"}).text.replace("\n","").replace(" ", "")
    
    try:
        d["Beds"] = item.find("span", {"class", "infoBed"}).find("b").text
    except:
        d["Beds"] = None

    try:
        d["Area"] = item.find("span", {"class", "infoSqFt"}).find("b").text
    except:
        d["Area"] = None

    try:
        d["Full Baths"] = item.find("span", {"class", "infoValueFullBath"}).find("b").text
    except:
        d["Full Baths"] = None

    try:
        d["Half Baths"] = item.find("span", {"class", "infoValueHalfBath"}).find("b").text
    except:
        d["Half Baths"] = None

    for column_group in item.find_all("div", {"class":"columnGroup"}):
        for feature_group, feature_name in zip(column_group.find_all("span", {"class":"featureGroup"}), column_group.find_all("span", {"class":"featureName"})):
            if "Lot Size" in feature_group.text:
                print(feature_name.text)
                d["Lot Size"] = feature_name.text
    l.append(d)

0.21 Acres
Under 1/2 Acre, 
Under 1/2 Acre, 
0.27 Acres
Under 1/2 Acre, 
Under 1/2 Acre, 
Under 1/2 Acre, 
2.02 Acres


In [20]:
import pandas
df = pandas.DataFrame(l)
df

Unnamed: 0,Area,Beds,Full Baths,Half Baths,Lot Size
0,,,,,
1,,4.0,4.0,,0.21 Acres
2,3154.0,5.0,3.0,,
3,3076.0,4.0,3.0,1.0,"Under 1/2 Acre,"
4,1172.0,3.0,3.0,,"Under 1/2 Acre,"
5,1932.0,4.0,2.0,,0.27 Acres
6,1676.0,4.0,3.0,,"Under 1/2 Acre,"
7,1344.0,3.0,2.0,,"Under 1/2 Acre,"
8,1920.0,3.0,2.0,,"Under 1/2 Acre,"
9,1664.0,3.0,2.0,,2.02 Acres


In [13]:
df.to_csv("Output.csv")

**Lecture:** [Crawling Through Webpages](https://www.udemy.com/the-python-mega-course/learn/v4/t/lecture/9439078?start=0)
---

In case you need to extract data from multiple pages, here is how to do it.

In [38]:
import requests, re
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
r = requests.get("https://www.pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/", headers=headers)
c = r.content


soup = BeautifulSoup(c,"html.parser")

all = soup.find_all("div",{"class":"propertyRow"})

all[0].find("h4", {"class":"propPrice"}).text.replace("\n", "").replace(" ", "")

page_nr = soup.find_all("a",{"class":"Page"})[-1].text
print(page_nr, "number of pages were found")

l = []
base_url = "http://www.pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/t=0&s="
for page in range(0, int(page_nr)*10, 10):
    print( )
    r = requests.get(base_url + str(page) + ".html", headers=headers)
    c = r.content
    soup = BeautifulSoup(c, "html.parser")
    all = soup.find_all("div", {"class":"propertyRow"})
    for item in all:
        d = {}
        d["Address"] = item.find_all("span", {"class","propAddressCollapse"})[0].text
        
        try:
            d["Locality"] = item.find_all("span",{"class","propAddressCollapse"})[1].text
        except:
            d["Locality"] = None
        d["Price"] = item.find("h4", {"class", "propPrice"}).text.replace("\n","").replace(" ", "")
        
        try:
            d["Beds"] = item.find("span", {"class", "infoBed"}).find("b").text
        except:
            d["Beds"] = None

        try:
            d["Area"] = item.find("span", {"class", "infoSqFt"}).find("b").text
        except:
            d["Area"] = None
    
        try:
            d["Full Baths"] = item.find("span", {"class", "infoValueFullBath"}).find("b").text
        except:
            d["Full Baths"] = None

        try:
            d["Half Baths"] = item.find("span", {"class", "infoValueHalfBath"}).find("b").text
        except:
            d["Half Baths"] = None
        
        for column_group in item.find_all("div", {"class":"columnGroup"}):
            for feature_group, feature_name in zip(column_group.find_all("span", {"class":"featureGroup"}), column_group.find_all("span", {"class":"featureName"})):
                if "Lot Size" in feature_group.text:
                    print(feature_name.text)
                    d["Lot Size"] = feature_name.text
        l.append(d)

3 number of pages were found

0.21 Acres
Under 1/2 Acre, 
Under 1/2 Acre, 
0.27 Acres
Under 1/2 Acre, 
Under 1/2 Acre, 
Under 1/2 Acre, 
2.02 Acres


5 Acres
0.7 Acres
3 Acres
Under 1/2 Acre
2.35 Acres
2.05 Acres
0.73 Acres
0.31 Acres


**Lecture:** [Final Code of Application 7]()
---

This is the final code. It accesses a webpage and it extracts data from that webpage and save those data in a CSV file.

**Note**: You need internet connection for the code to work.

In [43]:
import requests, re
from bs4 import BeautifulSoup
import pandas

headers = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
r = requests.get("https://www.pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/", headers=headers)
c = r.content

soup = BeautifulSoup(c,"html.parser")

all = soup.find_all("div",{"class":"propertyRow"})

all[0].find("h4", {"class":"propPrice"}).text.replace("\n", "").replace(" ", "")

page_nr = soup.find_all("a",{"class":"Page"})[-1].text
print(page_nr, "number of pages were found")

l = []
base_url = "http://www.pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/t=0&s="
for page in range(0, int(page_nr)*10, 10):
    print( )
    r = requests.get(base_url + str(page) + ".html", headers=headers)
    c = r.content
    soup = BeautifulSoup(c, "html.parser")
    all = soup.find_all("div", {"class":"propertyRow"})
    for item in all:
        d = {}
        d["Address"] = item.find_all("span", {"class","propAddressCollapse"})[0].text
        
        try:
            d["Locality"] = item.find_all("span",{"class","propAddressCollapse"})[1].text
        except:
            d["Locality"] = None
        d["Price"] = item.find("h4", {"class", "propPrice"}).text.replace("\n","").replace(" ", "")
        
        try:
            d["Beds"] = item.find("span", {"class", "infoBed"}).find("b").text
        except:
            d["Beds"] = None

        try:
            d["Area"] = item.find("span", {"class", "infoSqFt"}).find("b").text
        except:
            d["Area"] = None
    
        try:
            d["Full Baths"] = item.find("span", {"class", "infoValueFullBath"}).find("b").text
        except:
            d["Full Baths"] = None

        try:
            d["Half Baths"] = item.find("span", {"class", "infoValueHalfBath"}).find("b").text
        except:
            d["Half Baths"] = None
        
        for column_group in item.find_all("div", {"class":"columnGroup"}):
            for feature_group, feature_name in zip(column_group.find_all("span", {"class":"featureGroup"}), column_group.find_all("span", {"class":"featureName"})):
                if "Lot Size" in feature_group.text:
                    print(feature_name.text)
                    d["Lot Size"] = feature_name.text
        l.append(d)
        
df = pandas.DataFrame(l)
df.to_csv("Output.csv")

3 number of pages were found

0.21 Acres
Under 1/2 Acre, 
Under 1/2 Acre, 
0.27 Acres
Under 1/2 Acre, 
Under 1/2 Acre, 
Under 1/2 Acre, 
2.02 Acres


5 Acres
0.7 Acres
3 Acres
Under 1/2 Acre
2.35 Acres
2.05 Acres
0.73 Acres
0.31 Acres


In [44]:
!open Output.csv