# Data Collection

## Introduction

This notebook shows how data on properties in lagos were collected, in essence the datasets will be divided into sales and rents.

Specifically, the following steps will be taken:

1. **Exploring the website** - inspecting the website.
2. **Extracting the data** - we will download the datasets and turn them into pandas dataframe
3. **Organizing the data into csv files**

#### Exploring the website

In [1]:
import requests
from bs4 import BeautifulSoup as soup
import numpy as np
import pandas as pd

In [2]:
rent_link ="https://www.privateproperty.com.ng/flats-apartments-for-rent/lagos/serviced"
sales_link="https://www.privateproperty.com.ng/flats-apartments-for-sale/lagos/serviced"

To explore the pages, we will explore the first page from rent_link

In [3]:

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}
# we use header to disguise as a web-browser
with requests.get(rent_link, headers=header) as property_html: #Html Client and html itself
    property_page = soup(property_html.content, "html.parser") # extracts the html code


Using the inspect, one will find out that the details of each properties is contained in the div tag and class attribute
on 

In [4]:
# container for details
containers = property_page.findAll("div", {"class": "item-body table-cell"})

In [5]:
print(containers[0].prettify())

<div class="item-body table-cell">
 <div class="amenities-grid align-items-start">
  <div class="body-left table-cell">
   <div class="info-row calc d-lg-none d-md-none">
   </div>
   <div class="info-row price">
    <a class="item-price" href="/listings/1-bed-mini-serviced-flat-apartment-for-rent-ikeja-gra-off-mobolaji-bank-anthony-ikeja-lagos-ikeja-g-r-a-tosp97236">
     N1,300,000 per year
    </a>
   </div>
  </div>
  <div class="body-right table-cell">
   <div class="info-row amenities">
    <p>
     <span class="h-beds">
      <i class="fa fa-bed">
      </i>
      1
     </span>
     <span class="h-baths">
      <i class="fa fa-bathtub">
      </i>
      1
     </span>
    </p>
   </div>
  </div>
  <ul class="actions favourite-action">
   <li>
    <span class="favourite-none listing-favourite btn-favourite" data-t-listing_category="Flat &amp; Apartment" data-t-listing_id="97236" data-t-listing_priority="gold" data-t-listing_type="For Rent" onclick="NewFrontEnd.ProcessFavourite(9

A look at the html code above shows the following.

In [6]:
try:
    # price
    print("price :"+(containers[1].find("div",{"class":'body-left table-cell'}).text.strip()))
except:
    print("price : Unknown")
try:
    # number of bed rooms
    print("Property description :" +containers[1].h2.find('a').text.strip())
except:
    print("Property Description : Unknown")
    
try:
    # number of bed rooms
    print("number of bed rooms :" +containers[1].find('span', {'class':'h-beds'}).text.strip())
except:
    print("number of bed rooms : Unknown")

try:
    # number of bed rooms
    print("number of bathrooms :" +containers[1].find('span', {'class':'h-baths'}).text.strip())
except:
    print("number of bathrooms : Unknown")

try:
    #Location
    print("Location :"+containers[1].find("div", {"class":"property-location"}).text)
except:
    print("Location : Unknown")
# try:
#     #Property adddress 
#     print("Property adddress  :"+containers[0].find("address", {"class":"property-address d-none d-sm-block"}).text.strip())
# except:
#     print("Property adddress  : Unknown")
try:
    #estate agency
    print("estate agency  :"+containers[0].find("span", {"class":"estate-agency-logo"}).a.find("img")["alt"])
except:
    print("estate agency  : Unknown")

price :N25,000,000 per year
Property description :Exquisite And Well Maintained 3 & 4 Bedroom Apartment With Swimming Pool
number of bed rooms :3
number of bathrooms :5
Location :  Ocean Parade, Banana Island, Banana Island, Ikoyi
estate agency  :Tos Property Services


websites are usually standardize and consistent, wrote a for loop to get similar information.

In [7]:
price=[]
property_description=[]
number_of_bedrooms=[]
number_of_bathrooms=[]
Location=[]
estate_agency=[]


for i in range(len(containers)):
    # price
    try:
        price.append(containers[i].find("div",{"class":'body-left table-cell'}).text.strip())
    except AttributeError:
        price.append(np.nan)
    
    # property description
    try:
        property_description.append(containers[i].h2.find('a').text.strip())
    except AttributeError:
        property_description.append(np.nan)
    
    # number of bedrooms
    try:
       
        number_of_bedrooms.append(containers[i].find('span', {'class':'h-beds'}).text.strip())
    except AttributeError:
        number_of_bedrooms.append(np.nan)
   
    #number of bathrooms
    try:
        number_of_bathrooms.append(containers[i].find('span', {'class':'h-baths'}).text.strip())
    except AttributeError:
        number_of_bathrooms.append(np.nan)


    #Location
    try:
        Location.append(containers[i].find("div", {"class":"property-location"}).text.strip())
    except AttributeError:
        Location.append(np.nan)
    
    
 
    #estate_agency
    try:
        estate_agency.append(containers[i].find("span", {"class":"estate-agency-logo"}).a.find("img")["alt"].strip())
    except AttributeError:
        estate_agency.append(np.nan)
    
    
df=pd.DataFrame({"price":price,"property description":property_description, "number of bedrooms":number_of_bedrooms, 
                    "number of bathrooms": number_of_bathrooms,"Location":Location, "estate agency":estate_agency})
df.head() 

Unnamed: 0,price,property description,number of bedrooms,number of bathrooms,Location,estate agency
0,"N1,300,000 per year",Serviced 1 Bedroom Flat (Miniflat) with Swimmi...,1,1,"Ikeja Gra, Off Mobolaji Bank Anthony, Ikeja, L...",Tos Property Services
1,"N25,000,000 per year",Exquisite And Well Maintained 3 & 4 Bedroom Ap...,3,5,"Ocean Parade, Banana Island, Banana Island, Ikoyi",Ubosi Eleh And Co
2,"N7,000,000 per year",Furnished One Bedroom Apartment,1,1,"Banana Island, Ikoyi",Ifeanyi Igwebike
3,"N15,000,000 per year",3 bedroom flat with a bq,3,3,"Banana Island, Ikoyi",Verified properties Limited
4,"N8,000,000 per year",20 units 3 bedroom service and partially furni...,3,3,"Isaac John Street Gra Ikeja, Ikeja G.R.A, Ikeja",Loyalty Property


A quick view in as dataframe gives the feeling of what the output will look like, when we combine other pages with the first page

the second page is "https://www.privateproperty.com.ng/flats-apartments-for-rent/lagos/serviced?page=2",
it's conspicous that the difference between the first page and subsequent pages is "?page= page_no", as such we can write a function to capture this.

In [8]:
def get_pivatepropertywebpage (url):
    """
        gets privateproperty.com page
    """
    from bs4 import BeautifulSoup as soup
    import requests
    
    
    header = {
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
      "X-Requested-With": "XMLHttpRequest"
    }
   
    # we use header to disguise as a web-browser   
    with requests.get(url, headers=header) as page_htmlclient: #Html Client and html itself
        page_html = soup(page_htmlclient.content, "html.parser") # extracts the html code
    
    # container for details
    property_container = page_html.findAll("div", {"class": "item-body table-cell"})
    return property_container

In [9]:
def get_properties_info(property_container):
    price=[]
    property_description=[]
    number_of_bedrooms=[]
    number_of_bathrooms=[]
    Location=[]
    estate_agency=[]

    for i in range(len(property_container)):
        # price
        try:
            price.append(property_container[i].find("div",{"class":'body-left table-cell'}).text.strip())
        except AttributeError:
            price.append(np.nan)

        # property description
        try:
            property_description.append(property_container[i].h2.find('a').text.strip())
        except AttributeError:
            property_description.append(np.nan)

        # number of bedrooms
        try:

            number_of_bedrooms.append(property_container[i].find('span', {'class':'h-beds'}).text.strip())
        except AttributeError:
            number_of_bedrooms.append(np.nan)

        #number of bathrooms
        try:
            number_of_bathrooms.append(property_container[i].find('span', {'class':'h-baths'}).text.strip())
        except AttributeError:
            number_of_bathrooms.append(np.nan)


        #Location
        try:
            Location.append(property_container[i].find("div", {"class":"property-location"}).text.strip())
        except AttributeError:
            Location.append(np.nan)



        #estate_agency
        try:
            estate_agency.append(property_container[i].find("span", {"class":"estate-agency-logo"}).a.find("img")["alt"].strip())
        except AttributeError:
            estate_agency.append(np.nan)
    
    #convert info to dictionary
    property_info= {"price":price,"property description":property_description, "number of bedrooms":number_of_bedrooms, 
                     "number of bathrooms": number_of_bathrooms,"Location":Location, "estate agency":estate_agency}
    return property_info

In [10]:
def get_property_info_from_multiple_pages(page_link, no_of_pages=1):
    assert no_of_pages >= 1
    first_page = get_pivatepropertywebpage(page_link)
    property_dict = get_properties_info(first_page)
    
    if no_of_pages>1:
        import time
        time.sleep(1)
        if no_of_pages==2:
            page = get_pivatepropertywebpage(page_link + '?page=2')
            page_dict = get_properties_info(page)
            for key, value in page_dict.items():
                property_dict[key].extend(value)
        else:
            for i in range(2, no_of_pages+1):
                page = get_pivatepropertywebpage(page_link + f'?page={i}')
                page_dict = get_properties_info(page)
                for key, value in page_dict.items():
                    property_dict[key].extend(value)
                time.sleep(1)
    return property_dict

In [11]:
rent=get_property_info_from_multiple_pages(rent_link,35)

In [12]:
sales = get_property_info_from_multiple_pages(sales_link,37)

In [13]:
rent_df=pd.DataFrame(rent)
sales_df=pd.DataFrame(sales)

In [19]:
rent_df.head()

Unnamed: 0,price,property description,number of bedrooms,number of bathrooms,Location,estate agency
0,"N1,300,000 per year",Serviced 1 Bedroom Flat (Miniflat) with Swimmi...,1,1,"Ikeja Gra, Off Mobolaji Bank Anthony, Ikeja, L...",Tos Property Services
1,"N25,000,000 per year",Exquisite And Well Maintained 3 & 4 Bedroom Ap...,3,5,"Ocean Parade, Banana Island, Banana Island, Ikoyi",Ubosi Eleh And Co
2,"N7,000,000 per year",Furnished One Bedroom Apartment,1,1,"Banana Island, Ikoyi",Ifeanyi Igwebike
3,"N15,000,000 per year",3 bedroom flat with a bq,3,3,"Banana Island, Ikoyi",Verified properties Limited
4,"N8,000,000 per year",20 units 3 bedroom service and partially furni...,3,3,"Isaac John Street Gra Ikeja, Ikeja G.R.A, Ikeja",Loyalty Property


In [20]:
sales_df.head()

Unnamed: 0,price,property description,number of bedrooms,number of bathrooms,Location,estate agency
0,"N26,500,000",Newly Built 2 & 3 Bedroom Flat Apartments in S...,4,4,"Off Brown Road, Aguda , Surulere",Choice Property
1,"N65,000,000","Luxury 3 Bedroom Flat with Maid's Room, AC, Fi...",3,4,Along Water Corporation Drive Off Ligali Ayori...,Deluxe Residences Ltd
2,"N25,000,000",Newly Built 2 Bedroom Apartments,2,2,"Orchid Road, Chevron Drive, Lekki",JOA Homes
3,"N60,000,000",3 Bedroom Apartment With Bq,3,4,"The Dream Place, Lekki Phase 1, Lekki",
4,"N48,000,000",A 3 bedroom apartment,3,3,"Ikate, Ikate Elegushi, Lekki",Underwood Homes


In [24]:
sales_df.to_csv('Lagos Properties for sale.csv', index=False)
rent_df.to_csv('Lagos Properties for Rent.xls', index=False)

The Next notebook will focus on cleaning the generated data sets and some EDAs.

#### Thank You