## Web Scraping and Data Analysis Project

In this Jupyter Notebook, we'll perform web scraping to gather data about properties for sale and rent in Egypt from the Dubizzle website. We will then analyze this data to gain insights into the Egyptian real estate market. The project is part of my TMG internship.


In [2]:
# Import necessary libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

In [3]:
# Headers for the HTTP request
HEADERS = ({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US, en;q=0.5'
})


<br>

### Define Functions for Data Extraction

We will define functions to extract various property details from the web page. Each function takes a BeautifulSoup object representing a web page and returns a specific property detail.


In [4]:
# Function to get property type
def get_type(soup):
    details = soup.find_all("div", attrs={'class': 'b44ca0b3'}) 
    for i in range(details.__len__()):
        if(details[i].find('span').text == 'Type'):
            return details[i].text[4:]
    return np.nan

In [5]:
# Function to get property area (in square meters)
def get_area(soup):   
    details = soup.find_all("div", attrs={'class': 'b44ca0b3'}) 
    for i in range(details.__len__()):
        if(details[i].find('span').text == 'Area (m²)'):
            return details[i].text[9:]
    return None

In [6]:
# Function to get the number of bedrooms
def get_bedrooms(soup):   
    details = soup.find_all("div", attrs={'class': 'b44ca0b3'}) 
    for i in range(details.__len__()):
        if(details[i].find('span').text == 'Bedrooms'):
            return details[i].text[8:]
    return None

In [7]:
# Function to get the number of bathrooms
def get_bathrooms(soup):   
    details = soup.find_all("div", attrs={'class': 'b44ca0b3'}) 
    for i in range(details.__len__()):
        if(details[i].find('span').text == 'Bathrooms'):
            return details[i].text[9:]
    return None

In [8]:
# Function to get the property price
def get_price(soup):   
    details = soup.find_all("div", attrs={'class': 'b44ca0b3'}) 
    for i in range(details.__len__()):
        if(details[i].find('span').text == 'Price'):
            return details[i].text[5:]
    return None

In [9]:
# Function to get the down payment amount
def get_DownPayment(soup):   
    details = soup.find_all("div", attrs={'class': 'b44ca0b3'}) 
    for i in range(details.__len__()):
        if(details[i].find('span').text == 'Down Payment'):
            return details[i].text[12:]
    return None

In [10]:
# Function to get the payment option (e.g., Cash or Installment)
def get_PaymentOption(soup):   
    details = soup.find_all("div", attrs={'class': 'b44ca0b3'}) 
    for i in range(details.__len__()):
        if(details[i].find('span').text == 'Payment Option'):
            return details[i].text[14:]
    return None

In [11]:
# Function to get the delivery term
def get_DeliveryTerm(soup):   
    details = soup.find_all("div", attrs={'class': 'b44ca0b3'}) 
    for i in range(details.__len__()):
        if(details[i].find('span').text == 'Delivery Term'):
            return details[i].text[13:]
    return None

In [12]:
# Function to get the property level
def get_level(soup):   
    details = soup.find_all("div", attrs={'class': 'b44ca0b3'}) 
    for i in range(details.__len__()):
        if(details[i].find('span').text == 'Level'):
            return details[i].text[5:]
    return None

In [13]:
# Function to get the delivery date
def get_DeliveryDate(soup):   
    details = soup.find_all("div", attrs={'class': 'b44ca0b3'}) 
    for i in range(details.__len__()):
        if(details[i].find('span').text == 'Delivery Date'):
            return details[i].text[13:]
    return None

In [14]:
# Function to get the compound information
def get_Compound(soup):   
    details = soup.find_all("div", attrs={'class': 'b44ca0b3'}) 
    for i in range(details.__len__()):
        if(details[i].find('span').text == 'Compound'):
            return details[i].text[8:]
    return None

In [15]:
# Function to get the furnished status (Yes or No)
def get_furnished(soup):   
    details = soup.find_all("div", attrs={'class': 'b44ca0b3'}) 
    for i in range(details.__len__()):
        if(details[i].find('span').text == 'Furnished'):
            return details[i].text[9:]
    return None

In [16]:
# Function to get the real estate developer's name
def get_Developers(soup):
    c = soup.find("span", attrs={'class': '_6d5b4928 be13fe44'})
    if str(type(c)) == "<class 'NoneType'>":
        return np.nan
    return c.text

In [17]:
# Function to get the location of the property
def get_location(soup):
    c = soup.find("span", attrs={'aria-label': 'Location'})
    if str(type(c)) == "<class 'NoneType'>":
        return np.nan
    return c.text

In [18]:
# Function to get the property description
def get_description(soup):
    c = soup.find("h1", attrs={'class': 'a38b8112'})
    if str(type(c)) == "<class 'NoneType'>":
        return np.nan
    return c.text

In [19]:
# Function to get the property category
def get_category(soup):
    return soup.find_all("a", attrs={'data-testid': 'breadcrumbSearchLink'})[1].text

In [20]:
# Function to check if the property is featured (Yes or No)
def get_featured(soup):
    c = soup.find("span", attrs={'class': '_5e159053 be13fe44'})
    if str(type(c)) == "<class 'NoneType'>":
        return 'No'
    return 'Yes'


<br>


### Web Scraping Process

We'll perform the web scraping in several steps:
1. Fetch property links from multiple pages on Dubizzle.
2. Iterate through these links to extract property data.
3. Store the data in a dictionary.
4. Create a DataFrame from the dictionary.
5. Save the DataFrame as a CSV file.


In [None]:
# Base URL for property listings
URL_base = 'https://www.dubizzle.com.eg/en/properties/?page='

In [None]:
# Initialize an empty list to store links
links = []

for n in range(1,200):
    URL = URL_base + str(n)
    MainPage = requests.get(URL, headers=HEADERS)
    soup = BeautifulSoup(MainPage.content, "html.parser")

    # Fetch links as List of Tag Objects
    div_elements = soup.find_all("div", attrs={'class': 'a52608cc'})
 

    # Iterate through the elements in the div_elements ResultSet
    for div_element in div_elements:
        href = div_element.find('a').get("href")
        links.append('https://www.dubizzle.com.eg' + href)


In [204]:
links.__len__()

8955

### Step 2: Extract and Store Property Data

We'll now loop through the collected property links and extract data for various property details using the functions we defined earlier. We'll store this data in a dictionary.


In [21]:
# Initialize a dictionary to store property data
property_dict = {
    "Type": [],
    "Category": [],
    "Price": [],
    "Real_Estate_Developer": [],
    "Location": [],
    "Compound": [],
    "Area": [],
    "Bedrooms": [],
    "Bathrooms": [],
    "Level": [],
    "Furnished": [],
    "Payment_Option": [],
    "Down_Payment": [],
    "Delivery_Date": [],
    "Delivery_Term": [],
    "Description": [],
    "Featured": []
}


<br>


### Step 3: Loop Through Property Links

We'll iterate through the property links, extract data using our defined functions, and append it to the dictionary.


In [226]:
# Loop through property links to extract property data
for link in links:
    MainPage = requests.get(link, headers=HEADERS)
    soup = BeautifulSoup(MainPage.content, "html.parser")
    
    # Extract property data and append to the dictionary
    property_dict['Price'].append(get_price(soup))
    property_dict['Area'].append(get_area(soup))
    property_dict['Type'].append(get_type(soup))
    property_dict['Bedrooms'].append(get_bedrooms(soup))
    property_dict['Bathrooms'].append(get_bathrooms(soup))
    property_dict['Down_Payment'].append(get_DownPayment(soup))
    property_dict['Payment_Option'].append(get_PaymentOption(soup))
    property_dict['Delivery_Term'].append(get_DeliveryTerm(soup))
    property_dict['Level'].append(get_level(soup))
    property_dict['Delivery_Date'].append(get_DeliveryDate(soup))
    property_dict['Compound'].append(get_Compound(soup))
    property_dict['Furnished'].append(get_furnished(soup))
    property_dict['Real_Estate_Developer'].append(get_Developers(soup))
    property_dict['Location'].append(get_location(soup))
    property_dict['Category'].append(get_category(soup))
    property_dict['Featured'].append(get_featured(soup))
    property_dict['Description'].append(get_description(soup))


<br>

### Step 4: Create a DataFrame

We'll create a DataFrame from the collected property data stored in the dictionary.


In [227]:
# Create a DataFrame from the property dictionary
df = pd.DataFrame.from_dict(property_dict)
df

Unnamed: 0,Type,Category,Price,Real_Estate_Developer,Location,Compound,Area,Bedrooms,Bathrooms,Level,Furnished,Payment_Option,Down_Payment,Delivery_Date,Delivery_Term,Description,Featured
0,Office Space,Commercial for Rent,38500,Abrag Two,"Sheikh Zayed, Giza",,55,,,,No,,38500,,,مكتب للايجار 55 م كابيتال بيزنيس بارك موقع ممي...,No
1,Apartment,Apartments & Duplex for Sale,9402000,Dlleni,"O West, 6th of October",,150,3,2,,Yes,Cash or Installment,470100,Ready to move,,"""Discover the Luxurious Lifestyle at O West Or...",No
2,Apartment,Apartments & Duplex for Sale,410000,New point,"Taj City, New Cairo",,156,3,2,,,Cash or Installment,,,,شقة للبيع 156م علي طريق السويس مباشرة في كمبون...,No
3,Apartment,Apartments & Duplex for Sale,1850000,اسلام حسني,"Sheikh Zayed, Giza",,70,2,1,1,No,Cash,1850000,Ready to move,Finished,شقة للبيع في كمبوند روضة زايد دور اول باسنسير ...,No
4,Town House,Vacation Homes for Sale,750000,A.B.G Real Estate,"Telal Sokhna, Ain Sukhna",,150,4,3,Ground,,Cash or Installment,750000,,Finished,تاون هاوس150م للبيع في تلال العين السخنه Telal,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8951,Apartment,Apartments & Duplex for Sale,1600000,شركه أمازون للتشطيبات والديكور والتسويق العقاري,"Madinaty, Cairo",,142,3,2,3,No,Installment,,,Finished,شقة للبيع اقساط في مدينتي 142م بمقدم مليون و60...,No
8952,Town House,Villas For Sale,6600000,Middlemen,"L’Avenir, Mostakbal City",,271,4,4,,No,Cash,6600000,Ready to move,Semi Finished,تاون هاوس كورنر بحري 271م استلام فوري كمبوند l...,No
8953,Apartment,Apartments & Duplex for Sale,4900000,Cayan Egypt,"Bloomfields, Mostakbal City",,167,3,3,2,No,Cash or Installment,10,soon,Semi Finished,شقة للبيع في بلوم فيلدز فرصة بمقدم 490الف شقة ...,No
8954,Apartment,Apartments & Duplex for Sale,2550000,Smart Step,"L’Avenir, Mostakbal City",,160,3,3,2,No,Cash,,Ready to move,Semi Finished,شقه مميزه للبيع 160m في lavenir بمدينة المستقبل,No


<br>

### Step 5: Save Data as a CSV File

Finally, we'll save the DataFrame as a CSV file for further analysis and use.

In [228]:
df.to_csv('property_data_egypt.csv')
