# Daniel M Miranda

## Introduction
### Scraping apartment rental website from Toronto in order to create a dataset inlcuding informations like 'Rental Price', 'Address' and 'Number of bedrooms'.
### The dataset created will be used as part of a final project, whose objective is decide the best apartment for a foreign student.

In [3]:
# Importing Libraries

from bs4 import BeautifulSoup
import pandas as pd
import requests

#### Testing a way to scrap information from the website

In [1]:
# Scraping data from a apartment rental website, from Toronto (https://toronto.craigslist.org/search/apa)

url = 'https://toronto.craigslist.org/search/apa'

In [6]:
# Using BeautifulSoup as a tool to collect data from the website

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

In [8]:
result = soup.find('ul', class_='rows')
result

<ul class="rows">
<li class="result-row" data-pid="6905100603" data-repost-of="5674544352">
<a class="result-image gallery" data-ids="1:00m0m_hLIpTdfSmG6,1:00p0p_1iJMTqr7dNT,1:00T0T_fpnyDoLXCu5,1:00h0h_cJaMtU2DmNc,1:00R0R_3hJsuSvJgiZ,1:00707_8kfzAxwtIaR,1:00X0X_5RWefJVt4ek,1:00000_h4x2wltUuTr,1:00u0u_eZByvj8EouY,1:00l0l_lO2o3wipT3I,1:00303_f8nWsBPQ5Jf,1:00R0R_cBEkyd20TPK" href="https://toronto.craigslist.org/tor/apa/d/toronto-2-bedrooms-apt-available-on/6905100603.html">
<span class="result-price">$1800</span>
</a>
<p class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2019-06-05 09:59" title="Wed 05 Jun 09:59:45 AM">Jun  5</time>
<a class="result-title hdrlnk" data-id="6905100603" href="https://toronto.craigslist.org/tor/apa/d/toronto-2-bedrooms-apt-available-on/6905100603.html">2 Bedrooms Apt Available on July 1st (Gerrard &amp; Broadview)</a>
<span class="result-meta">

In [None]:
# list of properties for rent. Inlcudes number of bedrooms, rent price, address
rents = result.find_all('li',class_='result-row')

In [123]:
# rent price
rents[0].find('span', class_='result-price').text[1:]

'2200'

In [74]:
# Number of bedrooms
rents[0].find('span', class_='housing').text.strip('\n').strip(' ')[0:3]

'1br'

In [83]:
# address
rents[4].find('span', class_='result-hood').text.strip(' (').strip(')')

'Lambton Baby Point Toronto'

#### After studying a way to scrap the important information from the website, it's time to scrap several pages to collect more rent data.
#### I'll be scraping 7 pages from a rents website (https://toronto.craigslist.org/search/apa)

In [10]:
# Pages to Scrap
urls = ['a','b','c','d','e','f','g']
aux = '?s='
page_results = [0,120,240,360,480,600,720]
for i in range(0,7):
    if i == 0:
        urls[i]='https://toronto.craigslist.org/search/apa'
    else:
        urls[i]='https://toronto.craigslist.org/search/apa'+'?s='+str(page_results[i])
urls


['https://toronto.craigslist.org/search/apa',
 'https://toronto.craigslist.org/search/apa?s=120',
 'https://toronto.craigslist.org/search/apa?s=240',
 'https://toronto.craigslist.org/search/apa?s=360',
 'https://toronto.craigslist.org/search/apa?s=480',
 'https://toronto.craigslist.org/search/apa?s=600',
 'https://toronto.craigslist.org/search/apa?s=720']

#### Collecting rental data from the pages

In [12]:
# Creating a list with all the information collected from the websites

list_rents = []

for site in urls:
    r = requests.get(site)
    soup = BeautifulSoup(r.text,'html.parser')
    result = soup.find('ul', class_='rows')
    rents = result.find_all('li',class_='result-row')
    
    for rent in rents:
        if rent.find('span', class_='result-hood') is None:
            address = 'NA'
        else:
            address = rent.find('span', class_='result-hood').text.strip(' (').strip(')')
        price = rent.find('span', class_='result-price').text[1:]
        if rent.find('span', class_='housing') is None:
            bdr = 'NA'
        else:
            bdr = rent.find('span', class_='housing').text.strip('\n').strip(' ')[0:3]
        list_rents.append((address,price,bdr))

In [13]:
print('Number of rents found: {}'.format(len(list_rents)))

Number of rents found: 840


In [17]:
# Creating a Dataframe from the information collected

columns_name = ['Address','Rent Price','Number of Bedrooms']
rent_list = pd.DataFrame(data = list_rents, columns = columns_name)
rent_list.tail(3)

Unnamed: 0,Address,Rent Price,Number of Bedrooms
837,1547 Dupont St,2500,3br
838,1080 Bay St,3550,2br
839,4655 Glen Erin Dr,2500,2br


In [18]:
#Checking dataframe types

rent_list.dtypes

Address               object
Rent Price            object
Number of Bedrooms    object
dtype: object

In [19]:
#Changing price from object to float

rent_list['Rent Price']= rent_list['Rent Price'].astype(float)

In [20]:
rent_list.head(5)

Unnamed: 0,Address,Rent Price,Number of Bedrooms
0,,1800.0,2br
1,Wellington & Portland,2400.0,1br
2,King West & Portland,2350.0,1br
3,35 Mariner Terr,2250.0,1br
4,Toronto,2550.0,2br


In [21]:
# Removing rows without Number of bedrooms

df = rent_list[rent_list['Number of Bedrooms']!='NA']

In [22]:
# Removing rows without Address

df = rent_list[rent_list['Address']!='NA']

In [23]:
dataset = rent_list.iloc[1:][:]
dataset.head(5)

Unnamed: 0,Address,Rent Price,Number of Bedrooms
1,Wellington & Portland,2400.0,1br
2,King West & Portland,2350.0,1br
3,35 Mariner Terr,2250.0,1br
4,Toronto,2550.0,2br
5,228 Queens Quay West At Lower Simcoe Street M5...,2200.0,1br


In [24]:
dataset = dataset[dataset['Address']!='NA']

In [25]:
dataset.tail(5)

Unnamed: 0,Address,Rent Price,Number of Bedrooms
835,8 Fieldway Rd,2800.0,2br
836,102 Auburn Ave,2895.0,4br
837,1547 Dupont St,2500.0,3br
838,1080 Bay St,3550.0,2br
839,4655 Glen Erin Dr,2500.0,2br


In [27]:
dataset.describe()

Unnamed: 0,Rent Price
count,711.0
mean,2693.361463
std,1451.598905
min,1.0
25%,2000.0
50%,2489.0
75%,3100.0
max,15995.0


#### Obseravation:
Possible problems: Website without rental information pattern, affecting the consistency of the information collected.