<img src="https://i.ibb.co/wShF3QK/toronto-154805.jpg"/>

# Ontario house rent prediction
***
### Can you suggest house rent in ontario cities to new comers ?

**House rent in ontario especially in GTA gets even harder at scale**, considering the new immigrants in ontario, based on their preference lets try to predict prices in different cities.

**Kijiji**, Kijiji is an online classified advertising service that operates as a centralized network of online communities, organized by city and urban region, for posting local advertisements. 

In this project I scraped some datas from kijiji to build a dataset for training my ML model, which will help new comers to get an overall idea about the rent in a city based on their input. lets say if someone wants to move to Mississauga in a private room or shared basis, this model will predict the expected range of rent in the city based on the input.

**Recommendation**, City Recommendation will also be developed within this project to give the user some suggestions on cheapest nearby city to look for house rentals. -> Future Work :))


### Dataset Features

Since the dataset is scraped from kijiji website, I made some features that I think is relevent to this project and I tried to keep it minimal as possible.

- **Title**: Title of the advertisement
- **Description** : Description of the advertisement
- **Features** : Features of the advertisement. ( furnished or not, pet friendly or not etc)
- **Location** : Location of the property
- **Price** : Rent in CAD
- **Ddate Posted** : Posted date of the advertisement
- **URL** : advertisement url


# Import Packages

In [1]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix, hstack
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from bs4 import BeautifulSoup
import warnings
import re
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
dataset = pd.read_csv('kijiji.csv')

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3729 entries, 0 to 3728
Data columns (total 7 columns):
Title           3728 non-null object
Price           3729 non-null object
Description     3729 non-null object
Location        3728 non-null object
Ddate Posted    3688 non-null object
Features        3727 non-null object
URL             0 non-null float64
dtypes: float64(1), object(6)
memory usage: 204.0+ KB


In [4]:
dataset.columns

Index(['Title', 'Price', 'Description', 'Location', 'Ddate Posted', 'Features',
       'URL'],
      dtype='object')

In [5]:
dataset.head(1).style

Unnamed: 0,Title,Price,Description,Location,Ddate Posted,Features,URL
0,One Bedroom Suites Cherryhill Village for Rent - 301 Oxford...,"$1,145.00","[DescriptionProximate to Western University, dining options, and the mall, with a pool and health club, Cherryhill Village is located at Oxford Street and Platts Lane in the University Heights neighbourhood.Pets:Cherryhill Village is 100% pet friendly Neighbourhood Features: Walk Score: 72, walkable Transit Score: 56, public transit nearby Bike: very bikeableInquire today or visit minto.com to learn more.*Prices and specifications subject to change without notice. Errors and Omissions excepted. Amenities 24 Hour Emergency ServiceBicycle Storage Carpeted floorsElevatorsFamily Friendly Fitness roomFridgeGardening PlotsGreen SpaceHot tubIn-Building LaundryIndoor poolLaundry facilitiesNear Clinics or Hospitals Near Parks and TrailsNewly RenovatedOnsite PharmacyOutdoor BBQ Area Outdoor Exercise StationsOutdoor parkingOutdoor poolParks nearbyPedestrian FriendlyPet Friendly Picnic AreaPlaygroundPublic transitResident LoungeSaunaSchools nearbySecurity Controlled AccessShopping nearbyShuffle BoardSocial roomStorage SpaceStoveTennis CourtWalk-in Closets (Select Suites)Wheelchair accessWoodworking Shop]","301 Oxford St. W, London, ON, N6H 1S6",about 2 hours ago,Unit TypeApartment : Bedrooms1 : Bathrooms1 : Pet FriendlyYes : Size (sqft)0 : FurnishedNo :,


## Data Cleaning
***

Since we have a raw data it has to be processed to remove unwanted characters and html tags in each columns. We will be doing that first.


In [6]:
dataset.head()

Unnamed: 0,Title,Price,Description,Location,Ddate Posted,Features,URL
0,One Bedroom Suites Cherryhill Village for Ren...,"$1,145.00","[<div class=""descriptionContainer-3544745383"">...","<span class=""address-3617944557"" itemprop=""add...","<time datetime=""2019-09-04T17:54:00.000Z"" titl...",Unit TypeApartment : Bedrooms1 : Bathrooms1 : ...,
1,Upgraded 2 bedroom apartment for rent in Owen ...,"$1,625.00","[<div class=""descriptionContainer-3544745383"">...","<span class=""address-3617944557"" itemprop=""add...","<time datetime=""2019-08-29T00:09:19.000Z"" titl...",Unit TypeApartment : Bedrooms2 : Bathrooms1 : ...,
2,Well furnished room in a super clean house in ...,$800.00,"[<div class=""descriptionContainer-3544745383"">...","<span class=""address-3617944557"" itemprop=""add...","<time datetime=""2019-09-01T19:26:36.000Z"" titl...",Furnished : YesPet Friendly : No,
3,"Cambridge - Secure Outdoor Parking (Vehicles, ...",$55.00,"[<div class=""descriptionContainer-3544745383"">...","<span class=""address-3617944557"" itemprop=""add...","<time datetime=""2019-09-02T22:55:50.000Z"" titl...",More Info : Parking,
4,MAIN FLOOR 1 BEDROOM APT IN AN 11 PLEX,$850.00,"[<div class=""descriptionContainer-3544745383"">...","<span class=""address-3617944557"" itemprop=""add...","<time datetime=""2019-09-01T00:55:07.000Z"" titl...",Unit TypeApartment : Bedrooms1 : Bathrooms1 : ...,


In [7]:
#let's check missing values in the dataset
dataset.isnull().sum()

Title              1
Price              0
Description        0
Location           1
Ddate Posted      41
Features           2
URL             3729
dtype: int64

### Price column

In [8]:
#lets remove $ symbol from the price column
# for any advertisement with " please contact " we will be replacing it with 0

def fixPriceColumn(price):
    try:
        rent = re.sub('[^0-9.]+', '', price)
        return(int(float(rent)))
    except:
        return 0

In [9]:
dataset['Price'] = dataset['Price'].apply(fixPriceColumn)

In [10]:
dataset[dataset['Price'] == 0].shape

(410, 7)

In [11]:
# We have 410 items with rent of $0 out of 4000 advertisements. Let's take them out. 
dataset = dataset[dataset['Price'] != 0]
dataset.shape

(3319, 7)

### Description column

We have 3 kinds of list descriptions <br>
<pre>1.One with seperate list of Amenities as an unordered list.</pre> 
<pre>2.One with description in a Paragraph tag</pre> 
<pre>3.One with desciption within the div tag</pre>
We will have to consider these while extracting the text data from the html tags.<br>
We will be using BeautifulSoup library to make it easy.<br>

In [12]:
# a function to get the data from html tags
def cleanDescription(description):
    soup = BeautifulSoup(description, "html.parser")
    desc = []
    try:
        if len([link.get_text() for link in soup.select("ul > li")]) != 0:
            desc += [link.get_text() for link in soup.select("div > p")]
            desc += [link.get_text() for link in soup.select("ul > li")]
            return removeSymbols(str(desc))
        elif len([link.get_text() for link in soup.select("div > p")]) != 0:
            desc += [link.get_text() for link in soup.select("div > p")]
            return removeSymbols(str(desc))
        else:
            desc += [link.get_text() for link in soup.select("div")]
            #in this case we have to remove Description heading from the final text
            return removeSymbols(str(desc)[13:])
    except:
        return removeSymbols(str(desc))

In [13]:
def removeSymbols(text):
    return re.sub('[^A-Za-z0-9]+', ' ', text)

In [14]:
dataset['Description'] = dataset['Description'].apply(cleanDescription)

In [15]:
dataset.head()

Unnamed: 0,Title,Price,Description,Location,Ddate Posted,Features,URL
0,One Bedroom Suites Cherryhill Village for Ren...,1145,Proximate to Western University dining option...,"<span class=""address-3617944557"" itemprop=""add...","<time datetime=""2019-09-04T17:54:00.000Z"" titl...",Unit TypeApartment : Bedrooms1 : Bathrooms1 : ...,
1,Upgraded 2 bedroom apartment for rent in Owen ...,1625,Fantastic 2 bedroom apartment GREAT VALUE Uti...,"<span class=""address-3617944557"" itemprop=""add...","<time datetime=""2019-08-29T00:09:19.000Z"" titl...",Unit TypeApartment : Bedrooms2 : Bathrooms1 : ...,
2,Well furnished room in a super clean house in ...,800,I have one bed room available in my house not...,"<span class=""address-3617944557"" itemprop=""add...","<time datetime=""2019-09-01T19:26:36.000Z"" titl...",Furnished : YesPet Friendly : No,
3,"Cambridge - Secure Outdoor Parking (Vehicles, ...",55,Canam Self Storage has a secure state of the ...,"<span class=""address-3617944557"" itemprop=""add...","<time datetime=""2019-09-02T22:55:50.000Z"" titl...",More Info : Parking,
4,MAIN FLOOR 1 BEDROOM APT IN AN 11 PLEX,850,Available for Oct 1st is a 1 bedroom apt in a...,"<span class=""address-3617944557"" itemprop=""add...","<time datetime=""2019-09-01T00:55:07.000Z"" titl...",Unit TypeApartment : Bedrooms1 : Bathrooms1 : ...,


In [16]:
#let's check if we have any empty descriptions
dataset[dataset['Description'].str.split(" ").str.len() == 0]

Unnamed: 0,Title,Price,Description,Location,Ddate Posted,Features,URL


### Location column

In [17]:
# the data we have, we have to extract address and postal code from this
dataset['Location'][0]

'<span class="address-3617944557" itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">301 Oxford St. W, London, ON, N6H 1S6</span>'

In [18]:
def getAddress(address):
    soup = BeautifulSoup(address, "html.parser")
    add = ""
    postal = ""
    regex = '[A^a-zA-Z]{1}[0-9]{1}[A^a-zA-Z]{1}[0-9]{1}[A^a-zA-Z]{1}[0-9]{1}'
    try:
        givenAddress = str([link.get_text() for link in soup.select("span")])
        givenAddress = removeSymbols(givenAddress)
        postalAddress = givenAddress.replace(" ","")
        postal = re.search(regex,postalAddress)
        return givenAddress,postal.group(0)
    except:
        return givenAddress,postal

In [19]:
dataset['Location'],dataset['Postal Code'] = zip(*dataset['Location'].apply(getAddress))

In [20]:
dataset.head()

Unnamed: 0,Title,Price,Description,Location,Ddate Posted,Features,URL,Postal Code
0,One Bedroom Suites Cherryhill Village for Ren...,1145,Proximate to Western University dining option...,301 Oxford St W London ON N6H 1S6,"<time datetime=""2019-09-04T17:54:00.000Z"" titl...",Unit TypeApartment : Bedrooms1 : Bathrooms1 : ...,,N6H1S6
1,Upgraded 2 bedroom apartment for rent in Owen ...,1625,Fantastic 2 bedroom apartment GREAT VALUE Uti...,755 Tenth Street West Owen Sound ON N4K 6J7,"<time datetime=""2019-08-29T00:09:19.000Z"" titl...",Unit TypeApartment : Bedrooms2 : Bathrooms1 : ...,,N4K6J7
2,Well furnished room in a super clean house in ...,800,I have one bed room available in my house not...,31 Portstewart Crescent Brampton ON L6X 0R6 C...,"<time datetime=""2019-09-01T19:26:36.000Z"" titl...",Furnished : YesPet Friendly : No,,L6X0R6
3,"Cambridge - Secure Outdoor Parking (Vehicles, ...",55,Canam Self Storage has a secure state of the ...,111 Savage Dr Cambridge ON N1T 1S5 Canada,"<time datetime=""2019-09-02T22:55:50.000Z"" titl...",More Info : Parking,,N1T1S5
4,MAIN FLOOR 1 BEDROOM APT IN AN 11 PLEX,850,Available for Oct 1st is a 1 bedroom apt in a...,507 St Raphael st Sudbury P3G1M3 ON,"<time datetime=""2019-09-01T00:55:07.000Z"" titl...",Unit TypeApartment : Bedrooms1 : Bathrooms1 : ...,,P3G1M3


### Date Posted Column

In [21]:
## we have arround 30 missing values in date posted column due to some exception while scraping
## we will be filling those values with the date I scraped the site. I.e Sep 4 2019
dataset['Ddate Posted'].fillna(value = '2019-09-04', inplace=True)

In [22]:
dataset['Ddate Posted'][550]

'<time datetime="2019-08-19T15:27:44.000Z" title="August 19, 2019 3:27 PM">16 days ago</time>'

In [23]:
def getDate(datePosted):
    try:
        date = str(datePosted[16:26])
        return date
    except:
        print(datePosted)

In [25]:
dataset['Date Posted'] = dataset['Ddate Posted'].apply(getDate)
dataset.drop('Ddate Posted',inplace=True,axis=1)

In [26]:
dataset.head()

Unnamed: 0,Title,Price,Description,Location,Features,URL,Postal Code,Date Posted
0,One Bedroom Suites Cherryhill Village for Ren...,1145,Proximate to Western University dining option...,301 Oxford St W London ON N6H 1S6,Unit TypeApartment : Bedrooms1 : Bathrooms1 : ...,,N6H1S6,2019-09-04
1,Upgraded 2 bedroom apartment for rent in Owen ...,1625,Fantastic 2 bedroom apartment GREAT VALUE Uti...,755 Tenth Street West Owen Sound ON N4K 6J7,Unit TypeApartment : Bedrooms2 : Bathrooms1 : ...,,N4K6J7,2019-08-29
2,Well furnished room in a super clean house in ...,800,I have one bed room available in my house not...,31 Portstewart Crescent Brampton ON L6X 0R6 C...,Furnished : YesPet Friendly : No,,L6X0R6,2019-09-01
3,"Cambridge - Secure Outdoor Parking (Vehicles, ...",55,Canam Self Storage has a secure state of the ...,111 Savage Dr Cambridge ON N1T 1S5 Canada,More Info : Parking,,N1T1S5,2019-09-02
4,MAIN FLOOR 1 BEDROOM APT IN AN 11 PLEX,850,Available for Oct 1st is a 1 bedroom apt in a...,507 St Raphael st Sudbury P3G1M3 ON,Unit TypeApartment : Bedrooms1 : Bathrooms1 : ...,,P3G1M3,2019-09-01


### Features Column

In [27]:
dataset['Features'][0]

'Unit TypeApartment : Bedrooms1 : Bathrooms1 : Pet FriendlyYes : Size (sqft)0 : FurnishedNo : '

In [28]:
dataset['Features'][50]

'Unit TypeHouse : Bedrooms4 : Bathrooms3.5 : Parking Included3+ : Agreement Type1 Year : Move-In DateOctober 1, 2019 : Pet FriendlyLimited : Size (sqft)2,000 : FurnishedNo : Air ConditioningYes : Smoking PermittedNo : '

In [29]:
dataset['Features'][500]

'Furnished : YesPet Friendly : No'

In [30]:
dataset['Features'][550]

'Bedrooms : 2Bathrooms : 2Furnished : YesPet Friendly : No'