# Scraping rental data for central districts of Tokyo

This is a small project that I worked on while I lived in Japan in order to learn how to scrape valuable data from webpages. It simply scrapes data from a Japanese real estate website, determines which data is relevant, and then puts the data for each house into a Pandas dataframe. 

In [1]:
import requests, re
from bs4 import BeautifulSoup
import pandas
import database

### Determining number of pages based on example search

In [2]:
# this is the URL generated after choosing specific search criteria on the website (e.g. location, house type, price range)
search_url = "http://suumo.jp/jj/chintai/ichiran/FR301FC001/?ar=030&bs=040&ta=13&sc=13101&sc=13102&sc=13103&sc=13104&sc=13105&sc=13113&cb=0.0&ct=9999999&et=9999999&cn=9999999&mb=0&mt=9999999&shkr1=03&shkr2=03&shkr3=03&shkr4=03&fw2="

# obtaining all content from pre-defined URL
r = requests.get(search_url)
c = r.content

# we use beautifulsoup to make sense of the data
soup = BeautifulSoup(c,"html.parser")

# it was determined that we need to look inside the class "cassetteitem" having inspected the HTML elements 
all = soup.find_all("div",{"class":"cassetteitem"})

# now we can see how all entries related to the search were split into pages by looking for "pagination-parts" class instances.
page_nr = soup.find_all("ol",{"class":"pagination-parts"})[-1].text
page_nr = [int(s) for s in page_nr.split() if s.isdigit()]
page_nr = page_nr[len(page_nr)-1]

print(page_nr,"pages were found")

505 pages were found


### Extracting data of interest from each page, appending to lists in a dictionary, creating a Pandas dataframe and then saving as a CSV file

In [3]:
# iterating through each page by adding page number to end of search URL each time
l = []
for page in range(1, 2):  # for the sake of this example, only one page is used to speed up the operation
    r = requests.get(search_url + '&pn=' + str(page))
    c = r.content
    soup = BeautifulSoup(c,"html.parser")
    all = soup.find_all(lambda tag: tag.name == 'div' and 
                                   tag.get('class') == ['cassetteitem'])    # "cassetteitem" is the class for each house

    # for each house discovered, let's collect information on title, locality, number of room, floor area and price
    # and put the information into a dictionary
    for item in all:
        d = {}
        d["Title"] = item.find("div",{"class","cassetteitem_content-title"}).text
        d["Locality"] = item.find("li",{"class","cassetteitem_detail-col1"}).text
        d["Price"] = item.find("span",{"class","cassetteitem_other-emphasis ui-text--bold"}).text.replace("\n","").replace(" ","")

        # finding number of rooms is more complicated in this situation, because categories need to be decoded
        rooms = item.find("table",{"class","cassetteitem_other"}).text
        if 'ワンルーム' in rooms:
            d["Rooms"] = 1
        elif '1K' in rooms:
            d["Rooms"] = 1
        elif '2K' in rooms:
            d["Rooms"] = 2
        else:
            d["Rooms"] = "Unknown"
        
        # need to dig inside tables to find the relevant number
        table = item.find_all("table",{"class","cassetteitem_other"})[0]
        nums = re.findall(r'\d+(?:\.\d+)?', str(table))
        d["Size"] = nums[-6]+" m2"
        
#         d["Link"] = item.find("td",{"class","ui-text--midium ui-text--bold"}).text
        
        l.append(d)

# finally we can create a dataframe with the columns ordered in the way we please and then save it as a CSV file
df = pandas.DataFrame(l)
df = df[['Title', 'Locality', 'Size', 'Rooms', 'Price']]
df.to_csv("Output.csv")

In [4]:
df.head()

Unnamed: 0,Title,Locality,Size,Rooms,Price
0,コンフォリア春日富坂,東京都文京区春日１,64.85 m2,1,9.1万円
1,恵比寿レジデンス弐番館,東京都渋谷区東２,42.65 m2,Unknown,18.4万円
2,ガーデン原宿,東京都渋谷区神宮前４,47.54 m2,1,16万円
3,エスレジデンス月島,東京都中央区月島３,44.67 m2,1,10.4万円
4,プラザ勝どき,東京都中央区勝どき１,99.34 m2,1,10.7万円


### We can even put the data into an SQL database by importing and calling the included "database" script

In [7]:
# goes back through all of the houses in the list of dictionaries and enters the respective information into database
database.create_table()
for d in l:
    database.insert(d["Title"],d["Locality"],d["Size"],d["Rooms"],d["Price"])