# Uniplaces

## Context

We want to extract data regarding listings of private rooms in Lisbon from Uniplaces.

Uniplaces uses Javascript, meaning that approaches using BeautifulSoup do not work. 
We will use Selenimium to scrape the data of the first 20 pages of listings.

## Conclusions 
(also at the end)

This was my first proper experience using Selenium. I think I learned the basics and maybe a little more, as Uniplaces heavily relies on javacript and even on cases where I could avoided it and use something like BeautifulSoup or string functions, I made an effort of using Selenium. 

I was able to consistently extract all the data from the listings.

Still, there are a some of things that I would have done different:

1. **Learn about WebDriverWait:** When I started I was not aware of the WebDriverWait function and its need. As a result, my code looks a bit "sewn" together. If I knew, I would have created generic functions before.
2. **Learn about location and expections:** The same thing goes for location functions, which I found are only needed for some listings, and expections, particularly expected conditions.
3. **Do generic functions sooner:** The cards of the listings are not that different. I would have created more generic functions. 
4. **Scrape random listings:** I noticed that the scrapping could lead to a high amount of variables and used some techniques to reduced. However, I was not very efficient, as I was not aware of for example, how some variables may have "No photos" substring. I could have scrapped random listings first to identify possible problems like this. Still, it is easier and faster to clean the dataframe than scrape the data again.
5. **Learn to bypass request limits:** Fortunately, Uniplaces does not seem to limit the amount of requests made. If I do this again, I will look into how to bypass such limits (while not doing anything illegal).
6. **Prepare for listings removed:** I would have prepare in case the listing was removed (this only happened once in all 20 pages). 
7. **Include name of card in key:** For example, "student" can be regarding the tenant or the landlord, and I did not account for that.

The extraction of almost all the data from the listings led to a very complete, perhaps too complete, dataset. This is not uncommon, given that I did not have a problem/question in mind while acquiring data. As such, I scraped as much data as possible, which led to an overcomplicated dataset. 
But that is okay, case I get to show that I can clean a dataset!

Overall, this notebook was a success.

### Import libraries

In [1]:
import pandas as pd
import re

#Selenium
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager # we will use Chrome
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException

import time

In [None]:
#driver.quit()

## Open main page

One thing we need to be careful, particularly when using the run all cells option, is to wait between calls. 
As a result, all function will have a 

In [2]:
def open_url (url):
    driver = webdriver.Chrome(ChromeDriverManager().install())
    driver.get(url)
    driver.maximize_window()
    return driver, driver.window_handles[0]

In [3]:
url = 'https://www.uniplaces.com/'
driver, parent_handle = open_url (url)



Current google-chrome version is 97.0.4692
Get LATEST chromedriver version for 97.0.4692 google-chrome
Trying to download new driver from https://chromedriver.storage.googleapis.com/97.0.4692.71/chromedriver_mac64.zip
Driver has been saved in cache [/Users/frosa/.wdm/drivers/chromedriver/mac64/97.0.4692.71]


## Find listings

### Use search box

The search box and button elements were found by inspecting the page (right-click on search box -> inspect).
They are of type class, as it is usually the case, so the function is defined assuming the class type.
The element and button may change over time to avoid exactly web scrapping.

In [4]:
def write_search (driver, place, button = "styles__Button-sc-1qpfiqz-1", 
                  element="SearchBar_search-bar__input__1vHez", wait=5):
    
    search_box = WebDriverWait(driver, wait).until(EC.presence_of_element_located((By.CLASS_NAME, element)))
    search_box.send_keys(place)
    
    search_button = WebDriverWait(driver, wait).until(EC.element_to_be_clickable((By.CLASS_NAME, button)))
    time.sleep(wait) # this is needed
    search_button.click()  

In [5]:
write_search (driver, 'Lisbon')

### Check filters

We want to check only the filter 'Private bedrooms'. To check it in the browser we can click anywhere in its tile, including the label, and inspect it. Then XPATH is, as far as I know, the most generic way to proceed. In this case CSS_SELECTOR would also work.

In [6]:
def click_button_xpath (driver, xpath, wait=10):
    #filt = WebDriverWait(driver, wait).until(EC.presence_of_element_located((By.XPATH, xpath)))
    
    button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath)))
    button.click()

In [7]:
xpath = "//div[label/@for='rent-type-unit']"
click_button_xpath (driver, xpath)

### Individual listings

By inspecting the elements in the page, we see that we can easily get the links of each listing. However, this is not always the case, particularly in pages involving javascript.
There are two ways to proceed:

1. We get the links of each individual listing, and use the driver.get() method as before.
2. We click on a given link, extract the data we want, go back to the previous page (showing all the listings), and do the same with the next link.

We will use approach 2, but in any case here it is how approach 1 would work.

#### Fetch links (way 1)

In [10]:
# approach 1 - this is just for the current page
def fetch_urls (driver, element="sc-dqBHgY", xpath = "a[@href]", attribute="href", wait=5):
    urls = []
    driver.switch_to.window(window)

    clas = WebDriverWait(driver, wait).until(EC.presence_of_element_located((By.CLASS_NAME, element)))
    hrefs = clas.find_elements(By.XPATH, xpath)

    for href in hrefs:
        urls.append(href.get_attribute("href"))

    return urls # this is a list

#### Fetch listings (way 2)

In [9]:
# approach 2 - this is just for the current page
def fetch_listings (driver, element="sc-dqBHgY", tag = "a", wait=5, window = parent_handle):
    driver.switch_to.window(window)
    clas = WebDriverWait(driver, wait).until(EC.presence_of_element_located((By.CLASS_NAME, element)))
    return clas.find_elements(By.TAG_NAME, tag) # this is a list

#### Open listing 

The following opens a given ad in a new tab (Uniplaces is defined as such) and return the handle. The handle will be useful to make sure we close the correct window. 

In [11]:
def open_ad (driver, listing):
    listing.location_once_scrolled_into_view
    listing.click()
    ad_handle = driver.window_handles[-1]
    driver.switch_to.window(ad_handle)
    return ad_handle

In [12]:
# listings from current page
listings = fetch_listings(driver)

In [13]:
ad_handle = open_ad (driver, listings[0])

### Generic functions

Lets create some generic functions long the way (we should probably have done it before, but live and learn).

In [14]:
def find_by_class_el(section, el):
    try:
        element = section.find_element(By.CLASS_NAME, el)
    except NoSuchElementException:
        return "Does not exist"
    return element

In [15]:
def find_by_class_els(section, el):
    try:
        element = section.find_elements(By.CLASS_NAME, el)
    except NoSuchElementException:
        return "Does not exist"
    return element

In [16]:
def find_by_xpath(section, el):
    try:
        element = section.find_elements(By.XPATH, el)
    except NoSuchElementException:
        return "Does not exist"
    return element

In [17]:
def find_by_tag(section, el):
    try:
        element = section.find_elements(By.TAG_NAME, el)
    except NoSuchElementException:
        return "Does not exist"
    return element

In [18]:
def find_by_id(section, el):
    try:
        element = section.find_element(By.ID, el)
    except NoSuchElementException:
        return "Does not exist"
    return element

In [19]:
def make_dict_attributes (lst, attr1, attr2):
    dct = {}
    for item in lst:
        dct[item.get_attribute(attr1)] = item.get_attribute(attr2)
    return dct

### Gather data from listing (step-by-step)

In the listing page, the information about the  offer is inside the class "offer-page__container".
It has several div fields, where each one seems to be regarding a specific information/section of the page.

In [20]:
offer = find_by_class_el (driver, "offer-page__container")
tags = find_by_xpath(offer, "div") #this way we get only direct children

tags_ids = [tag.get_attribute("id") for tag in tags]

print("We have {} div tags:".format(len(tags_ids)))

tags_mean_keep = {tags_ids[0]: ["Main photo, thumbnails, review data, header data and buttons", 
                            "Main photo url, number of thumbnails, review data, header data"],
              tags_ids[1]: ["Price, availability and url", 
                            "Price, availability and url"],
              tags_ids[2]: ["Rules card", 
                            "Aspects listed, existing/missing"], 
              tags_ids[3]: ["Landlord card", 
                            "Aspects listed, existing/missing"], 
              tags_ids[4]: ["Bedroom card", 
                            "Aspects listed, existing/missing"],    
              tags_ids[5]: ["Place card", 
                            "Description, aspects listed, existing/missing"],              
              tags_ids[6]: ["Rental card", 
                            "Aspects listed"],  
              tags_ids[7]: ["About the landlord card", 
                            "Description, Aspects listed"],               
              tags_ids[8]: ["Reviews card", 
                            "Description"],               
              tags_ids[9]: ["Neighborhood", 
                            "Name"], 
              tags_ids[10]: ["Similar places", 
                            "Number"]            
}

print("Tag: [What it is, What we want]")
tags_mean_keep

We have 11 div tags:
Tag: [What it is, What we want]


{'ember37': ['Main photo, thumbnails, review data, header data and buttons',
  'Main photo url, number of thumbnails, review data, header data'],
 'ember39': ['Price, availability and url', 'Price, availability and url'],
 'ember41': ['Rules card', 'Aspects listed, existing/missing'],
 'ember43': ['Landlord card', 'Aspects listed, existing/missing'],
 'ember45': ['Bedroom card', 'Aspects listed, existing/missing'],
 'ember47': ['Place card', 'Description, aspects listed, existing/missing'],
 'ember49': ['Rental card', 'Aspects listed'],
 'ember51': ['About the landlord card', 'Description, Aspects listed'],
 'ember53': ['Reviews card', 'Description'],
 'ember55': ['Neighborhood', 'Name'],
 'ember57': ['Similar places', 'Number']}

#### Title, main photo url, nr of thumbnails, overall rating, and header data

In [21]:
def get_title (section, el = "header__title"):
    section.location_once_scrolled_into_view
    title = find_by_class_el (section, el)
    return title.text 

def get_main_photo (section, el = "header__main__photo"):
    section.location_once_scrolled_into_view
    main_photo = find_by_class_el (section, el)
    return main_photo.get_attribute("style").split('"')[1].split('"')[0]   

def get_nr_thumbnails (section, el = "header__thumbnail__label"):
    section.location_once_scrolled_into_view
    nr_thumbnails = find_by_class_el (section, el)
    if nr_thumbnails == "Does not exist":
        return "2 or less"
    return nr_thumbnails.text.split(" ")[0]

def get_header_data (section, el = "header__icon__label"):
    section.location_once_scrolled_into_view
    headers = find_by_class_els (section, el)
    return [x.text for x in headers]

def get_rating_count (section, el = "header__description__reviews__text", tag = 'path',
                     yellow = '#F6A623'):
    section.location_once_scrolled_into_view
    n = find_by_class_el(section, el)

    if type(n) == str or n.text != "Reference":
        tags = find_by_tag(section, tag)
        count = 0
        for tag in tags:
            if tag.get_attribute('fill') == yellow:
                count = count + 1
        if len(tags) == 6: #(the last is a half star)
            count = count - 0.5
    else:
        count = 0
        n = 0
    
    return count, n

def get_overview (section):
    
    count, n = get_rating_count (section)
    header_keys = ["type", "n_people", "n_bedrooms", "n_bathrooms"]
    header_values = get_header_data (section)
    header_dict = dict(zip(header_keys, header_values))
    main_dict = {"title": get_title (section), "photo_url": get_main_photo(section), "n_thumbnails": get_nr_thumbnails(section),
                 "rating": count, "n_reviews": n}
    return {**main_dict, **header_dict}

In [22]:
card = find_by_id (offer, tags_ids[0])
get_overview(card)

{'title': 'Cosy double bedroom in Intendente',
 'photo_url': 'https://cdn-static-new.uniplaces.com/property-photos/8034137a169586fa7302d0cfb0ac9d29f26d756a41077c8d8212dd0fd08178b5/x-large.jpg',
 'n_thumbnails': '23',
 'rating': 5,
 'n_reviews': <selenium.webdriver.remote.webelement.WebElement (session="9b63afb80fabae276be1c48ec05fc230", element="672c70b4-8ca2-4456-b60f-030770a0c0f7")>,
 'type': 'Double bedroom',
 'n_people': '1 person',
 'n_bedrooms': '3-bedroom apartment',
 'n_bathrooms': '1 Bathroom'}

#### Price, availability, and url

In [23]:
def get_price_avail_data (driver, el="meta", attr1='itemprop', attr2='content'):
    price_avail_tags = find_by_tag (driver, el)
    return make_dict_attributes (price_avail_tags, attr1, attr2)

In [24]:
card = find_by_id (offer, tags_ids[1])

get_price_avail_data (card)

{'price': '380',
 'priceCurrency': 'EUR',
 'availabilityStarts': 'Sun Jan 16 2022 00:00:00 GMT+0000',
 'availability': 'OnlineOnly',
 'url': 'https://www.uniplaces.com/accommodation/lisbon/65494'}

#### Rules, Landlord, Bedroom cards

These cards are comprised of a title div and a card div.

We will enter the card div, "display-card__card". As a result, there will be multiple classes named "display-card__card" within the page. That is why we search only inside the intended id.

Moreover, cards can also have a More/Less button. We need to click it if it has the word "More".

The non-existing/forbidden aspects are dimmed, and the word "dimmed" appears in their class.

In [25]:
def click_if (driver, section, clas, wait=1):
    button = find_by_class_el (section, clas)
    if type(button) != str:
        if 'more' in button.text.lower() or '+' in button.text:
            time.sleep(wait)
            # it has to be like this because with chrome the button is not always where it should be
            driver.execute_script("arguments[0].click();", button)

In [26]:
def get_items (driver, section, clas= "display-card__card", button="display-card__button", 
               tag = "div", keyword = 'dimmed', attr = "class"):
    
    section.location_once_scrolled_into_view

    click_if (driver, section, button)
    
    items_clas = find_by_class_el (section, clas)
    if type(items_clas) == str:
        return {}

    items_tags = find_by_tag (items_clas, tag)

    items_exist = [keyword not in item.get_attribute(attr) for item in items_tags]
    items_text = [x.text for x in items_tags]

    items_exist_dct = dict(zip(items_text, items_exist))
    # remove empty keys 
    items_exist_dct = {k: v for k, v in items_exist_dct.items() if k}
    # remove if it contains \n (seems to happen if button exists)
    items_exist_dct = {k: v for k, v in items_exist_dct.items() if '\n' not in k}
    #remove entry "- Less features" (seems to happen if button exists)
    items_exist_dct = {k: v for k, v in items_exist_dct.items() if k != '- Less features'}
    
    return items_exist_dct

##### Rules card

In [27]:
card = find_by_id (offer, tags_ids[2])
items = get_items (driver, card)
items

{'Smoking': True, 'Occasional overnight guests': True, 'Pets': False}

##### Landlord card

In [28]:
card = find_by_id (offer, tags_ids[3])
items = get_items (driver, card)
items

{'Female, 40+ years old': True, 'Professional': True, 'Has pets': True}

##### Bedroom card

In [29]:
card = find_by_id (offer, tags_ids[4])
items = get_items (driver, card)
items

{'1 Double bed': True,
 'Bedroom area 15m 2': True,
 'Wardrobe': True,
 'Chest of drawers': True,
 'Desk': True,
 'Chairs': True,
 'Towels': True,
 'Bed linen': True,
 'Window': True,
 'Sofa': False,
 'Sofa bed': False,
 'Balcony': False,
 'Tv': False,
 'Door lock': False}

#### Place card 

The description is usually in portuguese. We will first click the on the translate button and also check for a "see more" button.

Then we have to find which button is selected (ideally 'Apartment'), extract the icon labels and if they are 'off' (the same as 'dimmed') or not, and click on the arrows until we complete a lap.

We need to do the following:
1. Find the card field. It is a class named 'unit-navigation-wrapper'.
2. Find and extract text the active button. Its the second one from class 'unit-navigation__labels'.
3. Click + button if it exists
4. Find feature field. It is a class named 'unit-navigation__features'.
5. Get icon labels and check if they have 'off'. Each 3rd div tag contains both aspects.
6. Put all features in a dictionary per icon label. be careful because some may be repeated.

In [30]:
card = find_by_id (offer, tags_ids[5])

##### Description

In [31]:
def get_place_descr (driver, section, button_translate = 'toggle-description',
                     button_more = "truncate-multiline--button", 
                     clas = 'apartment-details__description--truncate'):
    
    section.location_once_scrolled_into_view

    button = find_by_class_el (section, button_translate)
    if type(button) != str and 'translated' in button.text.lower(): # not in english
        button.click()
    
    click_if (driver, section, button_more)
    time.sleep(1)
    description = find_by_class_el (section, clas)
    
    return {"place_description" : description.text.replace('\nSee less', '')}

In [32]:
description = get_place_descr (driver, card)
description

{'place_description': "If you're looking for an apartment in Lisbon thats in a residential area, but well-connected and in the city centre, then check out this flat located in Intendente. The accommodation has a total of three bedrooms, a living room, a kitchen and two common bathrooms. The apartment is also close to Instituto Superior Técnico If you want to know more about this neighbourhood please click on the Discover more about Intendente in the map section below!"}

##### Features 

In [33]:
def find_active_item (lst, attr):
    for item in lst:
        if "active" in item.get_attribute(attr):
            return item

def get_active_place_label (driver, section,  el = 'unit-navigation__labels', xpath = 'div',
                       attr = 'class', button_more = 'unit-navigation__features__show-more'):
    
    click_if (driver, section, button_more)

    labels = find_by_class_el (section, el)
    lst_labels = find_by_xpath (labels, xpath)
    active_item = find_active_item (lst_labels, attr)    
    
    return active_item.text

def get_features (section, el = 'unit-navigation__features', keyword='off', attr='class'):
    
    clas_features = find_by_class_el (section, el)
    lst_features = find_by_tag(clas_features, 'div')
    lst_features = lst_features[2::3]
    
    items_exist = [keyword not in item.get_attribute(attr) for item in lst_features]
    items_text = [x.text for x in lst_features]

    items_exist_dct = dict(zip(items_text, items_exist))
    
    return items_exist_dct

In [34]:
def get_all_features (driver, section, start = 'Apartment', 
                      max_it = 20, button_arrow = "unit-navigation__arrow"):
    
    section.location_once_scrolled_into_view

    all_features = {}
    actives = []
    
    #we want to start with "Apartment" as active
    active = get_active_place_label (driver, section)
    it = 0
        
    while start not in active:
        button = find_by_class_el (section, button_arrow)
        print(button)
        if button == "Does not exist":
            return {}        
        driver.execute_script("arguments[0].click();", button)
        active = get_active_place_label (driver, section)
        it = it + 1
        if it == max_it:
            return {}
    active_features = get_features (section)
    actives.append(active)
    active = active + "_" + str(actives.count(active)-1)
    all_features[active] = active_features
    
    button = find_by_class_el (section, button_arrow)
    if button == "Does not exist":
        return all_features

    driver.execute_script("arguments[0].click();", button)
    
    active = get_active_place_label (driver, section)
    
    while start not in active:
        
        active_features = get_features (section)
        actives.append(active)
        active = active + "_" + str(actives.count(active)-1)
        all_features[active] = active_features    
        
        button = find_by_class_el (section, button_arrow)
        driver.execute_script("arguments[0].click();", button)
        
        active = get_active_place_label (driver, section)

    return all_features   

The previous function is not the best because it uses other function with predefined args and does not give the option to change them. However, we will keep the function like this for now.

In [35]:
all_features = get_all_features (driver, card)
all_features

{'Apartment_0': {'Floor plan - 180 m2': True,
  'Wi-Fi': True,
  'Cable Tv': True,
  'Towels & bed linen': True,
  'Accessibility': False,
  'Central heating': False,
  'Air conditioning': False,
  'Outdoor area': False,
  'Elevator': False},
 'WC_0': {'Toilet': True, 'Sink': True, 'Window': False},
 'Bathroom_0': {'Window': True,
  'Toilet': True,
  'Sink': True,
  'Bathtub': True,
  'Shower': False},
 'Kitchen_0': {'Floor plan - 20 m2': True,
  'Chairs': True,
  'Window': True,
  'Fridge': True,
  'Freezer': True,
  'Stove': True,
  'Oven': True,
  'Washing machine': True,
  'Dishes & cutlery': True,
  'Pots & pans': True,
  'Table': True,
  'Balcony': False,
  'Microwave': False,
  'Dryer': False,
  'Dishwasher': False},
 'Living room_0': {'Floor plan - 48 m2': True,
  'Chairs': True,
  'Sofa': True,
  'Window': True,
  'Coffee table': True,
  'Tv': True,
  'Desk': False,
  'Sofa bed': False,
  'Balcony': False,
  'Table': False}}

#### Rental card 

Rental data is inside "rental-conditions__row" classes (one for each row).

In [36]:
def get_rental_features (section, el = "rental-conditions__row", xpath = 'div',
                         split = "\n"):
    
    section.location_once_scrolled_into_view
    lst_clas = find_by_class_els (section, el)

    dictionary = {}

    for clas in lst_clas:
        features = find_by_xpath(clas, xpath) # only direct children
        for feature in features:
            f = feature.text.split(split)
            dictionary[f[0]] = f[1]

    return dictionary

In [37]:
card = find_by_id (offer, tags_ids[6])
rental_features = get_rental_features (card)
rental_features

{'Contract': 'Fortnightly',
 'Bills': 'Some included',
 'Cancellation policy': 'Moderate',
 'Security deposit': 'Equal to first rent',
 'Cleaning Frequency': 'None',
 'Minimum stay': '27 nights'}

#### About the landlord card

In [38]:
def about_landlord_card (section, el = "about-landlord__item"):
    
    section.location_once_scrolled_into_view
    lst_items = find_by_class_els (section, el)

    tmp = [x.text.split("\n") for x in lst_items[:1]][0]
    
    values = [tmp[0]] + [tmp[2]] + [x.text for x in lst_items[1:]]
    
    if len(tmp) >= 3:
        values = values + [True]
    else: 
        values = values + [False]
    keys = ["type_landlord", "join_date", "reply_in", "reply_rate", "n_hosted", "trusted"]
    
    return dict(zip(keys, values))

In [39]:
card = find_by_id (offer, tags_ids[7])
about_landlord_card (card)

{'type_landlord': 'Homeowner',
 'join_date': 'Joined Uniplaces on May, 2017',
 'reply_in': 'Usually replies in a few minutes',
 'reply_rate': 'Response rate is 100%',
 'n_hosted': 'Hosted 11 people',
 'trusted': True}

#### Reviews card

We will do the following to get the rating:
1. Find classes 'reviews__main__item' and get their 'reviews__main__item__title'
3. Find tags path
4. Count the number of fill='#F6A623' in tags (this are yellow stars)
5. Count the number of path tags: if five there are no half stars; elif six, the last is a half star.

Then for the details:
1. Use click_if function
2. In each 'reviews__main__item' find all 'reviews-details__item' and get their text
3. Find tags path
4. Count the number of fill='#F6A623' in tags (this are yellow stars)
5. Count the number of path tags: if five there are no half stars; elif six, the last is a half star.

We will not use written reviews.


In [40]:
def count_stars (lst_items, tag = 'path', yellow = '#F6A623'):
    counts = []
    names = []
    for item in lst_items:
        names.append(re.split(r'( \d+)', item.text)[0])
        tags = find_by_tag(item, 'path')
        count = 0
        for tag in tags:
            if tag.get_attribute('fill') == yellow:
                count = count + 1
        if len(tags) == 6: #(the last is a half star)
            count = count - 0.5
        counts.append(count)
        
    return dict(zip(names, counts))

def rating_details (section, el_main = 'reviews__main__item', 
                    el_det = 'reviews-details__item', button = 'reviews__main__button'):
    
    section.location_once_scrolled_into_view
    lst_main = find_by_class_els(card, el_main)
    dict_main = count_stars (lst_main)
    click_if(driver, section, button)
    lst_det = find_by_class_els(card, el_det)
    
    return {**dict_main, **count_stars (lst_det)}

In [41]:
card = find_by_id (offer, tags_ids[8])
reviews_details = rating_details (card)
reviews_details

{'This room': 5,
 'This place': 5,
 'The landlord': 5,
 'Room': 5,
 'Listing accuracy': 5,
 'Value for money': 5,
 'Location': 5,
 'Common areas': 5,
 'Communication': 5,
 'Availability': 5}

#### Neighborhood card

In [42]:
def get_neighborhood (section, el = "neighbourhood__image__content__title"):
    section.location_once_scrolled_into_view
    value = find_by_class_el (section, el)
    if value == "Does not exist":
        return {'neighboorhood' : 'Unknown'}
    return {'neighboorhood' : value.text}

In [43]:
card = find_by_id (offer, tags_ids[9])
get_neighborhood (card)

{'neighboorhood': 'Intendente'}

#### Similar places card

In [44]:
def get_n_similar (section, el="recommendations__wrapper", tag="a"):
    section.location_once_scrolled_into_view
    wrapper = find_by_class_el (section, el)
    return {"n_similar" : len(find_by_tag(wrapper, tag))}

In [45]:
card = find_by_id (offer, tags_ids[10])
get_n_similar (card)

{'n_similar': 5}

## Gather data from listing (All steps)

We need to create a function to extract data from all data fiels. This could also be done with pipelines, but I find functions more readable.

We will not get the information in the "about the landlord", "reviews" and "n_similar" because it was taking too long, and this data is not particularly interesting.

In [46]:
def get_all_listing (driver, el="offer-page__container", xpath = "div", attr="id"):
    
    offer = find_by_class_el (driver, el)
    tags = find_by_xpath(offer, xpath) #this way we get only direct children

    tags_ids = [tag.get_attribute(attr) for tag in tags]
    
    dictionary = {}
    
    # overview
    card = find_by_id (offer, tags_ids[0])
    dictionary = {**dictionary, **get_overview(card)}
    print("Got field 1.")
    
    # price, availability, and url
    card = find_by_id (offer, tags_ids[1])
    dictionary = {**dictionary, **get_price_avail_data(card)}
    print("Got field 2.")

    # rules
    card = find_by_id (offer, tags_ids[2])
    dictionary = {**dictionary, **get_items(driver, card)}
    print("Got field 3.")
    
    # landlord aspects
    card = find_by_id (offer, tags_ids[3])
    dictionary = {**dictionary, **get_items(driver, card)}
    print("Got field 4.")
    
    # bedroom aspects
    card = find_by_id (offer, tags_ids[4])
    dictionary = {**dictionary, **get_items(driver, card)}
    print("Got field 5.")
    
    # place 
    card = find_by_id (offer, tags_ids[5])
    #dictionary = {**dictionary, **get_place_descr(driver, card)} #it takes too long
    place_features = get_all_features (driver, card)
    print("Got field 6.")

    # rental 
    card = find_by_id (offer, tags_ids[6])
    dictionary = {**dictionary, **get_rental_features(card)}
    print("Got field 7.")
    
    # about the landlord
    #card = find_by_id (offer, tags_ids[7])
    #dictionary = {**dictionary, **about_landlord_card(card)}
    
    # reviews
    #card = find_by_id (offer, tags_ids[8])
    #dictionary = {**dictionary, **rating_details(card)}

    # neighborhood
    card = find_by_id (offer, tags_ids[9])
    dictionary = {**dictionary, **get_neighborhood(card)}
    
    # n similar places
    #card = find_by_id (offer, tags_ids[10])
    #dictionary = {**dictionary, **get_n_similar(card)}
    
    return dictionary, place_features

### Join data

In [47]:
def get_freq (lst):
    freq = {}
    for item in lst:
        item = item.split("_")[0] + "_" + item.split("_")[1]
        if (item in freq):
            freq[item] = freq[item] + 1
        else:
            freq[item] = 1
    return freq

In [48]:
def dicts_join (dict_main, dict_features):
    
    df = pd.DataFrame(dict_features)
    
    # drop Apartment and count rooms
    col_rooms = ["n_" + col.split("_")[0] + "s" for col in df.columns if "partment" not in col]
    # join data
    dict_main = {**dict_main, **get_freq (col_rooms)}
   
    df["sum"] = df.sum(axis=1)
    
    cols_to_drop = [col for col in df.index if "plan" in col]
    if cols_to_drop == True:
        floor_plan = max([int(col.split("- ")[1].split(" m2")[0]) for col in cols_to_drop])
        df = df.drop(cols_to_drop)
        df.loc["Floor plan m2", 0] = floor_plan
        
    df = df["sum"].astype(int)
    return {**dict_main, **df.to_dict()}

## Gather data from all listings from page

In [49]:
lst_dicts = []

def all_from_page (driver, listings):
    c = 0
    for listing in listings:
        print("Listing {}".format(c))
        ad_handle = open_ad (driver, listing)
        print(driver.title)
        time.sleep(3)
        dict_main, dict_features = get_all_listing (driver)

        lst_dicts.append(dicts_join (dict_main, dict_features))

        driver.close() 
        driver.switch_to.window(parent_handle)
        c += 1
    #return lst_dicts

#### Go to next page

In [50]:
def go_next_page (driver, el = "sc-hrWEMg", tag = "a", attr1 = "rel", attr2 = "href"):
    
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") #to bottom
    clas = find_by_class_el(driver, el)    
    tags = find_by_tag(clas, tag)
    
    for tag in tags:
        if tag.get_attribute(attr1) == "next":
            if "null" not in tag.get_attribute(attr2):
                tag.click()
                return True
            else:
                return False # if it is the last

## Get data from all listings

We will take the listings from the first 20 pages.

In [None]:
go_to_next = True

c = 0
n_pages = 20
while go_to_next == True:
    print("Page {}".format(c))
    if c > n_pages: # comment if needed
        break # comment if needed
    listings = fetch_listings(driver)
    all_from_page (driver, listings)
    go_to_next = go_next_page (driver)
    time.sleep(3)
    c = c + 1

In [None]:
#df = pd.DataFrame(lst_dicts)
#df.to_csv("scraped_data/all_scraping.csv")

## Conclusions 

This was my first proper experience using Selenium. I think I learned the basics and maybe a little more, as Uniplaces heavily relies on javacript and even on cases where I could avoided it and use something like BeautifulSoup or string functions, I made an effort of using Selenium. 

I was able to consistently extract all the data from the listings.

Still, there are a some of things that I would have done different:

1. **Learn about WebDriverWait:** When I started I was not aware of the WebDriverWait function and its need. As a result, my code looks a bit "sewn" together. If I knew, I would have created generic functions before.
2. **Learn about location and expections:** The same thing goes for location functions, which I found are only needed for some listings, and expections, particularly expected conditions.
3. **Do generic functions sooner:** The cards of the listings are not that different. I would have created more generic functions. 
4. **Scrape random listings:** I noticed that the scrapping could lead to a high amount of variables and used some techniques to reduced. However, I was not very efficient, as I was not aware of for example, how some variables may have "No photos" substring. I could have scrapped random listings first to identify possible problems like this. Still, it is easier and faster to clean the dataframe than scrape the data again.
5. **Learn to bypass request limits:** Fortunately, Uniplaces does not seem to limit the amount of requests made. If I do this again, I will look into how to bypass such limits (while not doing anything illegal).
6. **Prepare for listings removed:** I would have prepare in case the listing was removed (this only happened once in all 20 pages). 
7. **Include name of card in key:** For example, "student" can be regarding the tenant or the landlord, and I did not account for that.

The extraction of almost all the data from the listings led to a very complete, perhaps too complete, dataset. This is not uncommon, given that I did not have a problem/question in mind while acquiring data. As such, I scraped as much data as possible, which led to an overcomplicated dataset. 
But that is okay, case I get to show that I can clean a dataset!

Overall, this notebook was a success.