# Web Scraping

### Web Scraping Best Practices:

- Never scrape more frequently than you need to.
- Consider caching the content you scrape so that it’s only downloaded once.
- Build pauses into your code using functions like time.sleep() to keep from overwhelming servers with too many requests too quickly.
- Video von [neuefische](https://www.youtube.com/watch?v=HMSe8WTNmFg)

## BeautifulSoup

The library we will use today to find fishes we can gift Larissa for christmas is [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). It is a library to extract data out of HTML and XML files.

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests.

The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.

## Import libraries

In [2]:
import time
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

## The Blueground - robots.txt
User-agent: *  <br>
Disallow: /book <br>
Disallow: /book-failed<br>
Disallow: /book-thankyou<br>
Disallow: /expired<br>
Disallow: /feedback<br>
Disallow: /guests<br>
Disallow: /nps<br>
Disallow: /offers<br>
Disallow: /payment-failed<br>
Disallow: /payment-thankyou<br>
Disallow: /payments<br>
Disallow: /rating<br>
Disallow: /users<br>
Sitemap: https://www.theblueground.com/sitemap.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/ist.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/lon.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/par.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/vie.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/dxb.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/mia.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/nyc.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/sfo.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/lax.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/bos.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/wdc.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/chi.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/sea.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/den.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/atx.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/zrh.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/ber.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/mad.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/bcn.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/lis.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/bsl.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/hkg.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/cph.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/lux.xml<br>
Sitemap: https://www.theblueground.com/sitemap-images/sgp.xml<br>

## get the content of the website

In [3]:
# get the content of the website
# Blueground - London
# https://www.theblueground.com/furnished-apartments-london-uk
# page = requests.get("https://www.theblueground.com/furnished-apartments-london-uk")
# html = page.content

In [4]:
weblink = 'https://www.theblueground.com/furnished-apartments-london-uk?currency=GBP&language=en&'
pagesite = 10 # we take number 10 to test the code
page = requests.get(weblink +  f'offset={ pagesite }&items=18')
html = page.content

We can use the BeautifulSoup library to parse this document, and extract the information from it.

We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

In [5]:
# parse the html and save it into a BeautifulSoup instance
bs = BeautifulSoup(html, 'html.parser')

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object.

In [6]:
#print(bs.prettify())

But what if we have more than one element with the same tag? Then we can just use the ```.find_all()``` method of BeautifulSoup:

### Searching for the Apartment/Studio Name

In [7]:
# get the list of all the apartments
object_names = bs.find_all(class_="listing-name")
object_names_lst = (object_name.get_text() for object_name in object_names)
object_names_lst = [object_name.strip() for object_name in object_names_lst]
object_names_lst[:5]


['Harbour Wy.,',
 'Bateman St,',
 'Green St,',
 'Tottenham Court Rd,',
 'St George Wharf,']

Looking for the neighborhood

In [8]:
neighborhood_names = bs.find_all("div", {"class":"name-place"})
neighborhood_names_lst = (neighborhood_name.get_text() for neighborhood_name in neighborhood_names)
neighborhood_names_lst = [neighborhood_name.strip() for neighborhood_name in neighborhood_names_lst]
neighborhood_names_lst[:5]

['Harbour Wy., Canary Wharf  - 165',
 'Bateman St, Soho  - 88',
 'Green St, Mayfair  - 94',
 'Tottenham Court Rd, Fitzrovia  - 63',
 'St George Wharf, Vauxhall  - 148']

In [9]:
neighborhood_names = bs.find_all("div", {"class":"name-place"})
neighborhood_names_lst = (neighborhood_name.get_text() for neighborhood_name in neighborhood_names)
neighborhood_names_lst = [neighborhood_name.strip() for neighborhood_name in neighborhood_names_lst]
neighborhood_names_lst = [i.rsplit(',', 1)[-1] for i in neighborhood_names_lst]
neighborhood_names_lst = [i.rsplit('-', 1)[0] for i in neighborhood_names_lst]
neighborhood_names_lst = [i.strip() for i in neighborhood_names_lst]

neighborhood_names_lst[:5]

['Canary Wharf', 'Soho', 'Mayfair', 'Fitzrovia', 'Vauxhall']

In [10]:
soup = bs
result = soup.find_all("div", {"class":"name-place"})

for res in result:
    print(res.text)

Harbour Wy., Canary Wharf  - 165
Bateman St, Soho  - 88
Green St, Mayfair  - 94
Tottenham Court Rd, Fitzrovia  - 63
St George Wharf, Vauxhall  - 148
City Rd, Old Street  - 125
Baltimore Wharf, Canary Wharf  - 133
Marylebone Rd, Marylebone  - 62
Dock St, Whitechapel/Brick Lane  - 182
Haymarket, Piccadilly  - 36
Marsh Wall, S Quay Square, Canary Wharf  - 139
Sussex St, Pimlico  - 39
St George Wharf, Vauxhall  - 130
Fitzroy Square, Fitzrovia  - 68
Macklin St, Covent Garden  - 211
Lower Marsh, Waterloo  - 104


### availabe - Tag

In [11]:
# get the availabe - tag
available_tags = bs.find_all(class_="availability__available")
available_tag_lst = (available_tag.get_text() for available_tag in available_tags)
available_tag_lst = [available_tag.strip() for available_tag in available_tag_lst]
available_tag_lst[:5]


['Available', 'Available', 'Available', 'Available', 'Available']

### available from

In [12]:
# get the availabe from information
availables = bs.find_all(class_="availability__date")
available_lst = (available.get_text() for available in availables)
available_lst = [available.strip() for available in available_lst]
available_lst[:5]


['22 May 2023', '30 May 2023', '01 Jun 2023', '02 Jun 2023', '02 Jun 2023']

### price - currency

In [13]:
# get the price__currency
price_currencys = bs.find_all(class_= "price__currency")
price_currencys_lst = [price_currency.get_text() for price_currency in price_currencys]
price_currencys_lst = [price_currency.strip() for price_currency in price_currencys_lst]
price_currencys_lst[:5]

['£', '£', '£', '£', '£']

### price - amount

In [14]:
# get the price amount
prices = bs.find_all(class_= "price__amount")
prices_lst = [price.get_text() for price in prices]
prices_lst = [price.strip() for price in prices_lst]
prices_lst[:5]

['5,170', '4,050', '3,180', '4,060', '4,300']

### price per month etc

In [15]:
# get the price per month tag
prices_month = bs.find_all(class_= "monthly-price__suffix monthly-price__suffix--mobile")
prices_month_lst = [price_month.get_text() for price_month in prices_month]
prices_month_lst = [price_month.strip() for price_month in prices_month_lst]
prices_month_lst = [price_month.replace('/', '') for price_month in prices_month_lst]
prices_month_lst = [price_month.replace('mo', 'month') for price_month in prices_month_lst]
prices_month_lst[:5]

['month', 'month', 'month', 'month', 'month']

### complett description of the appartement

In [16]:
# get the description of the apartments
descriptions= bs.find_all(class_="listing-amenities")
descriptions_lst = [description.get_text() for description in descriptions]
descriptions_lst = [description.strip() for description in descriptions_lst]
descriptions_lst[:8]

['2 Bedroom2 Bath 30th Floor | Gym | Doorman',
 '1 Bedroom1 Bath 3rd Floor',
 'Studio1 Bath Lower Ground Floor',
 '2 Bedroom1.5 Bath City view | 1st, 2nd Floor | Pets allowed',
 '2 Bedroom2 Bath 3rd Floor | Doorman | Balcony',
 '1 Bedroom1 Bath City view | 2nd Floor | Pool',
 '2 Bedroom2 Bath City view | 15th Floor | Gym',
 '2 Bedroom2 Bath Ground Floor | Gym | Doorman']

### look for property_type

### main-amenities  of the appartement

In [17]:
bedrooms = bs.find_all(class_="main-amenities")
bedrooms_lst = [bedroom.get_text() for bedroom in bedrooms]
bedrooms_lst = [bedroom.strip() for bedroom in bedrooms_lst]
bedrooms_lst = [i.split('o', 1)[0] for i in bedrooms_lst]
bedrooms_lst = [i.replace('Bedr', 'Bedroom') for i in bedrooms_lst]
bedrooms_lst = [i.replace('Studi', 'Studio') for i in bedrooms_lst]
bedrooms_lst = [i.rsplit(' ', 1)[-1] for i in bedrooms_lst]
#bedrooms_lst = [i.replace('Bedroom', 'Apartment') for i in bedrooms_lst]
bedrooms_lst[:8]

['Bedroom',
 'Bedroom',
 'Studio',
 'Bedroom',
 'Bedroom',
 'Bedroom',
 'Bedroom',
 'Bedroom']

bedroom

In [18]:
# get the main-amenities of the apartments
main_amenities= bs.find_all(class_="main-amenities")
main_amenities_lst = [main_amenitie.get_text() for main_amenitie in main_amenities]
main_amenities_lst = [main_amenitie.strip() for main_amenitie in main_amenities_lst]
main_amenities_lst = [i.split('o', 1)[0] for i in main_amenities_lst]
main_amenities_lst = [i.replace('Bedr', 'Bedroom') for i in main_amenities_lst]
main_amenities_lst[:8]

['2 Bedroom',
 '1 Bedroom',
 'Studi',
 '2 Bedroom',
 '2 Bedroom',
 '1 Bedroom',
 '2 Bedroom',
 '2 Bedroom']

bathroom

In [19]:
# get the main-amenities of the apartments
main_amenities= bs.find_all(class_="main-amenities")
main_amenities_lst = [main_amenitie.get_text() for main_amenitie in main_amenities]
main_amenities_lst = [main_amenitie.strip() for main_amenitie in main_amenities_lst]
main_amenities_lst = [i.rsplit('o', 1)[-1] for i in main_amenities_lst]
main_amenities_lst = [i.replace('m', '') for i in main_amenities_lst]
main_amenities_lst[:7]

['2 Bath', '1 Bath', '1 Bath', '1.5 Bath', '2 Bath', '1 Bath', '2 Bath']

### main_amenities_amenity seperated

In [20]:
# get the main-amenities of the apartments seperated
main_amenities_amenitys = bs.find_all(class_="main-amenities__amenity")
main_amenities_amenity_lst = [main_amenitie_amenity.get_text() for main_amenitie_amenity in main_amenities_amenitys]
main_amenities_amenity_lst = [main_amenitie_amenity.strip() for main_amenitie_amenity in main_amenities_amenity_lst]
main_amenities_amenity_lst[:5]

['2 Bedroom', '2 Bath', '1 Bedroom', '1 Bath', 'Studio']

In [21]:
# main_amenities_amenity_lst = [main_amenitie.strip() for main_amenitie in main_amenities_amenity_lst]
# main_amenities_amenity_lst[:5]

### rest_amenities of apartment

In [22]:
# get the rest of the amenities
rest_amenities = bs.find_all(class_="rest-amenities")
rest_amenities_lst = [rest_amenity.get_text() for rest_amenity in rest_amenities]
rest_amenities_lst = [rest_amenity.strip() for rest_amenity in rest_amenities_lst]
rest_amenities_lst[:5]

['30th Floor | Gym | Doorman',
 '3rd Floor',
 'Lower Ground Floor',
 'City view | 1st, 2nd Floor | Pets allowed',
 '3rd Floor | Doorman | Balcony']

### get the hyperlink from the website to the detail-side

In [23]:
# we will get all the elements of the class "ui-image-carousel"
results = bs.find_all(class_="ui-image-carousel")
# we will look for the element a
find_a= results[0].find_all('a')
print(find_a)


[<a aria-label="property" href="/furnished-apartments-london-uk/london-canary-wharf-165" rel="nofollow" target="_blank"><div class="flicking-viewport ui-image-carousel__photos"><div class="flicking-camera" style=""><!--[--><div class="ui-image-carousel__photo-container" style=""><img alt="2 bedroom furnished apartment in Harbour Wy. 165, Canary Wharf, London, photo 1" class="ui-image-carousel__photo" fetchpriority="high" loading="eager" sizes="(max-width: 360px) 368px, 736px" src="https://photos2.theblueground.com/736/pg20360-o-27ff2fcd-4730-617f-019a-526dc3b3b2e9.jpg" srcset="https://photos2.theblueground.com/368/pg20360-o-27ff2fcd-4730-617f-019a-526dc3b3b2e9.jpg 368w, https://photos2.theblueground.com/736/pg20360-o-27ff2fcd-4730-617f-019a-526dc3b3b2e9.jpg 736w" title="2 bedroom furnished apartment in Harbour Wy. 165, Canary Wharf, London, photo 1"/> <!--v-if--></div><!--]--><!--[--><div class="ui-image-carousel__photo-container" style=""><!--v-if--> <div class="ui-placeholder ui-plac

In [24]:
# this will give us all the links of the website
soup = bs

for a in soup.find_all('a', href=True):
    print("Found the URL:", a['href'])

Found the URL: /
Found the URL: /furnished-apartments-london-uk/london-canary-wharf-165
Found the URL: /furnished-apartments-london-uk/london-canary-wharf-165
Found the URL: /furnished-apartments-london-uk/london-soho-088
Found the URL: /furnished-apartments-london-uk/london-soho-088
Found the URL: /furnished-apartments-london-uk/london-mayfair-094
Found the URL: /furnished-apartments-london-uk/london-mayfair-094
Found the URL: /furnished-apartments-london-uk/london-fitzrovia-063
Found the URL: /furnished-apartments-london-uk/london-fitzrovia-063
Found the URL: https://www.theblueground.com/blueground-pass
Found the URL: /furnished-apartments-london-uk/london-vauxhall-148
Found the URL: /furnished-apartments-london-uk/london-vauxhall-148
Found the URL: /furnished-apartments-london-uk/london-old-street-125
Found the URL: /furnished-apartments-london-uk/london-old-street-125
Found the URL: /furnished-apartments-london-uk/london-canary-wharf-133
Found the URL: /furnished-apartments-london

### how to find the link

In [53]:
# this will get us the link to the detail page
class_with_link = bs.find_all(class_="ui-image-carousel")
# with the [] we can select the elment we want to get
for a in class_with_link[0].find_all('a', href=True):
    url = a['href']
    #print("Found the URL:", a['href'])
    print(url)

/furnished-apartments-london-uk/london-soho-090


In [26]:
url_lst = []
# this will get us the link to the detail page
class_with_link = bs.find_all(class_="ui-image-carousel")
# with the [] we can select the elment we want to get
for a in class_with_link[0].find_all('a', href=True):
    url_lst.append(a['href'])
    #print("Found the URL:", a['href'])
print(url_lst)

['/furnished-apartments-london-uk/london-canary-wharf-165']


In [27]:
url_lst = []

count = df.index.max()


# this will get us the link to the detail page
class_with_link = bs.find_all(class_="ui-image-carousel")
# with the [] we can select the elment we want to get
for a in class_with_link[count].find_all('a', href=True):
    url_lst.append(a['href'])
    #print("Found the URL:", a['href'])
print(url_lst)

NameError: name 'df' is not defined

In [None]:
page = requests.get('https://www.spotahome.com/s/london--uk/for-rent:apartments/for-rent:studios/bedrooms:3?features[]=pets&noDeposit=1')
html = page.content
bs = BeautifulSoup(html, 'html.parser')

ids = bs.find_all(class_ = 'l-list__item')
ids_lst = [id.get('data-homecard-scroll') for id in ids]

ids_lst[:5]

['619180', '806735', '641966', '641929']

In [52]:
url_lst = []
count = 0
while count <= df.index.max():
    
    # this will get us the link to the detail page
    class_with_link = bs.find_all(class_="ui-image-carousel")
    # with the [] we can select the elment we want to get
    for a in class_with_link[count].find_all('a', href=True):
        url_lst.append(a['href'])
    count += 1
print(url_lst)

NameError: name 'df' is not defined

# Erstellen eines DataFrames

In [None]:
df = pd.DataFrame()
df

In [None]:
df['object_name'] = pd.Series(object_names_lst)
df['available_tag'] = pd.Series(available_tag_lst)
#df['available'] = pd.Series(available_lst)
#df['description'] = pd.Series(descriptions_lst)
#df['main_amenities'] = pd.Series(main_amenities_lst)
#df['main_amenities_amenity'] =pd.Series( main_amenities_amenity_lst)
#df['rest_amenities'] = pd.Series(rest_amenities_lst)
#df['price_currencys'] = pd.Series(price_currencys_lst)
#df['prices'] = pd.Series(prices_lst)
#df['prices_month'] = pd.Series(prices_month_lst)
df['detail_links'] = pd.Series(url_lst)
display(df)

  df['detail_links'] = pd.Series(url_lst)


Unnamed: 0,object_name,available_tag,detail_links
0,"Harbour Wy.,",Available,
1,"Bateman St,",Available,
2,"Green St,",Available,
3,"Tottenham Court Rd,",Available,
4,"St George Wharf,",Available,
5,"City Rd,",Available,
6,"Baltimore Wharf,",Available,
7,"Marylebone Rd,",Available,
8,"Dock St,",Available,
9,"Haymarket,",Available,


In [None]:
index = df.index
print(index.max())

### Save the actual DataFrame

In [None]:
today = dt.datetime.today().strftime('%Y-%m-%d %H:%M') # to set the date in the csv filename
df.to_csv('blueground_{}.csv'.format(today), sep='\t')

# Information to get data in a loop

look for 'class="blank-slate__criteria"' , this will show us the last page of the infinite-scroll

In [None]:
# link = 'https://www.theblueground.com/furnished-apartments-london-uk?currency=GBP&language=en&
# for _ in range(7):
#     time.sleep(3)
#     if _ == 0:
#         page = requests.get(link + offset=1&items=18)
#         html = page.content
#     else:
#         print(link + f'/offset={_}&items=18')
#         page = requests.get(link + f'/offset={_}&items=18')
#         html = page.content

In [None]:
weblink = 'https://www.theblueground.com/furnished-apartments-london-uk?currency=GBP&language=en&'
pagesite = 1

while bs.find_all(class_=("blank-slate__criteria")) == False:
    time.sleep(5)
    page = requests.get(weblink +  f'offset={ pagesite }&items=18')
    html = page.content
    bs = BeautifulSoup(html, 'html.parser')
    print(weblink + f'/offset={ pagesite }&items=18')
    pagesite += 1

This will give us the correct url

In [None]:
weblink = 'https://www.theblueground.com/furnished-apartments-london-uk?currency=GBP&language=en&'
pagesite = 1

while pagesite < 7:
    #time.sleep(5)
    #page = requests.get(weblink +  f'/offset={ pagesite }&items=18')
    #html = page.content
    #bs = BeautifulSoup(html, 'html.parser')
    print(weblink + f'offset={ pagesite }&items=18')
    pagesite += 1

### Website not found:

https://www.theblueground.com/furnished-apartments-london-uk?currency=GBP&language=en&offset=18&items=18

In [None]:
# get the content of the website
# Blueground - London
# https://www.theblueground.com/furnished-apartments-london-uk
page_not_found = requests.get("https://www.theblueground.com/furnished-apartments-london-uk?currency=GBP&language=en&offset=15&items=18")
html_not_found = page_not_found.content

In [None]:
# parse the html and save it into a BeautifulSoup instance
bs_not_found = BeautifulSoup(html_not_found, 'html.parser')

In [None]:
blank_slates = bs_not_found.find_all(class_="blank-slate__criteria")

blank_slates_lst = (blank_slate.get_text() for blank_slate in blank_slates)
blank_slates_lst

In [None]:
blank_slates_lst = [blank_slate.strip() for blank_slate in blank_slates_lst]
blank_slates_lst

### We try to shop the loop with 'class="blank-slate__criteria"' to see, if we have reached the end of the infinite-scroll

Working with an Inifiy - Website - Scrolling

https://medium.com/@harshvb7/scraping-from-a-website-with-infinite-scrolling-7e080ea8768e

https://stackoverflow.com/questions/69046183/how-do-i-scrape-a-website-with-an-infinite-scroller

https://stackoverflow.com/questions/64527791/scraping-an-infinite-scroll-page

https://stackoverflow.com/questions/12519074/scrape-websites-with-infinite-scrolling

In [None]:
weblink = 'https://www.theblueground.com/furnished-apartments-london-uk?currency=GBP&language=en&'
pagesite = 12

# blank_slates = bs_not_found.find_all(class_="blank-slate__criteria")
# blank_slates_lst = (blank_slate.get_text() for blank_slate in blank_slates)
# blank_slates_lst
# blank_slates_lst = [blank_slate.strip() for blank_slate in blank_slates_lst]
blank_slates_lst = []
stop_loop = "We’re sorry! We can’t seem to find any apartments that match your search."
print("Startlink:", weblink + f'offset={ pagesite }&items=18')
print("We are looking for:", stop_loop)
print(type(stop_loop))
print("We currently have:", blank_slates_lst)
print(type(blank_slates_lst))


# https://flexiple.com/python/check-if-list-is-empty-python/
# Solution 3: Using len() function
# The len() function returns the number of items in a list. If the list is empty, it returns 0.
while len(blank_slates_lst) == 0:
    time.sleep(5)
    page = requests.get(weblink +  f'offset={ pagesite }&items=18')
    html = page.content
    bs_loop = BeautifulSoup(html, 'html.parser')
    print(weblink + f'offset={ pagesite }&items=18')
    pagesite += 1

    blank_slates = bs_loop.find_all(class_="blank-slate__criteria")
    blank_slates_lst = (blank_slate.get_text() for blank_slate in blank_slates)
    print(blank_slates_lst)
    blank_slates_lst = [blank_slate.strip() for blank_slate in blank_slates_lst]
    print(blank_slates_lst)


# Pseudo Loop

In [None]:
import time
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

#df_full = pd.DataFrame()
#df_object = pd.DataFrame()



weblink = 'https://www.theblueground.com/furnished-apartments-london-uk?currency=GBP&language=en&'
pagesite = 12
page = requests.get(weblink +  f'offset={ pagesite }&items=18')
html = page.content

# parse the html and save it into a BeautifulSoup instance
bs = BeautifulSoup(html, 'html.parser')

# get the list of all the apartments
object_names = bs.find_all(class_="listing-name")
object_names_lst = (object_name.get_text() for object_name in object_names)
object_names_lst = [object_name.strip() for object_name in object_names_lst]
print(object_names_lst[:5])

df_object['object_name'] = pd.Series(object_names_lst)
df_full = pd.concat([df_full, df_object], axis=0, ignore_index=True)
display(df_full)


# Combining the Loop with a run

## this code runs, not editing

In [None]:
# import all the libraries

import time # to pause the code
import requests # to get the content of the website
from bs4 import BeautifulSoup # to parse the html
import re # to use regular expressions
import pandas as pd # to use pandas
import numpy as np # to use numpy

In [29]:
# create an empty dataframe
df_full = pd.DataFrame()
df_object = pd.DataFrame()
df_search = pd.DataFrame()

In [30]:
# set up the link to the website

weblink = 'https://www.theblueground.com/furnished-apartments-london-uk?currency=GBP&language=en&'

# set up the first page to scrape
pagesite = 10 # we set it to 10 to test the code

# create an empty list to store the blank slates
blank_slates_lst = [] 

# set the stop condition
stop_loop = "We’re sorry! We can’t seem to find any apartments that match your search." 


In [31]:
# https://flexiple.com/python/check-if-list-is-empty-python/
# Solution 3: Using len() function
# The len() function returns the number of items in a list. If the list is empty, it returns 0.
while len(blank_slates_lst) == 0: # start and endpoint of the for-loop
    # pause the loop for 3 seconds to reduce the load on the server
    time.sleep(3)


    # get the content of the website
    page = requests.get(weblink +  f'offset={ pagesite }&items=18')
    # parse the html and save it into a BeautifulSoup instance
    html = page.content

    # parse the html and save it into a BeautifulSoup instance
    bs = BeautifulSoup(html, 'html.parser')

    # get the list of all the apartments
    object_names = bs.find_all(class_="listing-name")
    object_names_lst = (object_name.get_text() for object_name in object_names)
    object_names_lst = [object_name.strip() for object_name in object_names_lst]
    #print(object_names_lst[:5])

    # create an empty dataframe to store the object names
    df_object = pd.DataFrame()

    # store the object names in the dataframe
    df_object['object_name'] = pd.Series(object_names_lst)
    df_full = pd.concat([df_full, df_object], axis=0, ignore_index=True)

    # drop duplicates
    df_full.drop_duplicates(subset=['object_name'], keep='first', inplace=True)

    # set the number of rows to display to maximum
    pd.set_option('display.max_rows', None)

    # display the dataframe from the loop
    display(df_full)


    # check if we reached the end of the pages
    blank_slates = bs.find_all(class_="blank-slate__criteria")
    blank_slates_lst = (blank_slate.get_text() for blank_slate in blank_slates)
    blank_slates_lst = [blank_slate.strip() for blank_slate in blank_slates_lst]
    # print the list to make sure it works
    #print(blank_slates_lst)

    # increase the pagesite by 1
    pagesite += 1



Unnamed: 0,object_name
0,"Harbour Wy.,"
1,"Bateman St,"
2,"Green St,"
3,"Tottenham Court Rd,"
4,"St George Wharf,"
5,"City Rd,"
6,"Baltimore Wharf,"
7,"Marylebone Rd,"
8,"Dock St,"
9,"Haymarket,"


Unnamed: 0,object_name
0,"Harbour Wy.,"
1,"Bateman St,"
2,"Green St,"
3,"Tottenham Court Rd,"
4,"St George Wharf,"
5,"City Rd,"
6,"Baltimore Wharf,"
7,"Marylebone Rd,"
8,"Dock St,"
9,"Haymarket,"


Unnamed: 0,object_name
0,"Harbour Wy.,"
1,"Bateman St,"
2,"Green St,"
3,"Tottenham Court Rd,"
4,"St George Wharf,"
5,"City Rd,"
6,"Baltimore Wharf,"
7,"Marylebone Rd,"
8,"Dock St,"
9,"Haymarket,"


Unnamed: 0,object_name
0,"Harbour Wy.,"
1,"Bateman St,"
2,"Green St,"
3,"Tottenham Court Rd,"
4,"St George Wharf,"
5,"City Rd,"
6,"Baltimore Wharf,"
7,"Marylebone Rd,"
8,"Dock St,"
9,"Haymarket,"


  df_object['object_name'] = pd.Series(object_names_lst)


Unnamed: 0,object_name
0,"Harbour Wy.,"
1,"Bateman St,"
2,"Green St,"
3,"Tottenham Court Rd,"
4,"St George Wharf,"
5,"City Rd,"
6,"Baltimore Wharf,"
7,"Marylebone Rd,"
8,"Dock St,"
9,"Haymarket,"


## here you can start editing again

# We now try to create functions to work with the loop

#### von Markus

In [48]:
# parse the html and save it into a BeautifulSoup instance
#bs = BeautifulSoup(html, 'html.parser')

def get_object_name(bs):
    # get the names of all the apartments
    lst_name = []
    object_names = bs.find_all(class_="listing-name")
    for object_name in object_names:
        lst_name.append(
            object_name.get_text()
                .strip()
        )
    return lst_name

In [49]:
def get_url_to_detail_page(bs, maximus):
    url_lst = []
    count = 0
    while count <= int(maximus):
        
        # this will get us the link to the detail page
        class_with_link = bs.find_all(class_="ui-image-carousel")
        # with the [] we can select the elment we want to get
        for a in class_with_link[count].find_all('a', href=True):
            url_lst.append(a['href'])
        count += 1
    #print(url_lst)
    return url_lst

In [51]:
# https://flexiple.com/python/check-if-list-is-empty-python/
# Solution 3: Using len() function
# The len() function returns the number of items in a list. If the list is empty, it returns 0.
while len(blank_slates_lst) == 0: # start and endpoint of the for-loop
    # pause the loop for 3 seconds to reduce the load on the server
    time.sleep(3)


    # get the content of the website
    page = requests.get(weblink +  f'offset={ pagesite }&items=18')
    # parse the html and save it into a BeautifulSoup instance
    html = page.content
    bs = BeautifulSoup(html, 'html.parser')

    # create a pandas dataframe for the names and prices
    blueground_dict = {
        'object_names': get_object_name(bs),
        }

    # we now have a dataframe, we can use this to get a counter for the URL
    df_page = pd.DataFrame(blueground_dict)

    # we create a variable to store the number of rows in the dataframe
    maximus = df_page.index.max() # we give it the max value of the index


    if np.isnan(maximus):
        break
    else:
        df_page['get_url_to_detail_page'] = pd.Series(get_url_to_detail_page(bs, maximus))
    # we can now add the dataframe to the full dataframe
    df_search = df_search.append(df_page, ignore_index=True)
    


    # # store the object names in the dataframe
    # df_object['object_name'] = pd.Series(object_names_lst)
    # df_full = pd.concat([df_full, df_object], axis=0, ignore_index=True)

    # # drop duplicates
    # df_full.drop_duplicates(subset=['object_name'], keep='first', inplace=True)

    # # set the number of rows to display to maximum
    # pd.set_option('display.max_rows', None)

    # # display the dataframe from the loop
    # display(df_full)


    # check if we reached the end of the pages
    blank_slates = bs.find_all(class_="blank-slate__criteria")
    blank_slates_lst = (blank_slate.get_text() for blank_slate in blank_slates)
    blank_slates_lst = [blank_slate.strip() for blank_slate in blank_slates_lst]
    # print the list to make sure it works
    #print(blank_slates_lst)

    # increase the pagesite by 1
    pagesite += 1
 
pd.set_option('display.max_colwidth')
display(df_search)

AttributeError: 'str' object has no attribute 'set_option'

In [None]:
weblink = 'https://www.theblueground.com/furnished-apartments-london-uk?currency=GBP&language=en&'
pagesite = 12
page = requests.get(weblink +  f'offset={ pagesite }&items=18')
html = page.content
print(weblink +  f'offset={ pagesite }&items=18')

https://www.theblueground.com/furnished-apartments-london-uk?currency=GBP&language=en&offset=12&items=18


In [None]:
# parse the html and save it into a BeautifulSoup instance
bs = BeautifulSoup(html, 'html.parser')

-----

# Now we have to work with the details

In [None]:
# import all the libraries

import time # to pause the code
import requests # to get the content of the website
from bs4 import BeautifulSoup # to parse the html
import re # to use regular expressions
import pandas as pd # to use pandas
import numpy as np # to use numpy

In [None]:
# set up the link for the detail-page
weblink_detail = 'https://www.theblueground.com'
pagesite_detail = "/furnished-apartments-london-uk/london-bayswater-046"
#pagesite_with_df = df_search.loc[0, 'get_url_to_detail_page']

print(weblink_detail + pagesite_detail)

https://www.theblueground.com/furnished-apartments-london-uk/london-bayswater-046


In [None]:
pd = weblink_detail +  pagesite_detail
print(pd)
page_details = requests.get(weblink_detail +  pagesite_detail)
# parse the html and save it into a BeautifulSoup instance
html_details = page_details.content
bs_details = BeautifulSoup(html_details, 'html.parser')
print(bs_details.prettify())

https://www.theblueground.com/furnished-apartments-london-uk/london-bayswater-046
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge, chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="noindex" name="robots"/>
  <script>
   var dataLayer = [{
      country: 'GBR',
      city: 'LON',
      pageType: 'propertyDetails',
      propertyId: '16523',
      propertyCode:'LON-46',
      neighborhood: 'Bayswater',
      numberOfBedrooms: 1 }];
  </script>
  <title>
   Inverness Terrace - Apartment for Rent in Bayswater, London | Blueground
  </title>
  <meta content="Rent a fully furnished apartment in Bayswater, London. This serviced apartment for rent has property ID 46 and is suitable for long-term accommodation." name="description"/>
  <meta content="604800" property="og:ttl"/>
  <meta content="https://www.theblueground.com/furnished-apartments-london-uk/london-bayswater-046" property="og:url"/>
  <meta content="website" property="og:type"/>
  

property__amenities-list-item

property-code

In [None]:
bs_details.find_all('div', id='app')

[<div id="app"> <lt-main :configuration="configuration" :theme="activeTheme"> <template #header-left=""><cities-dropdown-header :city-code="property.cityCode" event-label="propertyPage"></cities-dropdown-header></template> <div> <transition name="fade"> <div class="loader" v-if="showLoader"></div> </transition> <div class="wrap wrap__top-spacer"> <property-carousel :on-photo-click="onCarouselPhotoClick" :photos="carouselPhotos" @mounted="onCarouselLoaded"></property-carousel> <div class="wrap__photo-sticker" v-if="hasSticker(property)"> <photo-sticker> <div class="wrap__photo-sticker-container"> <div class="wrap__photo-sticker-title">{{sticker.title}}</div> <div class="wrap__photo-sticker-info u-hidden--xs u-hidden--sm">{{sticker.info}}</div> <div class="wrap__photo-sticker-prompt u-hidden--xs u-hidden--sm">{{sticker.prompt}}</div> </div> </photo-sticker> </div> <div class="gallery-actions"> <carousel-button :on-click="onAllPhotosClick" :text="showAllPhotosText" class="u-hidden--xs u-h

In [None]:
# tag of the description of the fishes
test = bs_details.find_all(class_="property-booking-bar")
print(test)

[]


In [None]:
# We will search for the price
prices = bs_details.find_all(class_= 'property-code')

prices_lst = [price.get_text() for price in prices]
prices_lst

[]

In [None]:
test2 = bs_details.find_all('div', class_='property__amenities-container')
test2

[]

In [None]:
neighborhood_names = bs_details.find_all('span', {'class':'property-code'})
neighborhood_names

[]

In [None]:
soup = bs_details
spans = soup.findAll('span')
spans = soup.findAll('span', attrs = {'class' : 'property-code'}) # or span by class name
# spans = soup.findAll('span', attrs = {'title' : '000 Plus Minimum RAM Requirement'}) # or span with a title
for span in spans:
    print(span.text)

In [None]:
from bs4 import BeautifulSoup
import requests

soup = bs_details
spans=soup.find_all('span',"property-code")
for span in spans:
  print(span.text)

# Now we have to use the URL from the first dataframe to get some details

In [None]:
# set up the link for the detail-page


weblink_detail = 'https://www.theblueground.com'
#pagesite_detail = "/furnished-apartments-london-uk/london-bayswater-046"
pagesite_with_df = df_search.loc[0, 'get_url_to_detail_page']

print(weblink_detail + pagesite_with_df)

KeyError: 'get_url_to_detail_page'

In [None]:
page = requests.get(weblink_detail +  f'{ pagesite_with_df }')
# parse the html and save it into a BeautifulSoup instance
html = page.content
bs = BeautifulSoup(html, 'html.parser')

In [None]:
df_search.loc[0, 'get_url_to_detail_page']