# Webscraping small project

The aim of this notebook is to demonstrate simply how to use selenium to get content from a webpage.

### Prerequisite:
- webdriver-manager needs to be installed 
- a specific driver ( eg: geckodriver for firefox needs to be download and accessible by webdriver-manager)

## Importing libraries

In [1]:
import sys
import csv
import string
import time
import pandas as pd
import numpy as np
import selenium
from selenium import webdriver
import time

In [2]:
selenium.__version__

'3.141.0'

## Settings

In [3]:
# home page airbnb
url = " https://www.airbnb.com/"

## Loading driver with url

In [4]:
# import the webdriver
driver = webdriver.Firefox()
driver.get(url)
time.sleep(5) # give time to init driver

## (1) Extract the home page info, write it in a csv

In [5]:
from selenium.webdriver.common.by import By

container = driver.find_element(By.ID, "site-content")

# Fetching the div tags with data-testid id correspon to listing-card-title ( where titles of content cards are contained)
titles = container.find_elements(By.CSS_SELECTOR, "div[data-testid='listing-card-title']")
tmp_subtitles = container.find_elements(By.CSS_SELECTOR, "div[data-testid='listing-card-subtitle']")

subtitles= np.reshape(tmp_subtitles, (-1, 2))

# Fetching the prices elements ( need more work to retrieve the price inside the elements)
price_elements = container.find_elements(By.CSS_SELECTOR, "div[data-testid='price-availability-row']")

price_list = []
for p in price_elements:
    # retrieve price as text in the price element
    price_text = p.find_element(By.CLASS_NAME, "_1y74zjx").text
    price_list.append(price_text)
    

print(f'Size titles : {len(titles)}')
print(f"Size prices : {len(price_list)}")



Size titles : 20
Size prices : 20


In [6]:
data = {'Title': titles,
        'Price': price_list,
        'Distance' : subtitles[:,0],
        'Dates' : subtitles[:,1]}

df = pd.DataFrame(data)
df['Title'] =  df['Title'].transform(lambda x: x.text)
df['Distance'] =  df['Distance'].transform(lambda x: x.text)
df['Dates'] =  df['Dates'].transform(lambda x: x.text)

In [7]:
driver.quit()

In [8]:
df.head()

Unnamed: 0,Title,Price,Distance,Dates
0,"Claix, France",€ 168,6 kilometers away\n6 kilometers away,May 1 – 6\nMay 1 – 6
1,"Lumbin, France",€ 78,22 kilometers away\n22 kilometers away,May 1 – 6\nMay 1 – 6
2,"Charvieu-Chavagneux, France",€ 162,77 kilometers away\n77 kilometers away,May 5 – 10\nMay 5 – 10
3,"Saint-Alban-Leysse, France",€ 560,50 kilometers away\n50 kilometers away,May 10 – 15\nMay 10 – 15
4,"Sévrier, France",€ 317,83 kilometers away\n83 kilometers away,May 1 – 6\nMay 1 – 6


### Formatting the columns

We want :
- Clean `Distance` columns,
- Clean and split `Dates` column into 2 columns : `Date_Checkin`, `Date_Checkout`.

In [9]:
#### Formatting `Distance`

In [10]:
df['Distance_km'] = df['Distance'].str.replace(r' .*\n.*', '') # only keep number

  df['Distance_km'] = df['Distance'].str.replace(r' .*\n.*', '') # only keep number


In [11]:
df.head(30)

Unnamed: 0,Title,Price,Distance,Dates,Distance_km
0,"Claix, France",€ 168,6 kilometers away\n6 kilometers away,May 1 – 6\nMay 1 – 6,6
1,"Lumbin, France",€ 78,22 kilometers away\n22 kilometers away,May 1 – 6\nMay 1 – 6,22
2,"Charvieu-Chavagneux, France",€ 162,77 kilometers away\n77 kilometers away,May 5 – 10\nMay 5 – 10,77
3,"Saint-Alban-Leysse, France",€ 560,50 kilometers away\n50 kilometers away,May 10 – 15\nMay 10 – 15,50
4,"Sévrier, France",€ 317,83 kilometers away\n83 kilometers away,May 1 – 6\nMay 1 – 6,83
5,"Lathuile, France",€ 651,80 kilometers away\n80 kilometers away,May 1 – 6\nMay 1 – 6,80
6,"Verrens-Arvey, France",€ 161,71 kilometers away\n71 kilometers away,Dec 1 – 6\nDec 1 – 6,71
7,"Novalaise, France",€ 236,44 kilometers away\n44 kilometers away,Jun 30 – Jul 5\nJun 30 – Jul 5,44
8,"Saint-Jean-de-Moirans, France",€ 63,22 kilometers away\n22 kilometers away,Sep 1 – 6\nSep 1 – 6,22
9,"Saint-Jean-de-Moirans, France",€ 96,22 kilometers away\n22 kilometers away,Sep 16 – 21\nSep 16 – 21,22


#### Formating `Dates`

In [12]:
df['Dates'] = df['Dates'].str.replace(r'\n.*', '') # remove duplicate line


  df['Dates'] = df['Dates'].str.replace(r'\n.*', '') # remove duplicate line


In [13]:
df['Dates'].head()

0      May 1 – 6
1      May 1 – 6
2     May 5 – 10
3    May 10 – 15
4      May 1 – 6
Name: Dates, dtype: object

In [14]:
df[['Date_checkin', 'Date_checkout']] = df['Dates'].str.split('– ', expand=True)

In [15]:
df.head(15)

Unnamed: 0,Title,Price,Distance,Dates,Distance_km,Date_checkin,Date_checkout
0,"Claix, France",€ 168,6 kilometers away\n6 kilometers away,May 1 – 6,6,May 1,6
1,"Lumbin, France",€ 78,22 kilometers away\n22 kilometers away,May 1 – 6,22,May 1,6
2,"Charvieu-Chavagneux, France",€ 162,77 kilometers away\n77 kilometers away,May 5 – 10,77,May 5,10
3,"Saint-Alban-Leysse, France",€ 560,50 kilometers away\n50 kilometers away,May 10 – 15,50,May 10,15
4,"Sévrier, France",€ 317,83 kilometers away\n83 kilometers away,May 1 – 6,83,May 1,6
5,"Lathuile, France",€ 651,80 kilometers away\n80 kilometers away,May 1 – 6,80,May 1,6
6,"Verrens-Arvey, France",€ 161,71 kilometers away\n71 kilometers away,Dec 1 – 6,71,Dec 1,6
7,"Novalaise, France",€ 236,44 kilometers away\n44 kilometers away,Jun 30 – Jul 5,44,Jun 30,Jul 5
8,"Saint-Jean-de-Moirans, France",€ 63,22 kilometers away\n22 kilometers away,Sep 1 – 6,22,Sep 1,6
9,"Saint-Jean-de-Moirans, France",€ 96,22 kilometers away\n22 kilometers away,Sep 16 – 21,22,Sep 16,21


In [16]:
def add_month_if_not_found(df):
    if df.Date_checkout.startswith(tuple(string.digits)):
        prefix = df.Date_checkin.split(' ')[0]
        return f'{prefix} {df["Date_checkout"]}'
    return df['Date_checkout']

In [17]:
# Adding the month from `Date_checking` if not already present
df['Date_checkout'] = df.apply( add_month_if_not_found, axis=1)

In [18]:
df.head()

Unnamed: 0,Title,Price,Distance,Dates,Distance_km,Date_checkin,Date_checkout
0,"Claix, France",€ 168,6 kilometers away\n6 kilometers away,May 1 – 6,6,May 1,May 6
1,"Lumbin, France",€ 78,22 kilometers away\n22 kilometers away,May 1 – 6,22,May 1,May 6
2,"Charvieu-Chavagneux, France",€ 162,77 kilometers away\n77 kilometers away,May 5 – 10,77,May 5,May 10
3,"Saint-Alban-Leysse, France",€ 560,50 kilometers away\n50 kilometers away,May 10 – 15,50,May 10,May 15
4,"Sévrier, France",€ 317,83 kilometers away\n83 kilometers away,May 1 – 6,83,May 1,May 6


### Writing CSV

In [19]:
df = df.drop(columns=['Distance', 'Dates'])

In [20]:
df.reset_index(inplace=True)

In [21]:
df.to_csv("data.txt", index=False)

2

0      May 6 2024
1      May 6 2024
2     May 10 2024
3     May 15 2024
4      May 6 2024
5      May 6 2024
6      Dec 6 2024
7      Jul 5 2024
8      Sep 6 2024
9     Sep 21 2024
10    Jun 21 2024
11    Sep 20 2024
12    May 17 2024
13    May 18 2024
14     Jul 5 2024
15    May 31 2024
16    May 27 2024
17    May 25 2024
18    May 17 2024
19     May 6 2024
Name: Date_checkout, dtype: object

In [39]:
df['Date_checkout'] = pd.to_datetime( df['Date_checkout'] + " 2024", format='%b %d %Y')
df['Date_checkin'] = pd.to_datetime( df['Date_checkin'] + " 2024", format='%b %d %Y')
df.head()

SyntaxError: invalid syntax (1990048715.py, line 1)