# Download data from airbnb (insideairbnb) and make predictions based on it

# Introduction
This Jupyter notebook guides you through the process of downloading Airbnb data and using the data to make some predictions. The data we will be using is sourced from the Inside Airbnb project and can be accessed at the following URL: [Inside Airbnb Data](http://insideairbnb.com/get-the-data.html). In addition you can use the file to make some longitude and latitude requests on the google API.

The functions needed to scrape and process the data, as well as make predictions, are all stored in the Python file "scrape_and_predict_functions.py". This notebook makes use of all the functions from this file, but depending on your specific requirements, you may choose to use only certain parts. To ensure proper functioning, please ensure that the "scrape_and_predict_functions.py" file is in the same directory as this notebook.

## Note

Running this notebook will create several folders in your current working directory. Be aware of this to prevent any unwanted clutter or overwriting of existing folders or files.

## Table of Contents

* [1. Scraping the Data](#1.-Scraping-the-Data)
* [2. Clean and Merge the Data](#2.-Clean-and-Merge-the-Data)
* [3. Prediction](#3.-Prediction)
* [4. Get Input for Prediction](#4.-Get-Input-for-Prediction)


In [1]:
# import the needed packages
import scrape_and_predict_functions as spf
import os

## 1. Scraping the Data
First we need to scrape the data from the website and save it in the wanted location 
to later use it for predictions. Therefore we extract the needed links from the website. We then use this links to creat a download request. Finally we store them in a folder. Keep in minde that the downloaded files are in .gz format which is a compressed csv file. In order to keep the computational and download recourses low we will restrict the scraped data to the cities of amsterdam and antwerp.

In [2]:
# get all links from website which lead to the data files there are roughly 120 links
# need internet connection because it scrapes (requests) the website
csv_links = spf.get_links()


# filter out only the cities you want to scrape
# define which cities you want to download
cities = ['amsterdam', 'antwerp']

# get the link for the cities you want to scrape. They are from the scraped list of links csv_links
csv_links = spf.get_links_certain_city(csv_links, cities)

In [3]:
# Create a folder to store the downloaded compressed .csv.gz files. The csv.gz contain all the scraped airbnb data
os.makedirs('csv_files_airbnb_csv_gz', exist_ok=True)

# download the files and save them in a folder with compressed .csv.gz files
spf.download_csv_save_in_folder(csv_links, 'csv_files_airbnb_csv_gz')

Downloaded: the-netherlands_north-holland_amsterdam.csv.gz
Downloaded: belgium_vlg_antwerp.csv.gz


## 2. Clean and Merge the Data
Now that we downloaded the data files we are going to extract the csv.gz file and save it as a simple .csv file. Moreover, we want to merge the file into a single dataframe in order to clean, select and tranform the data and later on do the trianing of the prediction model with it.

In [4]:
# create a folder to store the extracted files
os.makedirs('csv_files_airbnb_extracted_csv', exist_ok=True)

# extract the files from the compressed .csv.gz files and save them in a folder with extracted .csv files
spf.extract_gz_from_to('csv_files_airbnb_csv_gz', 'csv_files_airbnb_extracted_csv')

Extracted: csv_files_airbnb_extracted_csv\belgium_vlg_antwerp.csv
Extracted: csv_files_airbnb_extracted_csv\the-netherlands_north-holland_amsterdam.csv


In [5]:
# take the data from the foler with extracted .csv files and put them into a merged dataframe
df_merged = spf.convert_csv_to_dataframe_merge('csv_files_airbnb_extracted_csv')

# apply some data cleaning such as rounding the long and lat or transforming the price into a float
df_merged = spf.data_clean(df_merged, round_long_lat=True)

# show the first 5 rows of the dataframe
df_merged.head()


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,country,region,city
0,50904,https://www.airbnb.com/rooms/50904,20230329174650,2023-03-29,city scrape,aplace/antwerp: cosy suite - fashion district,Decorated in a vintage style combined with a f...,,https://a0.muscache.com/pictures/f14b0908-cbc3...,234077,...,,0,4,2,0,0,0.03,belgium,vlg,antwerp
1,116134,https://www.airbnb.com/rooms/116134,20230329174650,2023-03-29,city scrape,Spacious apartment nearby Mas,Enjoy your stay at our 4 person apartment in t...,"The area ""`t eilandje"" is located at the old h...",https://a0.muscache.com/pictures/23732573/0708...,586942,...,,0,1,1,0,0,1.09,belgium,vlg,antwerp
2,224682,https://www.airbnb.com/rooms/224682,20230329174650,2023-03-29,city scrape,APARTMENT ROSCAM - OLD CENTRE ANTWERP,"<b>The space</b><br />Apartment ""Roscam"" is a ...",There is a paid parking lot around the corner....,https://a0.muscache.com/pictures/cc82d1b9-ec82...,1263933,...,,0,1,1,0,0,3.48,belgium,vlg,antwerp
3,345959,https://www.airbnb.com/rooms/345959,20230329174650,2023-03-29,city scrape,Marleen's home in Antwerp city,"your entire, private groundfloor 2-bedroom apa...",leuke residentiële buurt,https://a0.muscache.com/pictures/11642662/f9b6...,1754396,...,,0,1,1,0,0,0.63,belgium,vlg,antwerp
4,366252,https://www.airbnb.com/rooms/366252,20230329174650,2023-03-29,city scrape,ROOM IN FAMILY HOME near C. Station,"In the Antwerp district of Borgerhout, we live...",we live on the 5th floor on top of a bed store...,https://a0.muscache.com/pictures/miso/Hosting-...,1820186,...,,1,4,0,4,0,1.01,belgium,vlg,antwerp


## 3. Prediction
Now we have the Dataframe in the right format to make some ML prediction. We will apply a random forest ML approach in order to make the prediction. You can choose the features according to your needs. In this case we will predict the price per night depending on the location and minimum/ maximum nights that the airbnb can be rented, whether the host is available and how many other listings the host has. 

In [6]:
# for the sake of demonstration we only take a sample of the data in order to speed up the process
# only take city of amsterdam in dataframe
df_train = df_merged[df_merged['city'] == 'amsterdam']
df_train = df_train.sample(frac=0.5, replace=False)


# split the data into X (features) and y (outcome)
X = df_train[['latitude', 'longitude', 'minimum_nights', 
               'maximum_nights', 'has_availability', 'host_listings_count']]

y = df_train['price']

# train the model. Can take a while
fitmodel = spf.predict_random_forest(features = X, outcome = y, n_estimators=10, hyperparamopt=False)

In [7]:
# predict price for a cerain set of features take one observation from the trained data
# you can adjust the features to your liking. Please make sure it is a pandas dataframe
predict_object = X.sample(n=1)
print(predict_object)

# predict the price for the observation
print("the predicted price is" , fitmodel.predict(predict_object))

# show the actual price for the observation
print("the actual price is", y.loc[predict_object.index].values[0])

      latitude  longitude  minimum_nights  maximum_nights  has_availability  \
3508     52.35       4.89               3              65                 1   

      host_listings_count  
3508                    1  
the predicted price is [210.7]
the actual price is 243.0


## 4. Get Input for Prediction

If we only have an adress of the airbnb object we can get the long lat with the help of an google geocoding api. Furthermore, we can the neighborhood (variable of airbnb dataset) of an airbnb object with the input of long lat. In this section we will show you how to do that.  

In [8]:
# get long lat with google geocoding api
# In order to do this steps you need a functioning google cloud project
# as well as a billing account and an enabled geocoding API. 
# More information here: https://developers.google.com/maps/documentation/elevation/cloud-setup
# please note that there could be some costs involved


# define API key for google geocoding API. Please enter your own API key
api_key = None

if api_key is not None:
    # get the long and lat of of a certain street through the google maps api
    latitude, longitude = spf.get_geocode("Javastraat 115I, 1094 HD Amsterdam, Netherlands", api_key)

In [9]:
# find the neighbourhood of a certain city with long and lat input 
# first we need to get a .geojson file from the airbnb website
# they contain multipolygons of the neighbourhoods
geojson_links = spf.get_geojson_links()

# we again only want to scrape the geojson files for certain cities
geojson_links = spf.get_geojson_links_certain_city(geojson_links, cities)

# create a folder to store the geojson files
os.makedirs('geojson_files', exist_ok=True)

# download the geojson files
spf.download_geojson_save_in_folder(geojson_links, 'geojson_files')


Downloaded: amsterdam.geojson
Downloaded: antwerp.geojson


In [10]:

# get the neighbourhood of a certain long and lat input
# define longitude and latitude for demonstration
input_longitude = df_merged[['longitude', 'latitude']].iloc[2442][0]
input_latitude = df_merged[['longitude', 'latitude']].iloc[2442][1]

# define the path to the geojson file
file_path = 'geojson_files/amsterdam.geojson'

# get the neighbourhood in amsterdam
neighborhood = spf.get_neighbourhood(input_longitude, input_latitude, file_path)
neighborhood

'Centrum-West'