### View raw data

In [0]:
%sql
SELECT * FROM default.listings LIMIT 10

### Load .csv and select useful feature columns

The following columns could be of use in the price prediction: 

| Column Name              | Example Value | Description |
|--------------------------|--------------|-------------|
| name                     | "Private, quiet studio in the centre with terrace"            | title of the airbnb page, should be transformed to embedding           |
| description              | "All guests agree: the apartment  is perfect and the location even better. A real home away from home. Two bedrooms, a fully equipped kitchen, a living with a comfortable couch. Quiet area, next to the Museumplein with the 3 major Museums."            | description on the airbnb page, should be transformed to embedding       |
| neighborhood_overview    | "Near beach, harbor and canal. From livingroom you can see boats passing by"          |  description of the neighborhood, should be transformed to embedding         |
| neighborhood_cleansed    |    Centrum-West       | label for the neighborhood, needs to be one-hot-encoded           |
| property_type           | Private room in guest suite            | label for the property type, needs to be one-hot-encoded           |
| room_type                | Entire home/apt            | label for the room type, needs to be one-hot-encoded           |
| accommodates            | 4            | the number of guests           |
| bathrooms               | 1            | the number of bathrooms           |
| bedrooms                | 2            | the number of bedrooms           |
| beds                    | 1            | the number of beds          |
| amenities               | ["Central heating", "Shower gel", "Lake access"]            | array of categorical variables, needs to be one-hot-encoded           |
| availability_365        |   247          | number of days the airbnb is available per year           |
| review_scores_value     | 4.75            | review score for the value of the airbnb           |



In [0]:
# Load the data into dataframe
df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("sep", ",") \
.option("escape", '"') \
.option("encoding", "UTF-8") \
.option("quote", '"') \
.option("multiLine", "true") \
.load("/Volumes/workspace/airbnb/airbnb/listings.csv").toPandas()

# Select specific columns
selected_columns = [
    "name", "description", "neighborhood_overview", "neighbourhood_cleansed",
    "property_type", "room_type", "accommodates", "bathrooms", "bedrooms",
    "beds", "amenities", "availability_365", "review_scores_value", "price"
]

# Selecting the specified columns
df = df[selected_columns]

# Filter out records without price
df = df[df['price'].notna()]


# Display the first few rows
display(df)


### Pre-process columns one by one 

1. <b>name</b>: create text embeddings to capture semantics embeddings

In [0]:
%pip install sentence-transformers

In [0]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load a pre-trained model (optimized for sentence embeddings)
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

names = np.array(df['name'])

# Generate embeddings
name_embeddings = model.encode(names)


2. <b>description</b>: create text embeddings to capture semantics embeddings

In [0]:
descriptions = np.array(df['description'])
description_embeddings = model.encode(descriptions)

3. <b>Neighbourhood overview</b>: create text embeddings to capture semantics embeddings

In [0]:
neighborhood_overviews = np.array(df['neighborhood_overview'])
neighborhood_overview_embeddings = model.encode(neighborhood_overviews)

3. <b>Neighbourhood cleansed</b>: create one-hot-encodings

In [0]:
from sklearn.preprocessing import OneHotEncoder

categories = np.array(df['neighbourhood_cleansed'])
encoder = OneHotEncoder()
neighborhood_ohe = encoder.fit_transform(categories.reshape(-1, 1))
