# Exploring raw data from AirBnB listings from Paris and Venice
## Team: Breaking data

In this notebook, we are working with the raw data from Venice and Paris AirBnB listings, exploring what we have, and making a sugestion on what can be cleaned and transformed

### Overview on proposed actions 

**Data cleaning**: not all columns will bring a value for further analysis so we suggest dropping columns that contain urls to images, placements etc, as well as scraping info (id, time, source)

**Transorming bathrooms column**: the column is empty for Paris and partially filled for Venice but in both cases it is possible to extract data from the batrooms_text column (suggestions are written in the notebook)

**Hidden missing values**: in this notebook we already transformed N/A value into null, but there are also hidden values in amenities and host_verification columns that need to be taken care of (code that identifies such values available in the notebook)



### Load data

In [0]:
# Imports
from pyspark.sql import functions as F, types as T
from pyspark.sql import DataFrame
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap, BoundaryNorm
from typing import List, Tuple

In [0]:
path = "/Volumes/airbnb/airbnb_city_raw/city_data_volume/listings_paris.csv"

df_paris = (
    spark.read
    .option("header", "true")
    .option("multiLine", "true")
    .option("quote", '"')
    .option("escape", '"')
    .option("ignoreLeadingWhiteSpace", "true")
    .option("ignoreTrailingWhiteSpace", "true")
    .csv(path)
)
display(df_paris)


In [0]:
path = "/Volumes/airbnb/airbnb_city_raw/city_data_volume/listings_venice.csv"

df_venice = (
    spark.read
    .option("header", "true")
    .option("multiLine", "true")
    .option("quote", '"')
    .option("escape", '"')
    .option("ignoreLeadingWhiteSpace", "true")
    .option("ignoreTrailingWhiteSpace", "true")
    .csv(path)
)
display(df_venice)


As we can see, in the dataset we have information about the host (name, description, contact info, since when works etc), info on neighbourood of the host and the listing, information about the listing (bathrooms, rbedrooms, beds, price, amnenities, ratings, etc) and scraping information 

There are also columns that contain links to websites and images, which hold no significant value for the furhter research, which is why we suggest to drop it, as well as information about scraping

### Let's explore!

In [0]:
listings_paris = df_paris.count()
listings_venice = df_venice.count()

names = ["Paris", "Venice"]
values = [listings_paris, listings_venice]

fig, ax = plt.subplots()
bars = ax.barh(names, values, color="#3D65A5")

ax.set_title("Number of AirBnB listings in Paris and Venice")
ax.set_ylabel("Number of Rows")
ax.set_xlim(xmax= max(values) * 1.15)

# Add value lables
for bar in bars:
    width = bar.get_width()
    ax.text(
        width + max(values) * 0.01,
        bar.get_y() + bar.get_height() / 2,
        f"{int(width):,}",
        ha="left", va="center", fontsize=10,
    )
plt.tight_layout()


plt.show()

So, as we can see, Paris has many more accommodations than Venice, which can be explained by the fact that Paris is a much bigger city and has simply more territory

In [0]:
from pyspark.sql import functions as F

avg_paris = df_paris.select(F.avg("review_scores_value")).collect()[0][0]
avg_venice = df_venice.select(F.avg("review_scores_value")).collect()[0][0]

print(f"Paris avg rating:  {avg_paris:.2f}")
print(f"Venice avg rating: {avg_venice:.2f}")

names = ["Paris", "Venice"]
values = [avg_paris, avg_venice]

fig, ax = plt.subplots()
bars = ax.barh(names, values, color="#3D65A5")

ax.set_title("Average rate of AirBnB listings in Paris and Venice")
ax.set_ylabel("Average rating")
ax.set_ylabel("Number of Rows")
ax.set_xlim(xmax= max(values) * 1.1)

# Add value labels
for bar in bars:
    width = bar.get_width()
    ax.text(
        width + max(values) * 0.01,
        bar.get_y() + bar.get_height() / 2,
        f"{width:.2f}",
        ha="left", va="center", fontsize=10,
    )
plt.tight_layout()


plt.show()


In case of raitings, both cities have high review rates, but Paris is a bit lower in rating

In [0]:
df_clean = df_paris.withColumn("host_since", F.to_date("host_since"))

df_clean = df_clean.withColumn("host_start_year", F.year("host_since"))

hosts_by_year = (
    df_clean
    .filter(F.col("host_start_year").isNotNull())
    .groupBy("host_start_year")
    .agg(F.countDistinct("host_id").alias("unique_hosts"))
    .orderBy("host_start_year")
)

pdf = hosts_by_year.toPandas()

plt.figure(figsize=(8, 6))
bars = plt.bar(pdf["host_start_year"].astype(str), pdf["unique_hosts"], color="#3D65A5")

# Labels and title
plt.title("Number of new host per year in Paris", fontsize=14)
plt.xlabel("Year host started", fontsize=12)
plt.ylabel("Unique hosts", fontsize=12)
plt.ylim(ymax=max(pdf["unique_hosts"]) * 1.15)

# Add value labels
for bar in bars:
    height = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width()/2,
        height,                          
        f"{int(height)}",              
        ha="center", va="bottom", fontsize=9,
    )

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [0]:
df_clean = df_venice.withColumn("host_since", F.to_date("host_since"))

df_clean = df_clean.withColumn("host_start_year", F.year("host_since"))

hosts_by_year = (
    df_clean
    .filter(F.col("host_start_year").isNotNull())
    .groupBy("host_start_year")
    .agg(F.countDistinct("host_id").alias("unique_hosts"))
    .orderBy("host_start_year")
)

pdf = hosts_by_year.toPandas()

plt.figure(figsize=(8, 6))
bars = plt.bar(pdf["host_start_year"].astype(str), pdf["unique_hosts"], color="#3D65A5")

plt.title("Number of new host per year in Venice", fontsize=14)
plt.xlabel("Year host started", fontsize=12)
plt.ylabel("Unique hosts", fontsize=12)
plt.ylim(ymax=max(pdf["unique_hosts"]) * 1.15)

# Add value labels
for bar in bars:
    height = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width()/2,
        height,                          
        f"{int(height)}",              
        ha="center", va="bottom", fontsize=9,
    )

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

As we can see, in both cases the peak of new hosts was in 2015, probably because the platform became more popular, and the least amount of users was in 2020, which coincides with Covid

In [0]:
from pyspark.sql import functions as F
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df_paris_clean  = df_paris.filter(F.col("host_id").isNotNull())
df_venice_clean = df_venice.filter(F.col("host_id").isNotNull())

# unique hosts
unique_hosts_paris  = df_paris_clean.select("host_id").distinct().count()
unique_hosts_venice = df_venice_clean.select("host_id").distinct().count()

# listings per host (count of listings per host_id)
paris_host_listing_counts = (
    df_paris_clean.groupBy("host_id")
    .agg(F.count("*").alias("listings_for_host"))
)
avg_listings_per_host_paris = paris_host_listing_counts.select(
    F.avg("listings_for_host")
).collect()[0][0]

venice_host_listing_counts = (
    df_venice_clean.groupBy("host_id")
    .agg(F.count("*").alias("listings_for_host"))
)
avg_listings_per_host_venice = venice_host_listing_counts.select(
    F.avg("listings_for_host")
).collect()[0][0]

cities = ["Paris", "Venice"]

unique_vals = [unique_hosts_paris, unique_hosts_venice]

# Unique hosts plot
plt.figure(figsize=(6,4))
bars1 = plt.bar(cities, unique_vals, color="#3D65A5")
plt.title("Number of unique hosts")
plt.ylabel("Unique hosts")
plt.ylim(ymax=max(unique_vals) * 1.15)

# add value labels
for bar in bars1:
    h = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width()/2,
        h,
        f"{int(h)}",
        ha="center", va="bottom"
    )

plt.tight_layout()
plt.show()

# Avg listings per host plot
avg_vals = [avg_listings_per_host_paris, avg_listings_per_host_venice]

plt.figure(figsize=(6,4))
bars2 = plt.bar(cities, avg_vals, color="#3D65A5")
plt.title("Average listings per host")
plt.ylabel("Avg listings per host")
plt.ylim(ymax=max(avg_vals) * 1.15)

# value lables
for bar in bars2:
    h = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width()/2,
        h,
        f"{h:.2f}",
        ha="center", va="bottom"
    )

plt.tight_layout()
plt.show()


As we can see, on average in Paris it is common to have only 1 listing per person, while in venice the average is 2

### Missing value analysis

Lets look at a quality of our raw data

At first, lets replace N/A with null, as some columns have it

In [0]:
# Loop through all columns and replace "N/A" with null
df_paris = df_paris.select([
    F.when(F.col(c) == "N/A", None).otherwise(F.col(c)).alias(c)
    for c in df_paris.columns
])

df_venice = df_venice.select([
    F.when(F.col(c) == "N/A", None).otherwise(F.col(c)).alias(c)
    for c in df_venice.columns
])

## Paris

In [0]:
def is_missing_expr(col_name, dtype, include_empty_str):
    c = F.col(col_name)
    base_null = c.isNull()
    if isinstance(dtype, (T.FloatType, T.DoubleType)):
        cond = base_null | F.isnan(c)
    elif isinstance(dtype, T.StringType):
        cond = base_null | (F.length(F.trim(c)) == 0 if include_empty_str else F.lit(False))
    else:
        cond = base_null
    return F.when(cond, 1).otherwise(0)

def missing_summary(df, columns=None, include_empty_str=True):
    if columns is None:
        columns = df.columns
    columns = [c for c in columns if c in df.columns]
    if not columns:
        raise ValueError("No valid columns were given")

    total_rows = df.count()
    schema_map = {f.name: f.dataType for f in df.schema.fields}

    exprs = [F.sum(is_missing_expr(c, schema_map[c], include_empty_str)).alias(c) for c in columns]
    null_counts_row = df.select(*exprs).collect()[0].asDict()

    rows = []
    for c in columns:
        nulls = int(null_counts_row[c])
        present = total_rows - nulls
        null_pct = (nulls / total_rows * 100.0) if total_rows else 0.0
        rows.append({"column": c, "present_count": present, "null_count": nulls, "null_pct": round(null_pct, 4)})

    return pd.DataFrame(rows).sort_values("null_pct", ascending=False, ignore_index=True)

def missing_matrix(df, columns, max_rows = 5000, include_empty_str = True):
    if columns is None:
        columns = df.columns
    columns = [c for c in columns if c in df.columns]
    if not columns:
        raise ValueError("No valid columns given")

    with_id = df.select(F.monotonically_increasing_id().alias("_rid"), *columns).orderBy("_rid").limit(max_rows)

    schema_map = {f.name: f.dataType for f in df.schema.fields}
    miss_cols = [is_missing_expr(c, schema_map[c], include_empty_str).alias(c) for c in columns]

    bin_df = with_id.select("_rid", *miss_cols).orderBy("_rid").drop("_rid")
    pdf = bin_df.toPandas()
    return pdf

def visualize_missing(df, columns, max_rows = 2000, include_empty_str = True):
    """
    Visualisation that gives basic understanding of amount of pressent and missing data (with % of missing data), plots heatmap showing where missing data is (displays only columns where there are missing values) and bar chart of null % in the data 
    """
    summary_pdf = missing_summary(df, columns=columns, include_empty_str=include_empty_str)
    # Keeping columns that have some missing values
    cols_with_nulls = summary_pdf.loc[summary_pdf["null_count"] > 0, "column"].tolist()

    if not cols_with_nulls:
        print("No missing values detected")
        display(spark.createDataFrame(summary_pdf))
        return summary_pdf

    # Ploting heatmap: green - data is present, red - missing data (null value)
    mat_pdf = missing_matrix(df, columns=cols_with_nulls, max_rows=max_rows, include_empty_str=include_empty_str)
    if not mat_pdf.empty:
        mat = mat_pdf.values.astype(np.int8)
        fig1, ax1 = plt.subplots(figsize=(12, 10))
        cmap = ListedColormap(["#1F449C", "#F05039"])
        norm = BoundaryNorm([-0.5, 0.5, 1.5], cmap.N)
        ax1.imshow(mat, aspect="auto", interpolation="nearest", cmap=cmap, norm=norm)
        ax1.set_title(f"Missingness heatmap")
        ax1.set_xlabel("Columns")
        ax1.set_ylabel("Row number")
        ax1.set_xticks(range(len(mat_pdf.columns)))
        ax1.set_xticklabels(mat_pdf.columns, rotation=90)
        plt.tight_layout()
        plt.show()

    # Ploting bar charts with null % per column
    filtered_summary = summary_pdf[summary_pdf["column"].isin(cols_with_nulls)]
    if not filtered_summary.empty:
        fig2, ax2 = plt.subplots(figsize=(12, 8))
        ax2.bar(filtered_summary["column"], filtered_summary["null_pct"])
        ax2.set_title("Null % per column")
        ax2.set_xlabel("Column")
        ax2.set_ylabel("Null percentage (%)")
        ax2.set_xticklabels(filtered_summary["column"], rotation=90)
        plt.tight_layout()
        plt.show()

    display(spark.createDataFrame(summary_pdf))
    return summary_pdf

In [0]:
summary = visualize_missing(df_paris, columns=None, max_rows=15000, include_empty_str=True)

As we can see now, there are 2 columns have no data at all. They are **, calendar_updated, neighbourhood_group_cleansed**

I our case, there is not much we can do about this missing information and it can pe overlooked in a general research.
Still, let'stake a look at other partialy missing data. From our previous researches on similar datasets we know that column bathrooms is connected to bathrooms_text, so we can fill in the bathrooms column by extracting information from bathrooms_text. 

Column bathrooms_text is a pretty standardised column where not many cases can be expected: 
* Number word case: most common, like 1 bath
* Null case: empty cell, so we will put null to the bathrooms as well
* Half bath case: sometimes we don't get a number but only the "half bath" words that need to be taken into account

So, in the future we need to transform the bathrooms column

Let's look into other missing values and try to find any corelations

In [0]:
def missing_rate_expr(col: str):
    return (F.count(F.when(F.col(col).isNull(), 1)) / F.count(F.lit(1))).alias(col)

def missing_indicator(col: str):
    return F.when(F.col(col).isNull(), 1).otherwise(0).alias(col + "_miss")

def to_pandas(df, limit=None):
    return (df.limit(limit) if isinstance(limit, int) else df).toPandas()

SAMPLE_FRAC = 0.20  
SEED = 42

In [0]:
null_counts = df_paris.select([
    F.count(F.when(F.col(c).isNull(), 1)).alias(c) for c in df_paris.columns
]).collect()[0].asDict()

cols_with_nulls = [c for c, n in null_counts.items() if n and n > 0]

miss_df = df_paris.select([missing_indicator(c) for c in cols_with_nulls])
miss_df_sample = miss_df.sample(withReplacement=False, fraction=SAMPLE_FRAC, seed=SEED)

pdf = miss_df_sample.toPandas()

corr = pdf.corr(numeric_only=True)
corr.index.name = "col"
corr.columns.name = "col"


In [0]:
N = 40
miss_rates = miss_df_sample.agg(*[F.mean(c).alias(c) for c in miss_df_sample.columns]).collect()[0].asDict()
top_cols = [k for k,_ in sorted(miss_rates.items(), key=lambda kv: kv[1], reverse=True)[:N]]

corr_top = corr.loc[top_cols, top_cols].to_numpy()

plt.figure(figsize=(8,7))
im = plt.imshow(corr_top, aspect='auto')
plt.colorbar(im, fraction=0.046, pad=0.04)
plt.xticks(ticks=np.arange(len(top_cols)), labels=top_cols, rotation=90)
plt.yticks(ticks=np.arange(len(top_cols)), labels=top_cols)
plt.title("Co-missingness Correlation (Top N columns)")
plt.tight_layout()
plt.show()


As we can see from the plot, all missing review data correlate with each other, as well as neighbourhood information, and info about the host (name, start date, response time, profile, etc). and correlation between missing prices, beds, and bathrooms

As for reviews, the data might be missing due to the fact that the listing has yet to have reviews. 

As for neighbourhood information, it was probably not filled in by the host

In case of host data correlation, it is also clearly not a random loss. There are several possible reasons why this data might be missing:
* Host account was deleted or anonymised ed 
* Airbnb scrapper masked identifying information for privacy
* Listing was terminated

Now, let's take a look at correlation between price, beds and bathrooms

In [0]:
col1 = "price"
col2 = "beds"
col3= "bathrooms"
col4 = "bathrooms_text"
col5 = "last_review"
col6 = "license"


df_nulls = df_paris.filter(F.col(col1).isNull()).select(col1, col2, col3, col4, col5, col6)

display(df_nulls)

From this we can see that most data that has missing beds and prices have either no licences or last review was made years ago. From this we can make an assumption that when the listing is no longer active, we cannot see data about rooms and prices

However, there are at least 2 columns that can have null values that will not be displayed here 

They are: amenities and host_verifications

From looking at the columns it can be seen that there might be no information in them, but the column will have jutst [] instead of null, which also affects our research

In [0]:
cols_to_check = ["host_verifications", "amenities"]

conditions = {
    c: (
        F.col(c).isNull() |
        (F.trim(F.col(c)) == "") |
        (F.trim(F.col(c)) == "[]") |
        (F.trim(F.col(c)) == "['']") |
        (F.length(F.col(c)) <= 4)
    )
    for c in cols_to_check
}

combined_condition = None
for cond in conditions.values():
    combined_condition = cond if combined_condition is None else (combined_condition | cond)

filtered_df = df_paris.filter(combined_condition).select(*cols_to_check)

display(filtered_df)


In [0]:
cols_to_check = ["host_verifications", "amenities"]

conditions = {
    c: (
        F.col(c).isNull() |
        (F.trim(F.col(c)) == "") |
        (F.trim(F.col(c)) == "[]") |
        (F.trim(F.col(c)) == "['']") |
        (F.length(F.col(c)) <= 4)
    )
    for c in cols_to_check
}

summary_df = df_paris.agg(
    *[
        F.sum(F.when(conditions[c], 1).otherwise(0)).alias(f"{c}_empty_count")
        for c in cols_to_check
    ]
)

display(summary_df)


As we can see, there are some missing values although they were not visible before, and not just empty list, sometimes it's also word None, so we will have to clean them later

## Venice

In [0]:
summary = visualize_missing(df_venice, columns=None, max_rows=15000, include_empty_str=True)

As we can see, Venice dataset has a lot fewer fully missing columns, and in our case, fully missing columns, and in our case the only one with no info is calendar_updated, which can be overlooked.

Here bathrooms column has filled information, but upon looking into it, we found out that some values are missing, although the bathrooms_text column is filled for them. That's why in the future we will need to work with this column in a similar way that was proposed for Paris

In [0]:
from pyspark.sql import functions as F

col1 = "bathrooms"
col2 = "bathrooms_text"

df_nulls = df_venice.filter(F.col(col1).isNull()).select(col1, col2)

display(df_nulls)


In [0]:
null_counts = df_venice.select([
    F.count(F.when(F.col(c).isNull(), 1)).alias(c) for c in df_venice.columns
]).collect()[0].asDict()

cols_with_nulls = [c for c, n in null_counts.items() if n and n > 0]

miss_df = df_venice.select([missing_indicator(c) for c in cols_with_nulls])
miss_df_sample = miss_df.sample(withReplacement=False, fraction=SAMPLE_FRAC, seed=SEED)

pdf = miss_df_sample.toPandas()

corr = pdf.corr(numeric_only=True)
corr.index.name = "col"
corr.columns.name = "col"

In [0]:
N = 40
miss_rates = miss_df_sample.agg(*[F.mean(c).alias(c) for c in miss_df_sample.columns]).collect()[0].asDict()
top_cols = [k for k,_ in sorted(miss_rates.items(), key=lambda kv: kv[1], reverse=True)[:N]]

corr_top = corr.loc[top_cols, top_cols].to_numpy()

plt.figure(figsize=(8,7))
im = plt.imshow(corr_top, aspect='auto')
plt.colorbar(im, fraction=0.046, pad=0.04)
plt.xticks(ticks=np.arange(len(top_cols)), labels=top_cols, rotation=90)
plt.yticks(ticks=np.arange(len(top_cols)), labels=top_cols)
plt.title("Co-missingness Correlation (Top N columns)")
plt.tight_layout()
plt.show()

The correlation case is also quite similar to the Paris case, reasoning for which we wrote before, but apart from them, we also see a correlation between missing values for price, revenue, beds and bathrooms

In [0]:
col1 = "price"
col2 = "beds"
col3= "bathrooms"
col4 = "bathrooms_text"
col5 = "last_review"
col6 = "license"


df_nulls = df_venice.filter(F.col(col1).isNull()).select(col1, col2, col3, col4, col5, col6)

display(df_nulls)

Looking into the filterd table, we can see that batroom infromation can be filled, so the correlation will dissaper once its done, but for others we can make an assumption that the most common reason why all this data is missing is because the accomodation is currently unavilable or was terminated 

In [0]:
from pyspark.sql import functions as F

cols_to_check = ["host_verifications", "amenities"]

conditions = {
    c: (
        F.col(c).isNull() |
        (F.trim(F.col(c)) == "") |
        (F.trim(F.col(c)) == "[]") |
        (F.trim(F.col(c)) == "['']") |
        (F.length(F.col(c)) <= 4)
    )
    for c in cols_to_check
}

summary_df = df_venice.agg(
    *[
        F.sum(F.when(conditions[c], 1).otherwise(0)).alias(f"{c}_empty_count")
        for c in cols_to_check
    ]
)

display(summary_df)


As we can see, in this dataset we also have some hidden missing data in host verification and amenities, so we will have to clean it in the future