First of all, let`s read the data, that was downloaded earlier.

In [0]:
raw_df = spark.read.table('airbnb.raw.listings')
display(raw_df)

In [0]:
from pyspark.sql import functions as F

missing_reviews = raw_df.filter(F.col("reviews_per_month").isNull())
missing_scores = raw_df.filter(F.col("review_scores_value").isNull())

raw_df.select([
    F.count(F.when(F.col("reviews_per_month").isNull(), 1)).alias("missing_reviews_per_month"),
    F.count(F.when(F.col("review_scores_value").isNull(), 1)).alias("missing_review_scores_value"),
]).show()

There is mismatch between whole amount of the reviews and amount of scores -- we assume that mismatch is because of some airbnb is new enough to get a review. This is hypothesis number I.

In [0]:
new_listings = raw_df.filter(F.col("reviews_per_month").isNull()) \
  .select("host_id", "host_since", "reviews_per_month", "number_of_reviews", "last_review") \
  .show()

Our I. hypothesis was incorrect -- host_since is showing, that hosts are already few years in the database.

Our II. hypothesis will be about property: maybe it is affecting the popularity.

In [0]:
properties = raw_df.groupBy("property_type").agg(
    F.mean(F.when(F.col("reviews_per_month").isNull(), 1).otherwise(0)).alias("missing_reviews_rate")
).orderBy(F.desc("missing_reviews_rate")).show()


Here, in this table, from 0-1 == 1 is the 100% missing values. 

What we can see:
1. Some of the property - for example room in aparthotel - are under 0.5
That means that these listings have a lot of reviews, but sometimes reviews are still missing
2. Private room in boat and etc -- exotic variants, that we assume not attract a lot of customers - that is why reviews are missing.
3. Only three variants are missing reviews completely -- perhaps, it is appeared from new listings or lack of popularuty.

That is why our II. hypothesis is mostly correct, but not 100% correct.

III. hypothesis is about rarely booked rooms. For this we will compare column "availibiality_365"

In [0]:
raw_df.groupBy(F.when(F.col("reviews_per_month").isNull(), "Missing")
            .otherwise("Has reviews").alias("Review_Status")) \
  .agg(F.avg("availability_365").alias("avg_availability")) \
  .show()



Listings with missing reviews_per_month values actually appear to have slightly fewer available days (so they are booked a bit more) than those with reviews.

Our III. hypothesis was wrong (or false positive) == availability alone is not a strong predictor of missing reviews in this dataset.

Our last IV. hypothesis will be about the price: will the "avg_price" affect the number of reviews? 

In [0]:
df = raw_df.withColumn(
    "price_clean",
    F.regexp_replace(F.col("price"), "[$,]", "").cast("double")
)
# df.select("price", "price_clean").show(5)
price = df.groupBy(F.when(F.col("reviews_per_month").isNull(), "Missing")
            .otherwise("Has reviews").alias("Review_Status")) \
  .agg(F.avg("price_clean").alias("avg_price")) \
  .show()


Our IV. hypothesis was correct - average bookings with higher price has lack of reviews, while 21 euro less price already has a review.