Nos piden hacer un pequeño estudio sobre los alojamientos más populares: aquellos que tienen al menos 100 comentarios. Solo nos interesan alojamientos del tipo “Entire home/apt” (casa/apartamento entero) y que dispongan de licencia. Queremos saber cuántos alojamientos de este tipo hay en cada municipio (neighbourhood), ordenando los resultados alfabéticamente por el nombre del municipio. El resultado será parecido a esta tabla:

```
+----------------------+-----+
|              municipi|count|
+----------------------+-----+
|                Alaior|   XX|
| Ciutadella de Menorca|   YY|
|                   ...|  ...|
```



Crea un cuaderno de Google Colab que, utilizando PySpark y la librería Spark SQL, genere esta tabla con la información que se nos pide.

In [1]:
!pip install pyspark



In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder \
    .appName("Airbnb-Listings-Analysis") \
    .getOrCreate()

csv_url = "https://raw.githubusercontent.com/tnavarrete-iedib/bigdata-24-25/refs/heads/main/listings.csv"
!wget -O listings.csv {csv_url}

--2025-04-26 10:48:04--  https://raw.githubusercontent.com/tnavarrete-iedib/bigdata-24-25/refs/heads/main/listings.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 584697 (571K) [text/plain]
Saving to: ‘listings.csv’


2025-04-26 10:48:04 (22.4 MB/s) - ‘listings.csv’ saved [584697/584697]



In [3]:
df = spark.read.csv("listings.csv", header=True, inferSchema=True)

df.show(5)

+------+--------------------+-------+---------+-------------------+--------------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+--------------------+
|    id|                name|host_id|host_name|neighbourhood_group|       neighbourhood|latitude|longitude|      room_type|price|minimum_nights|number_of_reviews|last_review|reviews_per_month|calculated_host_listings_count|availability_365|number_of_reviews_ltm|             license|
+------+--------------------+-------+---------+-------------------+--------------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+--------------------+
| 44085|Villa in Addaia g...| 193043|  Manuela|               NULL|         Es Mercadal|40.00974|  4.19958|Entire home/apt|  460|             5|    

In [4]:
df.createOrReplaceTempView("airbnb_listings")

query = """
SELECT
    neighbourhood as municipio,
    COUNT(*) as count
FROM
    airbnb_listings
WHERE
    room_type = 'Entire home/apt'
    AND license IS NOT NULL
    AND license != ''
    AND number_of_reviews >= 100
GROUP BY
    neighbourhood
ORDER BY
    neighbourhood ASC
"""


result = spark.sql(query)
result.show(100, False)

result_df = df.filter(
    (col("room_type") == "Entire home/apt") &
    (col("license").isNotNull()) &
    (col("license") != "") &
    (col("number_of_reviews") >= 100)
).groupBy("neighbourhood").count().orderBy("neighbourhood")

print("\nResultado desde la API del Dataframe:")
result_df.show(100, False)

spark.stop()

+---------------------+-----+
|municipio            |count|
+---------------------+-----+
|Alaior               |16   |
|Ciutadella de Menorca|34   |
|Es Mercadal          |11   |
|Ferreries            |1    |
|Mahón                |2    |
|Sant Lluís           |9    |
+---------------------+-----+


Resultado desde la API del Dataframe:
+---------------------+-----+
|neighbourhood        |count|
+---------------------+-----+
|Alaior               |16   |
|Ciutadella de Menorca|34   |
|Es Mercadal          |11   |
|Ferreries            |1    |
|Mahón                |2    |
|Sant Lluís           |9    |
+---------------------+-----+

