# Cleaning Airbnb Listings Dataset

[Inside Airbnb](http://insideairbnb.com/get-the-data.html) provides datasets containing Airbnb listings for various cities around the world. These datasets often require cleaning before they can be effectively analyzed. In this notebook, we will demonstrate how to clean an Airbnb listings dataset for Barcelona, Spain.

## Data Cleaning Steps

1. **Import Libraries**: We will use `pandas` for data manipulation and `numpy` for numerical operations.
2. **Load Dataset**: Read the CSV file containing the Airbnb listings data.
   - Sometimes, the dataset may be compressed (e.g., in gzip format), so we will handle that accordingly.
   - If the dataset is large but can fit into memory, consider loading a sample for initial exploration.
   - If the dataset is too large to fit into memory, consider using libraries like Dask, Spark, or Polars for out-of-core processing.
3. **Inspect Data**: Check the first few rows of the dataset to understand its structure.
4. **Select Relevant Columns**: Identify and retain only the columns that are necessary for analysis.
5. **Handle Missing Values**: Identify columns with missing values and decide on strategies to handle them (e.g., dropping rows, filling with mean/median/mode, etc.).
6. **Data Type Conversion**: Ensure that each column has the appropriate data type (e.g., converting price columns to numeric types).
   - A common issue is that price columns may contain currency symbols or commas, which need to be removed before conversion.
   - Another common issue is date columns being stored as strings; these should be converted to datetime objects if you plan to perform date-based analyses.


▶️ Import libraries.


In [1]:
import pandas as pd
import numpy as np
import os

▶️ Display up to 100 columns in this Jupyter notebook.


In [2]:
pd.set_option("display.max_columns", 100)

:::{tip} Maximum Number of Rows
You can also set the maximum number of rows to display using the following code:

```python
pd.set_option("display.max_rows", 100)
```

If you want to display all columns without any limit, you can set it to `None`:

```python
pd.set_option("display.max_columns", None)
```

This is not recommended for very wide DataFrames, as it may clutter your output.

:::


:::{note} Other commonly used `pd.set_option()` settings

- `pd.set_option("display.width", 120)` sets the total console width.
- `pd.set_option("display.max_colwidth", 50)` sets the maximum column width for each column. This can be used to truncate long strings.
- `pd.set_option("display.precision", 2)` sets the number of decimal places for floating-point numbers.
- `pd.set_option("display.float_format", "{:.2f}".format)` formats floating-point numbers to two decimal places. This can be useful if you want to control the exact display format of floats.

:::


## Load Dataset


The `compression="gzip"` parameter is used because the dataset file is compressed in gzip format. If your dataset is not compressed, you should omit this parameter. This parameter is optional if the file extension is `.gz`, as pandas can automatically detect the compression type based on the file extension. However, it is always a good practice to specify it explicitly when you know the file is compressed, because additional URL query parameters may interfere with pandas' ability to infer the compression type from the file extension.


In [3]:
df = pd.read_csv(
    "https://data.insideairbnb.com/spain/catalonia/barcelona/2025-09-14/data/listings.csv.gz",
    compression="gzip",
)
df.head(2)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,availability_eoy,number_of_reviews_ly,estimated_occupancy_l365d,estimated_revenue_l365d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,18674,https://www.airbnb.com/rooms/18674,20250914152803,2025-09-15,city scrape,Huge flat for 8 people close to Sagrada Familia,110m2 apartment to rent in Barcelona. Located ...,Apartment in Barcelona located in the heart of...,https://a0.muscache.com/pictures/13031453/413c...,71615,https://www.airbnb.com/users/show/71615,Mireia,2010-01-19,"Barcelona, Spain","We are Mireia (47) & Maria (49), two multiling...",within an hour,96%,91%,f,https://a0.muscache.com/im/pictures/user/User/...,https://a0.muscache.com/im/pictures/user/User/...,la Sagrada Família,41.0,46.0,"['email', 'phone']",t,t,"Barcelona, CT, Spain",la Sagrada Família,Eixample,41.40556,2.17262,Entire rental unit,Entire home/apt,8,2.0,2 baths,3.0,6.0,"[""Pets allowed"", ""30 inch TV"", ""Coffee maker"",...",$210.00,1,1125,1.0,31.0,999.0,999.0,3.0,999.0,,t,12,29,56,80,2025-09-15,51,7,0,74,5,42,8820.0,2013-05-27,2025-07-31,4.34,4.4,4.56,4.64,4.62,4.82,4.32,ESFCTU000008058000039706000000000000000HUTB-00...,t,26,26,0,0,0.34
1,23197,https://www.airbnb.com/rooms/23197,20250914152803,2025-09-14,city scrape,"Forum CCIB DeLuxe, Spacious, Large Balcony, relax",Beautiful and Spacious Apartment with Large Te...,"Strategically located in the Parc del Fòrum, a...",https://a0.muscache.com/pictures/miso/Hosting-...,90417,https://www.airbnb.com/users/show/90417,Etain (Marnie),2010-03-09,"Catalonia, Spain","Hi there,\n\nI’m marnie, originally from Austr...",within an hour,100%,96%,t,https://a0.muscache.com/im/pictures/user/44b56...,https://a0.muscache.com/im/pictures/user/44b56...,El Besòs i el Maresme,6.0,9.0,"['email', 'phone']",t,t,"Sant Adria de Besos, Barcelona, Spain",el Besòs i el Maresme,Sant Martí,41.412432,2.21975,Entire rental unit,Entire home/apt,5,2.0,2 baths,3.0,4.0,"[""Electric stove"", ""Toaster"", ""Wine glasses"", ...",$285.00,3,32,1.0,7.0,1125.0,1125.0,4.1,1125.0,,t,10,33,63,289,2025-09-14,91,12,2,82,7,72,20520.0,2011-03-15,2025-09-08,4.82,4.94,4.9,4.94,4.99,4.66,4.68,ESFCTU000008106000547162000000000000000000HUTB...,f,1,1,0,0,0.52


▶️ Print out the DataFrame's summary.


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19410 entries, 0 to 19409
Data columns (total 79 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            19410 non-null  int64  
 1   listing_url                                   19410 non-null  object 
 2   scrape_id                                     19410 non-null  int64  
 3   last_scraped                                  19410 non-null  object 
 4   source                                        19410 non-null  object 
 5   name                                          19410 non-null  object 
 6   description                                   18673 non-null  object 
 7   neighborhood_overview                         8986 non-null   object 
 8   picture_url                                   19410 non-null  object 
 9   host_id                                       19410 non-null 

▶️ Although `info()` provides some information about the number of non-missing values, you can explicitly count missing values in each column using the following code:


In [5]:
df.isna().sum()
# Equivalent to:
df.isnull().sum()

id                                                 0
listing_url                                        0
scrape_id                                          0
last_scraped                                       0
source                                             0
                                                ... 
calculated_host_listings_count                     0
calculated_host_listings_count_entire_homes        0
calculated_host_listings_count_private_rooms       0
calculated_host_listings_count_shared_rooms        0
reviews_per_month                               4989
Length: 79, dtype: int64

To only filter columns that have missing values, you can use:


In [6]:
df.isna().sum()[df.isna().sum() > 0]
# Equivalent to:
df.isnull().sum()[df.isnull().sum() > 0]

description                      737
neighborhood_overview          10424
host_name                          5
host_since                         5
host_location                   4702
host_about                      7170
host_response_time              3126
host_response_rate              3126
host_acceptance_rate            2748
host_is_superhost                383
host_thumbnail_url                 5
host_picture_url                   5
host_neighbourhood             10787
host_listings_count                5
host_total_listings_count          5
host_verifications                 5
host_has_profile_pic               5
host_identity_verified             5
neighbourhood                  10424
bathrooms                       4112
bathrooms_text                    11
bedrooms                        1964
beds                            4182
price                           4134
minimum_minimum_nights             4
maximum_minimum_nights             4
minimum_maximum_nights             4
m

### Select and rename columns


In [7]:
df_c = df[
    [
        "name",
        "neighbourhood_cleansed",
        "room_type",
        "bedrooms",
        "bathrooms",
        "accommodates",
        "minimum_nights",
        "price",
        "availability_365",
        "number_of_reviews",
        "review_scores_rating",
        "latitude",
        "longitude",
        "host_is_superhost",
    ]
].copy()

df_c.rename(
    columns={
        "neighbourhood_cleansed": "neighbourhood",
        "review_scores_rating": "review_score",
        "host_is_superhost": "is_superhost",
    },
    inplace=True,
)

df_c.head(2)

Unnamed: 0,name,neighbourhood,room_type,bedrooms,bathrooms,accommodates,minimum_nights,price,availability_365,number_of_reviews,review_score,latitude,longitude,is_superhost
0,Huge flat for 8 people close to Sagrada Familia,la Sagrada Família,Entire home/apt,3.0,2.0,8,1,$210.00,80,51,4.34,41.40556,2.17262,f
1,"Forum CCIB DeLuxe, Spacious, Large Balcony, relax",el Besòs i el Maresme,Entire home/apt,3.0,2.0,5,3,$285.00,289,91,4.82,41.412432,2.21975,t


### Parse price as `float`s


In [8]:
df_c["price"] = (
    df_c["price"].str.replace("$", "").str.replace(",", "").astype(np.float64)
)

### Convert `is_superhost` to 0s and 1s


In [9]:
df_c["is_superhost"] = np.where(df_c["is_superhost"] == "t", 1, 0)
df_c[["name", "is_superhost"]].sample(10)

Unnamed: 0,name,is_superhost
8572,Lepanto,0
3078,Les Corts | 3-Bedroom Apartment with Terrace,0
6489,Amazing apartment in the city's center -20%OFF!!,0
17730,Habitación con estilo y calma,1
12508,Àtic amb vista panoràmica,0
18632,New Boutique Hotel Plaça Reial,0
17466,Cozy Studio in El Raval,0
14977,Apartamento con cocina a 2'del metro de badal 4B,0
11111,Central Studio Mauri,0
13051,Big Apartment @900m BEACH / BORN,1


### Remove listings with missing values


In [10]:
df_c["bathrooms"].unique()

array([ 2. ,  1.5,  1. ,  3. ,  nan,  3.5,  4. ,  0. ,  2.5,  4.5,  6. ,
        5.5,  0.5,  7.5,  5. ,  8. ,  7. ,  9. , 12. , 10. , 14. , 13. ,
       11. ])

In [11]:
df_c["bedrooms"].unique()

array([ 3.,  2.,  1.,  4., nan,  0.,  7.,  5.,  6.,  8.,  9., 12., 10.,
       18., 20., 15., 24., 14., 19., 16., 11., 26., 29.])

In [12]:
df_c.dropna(subset=["bedrooms", "bathrooms", "review_score"], inplace=True)

In [13]:
df_c["room_type"].value_counts()

room_type
Entire home/apt    8503
Private room       3160
Shared room          81
Hotel room           50
Name: count, dtype: int64

### Check the cleaned DataFrame


In [14]:
df_c.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11794 entries, 0 to 19366
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               11794 non-null  object 
 1   neighbourhood      11794 non-null  object 
 2   room_type          11794 non-null  object 
 3   bedrooms           11794 non-null  float64
 4   bathrooms          11794 non-null  float64
 5   accommodates       11794 non-null  int64  
 6   minimum_nights     11794 non-null  int64  
 7   price              11776 non-null  float64
 8   availability_365   11794 non-null  int64  
 9   number_of_reviews  11794 non-null  int64  
 10  review_score       11794 non-null  float64
 11  latitude           11794 non-null  float64
 12  longitude          11794 non-null  float64
 13  is_superhost       11794 non-null  int64  
dtypes: float64(6), int64(5), object(3)
memory usage: 1.3+ MB


In [15]:
df_c.head()

Unnamed: 0,name,neighbourhood,room_type,bedrooms,bathrooms,accommodates,minimum_nights,price,availability_365,number_of_reviews,review_score,latitude,longitude,is_superhost
0,Huge flat for 8 people close to Sagrada Familia,la Sagrada Família,Entire home/apt,3.0,2.0,8,1,210.0,80,51,4.34,41.40556,2.17262,0
1,"Forum CCIB DeLuxe, Spacious, Large Balcony, relax",el Besòs i el Maresme,Entire home/apt,3.0,2.0,5,3,285.0,289,91,4.82,41.412432,2.21975,1
2,Sagrada Familia area - Còrsega 1,el Camp d'en Grassot i Gràcia Nova,Entire home/apt,2.0,1.5,6,1,170.0,64,152,4.46,41.40566,2.17015,0
3,Stylish Top Floor Apartment - Ramblas Plaza Real,el Barri Gòtic,Entire home/apt,1.0,1.0,2,31,110.0,333,25,4.36,41.38062,2.17517,0
4,VIDRE HOME PLAZA REAL on LAS RAMBLAS,el Barri Gòtic,Entire home/apt,4.0,3.0,9,5,333.0,335,271,4.57,41.37978,2.17623,0


In [16]:
df_c.to_csv("cleaned-output.csv", index=None)