## Why is it important?
Poorly cleaned data leads to misleading conclusions and bad models.

## What should it contain?

- Loading data 
- Basic information (.shape, .info(), .describe())
- Handling missing values (e.g., filling in, deleting)
- Searching for and removing duplicates
- Checking types (e.g., dates → datetime, prices → float)
- Filtering out outliers (e.g., $10,000/night on Airbnb is unrealistic)
- Standardization (e.g., currencies, formats)

### Imports

In [1]:
import os
import sys

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))

if project_root not in sys.path:
    sys.path.append(project_root)

print("📂 Project root:", project_root)

📂 Project root: /Users/erikvida/PycharmProjects/airbnb-price-prediction


In [2]:
import pandas as pd
import numpy as np
from dotenv import load_dotenv

from src.db_connection import DatabaseConfig, DatabaseConnection


dotenv_path = "/Users/erikvida/PycharmProjects/airbnb-price-prediction/.env"
load_dotenv(dotenv_path)

DB_USER='postgres'
DB_PASS='kecske'


True

### 1.0 Loading data

In [3]:
amsterdams_airbnbs_raw_data = pd.read_csv("../data/raw/amsterdam_airbnbs_data.csv")
df= amsterdams_airbnbs_raw_data
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,27886,https://www.airbnb.com/rooms/27886,20250609011745,2025-06-17,city scrape,"Romantic, stylish B&B houseboat in canal district",Stylish and romantic houseboat on fantastic hi...,"Central, quiet, safe, clean and beautiful.",https://a0.muscache.com/pictures/02c2da9d-660e...,97647,...,4.92,4.9,4.78,0363 974D 4986 7411 88D8,f,1,0,1,0,1.85
1,28871,https://www.airbnb.com/rooms/28871,20250609011745,2025-06-17,city scrape,Comfortable double room,Basic bedroom in the center of Amsterdam.,"Flower market , Leidseplein , Rembrantsplein",https://a0.muscache.com/pictures/160889/362340...,124245,...,4.94,4.94,4.84,0363 607B EA74 0BD8 2F6F,f,2,0,2,0,3.93
2,29051,https://www.airbnb.com/rooms/29051,20250609011745,2025-06-17,city scrape,Comfortable single / double room,This room can also be rented as a single or a ...,the street is quite lively especially on weeke...,https://a0.muscache.com/pictures/162009/bd6be2...,124245,...,4.92,4.87,4.79,0363 607B EA74 0BD8 2F6F,f,2,0,2,0,4.74
3,44391,https://www.airbnb.com/rooms/44391,20250609011745,2025-06-17,previous scrape,Quiet 2-bedroom Amsterdam city centre apartment,Guests greatly appreciate the unique location ...,The appartment is located in the city centre. ...,https://a0.muscache.com/pictures/97741545/3900...,194779,...,4.9,4.68,4.5,0363 E76E F06A C1DD 172C,f,1,1,0,0,0.23
4,47061,https://www.airbnb.com/rooms/47061,20250609011745,2025-06-17,city scrape,Charming apartment in old centre,"A beautiful, quiet apartment in the center of ...",,https://a0.muscache.com/pictures/268343/a08ce2...,211696,...,4.9,4.85,4.63,0363 1266 8C04 4133 E6AC,f,1,1,0,0,1.13


### 2.0 Basic informations

#### 2.1 Shape

In [4]:
num_rows, num_columns = df.shape
print(f" Rows: {num_rows}, Columns: {num_columns}")

 Rows: 10168, Columns: 79


#### 2.2 Info - Data type of the differnt culomns

In [5]:
print("\n Data Types:")
df.info()


 Data Types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10168 entries, 0 to 10167
Data columns (total 79 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            10168 non-null  int64  
 1   listing_url                                   10168 non-null  object 
 2   scrape_id                                     10168 non-null  int64  
 3   last_scraped                                  10168 non-null  object 
 4   source                                        10168 non-null  object 
 5   name                                          10168 non-null  object 
 6   description                                   9859 non-null   object 
 7   neighborhood_overview                         5258 non-null   object 
 8   picture_url                                   10168 non-null  object 
 9   host_id                                       1

#### 2.3 Descriptive statistics

In [6]:
print("\n Descriptive Statistics:")
df.describe().T


 Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,10168.0,5.495778e+17,5.381949e+17,27886.0,24759680.0,6.545968e+17,1.05309e+18,1.438602e+18
scrape_id,10168.0,20250610000000.0,0.0,20250610000000.0,20250610000000.0,20250610000000.0,20250610000000.0,20250610000000.0
host_id,10168.0,130296100.0,173732500.0,1662.0,12694670.0,44423660.0,182933600.0,699642400.0
host_listings_count,10164.0,3.524105,30.57866,1.0,1.0,1.0,1.0,911.0
host_total_listings_count,10164.0,5.40368,55.06787,1.0,1.0,1.0,2.0,1621.0
neighbourhood_group_cleansed,0.0,,,,,,,
latitude,10168.0,52.36666,0.0171246,52.29028,52.35567,52.36552,52.37645,52.42512
longitude,10168.0,4.889542,0.03524221,4.75587,4.86455,4.887365,4.908866,5.026669
accommodates,10168.0,2.926928,1.292811,1.0,2.0,2.0,4.0,16.0
bathrooms,6377.0,1.250039,0.5376453,0.0,1.0,1.0,1.5,17.0


### 3.0 Identify and handle missing values

#### 3.1 Identify missing values

In [7]:
df.replace("?", np.nan, inplace = True)
df.replace("", np.nan, inplace = True)
df.head(5)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,27886,https://www.airbnb.com/rooms/27886,20250609011745,2025-06-17,city scrape,"Romantic, stylish B&B houseboat in canal district",Stylish and romantic houseboat on fantastic hi...,"Central, quiet, safe, clean and beautiful.",https://a0.muscache.com/pictures/02c2da9d-660e...,97647,...,4.92,4.9,4.78,0363 974D 4986 7411 88D8,f,1,0,1,0,1.85
1,28871,https://www.airbnb.com/rooms/28871,20250609011745,2025-06-17,city scrape,Comfortable double room,Basic bedroom in the center of Amsterdam.,"Flower market , Leidseplein , Rembrantsplein",https://a0.muscache.com/pictures/160889/362340...,124245,...,4.94,4.94,4.84,0363 607B EA74 0BD8 2F6F,f,2,0,2,0,3.93
2,29051,https://www.airbnb.com/rooms/29051,20250609011745,2025-06-17,city scrape,Comfortable single / double room,This room can also be rented as a single or a ...,the street is quite lively especially on weeke...,https://a0.muscache.com/pictures/162009/bd6be2...,124245,...,4.92,4.87,4.79,0363 607B EA74 0BD8 2F6F,f,2,0,2,0,4.74
3,44391,https://www.airbnb.com/rooms/44391,20250609011745,2025-06-17,previous scrape,Quiet 2-bedroom Amsterdam city centre apartment,Guests greatly appreciate the unique location ...,The appartment is located in the city centre. ...,https://a0.muscache.com/pictures/97741545/3900...,194779,...,4.9,4.68,4.5,0363 E76E F06A C1DD 172C,f,1,1,0,0,0.23
4,47061,https://www.airbnb.com/rooms/47061,20250609011745,2025-06-17,city scrape,Charming apartment in old centre,"A beautiful, quiet apartment in the center of ...",,https://a0.muscache.com/pictures/268343/a08ce2...,211696,...,4.9,4.85,4.63,0363 1266 8C04 4133 E6AC,f,1,1,0,0,1.13


#### 3.2 Evaluating for Missing Data

In [8]:
missing_data = df.isnull()
missing_data

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10163,False,False,False,False,False,False,False,True,False,False,...,True,True,True,False,False,False,False,False,False,True
10164,False,False,False,False,False,False,False,True,False,False,...,True,True,True,False,False,False,False,False,False,True
10165,False,False,False,False,False,False,False,True,False,False,...,True,True,True,False,False,False,False,False,False,True
10166,False,False,False,False,False,False,False,True,False,False,...,True,True,True,False,False,False,False,False,False,True


#### 3.3 Count missing values in each column

In [9]:
for column in missing_data.columns.values.tolist():
    print (missing_data[column].value_counts())
    print("")  

id
False    10168
Name: count, dtype: int64

listing_url
False    10168
Name: count, dtype: int64

scrape_id
False    10168
Name: count, dtype: int64

last_scraped
False    10168
Name: count, dtype: int64

source
False    10168
Name: count, dtype: int64

name
False    10168
Name: count, dtype: int64

description
False    9859
True      309
Name: count, dtype: int64

neighborhood_overview
False    5258
True     4910
Name: count, dtype: int64

picture_url
False    10168
Name: count, dtype: int64

host_id
False    10168
Name: count, dtype: int64

host_url
False    10168
Name: count, dtype: int64

host_name
False    10164
True         4
Name: count, dtype: int64

host_since
False    10164
True         4
Name: count, dtype: int64

host_location
False    9024
True     1144
Name: count, dtype: int64

host_about
False    5372
True     4796
Name: count, dtype: int64

host_response_time
False    6640
True     3528
Name: count, dtype: int64

host_response_rate
False    6640
True     3528
Name: co

#### 3.4 Remove unneccessary culomns with missing values

In [10]:
columns_to_drop = [
    "listing_url", "scrape_id", "last_scraped", "source",
    "neighborhood_overview", "picture_url", "host_id", "host_url",
    "host_name", "host_since", "host_location", "host_about", "host_response_time",
    "host_thumbnail_url", "host_picture_url", "host_neighbourhood",
    "host_identity_verified", "neighbourhood_group_cleansed",
    "calendar_updated", "has_availability", "availability_30", "availability_60",
    "availability_90", "availability_365", "calendar_last_scraped",
    "availability_eoy", "estimated_occupancy_l365d", "first_review", "last_review",
    "license", "calculated_host_listings_count",
    "calculated_host_listings_count_entire_homes",
    "calculated_host_listings_count_private_rooms",
    "calculated_host_listings_count_shared_rooms",
    "estimated_revenue_l365d", "reviews_per_month",
    "host_verifications", "latitude", "longitude",
]

df.drop(columns=columns_to_drop, errors="ignore", inplace=True)
df.head()

Unnamed: 0,id,name,description,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,neighbourhood,...,number_of_reviews_l30d,number_of_reviews_ly,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable
0,27886,"Romantic, stylish B&B houseboat in canal district",Stylish and romantic houseboat on fantastic hi...,100%,98%,t,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,1,26,4.92,4.9,4.94,4.95,4.92,4.9,4.78,f
1,28871,Comfortable double room,Basic bedroom in the center of Amsterdam.,100%,99%,t,2.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,9,96,4.88,4.9,4.87,4.94,4.94,4.94,4.84,f
2,29051,Comfortable single / double room,This room can also be rented as a single or a ...,100%,99%,t,2.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,7,88,4.81,4.88,4.83,4.93,4.92,4.87,4.79,f
3,44391,Quiet 2-bedroom Amsterdam city centre apartment,Guests greatly appreciate the unique location ...,,,f,1.0,1.0,t,"Amsterdam, Noord-Holland, Netherlands",...,0,0,4.71,4.68,4.49,4.95,4.9,4.68,4.5,f
4,47061,Charming apartment in old centre,"A beautiful, quiet apartment in the center of ...",100%,50%,f,1.0,2.0,t,,...,1,6,4.77,4.78,4.61,4.76,4.9,4.85,4.63,f


#### 3.6 Replace missing value freqency

In [11]:
mode_columns = [
    'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
    'neighbourhood', 'neighbourhood_cleansed', 'property_type', 'room_type'
]

for col in mode_columns:
    if col in df.columns and not df[col].mode().empty:
        df[col] = df[col].fillna(df[col].mode()[0])

df.head(10)

Unnamed: 0,id,name,description,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,neighbourhood,...,number_of_reviews_l30d,number_of_reviews_ly,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable
0,27886,"Romantic, stylish B&B houseboat in canal district",Stylish and romantic houseboat on fantastic hi...,100%,98%,t,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,1,26,4.92,4.9,4.94,4.95,4.92,4.9,4.78,f
1,28871,Comfortable double room,Basic bedroom in the center of Amsterdam.,100%,99%,t,2.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,9,96,4.88,4.9,4.87,4.94,4.94,4.94,4.84,f
2,29051,Comfortable single / double room,This room can also be rented as a single or a ...,100%,99%,t,2.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,7,88,4.81,4.88,4.83,4.93,4.92,4.87,4.79,f
3,44391,Quiet 2-bedroom Amsterdam city centre apartment,Guests greatly appreciate the unique location ...,100%,100%,f,1.0,1.0,t,"Amsterdam, Noord-Holland, Netherlands",...,0,0,4.71,4.68,4.49,4.95,4.9,4.68,4.5,f
4,47061,Charming apartment in old centre,"A beautiful, quiet apartment in the center of ...",100%,50%,f,1.0,2.0,t,"Amsterdam, Noord-Holland, Netherlands",...,1,6,4.77,4.78,4.61,4.76,4.9,4.85,4.63,f
5,48373,Cozy family home in Amsterdam South,Charming modern apartment in the quiet and gre...,100%,100%,f,1.0,1.0,t,"Amsterdam, Noord-Holland, Netherlands",...,0,3,5.0,5.0,5.0,5.0,5.0,4.6,5.0,f
6,49552,Multatuli Luxury Guest Suite in top location,Stylish & spacious 60m2 guest suite in Amsterd...,100%,92%,t,1.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,8,58,4.93,4.93,4.93,4.96,4.97,4.98,4.78,f
7,50263,Central de Lux 2 bedrooms (4p) apt 125 sqm,A beautiful 'De Lux' 125 sqm apartment for 4 a...,100%,91%,f,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,2,7,4.85,4.91,4.81,4.82,4.76,4.65,4.74,f
8,50515,"Family Home (No drugs, smoking or parties)",This is a beautiful family home in a lovely pa...,100%,21%,f,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,0,5,4.78,4.83,4.83,4.83,4.89,4.56,4.78,f
9,50523,B & B de 9 Straatjes (city center),B&B “De 9 Straatjes” – Your home in the heart ...,90%,99%,t,1.0,1.0,t,"Amsterdam, Noord-Holland, Netherlands",...,5,68,4.88,4.9,4.83,4.86,4.83,4.94,4.83,f


#### 3.7 Replace missing value with mean

In [12]:
mean_columns = [
    'review_scores_value', 'review_scores_location', 'review_scores_rating',
    'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin',
    'review_scores_communication'
]

for col in mean_columns:
    if col in df.columns:
        df[col] = df[col].fillna(df[col].mean())

df.head(10)


Unnamed: 0,id,name,description,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,neighbourhood,...,number_of_reviews_l30d,number_of_reviews_ly,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable
0,27886,"Romantic, stylish B&B houseboat in canal district",Stylish and romantic houseboat on fantastic hi...,100%,98%,t,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,1,26,4.92,4.9,4.94,4.95,4.92,4.9,4.78,f
1,28871,Comfortable double room,Basic bedroom in the center of Amsterdam.,100%,99%,t,2.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,9,96,4.88,4.9,4.87,4.94,4.94,4.94,4.84,f
2,29051,Comfortable single / double room,This room can also be rented as a single or a ...,100%,99%,t,2.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,7,88,4.81,4.88,4.83,4.93,4.92,4.87,4.79,f
3,44391,Quiet 2-bedroom Amsterdam city centre apartment,Guests greatly appreciate the unique location ...,100%,100%,f,1.0,1.0,t,"Amsterdam, Noord-Holland, Netherlands",...,0,0,4.71,4.68,4.49,4.95,4.9,4.68,4.5,f
4,47061,Charming apartment in old centre,"A beautiful, quiet apartment in the center of ...",100%,50%,f,1.0,2.0,t,"Amsterdam, Noord-Holland, Netherlands",...,1,6,4.77,4.78,4.61,4.76,4.9,4.85,4.63,f
5,48373,Cozy family home in Amsterdam South,Charming modern apartment in the quiet and gre...,100%,100%,f,1.0,1.0,t,"Amsterdam, Noord-Holland, Netherlands",...,0,3,5.0,5.0,5.0,5.0,5.0,4.6,5.0,f
6,49552,Multatuli Luxury Guest Suite in top location,Stylish & spacious 60m2 guest suite in Amsterd...,100%,92%,t,1.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,8,58,4.93,4.93,4.93,4.96,4.97,4.98,4.78,f
7,50263,Central de Lux 2 bedrooms (4p) apt 125 sqm,A beautiful 'De Lux' 125 sqm apartment for 4 a...,100%,91%,f,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,2,7,4.85,4.91,4.81,4.82,4.76,4.65,4.74,f
8,50515,"Family Home (No drugs, smoking or parties)",This is a beautiful family home in a lovely pa...,100%,21%,f,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,0,5,4.78,4.83,4.83,4.83,4.89,4.56,4.78,f
9,50523,B & B de 9 Straatjes (city center),B&B “De 9 Straatjes” – Your home in the heart ...,90%,99%,t,1.0,1.0,t,"Amsterdam, Noord-Holland, Netherlands",...,5,68,4.88,4.9,4.83,4.86,4.83,4.94,4.83,f


#### 3.8 Replace missing value with random object

In [13]:
if 'host_response_rate' in df.columns and not df['host_response_rate'].mode().empty:
    df['host_response_rate'] = df['host_response_rate'].fillna(df['host_response_rate'].mode()[0])

df.head(10)

Unnamed: 0,id,name,description,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,neighbourhood,...,number_of_reviews_l30d,number_of_reviews_ly,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable
0,27886,"Romantic, stylish B&B houseboat in canal district",Stylish and romantic houseboat on fantastic hi...,100%,98%,t,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,1,26,4.92,4.9,4.94,4.95,4.92,4.9,4.78,f
1,28871,Comfortable double room,Basic bedroom in the center of Amsterdam.,100%,99%,t,2.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,9,96,4.88,4.9,4.87,4.94,4.94,4.94,4.84,f
2,29051,Comfortable single / double room,This room can also be rented as a single or a ...,100%,99%,t,2.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,7,88,4.81,4.88,4.83,4.93,4.92,4.87,4.79,f
3,44391,Quiet 2-bedroom Amsterdam city centre apartment,Guests greatly appreciate the unique location ...,100%,100%,f,1.0,1.0,t,"Amsterdam, Noord-Holland, Netherlands",...,0,0,4.71,4.68,4.49,4.95,4.9,4.68,4.5,f
4,47061,Charming apartment in old centre,"A beautiful, quiet apartment in the center of ...",100%,50%,f,1.0,2.0,t,"Amsterdam, Noord-Holland, Netherlands",...,1,6,4.77,4.78,4.61,4.76,4.9,4.85,4.63,f
5,48373,Cozy family home in Amsterdam South,Charming modern apartment in the quiet and gre...,100%,100%,f,1.0,1.0,t,"Amsterdam, Noord-Holland, Netherlands",...,0,3,5.0,5.0,5.0,5.0,5.0,4.6,5.0,f
6,49552,Multatuli Luxury Guest Suite in top location,Stylish & spacious 60m2 guest suite in Amsterd...,100%,92%,t,1.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,8,58,4.93,4.93,4.93,4.96,4.97,4.98,4.78,f
7,50263,Central de Lux 2 bedrooms (4p) apt 125 sqm,A beautiful 'De Lux' 125 sqm apartment for 4 a...,100%,91%,f,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,2,7,4.85,4.91,4.81,4.82,4.76,4.65,4.74,f
8,50515,"Family Home (No drugs, smoking or parties)",This is a beautiful family home in a lovely pa...,100%,21%,f,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,0,5,4.78,4.83,4.83,4.83,4.89,4.56,4.78,f
9,50523,B & B de 9 Straatjes (city center),B&B “De 9 Straatjes” – Your home in the heart ...,90%,99%,t,1.0,1.0,t,"Amsterdam, Noord-Holland, Netherlands",...,5,68,4.88,4.9,4.83,4.86,4.83,4.94,4.83,f


#### 3.9 Drop rows with missing value

In [14]:
essential_columns = [
    'price', 'neighbourhood', 'neighbourhood_cleansed', 'property_type',
    'bathrooms_text', 'host_has_profile_pic', 'bedrooms', 'beds'
]

for col in essential_columns:
    if col in df.columns:
        df = df.dropna(subset=[col])

df.head(10)

Unnamed: 0,id,name,description,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,neighbourhood,...,number_of_reviews_l30d,number_of_reviews_ly,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable
0,27886,"Romantic, stylish B&B houseboat in canal district",Stylish and romantic houseboat on fantastic hi...,100%,98%,t,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,1,26,4.92,4.9,4.94,4.95,4.92,4.9,4.78,f
1,28871,Comfortable double room,Basic bedroom in the center of Amsterdam.,100%,99%,t,2.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,9,96,4.88,4.9,4.87,4.94,4.94,4.94,4.84,f
2,29051,Comfortable single / double room,This room can also be rented as a single or a ...,100%,99%,t,2.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,7,88,4.81,4.88,4.83,4.93,4.92,4.87,4.79,f
4,47061,Charming apartment in old centre,"A beautiful, quiet apartment in the center of ...",100%,50%,f,1.0,2.0,t,"Amsterdam, Noord-Holland, Netherlands",...,1,6,4.77,4.78,4.61,4.76,4.9,4.85,4.63,f
6,49552,Multatuli Luxury Guest Suite in top location,Stylish & spacious 60m2 guest suite in Amsterd...,100%,92%,t,1.0,2.0,t,"Amsterdam, North Holland, Netherlands",...,8,58,4.93,4.93,4.93,4.96,4.97,4.98,4.78,f
7,50263,Central de Lux 2 bedrooms (4p) apt 125 sqm,A beautiful 'De Lux' 125 sqm apartment for 4 a...,100%,91%,f,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,2,7,4.85,4.91,4.81,4.82,4.76,4.65,4.74,f
8,50515,"Family Home (No drugs, smoking or parties)",This is a beautiful family home in a lovely pa...,100%,21%,f,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,0,5,4.78,4.83,4.83,4.83,4.89,4.56,4.78,f
9,50523,B & B de 9 Straatjes (city center),B&B “De 9 Straatjes” – Your home in the heart ...,90%,99%,t,1.0,1.0,t,"Amsterdam, Noord-Holland, Netherlands",...,5,68,4.88,4.9,4.83,4.86,4.83,4.94,4.83,f
13,62015,"Charming, beautifully & sunny place",This beautiful apartment in one of the most li...,100%,0%,f,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",...,0,3,4.89,4.92,4.89,5.0,4.89,4.65,4.57,f
16,97221,Beautiful and spacious room,"Private room offered in elegant furnished, cle...",100%,98%,f,2.0,2.0,t,"Amsterdam, Noord-Holland, Netherlands",...,3,66,4.68,4.74,4.89,4.72,4.78,4.46,4.56,f


### 4.0 Searching for and removing duplicates

In [15]:
duplicate_rows = df[df.duplicated(keep=False)]
print(f"Number of duplicate rows: {duplicate_rows.shape[0]}")

Number of duplicate rows: 0


### 5.0 Save cleaned data

#### 5.1 Save to csv

In [16]:
CLEAN_CSV_PATH = "../data/clean/amsterdam_airbnbs_clean_data.csv"
df.to_csv(CLEAN_CSV_PATH, index=False)
print(f"Cleaned data saved to CSV: {CLEAN_CSV_PATH}")

Cleaned data saved to CSV: ../data/clean/amsterdam_airbnbs_clean_data.csv


#### 5.2 Save to sql database

In [17]:
config = DatabaseConfig(env_path=dotenv_path)
db = DatabaseConnection(config)

TABLE_NAME = "amsterdam_airbnbs_clean_data"

db.write_dataframe(df, TABLE_NAME, if_exists="replace")

DataFrame successfully saved to table: amsterdam_airbnbs_clean_data
