# Airbnb Open Data - Professional Data Cleaning Pipeline

## Overview
This notebook provides a comprehensive data cleaning pipeline for Airbnb Open Data. The cleaning process includes:
- Data loading and initial exploration
- Missing value analysis and treatment
- Duplicate removal
- Data type conversions
- Feature selection and engineering
- Data validation and quality checks

## Dataset Information
- **Source**: Airbnb Open Data
- **File**: `Airbnb_Open_Data.csv`
- **Purpose**: Clean and prepare data for analysis and modeling

---

## 1. Library Imports and Configuration

Setting up the necessary libraries and pandas display options for optimal data exploration.

In [1]:
# Import essential libraries
import pandas as pd
import numpy as np
import warnings
import os
from pathlib import Path

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 100)

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("✓ Libraries imported successfully")
print(f"✓ Pandas version: {pd.__version__}")
print(f"✓ NumPy version: {np.__version__}")

✓ Libraries imported successfully
✓ Pandas version: 2.2.3
✓ NumPy version: 2.2.3


## 2. Data Loading and Initial Exploration

Loading the dataset and performing initial exploration to understand its structure and content.

In [2]:
df = pd.read_csv('Airbnb_Open_Data.csv')
df

Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,country code,instant_bookable,cancellation_policy,room type,Construction year,price,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
0,1001254,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,US,False,strict,Private room,2020.0,$966,$193,10.0,9.0,10/19/2021,0.21,4.0,6.0,286.0,Clean up and treat the home the way you'd like your home to be treated. No smoking.,
1,1002102,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,US,False,moderate,Entire home/apt,2007.0,$142,$28,30.0,45.0,5/21/2022,0.38,4.0,2.0,228.0,Pet friendly but please confirm with me if the pet you are planning on bringing with you is OK. ...,
2,1002403,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.94190,United States,US,True,flexible,Private room,2005.0,$620,$124,3.0,0.0,,,5.0,1.0,352.0,"I encourage you to use my kitchen, cooking and laundry facilities. There is no additional charge...",
3,1002755,,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,US,True,moderate,Entire home/apt,2005.0,$368,$74,30.0,270.0,7/5/2019,4.64,4.0,1.0,322.0,,
4,1003689,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,US,False,moderate,Entire home/apt,2009.0,$204,$41,10.0,9.0,11/19/2018,0.10,3.0,1.0,289.0,"Please no smoking in the house, porch or on the property (you can go to the nearby corner). Rea...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102594,6092437,Spare room in Williamsburg,12312296767,verified,Krik,Brooklyn,Williamsburg,40.70862,-73.94651,United States,US,False,flexible,Private room,2003.0,$844,$169,1.0,0.0,,,3.0,1.0,227.0,No Smoking No Parties or Events of any kind Please take out your shoes when you arrive. Remember...,
102595,6092990,Best Location near Columbia U,77864383453,unconfirmed,Mifan,Manhattan,Morningside Heights,40.80460,-73.96545,United States,US,True,moderate,Private room,2016.0,$837,$167,1.0,1.0,7/6/2015,0.02,2.0,2.0,395.0,House rules: Guests agree to the following terms and conditions 1.Guest(s) agree to NO PARTIES a...,
102596,6093542,"Comfy, bright room in Brooklyn",69050334417,unconfirmed,Megan,Brooklyn,Park Slope,40.67505,-73.98045,United States,US,True,moderate,Private room,2009.0,$988,$198,3.0,0.0,,,5.0,1.0,342.0,,
102597,6094094,Big Studio-One Stop from Midtown,11160591270,unconfirmed,Christopher,Queens,Long Island City,40.74989,-73.93777,United States,US,True,strict,Entire home/apt,2015.0,$546,$109,2.0,5.0,10/11/2015,0.10,3.0,1.0,386.0,,


### 2.1 Dataset Overview

In [3]:
# Comprehensive dataset overview
if df is not None:
    print("Dataset Basic Information:")
    print("=" * 50)
    print(f"Shape: {df.shape}")
    print(f"Columns: {df.shape[1]}")
    print(f"Rows: {df.shape[0]}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    print("\nColumn names:")
    print("-" * 30)
    for i, col in enumerate(df.columns, 1):
        print(f"{i:2d}. {col}")
    
    print("\nData types:")
    print("-" * 30)
    print(df.dtypes)
    
    print("\nBasic statistics:")
    print("-" * 30)
    display(df.describe(include='all'))
    
    print("\nMissing values:")
    print("-" * 30)
    missing_values = df.isnull().sum()
    missing_percentage = (missing_values / len(df)) * 100
    missing_df = pd.DataFrame({
        'Missing Count': missing_values,
        'Missing Percentage': missing_percentage
    })
    missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
    display(missing_df)
else:
    print("Cannot perform analysis: Dataset not loaded properly.")

Dataset Basic Information:
Shape: (102599, 26)
Columns: 26
Rows: 102599
Memory usage: 104.69 MB

Column names:
------------------------------
 1. id
 2. NAME
 3. host id
 4. host_identity_verified
 5. host name
 6. neighbourhood group
 7. neighbourhood
 8. lat
 9. long
10. country
11. country code
12. instant_bookable
13. cancellation_policy
14. room type
15. Construction year
16. price
17. service fee
18. minimum nights
19. number of reviews
20. last review
21. reviews per month
22. review rate number
23. calculated host listings count
24. availability 365
25. house_rules
26. license

Data types:
------------------------------
id                                  int64
NAME                               object
host id                             int64
host_identity_verified             object
host name                          object
neighbourhood group                object
neighbourhood                      object
lat                               float64
long                        

Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,country code,instant_bookable,cancellation_policy,room type,Construction year,price,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
count,102599.0,102349,102599.0,102310,102193,102570,102583,102591.0,102591.0,102067,102468,102494,102523,102599,102385.0,102352,102326,102190.0,102416.0,86706,86720.0,102273.0,102280.0,102151.0,50468,2
unique,,61281,,2,13190,7,224,,,1,1,2,3,4,,1151,231,,,2477,,,,,1976,1
top,,Home away from home,,unconfirmed,Michael,Manhattan,Bedford-Stuyvesant,,,United States,US,False,moderate,Entire home/apt,,$206,$41,,,6/23/2019,,,,,#NAME?,41662/AL
freq,,33,,51200,881,43792,7937,,,102067,102468,51474,34343,53701,,137,526,,,2443,,,,,2712,2
mean,29146230.0,,49254110000.0,,,,,40.728094,-73.949644,,,,,,2012.487464,,,8.135845,27.483743,,1.374022,3.279106,7.936605,141.133254,,
std,16257510.0,,28539000000.0,,,,,0.055857,0.049521,,,,,,5.765556,,,30.553781,49.508954,,1.746621,1.284657,32.21878,135.435024,,
min,1001254.0,,123600500.0,,,,,40.49979,-74.24984,,,,,,2003.0,,,-1223.0,0.0,,0.01,1.0,1.0,-10.0,,
25%,15085810.0,,24583330000.0,,,,,40.68874,-73.98258,,,,,,2007.0,,,2.0,1.0,,0.22,2.0,1.0,3.0,,
50%,29136600.0,,49117740000.0,,,,,40.72229,-73.95444,,,,,,2012.0,,,3.0,7.0,,0.74,3.0,1.0,96.0,,
75%,43201200.0,,73996500000.0,,,,,40.76276,-73.93235,,,,,,2017.0,,,5.0,30.0,,2.0,4.0,2.0,269.0,,



Missing values:
------------------------------


Unnamed: 0,Missing Count,Missing Percentage
license,102597,99.998051
house_rules,52131,50.810437
last review,15893,15.490404
reviews per month,15879,15.476759
country,532,0.518524
availability 365,448,0.436651
minimum nights,409,0.398639
host name,406,0.395715
review rate number,326,0.317742
calculated host listings count,319,0.310919


## 3. Data Quality Assessment

This section provides detailed analysis of data quality issues including missing values, duplicates, and data types.

In [4]:
# Detailed data info
print("Detailed Data Information:")
print("=" * 50)
df.info()

print("\nMissing Values Summary:")
print("=" * 50)
missing_summary = df.isnull().sum()
print(missing_summary[missing_summary > 0].sort_values(ascending=False))

Detailed Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102599 entries, 0 to 102598
Data columns (total 26 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              102599 non-null  int64  
 1   NAME                            102349 non-null  object 
 2   host id                         102599 non-null  int64  
 3   host_identity_verified          102310 non-null  object 
 4   host name                       102193 non-null  object 
 5   neighbourhood group             102570 non-null  object 
 6   neighbourhood                   102583 non-null  object 
 7   lat                             102591 non-null  float64
 8   long                            102591 non-null  float64
 9   country                         102067 non-null  object 
 10  country code                    102468 non-null  object 
 11  instant_bookable                102494 non-null  ob

## 4. Feature Selection and Engineering

Based on the data exploration, we'll select relevant features and remove unnecessary columns that have high missing values or provide little analytical value.

In [5]:
# Define columns to drop and keep based on analysis
columns_to_drop = [
    'id',                               # Unique identifier, not useful for analysis
    'reviews per month',                # High missing values
    'review rate number',               # High missing values  
    'calculated host listings count',   # High missing values
    'availability 365',                 # High missing values
    'house_rules',                      # Text field with high missing values
    'license'                           # High missing values
]

columns_to_keep = [
    'NAME', 'host id', 'host_identity_verified', 'host name',
    'neighbourhood group', 'neighbourhood', 'lat', 'long', 'country',
    'country code', 'instant_bookable', 'cancellation_policy', 'room type',
    'Construction year', 'price', 'service fee', 'minimum nights',
    'number of reviews', 'last review'
]

print(f"Columns to keep: {len(columns_to_keep)}")
print(f"Columns to drop: {len(columns_to_drop)}")
print(f"Total original columns: {len(df.columns)}")

# Display the columns we're keeping
print("\nColumns being retained:")
for i, col in enumerate(columns_to_keep, 1):
    print(f"{i:2d}. {col}")

Columns to keep: 19
Columns to drop: 7
Total original columns: 26

Columns being retained:
 1. NAME
 2. host id
 3. host_identity_verified
 4. host name
 5. neighbourhood group
 6. neighbourhood
 7. lat
 8. long
 9. country
10. country code
11. instant_bookable
12. cancellation_policy
13. room type
14. Construction year
15. price
16. service fee
17. minimum nights
18. number of reviews
19. last review


In [6]:
# Apply feature selection
df = df[columns_to_keep]

print(f"✓ Feature selection applied")
print(f"✓ New dataset shape: {df.shape}")
print(f"✓ Memory usage after selection: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

✓ Feature selection applied
✓ New dataset shape: (102599, 19)
✓ Memory usage after selection: 77.97 MB


## 5. Data Cleaning and Transformation

This section handles data cleaning tasks including column renaming, duplicate removal, and data type conversions.

### 5.1 Column Renaming and Initial Cleaning

In [7]:
# Rename columns to follow naming conventions
column_rename_map = {
    'NAME': 'name'  # Convert to lowercase for consistency
}

df.rename(columns=column_rename_map, inplace=True)

print("✓ Column renaming completed")
print(f"✓ Renamed columns: {list(column_rename_map.keys())}")

# Display first few rows to verify changes
display(df.head())

✓ Column renaming completed
✓ Renamed columns: ['NAME']


Unnamed: 0,name,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,country code,instant_bookable,cancellation_policy,room type,Construction year,price,service fee,minimum nights,number of reviews,last review
0,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,US,False,strict,Private room,2020.0,$966,$193,10.0,9.0,10/19/2021
1,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,US,False,moderate,Entire home/apt,2007.0,$142,$28,30.0,45.0,5/21/2022
2,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,US,True,flexible,Private room,2005.0,$620,$124,3.0,0.0,
3,,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,US,True,moderate,Entire home/apt,2005.0,$368,$74,30.0,270.0,7/5/2019
4,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,US,False,moderate,Entire home/apt,2009.0,$204,$41,10.0,9.0,11/19/2018


### 5.2 Duplicate Removal

In [8]:
# Check for duplicates
print("Duplicate Analysis:")
print("=" * 30)
duplicate_count = df.duplicated().sum()
print(f"Total duplicates found: {duplicate_count}")
print(f"Percentage of duplicates: {(duplicate_count / len(df)) * 100:.2f}%")

# Remove duplicates
initial_shape = df.shape
df.drop_duplicates(inplace=True)
final_shape = df.shape

print(f"\n✓ Duplicates removed")
print(f"✓ Original shape: {initial_shape}")
print(f"✓ After duplicate removal: {final_shape}")
print(f"✓ Rows removed: {initial_shape[0] - final_shape[0]}")

# Reset index after dropping duplicates
df.reset_index(drop=True, inplace=True)
print("✓ Index reset completed")

Duplicate Analysis:
Total duplicates found: 541
Percentage of duplicates: 0.53%

✓ Duplicates removed
✓ Original shape: (102599, 19)
✓ After duplicate removal: (102058, 19)
✓ Rows removed: 541
✓ Index reset completed


### 5.3 Missing Value Treatment

In [9]:
# Analyze missing values before cleaning
print("Missing Values Analysis:")
print("=" * 40)
missing_before = df.isnull().sum()
missing_before_pct = (missing_before / len(df)) * 100

missing_analysis = pd.DataFrame({
    'Missing Count': missing_before,
    'Missing Percentage': missing_before_pct
})
missing_analysis = missing_analysis[missing_analysis['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if not missing_analysis.empty:
    display(missing_analysis)
else:
    print("No missing values found!")

# Display first few rows to understand data structure
print("\nDataset Preview:")
print("-" * 20)
display(df.head())

Missing Values Analysis:


Unnamed: 0,Missing Count,Missing Percentage
last review,15832,15.512748
country,532,0.521272
host name,404,0.395853
minimum nights,400,0.391934
host_identity_verified,289,0.283172
service fee,273,0.267495
name,250,0.244959
price,247,0.242019
Construction year,214,0.209685
number of reviews,183,0.17931



Dataset Preview:
--------------------


Unnamed: 0,name,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,country code,instant_bookable,cancellation_policy,room type,Construction year,price,service fee,minimum nights,number of reviews,last review
0,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,US,False,strict,Private room,2020.0,$966,$193,10.0,9.0,10/19/2021
1,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,US,False,moderate,Entire home/apt,2007.0,$142,$28,30.0,45.0,5/21/2022
2,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,US,True,flexible,Private room,2005.0,$620,$124,3.0,0.0,
3,,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,US,True,moderate,Entire home/apt,2005.0,$368,$74,30.0,270.0,7/5/2019
4,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,US,False,moderate,Entire home/apt,2009.0,$204,$41,10.0,9.0,11/19/2018


In [10]:
# Strategy for handling missing values

# 1. Drop columns with too many missing values (like 'last review')
columns_to_drop_missing = ['last review']
for col in columns_to_drop_missing:
    if col in df.columns:
        df.drop(columns=[col], inplace=True)
        print(f"✓ Dropped column '{col}' due to high missing values")

# 2. Check remaining missing values
remaining_missing = df.isnull().sum()
print(f"\nRemaining missing values after column removal:")
print(remaining_missing[remaining_missing > 0])

# 3. Drop rows with any remaining missing values (data quality approach)
initial_rows = len(df)
df.dropna(inplace=True)
final_rows = len(df)
print(f"\n✓ Removed {initial_rows - final_rows} rows with missing values")
print(f"✓ Final dataset shape: {df.shape}")

# Verify no missing values remain
print(f"\n✓ Verification - Missing values remaining: {df.isnull().sum().sum()}")

✓ Dropped column 'last review' due to high missing values

Remaining missing values after column removal:
name                      250
host_identity_verified    289
host name                 404
neighbourhood group        29
neighbourhood              16
lat                         8
long                        8
country                   532
country code              131
instant_bookable          105
cancellation_policy        76
Construction year         214
price                     247
service fee               273
minimum nights            400
number of reviews         183
dtype: int64

✓ Removed 2716 rows with missing values
✓ Final dataset shape: (99342, 18)

✓ Verification - Missing values remaining: 0


### 5.4 Data Type Conversion and Feature Engineering

In [11]:
# Data type conversions and feature engineering

print("Data Type Conversions:")
print("=" * 30)

# 1. Convert instant_bookable to binary (0/1)
df['instant_bookable'] = df['instant_bookable'].apply(lambda x: 1 if x == True else 0)
print("✓ Converted instant_bookable to binary (0/1)")

# 2. Clean and convert price column
print("\n✓ Cleaning price column...")
df['price'] = df['price'].str.replace('$', "")
df['price'] = df['price'].str.replace(',', "")
df['price'] = df['price'].str.replace(' ', "")
df['price'] = df['price'].astype(int)
print("✓ Price column cleaned and converted to integer")

# 3. Reset index after all transformations
df.reset_index(drop=True, inplace=True)
print("✓ Index reset after transformations")

# Verify the conversions
print(f"\n✓ Instant bookable data type: {df['instant_bookable'].dtype}")
print(f"✓ Price data type: {df['price'].dtype}")
print(f"✓ Final dataset shape: {df.shape}")

# Display sample of cleaned data
print("\nSample of cleaned data:")
display(df.head())

Data Type Conversions:
✓ Converted instant_bookable to binary (0/1)

✓ Cleaning price column...
✓ Price column cleaned and converted to integer
✓ Index reset after transformations

✓ Instant bookable data type: int64
✓ Price data type: int64
✓ Final dataset shape: (99342, 18)

Sample of cleaned data:


Unnamed: 0,name,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,country code,instant_bookable,cancellation_policy,room type,Construction year,price,service fee,minimum nights,number of reviews
0,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,US,0,strict,Private room,2020.0,966,$193,10.0,9.0
1,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,US,0,moderate,Entire home/apt,2007.0,142,$28,30.0,45.0
2,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,US,0,moderate,Entire home/apt,2009.0,204,$41,10.0,9.0
3,Large Cozy 1 BR Apartment In Midtown East,45498551794,verified,Michelle,Manhattan,Murray Hill,40.74767,-73.975,United States,US,1,flexible,Entire home/apt,2013.0,577,$115,3.0,74.0
4,BlissArtsSpace!,90821839709,unconfirmed,Emma,Brooklyn,Bedford-Stuyvesant,40.68688,-73.95596,United States,US,0,moderate,Private room,2009.0,1060,$212,45.0,49.0


## 6. Final Data Validation and Summary

This section provides a final validation of the cleaned dataset and summary of the cleaning process.

In [12]:
# Final Data Quality Report
print("🔍 FINAL DATA QUALITY REPORT")
print("=" * 50)

# Basic information
print(f"📊 Dataset Shape: {df.shape}")
print(f"📈 Total Records: {df.shape[0]:,}")
print(f"📋 Total Features: {df.shape[1]}")
print(f"💾 Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Data quality checks
print(f"\n✅ QUALITY CHECKS:")
print(f"   • Missing Values: {df.isnull().sum().sum()}")
print(f"   • Duplicate Rows: {df.duplicated().sum()}")
print(f"   • Data Types: {df.dtypes.nunique()} unique types")

# Column information
print(f"\n📝 FINAL COLUMNS ({len(df.columns)}):")
for i, col in enumerate(df.columns, 1):
    dtype = df[col].dtype
    non_null = df[col].count()
    print(f"   {i:2d}. {col:<25} | {str(dtype):<12} | {non_null:>6,} non-null")

# Sample statistics
print(f"\n📊 SAMPLE STATISTICS:")
print(f"   • Numeric columns: {df.select_dtypes(include=[np.number]).shape[1]}")
print(f"   • Object columns: {df.select_dtypes(include=['object']).shape[1]}")
print(f"   • Unique listings: {df['name'].nunique():,}")

# Data integrity checks
print(f"\n🔒 DATA INTEGRITY:")
print(f"   • Price range: ${df['price'].min():,} - ${df['price'].max():,}")
print(f"   • Average price: ${df['price'].mean():.2f}")
print(f"   • Countries represented: {df['country'].nunique()}")

print(f"\n✅ Data cleaning completed successfully!")
print(f"   The dataset is now ready for analysis and modeling.")

# Display final sample
print(f"\n📋 FINAL DATASET SAMPLE:")
display(df.head(3))

🔍 FINAL DATA QUALITY REPORT
📊 Dataset Shape: (99342, 18)
📈 Total Records: 99,342
📋 Total Features: 18
💾 Memory Usage: 63.42 MB

✅ QUALITY CHECKS:
   • Missing Values: 0
   • Duplicate Rows: 0
   • Data Types: 3 unique types

📝 FINAL COLUMNS (18):
    1. name                      | object       | 99,342 non-null
    2. host id                   | int64        | 99,342 non-null
    3. host_identity_verified    | object       | 99,342 non-null
    4. host name                 | object       | 99,342 non-null
    5. neighbourhood group       | object       | 99,342 non-null
    6. neighbourhood             | object       | 99,342 non-null
    7. lat                       | float64      | 99,342 non-null
    8. long                      | float64      | 99,342 non-null
    9. country                   | object       | 99,342 non-null
   10. country code              | object       | 99,342 non-null
   11. instant_bookable          | int64        | 99,342 non-null
   12. cancellation_policy 

Unnamed: 0,name,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,country code,instant_bookable,cancellation_policy,room type,Construction year,price,service fee,minimum nights,number of reviews
0,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,US,0,strict,Private room,2020.0,966,$193,10.0,9.0
1,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,US,0,moderate,Entire home/apt,2007.0,142,$28,30.0,45.0
2,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,US,0,moderate,Entire home/apt,2009.0,204,$41,10.0,9.0


## 7. Export Cleaned Data (Optional)

Uncomment the cell below if you want to save the cleaned dataset to a CSV file.

In [13]:
# Export cleaned dataset
# Uncomment the lines below to save the cleaned data

from datetime import datetime

# Create timestamped filename
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_filename = f"Airbnb_Cleaned_{timestamp}.csv"

# Export to CSV with index=False to avoid unnecessary index column
# df.to_csv(output_filename, index=False)
# print(f"✓ Cleaned dataset exported to: {output_filename}")

# Export summary statistics
# summary_filename = f"Airbnb_Summary_{timestamp}.txt"
# with open(summary_filename, 'w') as f:
#     f.write(f"Airbnb Dataset Cleaning Summary\n")
#     f.write(f"Generated on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
#     f.write(f"Original shape: Not tracked\n")
#     f.write(f"Final shape: {df.shape}\n")
#     f.write(f"Missing values: {df.isnull().sum().sum()}\n")
#     f.write(f"Duplicates: {df.duplicated().sum()}\n")

print("Export code ready. Uncomment the lines above to save files.")

Export code ready. Uncomment the lines above to save files.
