## New York City Airbnb

#### Dataset: New York City Airbnb

#### File used: New York City Airbnb.csv

#### Importing intial Libaraires

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


### Data Loading & Initial Exploration

In [3]:
data = pd.read_csv("New York City Airbnb.csv")


In [4]:
data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [5]:
data.tail()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2
48894,36487245,Trendy duplex in the very heart of Hell's Kitchen,68119814,Christophe,Manhattan,Hell's Kitchen,40.76404,-73.98933,Private room,90,7,0,,,1,23


In [6]:
data.shape

(48895, 16)

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

### Data Intergrity Check

In [8]:
data.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


In [9]:
data.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

In [11]:
# Convert columns to correct types 
data['price'] = pd.to_numeric(data['price'], errors='coerce')
data['minimum_nights'] = pd.to_numeric(data['minimum_nights'], errors='coerce')

In [12]:
# Check logical integrity 
data = data[data['price'] > 0]
data = data[data['minimum_nights'] > 0]

In [13]:
# Standardize formats (consistency)
if "room_type" in data.columns:
    data["room_type"] = data["room_type"].str.strip().str.lower()

### Handling Missing Values

In [15]:
data.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10051
reviews_per_month                 10051
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [16]:
data.fillna({'name': 'N/A'}, inplace=True)

In [None]:
data.drop(['id', 'host_name', 'last_review'], axis=1, inplace=True)

In [23]:
data.fillna({'reviews_per_month': 0}, inplace=True)

In [24]:
data.isnull().sum()

name                              0
host_id                           0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

### Duplicate Removal

In [25]:
data.duplicated().sum()



np.int64(0)

In [26]:
data = data.drop_duplicates()

### Standardization (Formatting & Consistency)

In [27]:
# Standardize price (remove currency symbols if any)
data['price'] = data['price'].replace('[\$,]', '', regex=True).astype(float)

In [28]:
# Standardize text columns
data['room_type'] = data['room_type'].str.lower().str.strip()
data['neighbourhood'] = data['neighbourhood'].str.title()

In [29]:
# Rename columns for consistency
data.columns = data.columns.str.lower().str.replace(" ", "_")

### Outlier Detection & Treatment

#### Using IQR Method

In [32]:
Q1 = data['price'].quantile(0.25)
Q3 = data['price'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

data = data[(data['price'] >= lower) & (data['price'] <= upper)]


In [33]:
from scipy import stats

data = data[(np.abs(stats.zscore(data['price'])) < 3)]

### Final Clean Dataset Check`

In [34]:
data.info()
data.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
Index: 45730 entries, 0 to 48894
Data columns (total 13 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   name                            45730 non-null  object 
 1   host_id                         45730 non-null  int64  
 2   neighbourhood_group             45730 non-null  object 
 3   neighbourhood                   45730 non-null  object 
 4   latitude                        45730 non-null  float64
 5   longitude                       45730 non-null  float64
 6   room_type                       45730 non-null  object 
 7   price                           45730 non-null  float64
 8   minimum_nights                  45730 non-null  int64  
 9   number_of_reviews               45730 non-null  int64  
 10  reviews_per_month               45730 non-null  float64
 11  calculated_host_listings_count  45730 non-null  int64  
 12  availability_365                45730

name                              0
host_id                           0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

### Before vs After Cleaning

| Metric         | Before  | After   |
| -------------- | ------- | ------- |
| Rows           | High    | Reduced |
| Missing values | Many    | Minimal |
| Duplicates     | Present | Removed |
| Outliers       | Present | Handled |


### Description

Cleaned and standardized NYC Airbnb listing data by handling missing values, removing duplicates, treating outliers, and validating data integrity using Python (Pandas & NumPy). Prepared analysis-ready dataset to ensure reliable insights.

## Completed