## EDA & Cleaning on Airbnb NYC Listings Dataset

### Import Library

In [1]:
import pandas as pd

### Load Dataset

In [2]:
df = pd.read_csv('airbnb.csv')

### Show first 5 rows

In [3]:
print("First 5 rows:")
df.head()

First 5 rows:


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


### Shape, Info & Null Check

In [4]:
print("Shape of dataset:")
df.shape

Shape of dataset:


(48895, 16)

In [5]:
print("Basic Info:")
df.info()

Basic Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review         

In [6]:
print("Null Check:")
df.isnull().sum()

Null Check:


id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

### Column Clean

In [7]:
# Drop completely useless columns
df.drop(["id", "name", "host_name", "last_review"], axis=1, inplace=True)

In [8]:
# Fill missing 'reviews_per_month' with 0 (means no review)
df.fillna({"reviews_per_month": 0}, inplace=True)

### **Some questions**

1. Which neighbourhood group has the most listings?

In [9]:
df['neighbourhood_group'].value_counts()

neighbourhood_group
Manhattan        21661
Brooklyn         20104
Queens            5666
Bronx             1091
Staten Island      373
Name: count, dtype: int64

2. Which type of room is most common?

In [10]:
df['room_type'].value_counts()

room_type
Entire home/apt    25409
Private room       22326
Shared room         1160
Name: count, dtype: int64

3. What are the minimum, average, and maximum prices?

In [11]:
df['price'].describe()

count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

4. How many missing values are in each column?

In [12]:
df.isnull().sum()

host_id                           0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

5. What is the average number of reviews per month?

In [13]:
df['reviews_per_month'].mean()

np.float64(1.0909099089886491)

6. Price Distribution Outlier Check

In [14]:
df[df['price'] > 500].shape

(1044, 12)

7. Top Reviewed Listings

In [15]:
top_reviews = df.sort_values(by="number_of_reviews", ascending=False).head()
top_reviews[['neighbourhood_group', 'room_type', 'price', 'number_of_reviews']]

Unnamed: 0,neighbourhood_group,room_type,price,number_of_reviews
11759,Queens,Private room,47,629
2031,Manhattan,Private room,49,607
2030,Manhattan,Private room,49,597
2015,Manhattan,Private room,49,594
13495,Queens,Private room,47,576
