In [1]:
import pandas as pd
import wandb

# Exploratory Data Analysis

We run an EDA on the New York Airbnb property data to better understand the dataset. This versioning of this notebook will be uploaded to Weights and Biases - so make sure you're logged in prior to running this notebook

In [2]:
run = wandb.init(project='nyc_airbnb', group='eda', save_code=True)
local_path = run.use_artifact('sample.csv:latest').file()

[34m[1mwandb[0m: Currently logged in as: [33mashrielbrian[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.11.0 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [3]:
df = pd.read_csv(local_path)
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [4]:
import pandas_profiling

profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/29 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

## Simple cleaning

We remove outliers and keep only records with prices between \\$10 - \\$350. Also, the `last_review` column is transformed into `datetime` type.

In [5]:
# remove outliers
min_price = 10
max_price = 350

In [6]:
idx = df.price.between(min_price, max_price)
df['price'] = df[idx].copy()

In [7]:
df['last_review'] = pd.to_datetime(df.last_review)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              48895 non-null  int64         
 1   name                            48879 non-null  object        
 2   host_id                         48895 non-null  int64         
 3   host_name                       48874 non-null  object        
 4   neighbourhood_group             48895 non-null  object        
 5   neighbourhood                   48895 non-null  object        
 6   latitude                        48895 non-null  float64       
 7   longitude                       48895 non-null  float64       
 8   room_type                       48895 non-null  object        
 9   price                           46428 non-null  object        
 10  minimum_nights                  48895 non-null  int64         
 11  nu

Let's conclude the run here.

In [9]:
run.finish()

VBox(children=(Label(value=' 0.05MB of 0.05MB uploaded (0.01MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…