## Exploratory Data Analysis NYC Airbnb Project

In this notebook we will explore the data of the project, this is a key step in order to perform the following steps of data cleaning and data checks.

In [1]:
# import necessary libraries
import wandb
import pandas as pd
import pandas_profiling

Let's start by creating a new W&B run and save this notebook code

In [2]:
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)

[34m[1mwandb[0m: Currently logged in as: [33mapolanco3225[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.9 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


Now let's read the csv file from W&B and load it with pandas

In [3]:
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

For this dataset we have the following records:

In [8]:
print(f"Number of rows: {df.shape[0]} and columns: {df.shape[1]}")

Number of rows: 20000 and columns: 16


Explore the values of the first 2 rows

In [5]:
df.head(2)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,9138664,Private Lg Room 15 min to Manhattan,47594947,Iris,Queens,Sunnyside,40.74271,-73.92493,Private room,74,2,6,2019-05-26,0.13,1,5
1,31444015,TIME SQUARE CHARMING ONE BED IN HELL'S KITCHEN...,8523790,Johlex,Manhattan,Hell's Kitchen,40.76682,-73.98878,Entire home/apt,170,3,0,,,1,188


Use pandas profiling to explore value details of the dataset:

In [6]:
profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset: 100%|██████████| 29/29 [00:12<00:00,  2.40it/s, Completed]                                       
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.51s/it]
                                                             

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Data cleaning steps
1. Drop outliers
2. Change format from string to date

In [10]:
# remove outliers in the price column
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()

In [11]:
# convert last_review format from string to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

Confirm that problems have been solved

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  nu

In [14]:
run.finish()

VBox(children=(Label(value=' 0.09MB of 0.09MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…