# Rental Prices in NYC, Exploratory Data Analysis

In [8]:
import wandb
import pandas as pd
import pandas_profiling

Start by initiating a W&B run and fetching the dataset from W&B. The notebook will be uploaded to W&B under the current run.

In [9]:
# Start the run
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
# Fetch the dataset for EDA
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

Generate a profile summary of the dataset with [Pandas Profiling](https://pandas-profiling.ydata.ai/docs/master/index.html).

In [10]:
# Generate an EDA report
profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Some prices are suspiciously high or low, they are not credible, and we are going to drop the corresponding samples.
The `last_review` variable is of type string, but it actually contains dates, therefore we are going to convert it to dates.
There are variables with missing values, we are going to impute them as part of the training/inference pipeline.

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20000 non-null  int64  
 1   name                            19993 non-null  object 
 2   host_id                         20000 non-null  int64  
 3   host_name                       19992 non-null  object 
 4   neighbourhood_group             20000 non-null  object 
 5   neighbourhood                   20000 non-null  object 
 6   latitude                        20000 non-null  float64
 7   longitude                       20000 non-null  float64
 8   room_type                       20000 non-null  object 
 9   price                           20000 non-null  int64  
 10  minimum_nights                  20000 non-null  int64  
 11  number_of_reviews               20000 non-null  int64  
 12  last_review                     

In [12]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

After dropping samples with out-of-range price, we still have 19001 samples out of the initial 20000.

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  nu

Complete the run. To ensure the run is finished successfully, after closing this notebook, stop the Jupyter server from its home page, with the `Close` button; don't kill the server.

In [14]:
run.finish()  # Done with the run

VBox(children=(Label(value='0.048 MB of 0.048 MB uploaded (0.007 MB deduped)\r'), FloatProgress(value=1.0, max…