## Exploratory Data Analysis

This notebook executes the following steps to perform exploratory data analysis:

1. Initialize the run and fetch the `sample.csv` dataset from `wandb` artifact storage
2. Profile the dataset using `pandas_profiling`
3. Identify and fix issues

In [1]:
import wandb
import pandas as pd
import pandas_profiling

### Step1: Fetch data

We fetch the artifact we created in the previous step (`sample.csv`) from W&B and read it with pandas

In [2]:
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

[34m[1mwandb[0m: Currently logged in as: [33mgbouz[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Step2: Profile data

Using `pandas-profiling`, we create a profile

In [3]:
profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

### Step3: Fix issues

Fixing some small problems we have found in the data. Note how we do not impute missing values. We will do that in the inference pipeline, so we will be able to handle missing values also in production.

In [4]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  nu

In [6]:
# terminate the run
run.finish()

VBox(children=(Label(value='0.063 MB of 0.063 MB uploaded (0.007 MB deduped)\r'), FloatProgress(value=1.0, max…