# Imports and project initialization

In [1]:
import wandb
import pandas as pd
import pandas_profiling

run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)

[34m[1mwandb[0m: Currently logged in as: [33mreccodo[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.10 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


# Loading the data

Declaring an artifact as the input to the run and get the contenrts of the returned object locally.

In [2]:
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

# Exploring the data

This is the main part of the EDA. For each column the following statistics - if relevant for the column type - are examined:

* The type of the columns.
* The most frequent, unique, and missing values of the columns.
* Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range.
* Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness.
* The distribution of the values on each column.
* The correlations between the variables.

In [3]:
profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/29 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

# Data cleaning

After the analysis we concluded that there are several outliers in the `price` variable that we need to address. Additionally, we need to transform the value of `last_review` to a datetime.

In [5]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

Now we can check that the specific problems have been addressed.

In [6]:
profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/30 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [8]:
run.finish()

VBox(children=(Label(value=' 0.09MB of 0.09MB uploaded (0.01MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…