# EDA of New York Airbnb Rental Price Data

## Importing required libraries

In [1]:
import wandb
import pandas as pd
import pandas_profiling

## Getting Artifact from Weights and Biases

In [2]:
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

[34m[1mwandb[0m: Currently logged in as: [33mal-098[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Performing EDA using Pandas Profiling

In [3]:
profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/29 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

## Takeaways

### Observations
1. There are a total of 16 columns in the dataset, with 10 numeric columns and 6 categorical columns.
1. There are no duplicate rows in the dataset.
1. The dataset covers all 5 boroughs of the City of New York - Manhatten, Brooklyn, Queens, Bronx, Staten Island.
1. Manhatten and Brooklyn account for 85% of the data.
1. Entire homes and Private Rooms account for 97% of the listings.
1. The columns 'last_review' and 'reviews_per_month' have missing values.
1. Also, 'last_review' is a date, however, currently the column is in string format.
1. 'price' column has outliers, with maximum value being 10000 minimum being 0.

### Plan of Action
1. Handle missing values in inference pipeline.
1. Convert 'last_review' to datetime format.
1. Only keep prices in the range of 10 to 30 dollars per night.

## Fixing DataFrame as per Observations

In [4]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

## Checking on Columns to Confirm Fixes

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46428 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              46428 non-null  int64         
 1   name                            46413 non-null  object        
 2   host_id                         46428 non-null  int64         
 3   host_name                       46407 non-null  object        
 4   neighbourhood_group             46428 non-null  object        
 5   neighbourhood                   46428 non-null  object        
 6   latitude                        46428 non-null  float64       
 7   longitude                       46428 non-null  float64       
 8   room_type                       46428 non-null  object        
 9   price                           46428 non-null  int64         
 10  minimum_nights                  46428 non-null  int64         
 11  nu

## Saving Notebook and Exiting Weights and Biases Session

In [6]:
run.finish()

VBox(children=(Label(value='0.048 MB of 0.048 MB uploaded (0.010 MB deduped)\r'), FloatProgress(value=1.0, max…