## ANALYSIS of AirBnb's NYC rental dataset
Extracting relevant information from the Airbnb NYC rental dataset
* to find outliers and remove them from the modelling pipeline
* to understand what features can be extracted, manipulated or converted
* to find out what features might be useful for creating a model pipeline

In [2]:
import wandb
import pandas as pd

# login and retrieve a sample of the data
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

VBox(children=(Label(value=' 0.02MB of 0.02MB uploaded (0.01MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

[34m[1mwandb[0m: wandb version 0.12.17 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


**Now df contains the airbnb new york city rental dataset**

In [3]:
import pandas_profiling

from markupsafe import escape

# produces a pandas profiling report of the sample dataframe
profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/29 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

# Analyzing the results of profiling with pandas

## General

There are about 50k data observations, and the following variables 

## Variables

* _name_ is a text field, description, that has a very high distinct value and is a good candidate for NLP analysis, as it seems to contain relevant information
* _id_ and _host_id_ are high cardinality identifiers which are not to be used as features
* _host_name_ is a high cardinality text field which does not seem to be useful as a feature
* _neighborhood_group_ and _neighborhood_ are categorical features which might be useful as features. *Neighborhood* has a high cardinality, though, and probably less useful
* _latitude_ and _longitude_ are numerical features that might be useful as features. Values in the middle seem to have higher price
* _room_type_ is categorical and influences the price
* _price_ is the target variable. It has some extreme, unfrequent values that should be removed
* _minimum_nights_, _calculated_host_listings_count_ are numerical values that are too highly skewed to have any relevant influence as features
* _number_of_reviews_, _availability_365_, _reviews_per_month_ are numerical values that seem to somewhat influence the price, but have a lot of null values
* _last_review_ is supposed to contain a data, but instead it contains a string, categorical representation 


## Conclusions

Some of the numerical and categorical features influence the price but might not be enough to train a model. Therefore it is necessary to include a NLP analysis of _name_ in the model pipeline


## Interventions

* _price_ has extreme outlier values, that should be removed
* _last_review_ is not useful as a categorical feature, should be converted to a date

In [7]:
# Drop outlier values for prices
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()

# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

## Second review

We review data after the last change

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46428 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              46428 non-null  int64         
 1   name                            46413 non-null  object        
 2   host_id                         46428 non-null  int64         
 3   host_name                       46407 non-null  object        
 4   neighbourhood_group             46428 non-null  object        
 5   neighbourhood                   46428 non-null  object        
 6   latitude                        46428 non-null  float64       
 7   longitude                       46428 non-null  float64       
 8   room_type                       46428 non-null  object        
 9   price                           46428 non-null  int64         
 10  minimum_nights                  46428 non-null  int64         
 11  nu

In [9]:
# review after the last changes
fprofile = pandas_profiling.ProfileReport(df)
fprofile.to_widgets()

Summarize dataset:   0%|          | 0/30 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

## After review

* _price_ 's extreme values are more frequent and much closer to the 5-th and 95-th percentiles
* _last_review_ as a date is more usable, although still skewed and still not influencing the price. It might be useful to convert it to a numeric values, such as the day difference with today.

In [11]:
run.finish()


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…