## Notes:

* This notebook carries out basic Explorative Data Analysis (EDA) of the Airbnb data from NYC to understand the nature of the data and if any cleaning has to be done.

* A version of this notebook at the time of carrying this analysis is found in the W&B page of the project: https://[wandb.ai/kbhaskar-between-jobs/nyc_airbnb/artifacts/code/source-nyc_airbnb-EDA.ipynb](https://wandb.ai/kbhaskar-between-jobs/nyc_airbnb/artifacts/code/source-nyc_airbnb-EDA.ipynb).

* Note that the W&B artifact in this notebook points to `"sample.csv:latest"`, which at the time of running this notebook was the data `"sample1.csv"`. However, in a subsequent run of the project, the random forest model was trained on a new data of the project called `"sample2.csv"`. Hence the W&B artifact `"sample.csv:latest"` now refers to `"sample2.csv"`. 

* If you want to carry an analysis of the same data as was used during the creation of this notebook, please use `"sample.csv:v0"` instead of `"sample.csv:latest"`.

## Basic conclusions from this notebook:

* The dataset has 20000 rows and 16 columns. The `price` column is the target variable for the ML-pipeline.

* There are missing values in the name, reviews per month and last review. reviews per month and last review are missing in identical places, indicating that these have not been reviewed yet.

* There are outliers in the `price` column.

## Basic cleaning steps suggested by the analysis in this notebook:

* We will omit instances where the `price` columns lies outside \\$ 10 and \\$ 350.

* We will change the datatype of `last_review` to pandas datetime object.

* Missing values will not be imputed here. Imputing missing values and the other seps mentioned here will be added in the basic cleaning step in the MLflow pipeline, after which the cleaned dataset will be uploaded to Weights and Biases.

In [3]:
import wandb
import pandas as pd
from ydata_profiling import ProfileReport

In [42]:
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

VBox(children=(Label(value='0.269 MB of 0.269 MB uploaded (0.027 MB deduped)\r'), FloatProgress(value=1.0, max…

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011113640489121382, max=1.0…

* There are 20000 rows and 16 columns in the dataset.

In [43]:
df.shape

(20000, 16)

In [44]:
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [45]:
df.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

* The above shows that the column `last_review` is a string although it contains dates:

In [46]:
df["last_review"]

0        2019-05-26
1               NaN
2        2018-09-19
3        2019-05-24
4        2019-06-23
            ...    
19995    2016-08-27
19996    2019-05-21
19997    2019-05-23
19998    2019-07-01
19999    2019-04-28
Name: last_review, Length: 20000, dtype: object

In [48]:
df["name"].value_counts()

name
Brooklyn Apartment                            7
Hillside Hotel                                7
Home away from home                           6
New york Multi-unit building                  6
Private Room                                  6
                                             ..
Private Backyard! Beautiful 1BR apartment.    1
Geometric Getaway                             1
A Hidden Gem in Queens                        1
Cozy Studio in Midtown                        1
Private Bedroom in Williamsburg Apt!          1
Name: count, Length: 19768, dtype: int64

* There are missing values in `last_review` and `reviews_per_month` columns:

In [50]:
df.isna().sum()

id                                   0
name                                 7
host_id                              0
host_name                            8
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       4123
reviews_per_month                 4123
calculated_host_listings_count       0
availability_365                     0
dtype: int64

In [52]:
df["price"].describe()

count    20000.000000
mean       153.269050
std        243.325609
min          0.000000
25%         69.000000
50%        105.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

* The above shows that there are outliers in the `price` column, and some are also zero.
* Omit instances with price outside the range \\$10 and \\$350:

In [57]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

In [58]:
# Number of rows after removing outliers:
df.shape

(19001, 16)

In [59]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  number_

In [70]:
run.finish()

VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))