# Load and Access the Raw Data

Load the Airbnb dataset into a pandas DataFrame.

Examine the DataFrame and determine which of the columns belong in one of the following categories:
* Categorical: the column contains a fixed number of distinct values
* Continuous: the column contains numerical values
* Text: the column contains free-form text

Complete a basic assessment of the dataset. For each column in the dataset, you want to get a count of:
* missing values — You can use the pandas isnull() function to find missing values.
* Invalid values (for example, negative values in continuous columns and latitude/longitude values that are outside of New York City)

The number of distinct values in each of the categorical columns

Save the DataFrame to a pickle file so that you can examine it again in another Python session.

In [1]:

import pandas as pd
import folium
from folium.plugins import FastMarkerCluster
import utils

CONFIG_FILE = "01-01-load-raw-data_config.yml"


In [2]:
# Load Notebook Config
config = utils.load_config(CONFIG_FILE)
config

{'general': {'load_from_scratch': False,
  'save_raw_dataframe': False,
  'save_transformed_dataframe': False,
  'remove_bad_values': True},
 'columns': {'categorical': ['neighbourhood_group',
   'neighbourhood',
   'room_type'],
  'continuous': ['minimum_nights',
   'number_of_reviews',
   'reviews_per_month',
   'calculated_host_listings_count',
   'latitude',
   'longitude'],
  'date': ['last_review'],
  'text': ['name', 'host_name'],
  'excluded': ['price', 'id']},
 'bounding_box': {'max_long': -73.70018092,
  'max_lat': 40.91617849,
  'min_long': -74.25909008,
  'min_lat': 40.47739894},
 'newark_bounding_box': {'max_long': -74.11278706,
  'max_lat': 40.67325015,
  'min_long': -74.25132408,
  'min_lat': 40.78813864},
 'geo_columns': ['latitude', 'longitude'],
 'file_names': {'input_csv': '../data/AB_NYC_2019.csv',
  'parquet_input_dataframe': '../data/AB_NYC_2019_input_1_sep_2023.parquet',
  'parquet_output_dataframe': '../data/AB_NYC_2019_output_1_sep_2023.parquet'}}

## Load Data

In [3]:
airbnb_df = pd.read_csv(config["file_names"]["input_csv"])

In [4]:
airbnb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

## Type of columns and number of categorical values

Type of column based on number of unique values.

In [5]:

utils.ColumnType.classify_columns(airbnb_df, categorical_max_count=221)

The column 'name' has 47906 unique values.
The column 'host_name' has 11453 unique values.
The column 'neighbourhood_group' has 5 unique values.
The column 'neighbourhood' has 221 unique values.
The column 'room_type' has 3 unique values.
The column 'last_review' has 1765 unique values.


{'id': <ColumnType.CONTINUOUS: 'continuous'>,
 'name': <ColumnType.TEXT: 'text'>,
 'host_id': <ColumnType.CONTINUOUS: 'continuous'>,
 'host_name': <ColumnType.TEXT: 'text'>,
 'neighbourhood_group': <ColumnType.CATEGORICAL: 'categorical'>,
 'neighbourhood': <ColumnType.CATEGORICAL: 'categorical'>,
 'latitude': <ColumnType.CONTINUOUS: 'continuous'>,
 'longitude': <ColumnType.CONTINUOUS: 'continuous'>,
 'room_type': <ColumnType.CATEGORICAL: 'categorical'>,
 'price': <ColumnType.CONTINUOUS: 'continuous'>,
 'minimum_nights': <ColumnType.CONTINUOUS: 'continuous'>,
 'number_of_reviews': <ColumnType.CONTINUOUS: 'continuous'>,
 'last_review': <ColumnType.TEXT: 'text'>,
 'reviews_per_month': <ColumnType.CONTINUOUS: 'continuous'>,
 'calculated_host_listings_count': <ColumnType.CONTINUOUS: 'continuous'>,
 'availability_365': <ColumnType.CONTINUOUS: 'continuous'>}

## Missing Price values

11 listings have zero price value.

In [6]:

missing_price_df = airbnb_df[airbnb_df["price"]<=0 | airbnb_df["price"].isnull()]
print(len(missing_price_df))
missing_price_df

11


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
23161,18750597,"Huge Brooklyn Brownstone Living, Close to it all.",8993084,Kimberly,Brooklyn,Bedford-Stuyvesant,40.69023,-73.95428,Private room,0,4,1,2018-01-06,0.05,4,28
25433,20333471,★Hostel Style Room | Ideal Traveling Buddies★,131697576,Anisha,Bronx,East Morrisania,40.83296,-73.88668,Private room,0,2,55,2019-06-24,2.56,4,127
25634,20523843,"MARTIAL LOFT 3: REDEMPTION (upstairs, 2nd room)",15787004,Martial Loft,Brooklyn,Bushwick,40.69467,-73.92433,Private room,0,2,16,2019-05-18,0.71,5,0
25753,20608117,"Sunny, Quiet Room in Greenpoint",1641537,Lauren,Brooklyn,Greenpoint,40.72462,-73.94072,Private room,0,2,12,2017-10-27,0.53,2,0
25778,20624541,Modern apartment in the heart of Williamsburg,10132166,Aymeric,Brooklyn,Williamsburg,40.70838,-73.94645,Entire home/apt,0,5,3,2018-01-02,0.15,1,73
25794,20639628,Spacious comfortable master bedroom with nice ...,86327101,Adeyemi,Brooklyn,Bedford-Stuyvesant,40.68173,-73.91342,Private room,0,1,93,2019-06-15,4.28,6,176
25795,20639792,Contemporary bedroom in brownstone with nice view,86327101,Adeyemi,Brooklyn,Bedford-Stuyvesant,40.68279,-73.9117,Private room,0,1,95,2019-06-21,4.37,6,232
25796,20639914,Cozy yet spacious private brownstone bedroom,86327101,Adeyemi,Brooklyn,Bedford-Stuyvesant,40.68258,-73.91284,Private room,0,1,95,2019-06-23,4.35,6,222
26259,20933849,the best you can find,13709292,Qiuchi,Manhattan,Murray Hill,40.75091,-73.97597,Entire home/apt,0,3,0,,,1,0
26841,21291569,Coliving in Brooklyn! Modern design / Shared room,101970559,Sergii,Brooklyn,Bushwick,40.69211,-73.9067,Shared room,0,30,2,2019-06-22,0.11,6,333


## Number of missing values by column

In [7]:
airbnb_df.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

## Listing coordinates on Map

All the points seem to be located within the New York City region. To acurately confirm this, we'll need a detailed boarder information for the New York City.

In [8]:

mean_location = airbnb_df[["latitude", "longitude"]].mean()
f = folium.Figure(width=1000, height=500)
m = folium.Map(location=mean_location, tiles="Stamen Toner", zoom_start=10, zoom_min=8, max_bounds=True).add_to(f)
FastMarkerCluster(data=airbnb_df[["latitude", "longitude", "name"]]).add_to(m)

f

## Save Data in Parquet file

Using parquet instead of pickel due to it's speed and memory efficiency.

In [9]:
save_path = config["file_names"]["parquet_input_dataframe"]
airbnb_df.to_parquet(save_path)

print(f"Dataframe saved at: {save_path}")

Dataframe saved at: ../data/AB_NYC_2019_input_1_sep_2023.parquet
