# Instructions for Airbnb Preprocessed Features

After running the notebook "Airbnb Feature Processing", we get 3 pickle objects (written in binary):

- airbnb_calendar_cleaned.pickle
- airbnb_listings_filters_cleaned_encoded.pickle
- airbnb_listings_filters_cleaned.pickle

In [1]:
import pandas as pd
import numpy as np
import sklearn
import sklearn.preprocessing
import re
import pickle

## Load the Saved Pickle Objects

In [2]:
# make sure that the pickle objects are in the current directory
# note 'rb' for reading binary

listings_filters_cleaned = pickle.load(open("airbnb_listings_filters_cleaned.pickle", "rb"))
listings_filters_cleaned_encoded = pickle.load(open("airbnb_listings_filters_cleaned_encoded.pickle", "rb"))
calendar_cleaned = pickle.load(open("airbnb_calendar_cleaned.pickle", "rb"))

## Listings Data Description

The 0th column is the id of the listing.

Columns 1-10 (0-indexing) are numerical.

Starting from column 11, there are 3 categorical features:

- city
- state
- neighbourhood

Within each categorical feature column, all values are uppercase and have been stripped of surrounding whitespace.

In [3]:
listings_filters_cleaned.columns.to_series().groupby(listings_filters_cleaned.dtypes).groups

{dtype('int64'): Index(['id', 'accommodates', 'minimum_nights', 'maximum_nights'], dtype='object'),
 dtype('float64'): Index(['price', 'weekly_price', 'monthly_price', 'bedrooms', 'beds',
        'latitude', 'longitude'],
       dtype='object'),
 dtype('O'): Index(['city', 'state', 'neighbourhood'], dtype='object')}

In [4]:
listings_filters_cleaned.head()

Unnamed: 0,id,accommodates,minimum_nights,maximum_nights,price,weekly_price,monthly_price,bedrooms,beds,latitude,longitude,city,state,neighbourhood
0,2515,3,2,21,59.0,720.0,1690.0,1.0,2.0,40.799205,-73.953676,NEW YORK,NY,HARLEM
1,2539,4,1,730,149.0,299.0,999.0,1.0,3.0,40.647486,-73.97237,BROOKLYN,NY,KENSINGTON
2,2595,2,1,1125,225.0,1995.0,,0.0,1.0,40.753621,-73.983774,NEW YORK,NY,MIDTOWN
3,3330,2,5,730,70.0,650.0,1900.0,1.0,1.0,40.708558,-73.942362,BROOKLYN,NY,WILLIAMSBURG
4,3647,2,3,7,150.0,,,1.0,1.0,40.809018,-73.941902,NEW YORK,NY,HARLEM


### 2 Formats for Categorical Features

"airbnb_listings_filters_cleaned.pickle" stores categorical features directly:

In [5]:
listings_filters_cleaned.iloc[:, 11:].head()

Unnamed: 0,city,state,neighbourhood
0,NEW YORK,NY,HARLEM
1,BROOKLYN,NY,KENSINGTON
2,NEW YORK,NY,MIDTOWN
3,BROOKLYN,NY,WILLIAMSBURG
4,NEW YORK,NY,HARLEM


"airbnb_listings_filters_cleaned_encoded" stores categorical features with one-hot encoding:

- e.g. the original `city` feature contains many different "categories" (name of cities), e.g. HARLEM and KENSINGTON (as we can see from the cell above); the one-hot encoded version has one column for each of these cities, e.g. `city_HARLEM` and `city_KENSINGTON`, where a value of 1 indicates that the listing/row belongs to this city and a value of 0 indicates that the listing/row does _not_ belong to this city

In [6]:
listings_filters_cleaned_encoded.iloc[:, 11:].head()

Unnamed: 0,city_8425 ELMHURST AVENUE,city_ARVERNE,city_ASTORIA,city_ASTORIA - NEW YORK,city_ASTORIA NEW YORK,city_ASTORIA QUEENS,"city_ASTORIA, NEW YORK","city_ASTORIA, NYC","city_ASTORIA, QUEENS","city_ASTORIA,NEW YORK",...,neighbourhood_WESTCHESTER VILLAGE,neighbourhood_WESTERLEIGH,neighbourhood_WHITESTONE,neighbourhood_WILLIAMSBRIDGE,neighbourhood_WILLIAMSBURG,neighbourhood_WILLOWBROOK,neighbourhood_WINDSOR TERRACE,neighbourhood_WOODHAVEN,neighbourhood_WOODLAWN,neighbourhood_WOODSIDE
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Important Note about Categorical Features

From the cell above, we can see that even after data cleaning, cities like Astoria can be represented in various different ways: `city_ASTORIA`, `city_ASTORIA - NEW YORK`, `city_ASTORIA NEW YORK`, `city_ASTORIA QUEENS`, `city_ASTORIA, NYC`, etc.

Thus, when checking whether a listing/row belongs to a city, e.g. Astoria, **use grep to find city names that CONTAIN e.g. "ASTORIA" rather than checking city name == "ASTORIA**. In this way, we can account for all listings related to the city Astoria even though these listings use different name representations of the city, e.g. `city_ASTORIA`, `city_ASTORIA - NEW YORK`, `city_ASTORIA NEW YORK`, `city_ASTORIA QUEENS`, `city_ASTORIA, NYC`, ...

## Calendar Data Description

The cleaned calendar data stores which days the a listing is available and the associated price. For each listing, we can check its availability over 1 year (365 days):

In [7]:
calendar_cleaned.groupby(['listing_id'])['date'].count().unique()

array([365])

In [8]:
calendar_cleaned.head()

Unnamed: 0,listing_id,date,available,price
0,2515,2019-08-06,t,89.0
1,2515,2019-08-05,t,89.0
2,2515,2019-08-04,t,89.0
3,2515,2019-08-03,t,89.0
4,2515,2019-08-02,t,89.0


e.g. to get the available days for listing with id 2515:

In [9]:
calendar_cleaned[(calendar_cleaned['listing_id'] == 2515) & (calendar_cleaned['available'] == 't')]  
# available 317/365 days

Unnamed: 0,listing_id,date,available,price
0,2515,2019-08-06,t,89.0
1,2515,2019-08-05,t,89.0
2,2515,2019-08-04,t,89.0
3,2515,2019-08-03,t,89.0
4,2515,2019-08-02,t,89.0
5,2515,2019-08-01,t,89.0
6,2515,2019-07-31,t,89.0
7,2515,2019-07-30,t,89.0
8,2515,2019-07-29,t,89.0
9,2515,2019-07-28,t,89.0
