# Contents
1. [Introduction](#intro)   
2. [Quantitative features](#quants)  
3. [Categorical features](#cats)  
    3.1 [Binary features](#binary)  
    3.2 [Ordinal features](#ordinal)  
    3.3 [Nominal features with >2 levels](#nominal)  
4. [Combining feature sets and rescaling](#combine)  

# 1. Introduction
<a id='intro'></a>

In this notebook, we extract the features we will use to predict property price, impute missing values and rescale to N(0, 1).

In [23]:
import pandas as pd
import numpy as np
import pickle
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction import DictVectorizer

In [2]:
listings = pd.read_csv('data/listings_train.csv', low_memory=False)
listings.set_index('id', inplace=True)

# 2. Quantitative features
<a id='quants'></a>

Convert percent strings to floats:

In [3]:
def pct_to_float(pct_column):
    """Strip punctuation from percents and convert to floats"""
    float_pct = [float(str(pct).replace('%', '')) for pct in pct_column]
    return float_pct

In [4]:
listings.host_response_rate = pct_to_float(listings.host_response_rate)

Create `days_as_host` feature from `host_since` column, which is the date the host joined the site:

In [5]:
listings['days_as_host'] = (pd.to_datetime('2019-07-14') - pd.to_datetime(listings.host_since)).dt.days

Impute missing values:

In [6]:
quant_features = ['days_as_host', 'host_response_rate', 'host_listings_count', 'accommodates', 'bathrooms', 'bedrooms',
                  'beds', 'guests_included', 'minimum_nights', 'number_of_reviews', 'review_scores_rating',
                  'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin',
                  'review_scores_communication', 'review_scores_location', 'review_scores_value']

In [7]:
mean_imp = SimpleImputer()
listings[quant_features] = mean_imp.fit_transform(listings[quant_features])

# 3. Categorical features
<a id='cats'></a>

## 3.1 Binary features
<a id='binary'></a>

Recode 't' and 'f' to 1 and 0, respectively:

In [8]:
binary_features = ['host_is_superhost', 'instant_bookable']
binary_recode = {'t': 1, 'f': 0}
listings[binary_features] = listings[binary_features].replace(binary_recode)

Impute missing values:

In [9]:
freq_imp = SimpleImputer(strategy='most_frequent')
listings[binary_features] = freq_imp.fit_transform(listings[binary_features])

## 3.2 Ordinal features
<a id='ordinal'></a>

Recode levels of ordinal features to preserve their order:

In [10]:
ordinal_features = ['host_response_time', 'cancellation_policy', 'room_type']

In [11]:
ordinal_recode = {'host_response_time':
                  {'within an hour': 4, 'within a few hours': 3, 'within a day': 2, 'a few days or more': 1},
                  'cancellation_policy':
                  {'super_strict_60': 1, 'super_strict_30': 2, 'strict': 3, 'strict_14_with_grace_period': 3, 'moderate': 4, 'flexible': 5},
                  'room_type': {'Entire home/apt': 3, 'Private room': 2, 'Shared room': 1}}

In [12]:
listings.replace(ordinal_recode, inplace=True)

Impute missing values:

In [13]:
listings[ordinal_features] = freq_imp.fit_transform(listings[ordinal_features])

## 3.3 Nominal features with >2 classes
<a id='nominal'></a>

Two columns, `amenities` and `host_verifications` contain overlapping classes (i.e., a listing will have multiple amenities and a host may have been verified in multiple ways). We'll create binary features for each class with the following:

In [14]:
amenity_options = []
for index, row in listings.iterrows():
    listing_amenities = row.amenities.replace('{','').replace('}','').replace('/','_')
    listing_amenities = listing_amenities.replace('\"','').replace(' ', '_').split(',')
    for amenity in listing_amenities:
        if amenity not in listings.columns:
            listings[amenity] = 0
            amenity_options.append(amenity)
        listings.at[index, amenity] = 1

In [15]:
verification_options = []
for index, row in listings.iterrows():
    listing_verifications = listings.host_verifications.iloc[0].replace('\'', '').replace('[', '').replace(']', '').split(', ')
    for verification in listing_verifications:
        if verification not in listings.columns:
            listings[verification] = 0
            verification_options.append(verification)
        listings.at[index, verification] = 1

For features with non-overlapping classes, we can use sklearn's DictVectorizer to convert each class to a binary feature:

In [16]:
nominal_features = ['neighbourhood_group_cleansed', 'property_type']

In [17]:
nom_dict = listings[nominal_features].to_dict('records')

In [18]:
v = DictVectorizer(sparse=False)
nom_onehot = v.fit_transform(nom_dict)

# 4. Combining feature sets and rescaling
<a id='combine'></a>

In [19]:
df_features = quant_features + binary_features + ordinal_features + amenity_options + verification_options
X_train = np.c_[listings[df_features].values, nom_onehot]

In [20]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

Because of the issues we saw in the price data in the `listings.csv` file, we'll use price data from the `calendar.csv` file as our target feature, instead. For now, we'll pick a fixed weekend date that isn't too far beyond when they calendars were scraped.

In [21]:
calendar = pd.read_csv('data/calendar_train_price.csv')
y_train = calendar[calendar.date == '2019-08-03'].set_index('listing_id').loc[listings.index].price.values

In [29]:
feature_names = df_features + v.get_feature_names()

In [22]:
#np.save('data/X_train.npy', X_train)
#np.save('data/y_train.npy', y_train)
#with open('data/feature_names.txt', 'wb') as fp:
#    pickle.dump(feature_names, fp)