# Tutorial 1 - Autoencoder

We will build an ANOMALY detector. We will use the airbnb data set

**The unit of analysis is a single airbnb listing**

**Let's assume that `price_gte_150 = 1` category is the "anomalous" category. We will train an autoencoder on the `price_gte_150 = 0` category and observe the regeneration error for "normal" and "anomalous" data.**

I already created two files:<br>
`airbnb-anomaly.csv`: includes only the listings that are `price_gte_150 = 1`<br>
`airbnb-normal.csv`: includes only the listings that are `price_gte_150 = 0`

# Setup

In [None]:
# Common imports
import numpy as np
import pandas as pd

random_state=42

# Get the data

In [None]:
airbnb_anomaly = pd.read_csv("airbnb-anomaly.csv")

airbnb_normal = pd.read_csv("airbnb-normal.csv")


# Data Prep

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

##  Identify the numerical and categorical columns

In [None]:
airbnb_normal.dtypes

**At this stage, you can manually identify numeric, binary, and categorical columns as follows:**

`numeric_columns = ['latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'Number of amenities', 'guests_included', 'price_per_extra_person', 'minimum_nights', 'number_of_reviews', 'number_days_btw_first_last_review', 'review_scores_rating']`
 
 `binary_columns = ['host_is_superhost', 'host_identity_verified']`
 
 `categorical_columns = ['neighbourhood_cleansed', 'property_type', 'room_type', 'bed_type', 'cancellation_policy']`
 
<br>
 
**If you do not want to manually type these, you can do the below tricks:**

In [None]:
# Identify the numerical columns
numeric_columns = airbnb_normal.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = airbnb_normal.select_dtypes('object').columns.to_list()

In [None]:
# Identify the binary columns so we can pass them through without transforming
binary_columns = ['host_is_superhost', 'host_identity_verified']

In [None]:
# Be careful: numerical columns already includes the binary columns,
# So, we need to remove the binary columns from numerical columns.

for col in binary_columns:
    numeric_columns.remove(col)

In [None]:
binary_columns

In [None]:
numeric_columns

In [None]:
categorical_columns

# Pipeline

In [None]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [None]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [None]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [None]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')
    
#passtrough is an optional step. You don't have to use it.

# Transform: fit_transform() for NORMAL data

In [None]:
#Fit and transform the train data
normal_x = preprocessor.fit_transform(airbnb_normal)

normal_x

In [None]:
normal_x.shape

# Tranform: transform() for ANOMALOUS DATA

In [None]:
# Transform the test data
anomaly_x = preprocessor.transform(airbnb_anomaly)

anomaly_x

In [None]:
anomaly_x.shape

# Autoencoder

In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
model = keras.models.Sequential()

#Encoder
model.add(keras.layers.InputLayer(input_shape=normal_x.shape[1]))
model.add(keras.layers.Dense(55, activation='relu'))
model.add(keras.layers.Dense(50, activation='relu'))

#Decoder
model.add(keras.layers.Dense(55, activation='relu'))
model.add(keras.layers.Dense(normal_x.shape[1], activation=None))

model.summary()

In [None]:
adam = keras.optimizers.Adam(learning_rate=0.001)


model.compile(loss='mse', optimizer='Nadam', metrics=['mean_squared_error'])

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

earlystop = EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='auto')

callback = [earlystop]

In [None]:
# Be careful: both input and output are "housing_normal_std" while training the autoencoder

model.fit(normal_x, normal_x, 
          validation_data = (normal_x, normal_x),
          epochs=100, batch_size=100, callbacks=callback)

### Check the average MSE on the "normal" data

In [None]:
model.evaluate(normal_x, normal_x)

In [None]:
#Multiply by 1000 to make sense of the error term:

model.evaluate(normal_x, normal_x)[0]*1000

### Check the average MSE on the "anomalous" data

In [None]:
model.evaluate(anomaly_x, anomaly_x)

In [None]:
#Multiply by 1000 to make sense of the error term:

model.evaluate(anomaly_x, anomaly_x)[0]*1000

## Predict first 20 in normal data

In [None]:
from sklearn.metrics import mean_squared_error

for i in range(0,20):
    prediction = model.predict(normal_x[i:i+1])
    print((mean_squared_error(normal_x[i:i+1], prediction))*1000)

    
#Error terms are multiplied by 1000 to make sense of the numbers

## Predict first 20 in anomalous data


In [None]:
for i in range(0,20):
    prediction = model.predict(anomaly_x[i:i+1])
    print((mean_squared_error(anomaly_x[i:i+1], prediction))*1000)

    
#Error terms are multiplied by 1000 to make sense of the numbers