### Author: Vidhi Kokel
# Understand the template

#### What is the experimental protocol used and how was it carried out? How did we tune hyper-parameters in the template? What is the search space and what is the criteria to determine good/bad hyper-parameters?
#### ✅The experimental protocols used here are a combination of following layers combined in a sequential model with multiple layers. 
1. Embedding Layer
2. LSTM Layer
3. GRU Layer
4. Bidirectional LSTM Layer
5. Birdirectional GRU Layer
6. Convolutional Layer
7. Max Pooling Layer
8. Fully Connected Layer

#### The search space for the experiment is various hyper-parameters that can affect the performance of the model which are listed below.
1. Number of Hidden layers and units
2. Different types of layers
3. Weight Initialization
4. Activation functions
5. Learning Rate
6. Number of epochs
7. Batch size

#### The template shows how a multi-modality (both text and image inputs) multi-objective (predicting both price and type) solution can be provided for the current problem by using Embedding and reduced_mean layers for text inputs and a convolutional layer with max pooling for the image inputs. The loss weights have been set in the template. The criteria to decide good/bad hyper-parameters depends on how well the neural network is able to learn. It should not under-fit or overfit.

# Problem Definition
#### Define the problem. What is the input? What is the output? What data mining function is required? What could be the challenges? What is the impact? What is an ideal solution?
#### ✅We are given the textual summary and images of multiple air bnb listings in Montreal in 2019. From these summaries and images we have to classify the type of the listing as well as the price category in which the property belongs to. There are following 3 sets of inputs used accross all the experiments. 

- Text(summary) + Image
- Text(summary) only
- Image only

#### Moreover, there are following 3 sets of outputs representing the type and price category of the property listing that are possible from the above experiments. But since we have to find out the price in our problem, I have only used price prediction for submitting the solutions on Kaggle.

- Price + Type
- Price only
- Type only

#### Classification is required for this problem. The challenges could be the image resolution, unclear summary and not clean textual summary. The impact might be misclassification of the types and prices leading to false positives or false negatives. The ideal solution is a classification algorithm that accurately identifies the price and type categories for a given listing.

# Theoretical Questions

#### 🌈Based on the provided template, describe the format of the input file (sdf file).
#### 🌈What are the input tensors to the neural network model (their meaning, not just symbol)? What is each of their dims and their meaning (e.g. batch_size)?
#### 🌈For each dim of gnn_out, what does it symbolize? For each dim of avg, what does it symbolize?
#### 🌈What is the difference between segment_mean and tf.reduce_mean? For each dim of pred, what does it symbolize?
#### 🌈What is the motivation/theory/idea to use multiple gcn layers comparing to just one? How many layers were used in the template?
#### ✅For sequential data fully-connected model is a good model
#### Because they are “structure agnostic.” That is, no special assumptions need to be made about the inputs.
##### Source: 


# Trial Discussion

#### ✅✅✅✅Here from all the 5 approaches, multi modality implementations outperform the implementations using only text inputs and that using only text inputs outperform the implementation that uses only image inputs.

## Read SDF format data (structured-data format)

In [1]:
import numpy as np
from tqdm.notebook import tqdm

def read_sdf(file):
    with open(file, 'r') as rf:
        content = rf.read()
    samples = content.split('$$$$')
    
    def parse_sample(s):
        lines = s.splitlines()
        links = []
        nodes = []
        label = 0
        for l in lines:
            if l.strip() == '1.0':
                label = 1
            if l.strip() == '-1.0':
                label = 0
            if l.startswith('    '):
                feature = l.split()
                node = feature[3]
                nodes.append(node)
            elif l.startswith(' '):
                lnk = l.split()
                # edge: (from, to,) (1-based index)
                if int(lnk[0]) - 1 < len(nodes):
                    links.append((
                        int(lnk[0])-1, 
                        int(lnk[1])-1, # zero-based index
                        # int(lnk[2]) ignore edge weight
                    ))
        return nodes, np.array(links), label
    
    return [parse_sample(s) for s in tqdm(samples) if len(s[0]) > 0]

In [2]:
from sklearn.model_selection import train_test_split

training_set = read_sdf('train.sdf')
training_set, validation_set = train_test_split(training_set, test_size=0.15,)

  0%|          | 0/25024 [00:00<?, ?it/s]

In [3]:
testing_set  = read_sdf('test_x.sdf')

  0%|          | 0/12326 [00:00<?, ?it/s]

In [4]:
print(training_set[1])

(['S', 'O', 'O', 'O', 'O', 'O', 'O', 'N', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'], array([[ 0, 13],
       [ 0, 14],
       [ 1, 16],
       [ 2, 17],
       [ 3, 20],
       [ 4, 21],
       [ 5, 22],
       [ 6, 23],
       [ 7, 11],
       [ 7, 12],
       [ 7, 13],
       [ 8, 13],
       [ 8, 15],
       [ 9, 15],
       [ 9, 21],
       [ 9, 26],
       [10, 20],
       [10, 21],
       [10, 27],
       [11, 14],
       [11, 16],
       [12, 15],
       [12, 20],
       [14, 17],
       [16, 18],
       [17, 19],
       [18, 19],
       [18, 22],
       [19, 23],
       [22, 24],
       [23, 25],
       [24, 25]]), 0)


In [25]:
for i in training_set:
    print()

0

## Visualizing/Inspecting a Sample

In [5]:
# # !pip install --quiet networkx
# !pip install --user --quiet decorator==4.3.0

# !pip install --user --quiet networkx==2.3
# !pip install --user --quiet matplotlib==2.2.3
# import networkx as nx
# import matplotlib.pyplot as plt
# from matplotlib import cm
# colors = cm.rainbow(np.linspace(0, 1, 50))

In [6]:
# def visualize(sample):
#     G=nx.Graph()
#     nodes = sample[0]
#     edges = sample[1]
    
#     labeldict={}
#     node_color=[]
#     for i,n in enumerate(nodes):
#         G.add_node(i)
#         labeldict[i]=n
#         node_color.append(colors[hash(n)%len(colors)])

#     # a list of nodes:
#     for e in edges:
#         G.add_edge(e[0], e[1])
        
#     nx.draw(G, labels=labeldict, with_labels = True, node_color = node_color)
#     plt.show()
    
#     return G

In [7]:
# plt.clf()
# visualize(training_set[20])

## Preprocessing:

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer

max_vocab = 500
max_len = 100


# build vocabulary from training set
all_nodes = [s[0] for s in training_set]
tokenizer = Tokenizer(num_words=max_vocab)
tokenizer.fit_on_texts(all_nodes)

In [9]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import random
random.seed(0)

def prepare_single_batch(samples):
    sample_nodes = [s[0] for s in samples]
    sample_nodes = tokenizer.texts_to_sequences(sample_nodes)
    sample_nodes = pad_sequences(sample_nodes, padding='post')
    max_nodes_len = np.shape(sample_nodes)[1]
    edges = [s[1]+i*max_nodes_len for i,s in enumerate(samples)]
    edges = [e for e in edges if len(e) > 0]
    node_to_graph = [[i]*max_nodes_len for i in range(len(samples))]
    
    all_nodes = np.reshape(sample_nodes, -1)
    all_edges = np.concatenate(edges)

    node_to_graph = np.reshape(node_to_graph, -1)
    return {
        'data': all_nodes,
        'edges': all_edges,
        'node2grah': node_to_graph,
    }, np.array([s[2] for s in samples])



def gen_batch(dataset, batch_size=16, repeat=False, shuffle=True):
    while True:
        dataset = list(dataset)
        if shuffle:
            random.shuffle(dataset)
        l = len(dataset)
        for ndx in range(0, l, batch_size):
            batch_samples = dataset[ndx:min(ndx + batch_size, l)]
            yield prepare_single_batch(batch_samples)
        if not repeat:
            break


In [10]:
# showing one batch:
for train_batch in gen_batch(training_set, batch_size=4):
    for k,v in train_batch[0].items():
        print(k)
        print(v)
        pass
    print('label', train_batch[1])
    break

data
[2 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 4 2 2 2 2 3 1 1 1 1 1 1
 1 1 1 1 1 1 1 0 0 0 0 0 0 4 2 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 0 6 2 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
edges
[[ 0 12]
 [ 0 18]
 [ 1  2]
 [ 1  8]
 [ 1 10]
 [ 2  3]
 [ 3  9]
 [ 4  8]
 [ 4 13]
 [ 5 11]
 [ 5 13]
 [ 6 11]
 [ 7 13]
 [ 8  9]
 [ 9 11]
 [10 12]
 [10 14]
 [12 15]
 [14 16]
 [15 17]
 [16 17]
 [25 33]
 [25 41]
 [26 36]
 [27 36]
 [28 30]
 [29 30]
 [30 34]
 [31 33]
 [31 35]
 [31 36]
 [32 34]
 [32 35]
 [32 37]
 [33 38]
 [34 39]
 [37 40]
 [38 42]
 [39 43]
 [40 43]
 [41 42]
 [50 54]
 [50 55]
 [51 69]
 [52 54]
 [52 56]
 [53 62]
 [53 68]
 [54 57]
 [55 56]
 [55 58]
 [56 59]
 [57 60]
 [57 61]
 [58 65]
 [59 66]
 [60 63]
 [61 64]
 [62 63]
 [62 64]
 [65 66]
 [67 68]
 [67 69]
 [67 70]
 [69 71]
 [70 72]
 [71 73]
 [72 73]
 [75 97]
 [76 82]
 [76 86]
 [77 87]
 [78 88]
 [79 82]
 [79 87]
 [80 88]
 [80 94]
 [81 83]
 [81 84]
 [81 85]
 [82 85]
 [82 91]
 [83 87]
 [83 88]
 [84 86]
 [84 89]
 [86 90]
 [

In [11]:
!pip install --quiet tf2_gnn

# https://github.com/microsoft/tf2-gnn
# https://github.com/microsoft/tf2-gnn/blob/master/tf2_gnn/layers/gnn.py

from tf2_gnn.layers.gnn import GNN, GNNInput

In [12]:
import tensorflow as tf
from tensorflow.math import segment_mean
from tensorflow import keras
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Embedding, Dense
from tensorflow.keras.optimizers import Adam

data = keras.Input(batch_shape=(None,))

# the first dim is different to the previous one. it is the total number of edges in this batch
edge = keras.Input(batch_shape=(None, 2), dtype=tf.int32)
node2graph = keras.Input(batch_shape=(None,), dtype=tf.int32)
embeded = Embedding(tokenizer.num_words, 20)(data)

# number of graphs (number of samples)
num_graph = tf.reduce_max(node2graph)+1

gnn_input = GNNInput(
    node_features=embeded,
    adjacency_lists=(edge,),
    node_to_graph_map=node2graph, 
    num_graphs=num_graph,
)

# https://github.com/microsoft/tf2-gnn/blob/master/tf2_gnn/layers/gnn.py
params = GNN.get_default_hyperparameters()

# Attention aggregation mechanism
# params["message_calculation_class"] = "rgat"
# params["num_heads"] = 8

# GGNN aggregation mechanism
# params["message_calculation_class"] = "ggnn"

# GGNN aggregation mechanism
params["message_calculation_class"] = "gnn_film"
params["film_parameter_MLP_hidden_layers"] = 8

params["hidden_dim"] = 32
gnn_layer = GNN(params)
gnn_out = gnn_layer(gnn_input)

print('gnn_out', gnn_out)

# https://www.tensorflow.org/api_docs/python/tf/math/segment_mean
avg = segment_mean(
    data=gnn_out,
    segment_ids=node2graph
)
print('mean:', avg)

pred = Dense(1, activation='sigmoid')(avg)
print('pred:', pred)

model = Model(
    inputs={
        'data': data,
        'edges': edge,
        'node2grah': node2graph,
    },
    outputs=pred
)
model.summary()

gnn_out KerasTensor(type_spec=TensorSpec(shape=(None, 32), dtype=tf.float32, name=None), name='gnn/StatefulPartitionedCall:0', description="created by layer 'gnn'")
mean: KerasTensor(type_spec=TensorSpec(shape=(None, 32), dtype=tf.float32, name=None), name='tf.math.segment_mean/SegmentMean:0', description="created by layer 'tf.math.segment_mean'")
pred: KerasTensor(type_spec=TensorSpec(shape=(None, 1), dtype=tf.float32, name=None), name='dense/Sigmoid:0', description="created by layer 'dense'")
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_3 (InputLayer)           [(None,)]            0           []                               
                                                                                                  
 input_1 (InputLayer)           [(None,)]            0           []                      

In [13]:
model.compile(
    loss='BinaryCrossentropy',
    metrics=['AUC']
)

In [14]:
import math

batch_size = 32
num_batchs = math.ceil(len(training_set) / batch_size)
num_batchs_validation = math.ceil(len(validation_set) / batch_size)

model.fit(
    gen_batch(
        training_set, batch_size=batch_size, repeat=True
    ),
    steps_per_epoch=num_batchs,
    epochs=25,
    validation_data=gen_batch(
        validation_set, batch_size=batch_size, repeat=True
    ),
    validation_steps=num_batchs_validation,
)

Epoch 1/25




Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x22ed8e1abb0>

In [15]:
y_pred = model.predict(
    gen_batch(testing_set, batch_size=16, shuffle=False)
)
y_pred = np.reshape(y_pred, -1)

In [16]:
len(y_pred)

12326

In [17]:
import pandas as pd 
submission = pd.DataFrame({'label':y_pred})
submission.index.name = 'id'
submission.to_csv('sample_submission_rgat.csv')