# Table of Contents
* [Project Goals](#project_goals)
* [Introduction](#introduction)
* [Importing Data](#importing_data)
* [Exploratory Data Analysis](#exploratory_data_analysis)
* [Feature Engineering](#feature_engineering)
* [Modeling](#modeling)
* [Model Evaluation](#model_evaluation)
* [Predictions](#prediction)
* [Conclusion](#conclusion)

<a class="anchor" id="project_goals"></a>

# <p style="background:#00003f url('pylogo.jpg') no-repeat; font-family:tahoma; font-size:150%; color:white; text-align:center; border-radius:20px 30px; width:92%; padding:30px; font-weight:bold">Project Goals</p>

The task is to create an algorithm, that takes an HTML page as input and infers if the page contains the information about cancer tumor board or not.

What is a tumor board? Tumor Board is a consilium of doctors (usually from diferent disciplines) discussing cancer cases in their departments. If you want to know more you can read this article.

As a final output from this task I will provide a submission.csv file for the test data set with two columns: document ID and a prediction, and a Jupyter notebook with code and documentation giving answers to the following questions:

-	How did you decide to do feature engineering?
-	How did you decide which models to try (if you decide to train any models)?
-	How did you perform validation of your model?
-	What metrics did you measure?
-	How do you expect your model to perform on test data (in terms of your metrics)?
-	How fast will your algorithm perform and how could you improve its performance if you would have more time?
-	How do you think you would be able to improve your algorithm if you would have more data?
-	What potential issues do you see with your algorithm?

----------------

<a class="anchor" id="introduction"></a>

# <p style="background:#990011 url('pylogo.jpg') no-repeat; font-family:tahoma; font-size:150%; color:#FCF6F5; text-align:center; border-radius:20px 30px; width:40%; padding:30px; font-weight:bold">Introduction</p>

As a first step in solving this problem, we will load the provided CSV files using the `Pandas` library. The training CSV file contains 100 rows, with three columns: `URL`, `doc_id`, and a `label`. The test CSV file has 48 rows with two columns: `URL` and `doc_id`. The goal is to train a machine learning model that can predict a `label` for the documents provided in the test CSV based on the data that is available in the training CSV.

In [1]:
import pandas as pd

In [2]:
train_csv = pd.read_csv(filepath_or_buffer="train.csv")
print("Training set shape", train_csv.shape)
train_csv.head(10)

Training set shape (100, 3)


Unnamed: 0,url,doc_id,label
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3
5,http://krebszentrum.kreiskliniken-reutlingen.d...,8,1
6,http://krebszentrum.kreiskliniken-reutlingen.d...,9,1
7,http://krebszentrum.kreiskliniken-reutlingen.d...,10,1
8,http://krebszentrum.kreiskliniken-reutlingen.d...,11,1
9,http://krebszentrum.kreiskliniken-reutlingen.d...,12,1


In [3]:
test_csv = pd.read_csv(filepath_or_buffer="test.csv")
print("Test set shape", test_csv.shape)
test_csv.head()

Test set shape (48, 2)


Unnamed: 0,url,doc_id
0,http://chirurgie-goettingen.de/medizinische-ve...,0
1,http://evkb.de/kliniken-zentren/chirurgie/allg...,2
2,http://krebszentrum.kreiskliniken-reutlingen.d...,7
3,http://marienhospital-buer.de/mhb-av-chirurgie...,15
4,http://marienhospital-buer.de/mhb-av-chirurgie...,16


In [4]:
tumor_keywords = pd.read_csv(filepath_or_buffer="keyword2tumor_type.csv")
print("Tumor keywords set shape", tumor_keywords.shape)
tumor_keywords.head()

Tumor keywords set shape (126, 2)


Unnamed: 0,keyword,tumor_type
0,senologische,Brust
1,brustzentrum,Brust
2,breast,Brust
3,thorax,Brust
4,thorakale,Brust


We have `100` documents in the training set, and `48` in the test set. We have `32` documents that mention no tumor board (label = 1), `59` documents where a tumor board is mentioned, but we are not certain if it is the main focus of the page (label = 2), and `9` documents for which we are certain that they are dedicated to tumor boards.

In [5]:
train_csv.groupby(by="label").size()

label
1    32
2    59
3     9
dtype: int64

--------------------

<a class="anchor" id="importing_data"></a>

# <p style="background:#990011 url('pylogo.jpg') no-repeat; font-family:tahoma; font-size:150%; color:#FCF6F5; text-align:center; border-radius:20px 30px; width:60%; padding:30px; font-weight:bold">Loading Data</p>

In [8]:
def read_html(doc_id: int) -> str:
    """
    Reads the HTML document with the given doc_id from the database.
    """
    with open(file=f"htmls/{doc_id}.html", mode= 'r', encoding= "latin1") as f:
        html = f.read()
    return html

##### Read the HTML documents in the train_csv

In [9]:
train_csv["html"] = train_csv["doc_id"].apply(read_html)
train_csv.head()

Unnamed: 0,url,doc_id,label,html
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1,<!DOCTYPE html>\n<!-- jsn_reta_pro 1.0.2 -->\n...
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3,"<!DOCTYPE html>\n<html class=""no-js"" lang=""de""..."
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<..."
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<..."
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<..."


##### Extract the text from the HTML

In [13]:
import warnings
from bs4 import BeautifulSoup
warnings.filterwarnings(action="ignore")

In [15]:
def extract_html_text(html):
    bs = BeautifulSoup(markup=html, features = 'lxml')
    for script in bs(name=["script", "style"]):
        script.decompose()
    return bs.get_text(separator= " ")

In [16]:
train_csv["html_text"] = train_csv["html"].apply(extract_html_text)
train_csv.head()

Unnamed: 0,url,doc_id,label,html,html_text
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1,<!DOCTYPE html>\n<!-- jsn_reta_pro 1.0.2 -->\n...,\n \n \n \n \n \n Elbe-Elster Klinikum - Chiru...
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3,"<!DOCTYPE html>\n<html class=""no-js"" lang=""de""...",\n \n \n \n \n \n \n Onkologisches Zentrum - K...
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Zentrum - SozialpÃ¤diatrisches Zentrum -...
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Leistung - Spezielle UnterstÃ¼tzung bei ...
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Zuweiser - Tumorkonferenzen - Tumorkonfe...


We observe that there is a very large number of the new line characters `\n` in the new `html_text` columns at the very begining of every entry. Accordingly, we would want to provide clear text, with no special characters and in a proper, human-readable format.

In [17]:
from gensim.parsing import preprocessing

def preprocess_html_text(html_text: str) -> str:
    preprocessed_text = preprocessing.strip_non_alphanum(s=html_text)
    preprocessed_text = preprocessing.strip_multiple_whitespaces(s=preprocessed_text)
    preprocessed_text = preprocessing.strip_punctuation(s=preprocessed_text)
    preprocessed_text = preprocessing.strip_numeric(s=preprocessed_text)

    preprocessed_text = preprocessing.stem_text(text=preprocessed_text)
    preprocessed_text = preprocessing.remove_stopwords(s=preprocessed_text)
    return preprocessed_text



In [19]:
train_csv["preprocessed_html_text"] = train_csv["html_text"].apply(preprocess_html_text)
train_csv.head()

Unnamed: 0,url,doc_id,label,html,html_text,preprocessed_html_text
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1,<!DOCTYPE html>\n<!-- jsn_reta_pro 1.0.2 -->\n...,\n \n \n \n \n \n Elbe-Elster Klinikum - Chiru...,elb elster klinikum chirurgi finsterwald suche...
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3,"<!DOCTYPE html>\n<html class=""no-js"" lang=""de""...",\n \n \n \n \n \n \n Onkologisches Zentrum - K...,onkologisch zentrum klinikum bayreuth aktuel ã...
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Zentrum - SozialpÃ¤diatrisches Zentrum -...,zentrum sozialpã diatrisch zentrum stã dtisch ...
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Leistung - Spezielle UnterstÃ¼tzung bei ...,leistung speziel unterstã¼tzung bei der anmeld...
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Zuweiser - Tumorkonferenzen - Tumorkonfe...,zuweis tumorkonferenzen tumorkonferenz gastroi...


----------

<a class="anchor" id="exploratory_data_analysis"></a>

# <p style="background:#990011 url('pylogo.jpg') no-repeat; font-family:tahoma; font-size:150%; color:#FCF6F5; text-align:center; border-radius:20px 30px; width:60%; padding:30px; font-weight:bold">Exploratory Data Analysis</p>

In [20]:
import plotly.express as px
import plotly.offline as pyo

# set notebook mode to work in offline
pyo.init_notebook_mode(connected=True)

In [21]:
train_csv['preprocessed_html_text'].apply(len)

0       8274
1      22589
2       8980
3       4053
4       4370
       ...  
95    175427
96      7044
97     13288
98     15349
99      8628
Name: preprocessed_html_text, Length: 100, dtype: int64

In [22]:
px.histogram(x=train_csv['preprocessed_html_text'].apply(len), title="Distribution of Text Length (Character Count)")

##### We observe that There is one document with 170-179K characters. Others are with < 50K character count in total.

In [23]:
px.histogram(x=train_csv["preprocessed_html_text"].apply(lambda text: text.split(" ")).apply(len),
             title="Distribution of Text Length (Word Count)")

##### There is one document with 27-28K words. Other documents all have < 6K words in total.


In [24]:
px.histogram(x=train_csv["preprocessed_html_text"].apply(lambda text: set(text.split(" "))).apply(len),
             title="Unique Words Count")

##### There is one document with 6500-7000 unique words. All others consist of < 2000 unique words.


-----

<a class="anchor" id="modeling"></a>

# <p style="background:#990011 url('pylogo.jpg') no-repeat; font-family:tahoma; font-size:150%; color:#FCF6F5; text-align:center; border-radius:20px 30px; width:60%; padding:30px; font-weight:bold">Modeling</p>

To solve our task, which falls under the umbrella of NLP, we will use a model called the siamese network. Siamese networks able to address the class imbalance and small data set sizes. They are mostly used in few shots learning tasks, like signature verification systems, face recognition, object detection, etc.

In [25]:
import random
import numpy as np
import tensorflow as tf

### Data Generators

##### While it is not crucial in this task, we would like to show how to properly use Tensorflow (and Keras) by implementing a data generator class.


In [26]:


class Pair(tf.keras.utils.Sequence):
    def __init__(self, dataframe: pd.DataFrame, labels: pd.Series, n_batch: int, batch_size: int):
        self.dataframe = dataframe
        self.labels = labels
        self.n_batch = n_batch
        self.batch_size = batch_size
        self.all_classes = set(self.labels)
        self.anchor_groups = {}
        for target_class in self.all_classes:
            self.anchor_groups[target_class] = {
                "positive": self.dataframe[self.labels == target_class],
                "negative": self.dataframe[self.labels != target_class]
            }

    def __len__(self):
        return self.n_batch

    def __getitem__(self, item):
        pairs = []

        for i in range(int(self.batch_size / 2)):
            anchor_class = random.randint(1, 3)
            anchor_group = self.anchor_groups[anchor_class]["positive"]
            not_anchor_group = self.anchor_groups[anchor_class]["negative"]

            anchor = anchor_group.sample(n=1).iloc[0]
            positive = anchor_group.sample(n=1).iloc[0]
            negative = not_anchor_group.sample(n=1).iloc[0]

            pairs.append([anchor, positive, 1])
            pairs.append([anchor, negative, 0])

        random.shuffle(x=pairs)
        pairs = np.array(pairs)

        data_pairs = pairs[:, :2]
        targets = pairs[:, 2]

        return data_pairs, tf.convert_to_tensor(targets, dtype=np.float32)

    def get_support_set(self, sample_size: int = 1):
        support_set = {}
        for target_class in self.all_classes:
            support_set[target_class] = self.anchor_groups[target_class]["positive"].sample(n=sample_size)
        return support_set

### Model Definition

##### Here, we define our model, as a siamese network. The model is a sequence of layers, starting with a TextVectorization layer. This layer accepts natural language (text) as input, and maps it to an integer sequence. At initialization time, we should provide a vocabulary of words for it to be able to map the words at prediction time.

##### Following the text vectorization layer, we implement three Dense layers, with two Dropout layers in between. Lastly, we apply a L2 normalization layer to penalize large weights.


In [27]:
class SiameseNetwork(tf.keras.Model):
    def __init__(self, corpora: pd.Series):
        super(SiameseNetwork, self).__init__()
        self.vectorizer_layer: tf.keras.layers.TextVectorization = tf.keras.layers.TextVectorization(
            max_tokens=2000,
            output_mode="int",
            output_sequence_length=512
        )
        self.vectorizer_layer.adapt(corpora.values)
        self.encoder = tf.keras.Sequential(layers=[
            self.vectorizer_layer,
            tf.keras.layers.Dense(units=256, activation=tf.keras.activations.relu),
            tf.keras.layers.Dropout(rate=0.3),
            tf.keras.layers.Dense(units=128, activation=tf.keras.activations.relu),
            tf.keras.layers.Dropout(rate=0.3),
            tf.keras.layers.Dense(units=64, activation=tf.keras.activations.relu),
            tf.keras.layers.Lambda(function=lambda x: tf.math.l2_normalize(x, axis=1))
        ])
        self.encoding_distance = tf.keras.layers.Dot(axes=1)

    def __call__(self, inputs, *args, **kwargs):
        anchors, supports = inputs[:, 0], inputs[:, 1]
        anchors_encoded = self.encoder(anchors)
        supports_encoded = self.encoder(supports)
        return self.encoding_distance((anchors_encoded, supports_encoded))

    def predict_with_support_set(self, entry, support_set: dict):
        scores = {}
        for instance_class, texts in support_set.items():
            class_scores = ([self(np.array([entry, text]).reshape((-1, 2))) for text in texts])
            scores[instance_class] = tf.math.reduce_mean(class_scores)
        return max(scores, key=scores.get)

##### Below we instantiate a model object and compile it.


In [28]:
model = SiameseNetwork(corpora=train_csv["preprocessed_html_text"])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics='binary_accuracy')

At this point, we have our model, our data, and the data generator. We are ready to commence training. But, before we do that, let's split the data in `train_csv` into training and validation sets.

In [29]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(train_csv["preprocessed_html_text"], train_csv["label"],
                                                      test_size=0.2,
                                                      random_state=42, stratify=train_csv["label"])

In [30]:
# training params
BATCH_SIZE = 64
N_BATCH = 100
# we instantiate training and validation data / pair generators
TRAIN_PAIR_GENERATOR = Pair(dataframe=X_train, labels=y_train, n_batch=N_BATCH, batch_size=BATCH_SIZE)
VALID_PAIR_GENERATOR = Pair(dataframe=X_valid, labels=y_valid, n_batch=N_BATCH, batch_size=BATCH_SIZE)

##### Finally, we put in an early stopping callback method that will stop the training prematurely if the validation loss does not decrease for 3 straight epochs.


In [31]:
early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=3)

In [32]:
history = model.fit(
    x=TRAIN_PAIR_GENERATOR,
    validation_data=VALID_PAIR_GENERATOR,
    epochs=10,
    callbacks=[early_stopping_callback],
    verbose=1
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


------

<a class="anchor" id="model_evaluation"></a>

# <p style="background:#990011 url('pylogo.jpg') no-repeat; font-family:tahoma; font-size:150%; color:#FCF6F5; text-align:center; border-radius:20px 30px; width:60%; padding:30px; font-weight:bold">Model Evaluation</p>

Once we finish with the model training we can start evaluating the produced model. All training information is stored in the `history` object that is returned by the `model.fit()` method. In the plots below, we plot the model's training and validation accuracy and loss over the number of epochs.

In [33]:
import plotly.graph_objects as go



figure = go.Figure()

figure.add_scatter(y=history.history["binary_accuracy"], name="Training Accuracy")
figure.add_scatter(y=history.history["val_binary_accuracy"], name="Validation Accuracy")

figure.update_layout(dict1={
    "title": "Model Accuracy During Training",
    "xaxis_title": "Epoch",
    "yaxis_title": "Accuracy"
}, overwrite=True)

figure.show()

Let's try to make predictions on the validation set. The validation metrics are not indicative of the model's general performance on unseen data, since they have been used during the training process, therefore they are a bit optimistic. In general, we would expect the metrics to be lower in the production setting (though, not much lower - hopefully).

In [34]:
y_pred = X_valid.apply(lambda text: model.predict_with_support_set(
    entry=text,
    support_set=TRAIN_PAIR_GENERATOR.get_support_set(7)
))

In [35]:
# build a classification report
from sklearn.metrics import classification_report

report = classification_report(y_true=y_valid, y_pred=y_pred, zero_division=0)
print(report)

              precision    recall  f1-score   support

           1       0.40      0.33      0.36         6
           2       0.71      0.83      0.77        12
           3       0.00      0.00      0.00         2

    accuracy                           0.60        20
   macro avg       0.37      0.39      0.38        20
weighted avg       0.55      0.60      0.57        20



------

<a class="anchor" id="prediction"></a>

# <p style="background:#990011 url('pylogo.jpg') no-repeat; font-family:tahoma; font-size:150%; color:#FCF6F5; text-align:center; border-radius:20px 30px; width:60%; padding:30px; font-weight:bold">Predictions</p>

We apply the same set of pre-processing steps as we did for the training data.

In [36]:
test_csv["html"] = test_csv["doc_id"].apply(read_html)
test_csv["html_text"] = test_csv["html"].apply(extract_html_text)
test_csv["preprocessed_html_text"] = test_csv["html_text"].apply(preprocess_html_text)
test_csv.head()

Unnamed: 0,url,doc_id,html,html_text,preprocessed_html_text
0,http://chirurgie-goettingen.de/medizinische-ve...,0,"<!doctype html>\n<html lang=""de"" dir=""ltr"">\n\...",\n \n \n \n \n \n BauchspeicheldrÃ¼se | Klinik...,bauchspeicheldrã¼s klinik fã¼r allgemein visze...
1,http://evkb.de/kliniken-zentren/chirurgie/allg...,2,"<!DOCTYPE html>\n<html lang=""de"" dir=""ltr"" cla...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...,chirurgi der bauchspeicheldrã¼s pankreaschirur...
2,http://krebszentrum.kreiskliniken-reutlingen.d...,7,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...",\n \n \n Brustzentrum Reutlingen: Behandlungsv...,brustzentrum reutlingen behandlungsverfahren k...
3,http://marienhospital-buer.de/mhb-av-chirurgie...,15,"<?xml version=""1.0"" encoding=""utf-8""?>\n<!DOCT...",\n \n \n \n \n Leistungsspektrum: Sankt Marien...,leistungsspektrum sankt marien hospit buer gmb...
4,http://marienhospital-buer.de/mhb-av-chirurgie...,16,"<?xml version=""1.0"" encoding=""utf-8""?>\n<!DOCT...",\n \n \n \n \n Leistungsspektrum: Sankt Marien...,leistungsspektrum sankt marien hospit buer gmb...


In [37]:
# do inference
test_csv["predictions"] = test_csv["preprocessed_html_text"].apply(lambda text: model.predict_with_support_set(
    entry=text,
    support_set=TRAIN_PAIR_GENERATOR.get_support_set(sample_size=7)
))

##### Below, we explore the predictions on the test set.

In [38]:
test_csv["predictions"].value_counts()

2    40
1     8
Name: predictions, dtype: int64

In [39]:
test_csv[["doc_id", "predictions"]]

Unnamed: 0,doc_id,predictions
0,0,2
1,2,2
2,7,2
3,15,2
4,16,2
5,24,2
6,31,2
7,32,1
8,36,1
9,38,2


------

<a class="anchor" id="conclusion"></a>

# <p style="background:#990011 url('pylogo.jpg') no-repeat; font-family:tahoma; font-size:150%; color:#FCF6F5; text-align:center; border-radius:20px 30px; width:60%; padding:30px; font-weight:bold">Conclusion</p>



In this section, we provide answers to the questions that were posed at the beginning of the assignment.

How did you decide to handle this amount of data?

    We have used data generators that dynamically load the data samples from disk. It would have been possible to load the entire data set into memory, given that it is relatively small.

How did you decide to do feature engineering?

    We haven't used any feature engineering techniques per se, though we have spent some effort on data pre-processing, with steps like removing punctuation, multiple whitespaces, non-alphanumerical characters, etc.

How did you decide which models to try (if you decide to train any models)?

    We've decided to use the Siamese Network model because it is very popular for this particular task (natural language processing, small data set, class imbalance). The choice of layers is also very common in the field: we intertwine dropout layers with dense layers, which have a decreasing number of units. Lastly, we apply L2 regularization to penalize any large weights.

How did you perform validation of your model?

    Validation is automatically handled by the Tensorflow library, we just pass in a validation set. The validation set was obtained by splitting the provided data into the train (80%) and validation (20%) sets.

What metrics did you measure?

    During training we measure binary accuracy. In the evaluation phase, we measure per-class precision, recall, and f1 scores on the validation set.

How do you expect your model to perform on test data (in terms of your metrics)?

    We expect somewhat similar performance to the validation set, around 0.5-6 f1 score on label = 1, around 0.8 f1 score on label = 2, and we hope, f1 > 0 on label = 3.

How fast will your algorithm perform and how could you improve its performance if you would have more time?

    Each epoch takes around 30s to execute. We can improve that if we were to run the model on GPUs.

How do you think you would be able to improve your algorithm if you would have more data?

        Build a more complex model
        Try different loss metrics
        Use pre-trained models

What potential issues do you see with your algorithm?

        It is very prone to overfitting, though this is almost certainly because of the small data set.
        We have zero precision and recall on the label = 3 which is concerning and should be addressed somehow.

