# Table of Contents
* [Project Goals](#project_goals)
* [Introduction](#introduction)
* [Importing Data](#importing_data)
* [Exploratory Data Analysis](#exploratory_data_analysis)
* [Feature Engineering](#feature_engineering)

<a class="anchor" id="project_goals"></a>

# <p style="background:#00003f url('pylogo.jpg') no-repeat; font-family:tahoma; font-size:150%; color:white; text-align:center; border-radius:20px 30px; width:92%; padding:30px; font-weight:bold">Project Goals</p>

The task is to create an algorithm, that takes an HTML page as input and infers if the page contains the information about cancer tumor board or not.

What is a tumor board? Tumor Board is a consilium of doctors (usually from diferent disciplines) discussing cancer cases in their departments. If you want to know more you can read this article.

As a final output from this task I will provide a submission.csv file for the test data set with two columns: document ID and a prediction, and a Jupyter notebook with code and documentation giving answers to the following questions:

-	How did you decide to do feature engineering?
-	How did you decide which models to try (if you decide to train any models)?
-	How did you perform validation of your model?
-	What metrics did you measure?
-	How do you expect your model to perform on test data (in terms of your metrics)?
-	How fast will your algorithm perform and how could you improve its performance if you would have more time?
-	How do you think you would be able to improve your algorithm if you would have more data?
-	What potential issues do you see with your algorithm?

<a class="anchor" id="introduction"></a>

# <p style="background:#990011 url('pylogo.jpg') no-repeat; font-family:tahoma; font-size:150%; color:#FCF6F5; text-align:center; border-radius:20px 30px; width:40%; padding:30px; font-weight:bold">Introduction</p>

As a first step in solving this problem, we will load the provided CSV files using the `Pandas` library. The training CSV file contains 100 rows, with three columns: `URL`, `doc_id`, and a `label`. The test CSV file has 48 rows with two columns: `URL` and `doc_id`. The goal is to train a machine learning model that can predict a `label` for the documents provided in the test CSV based on the data that is available in the training CSV.

In [1]:
import pandas as pd

In [2]:
train_csv = pd.read_csv(filepath_or_buffer="train.csv")
print("Training set shape", train_csv.shape)
train_csv.head(10)

Training set shape (100, 3)


Unnamed: 0,url,doc_id,label
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3
5,http://krebszentrum.kreiskliniken-reutlingen.d...,8,1
6,http://krebszentrum.kreiskliniken-reutlingen.d...,9,1
7,http://krebszentrum.kreiskliniken-reutlingen.d...,10,1
8,http://krebszentrum.kreiskliniken-reutlingen.d...,11,1
9,http://krebszentrum.kreiskliniken-reutlingen.d...,12,1


In [3]:
test_csv = pd.read_csv(filepath_or_buffer="test.csv")
print("Test set shape", test_csv.shape)
test_csv.head()

Test set shape (48, 2)


Unnamed: 0,url,doc_id
0,http://chirurgie-goettingen.de/medizinische-ve...,0
1,http://evkb.de/kliniken-zentren/chirurgie/allg...,2
2,http://krebszentrum.kreiskliniken-reutlingen.d...,7
3,http://marienhospital-buer.de/mhb-av-chirurgie...,15
4,http://marienhospital-buer.de/mhb-av-chirurgie...,16


In [4]:
tumor_keywords = pd.read_csv(filepath_or_buffer="keyword2tumor_type.csv")
print("Tumor keywords set shape", tumor_keywords.shape)
tumor_keywords.head()

Tumor keywords set shape (126, 2)


Unnamed: 0,keyword,tumor_type
0,senologische,Brust
1,brustzentrum,Brust
2,breast,Brust
3,thorax,Brust
4,thorakale,Brust


We have `100` documents in the training set, and `48` in the test set. We have `32` documents that mention no tumor board (label = 1), `59` documents where a tumor board is mentioned, but we are not certain if it is the main focus of the page (label = 2), and `9` documents for which we are certain that they are dedicated to tumor boards.

In [5]:
train_csv.groupby(by="label").size()

label
1    32
2    59
3     9
dtype: int64

<a class="anchor" id="importing_data"></a>

# <p style="background:#990011 url('pylogo.jpg') no-repeat; font-family:tahoma; font-size:150%; color:#FCF6F5; text-align:center; border-radius:20px 30px; width:60%; padding:30px; font-weight:bold">Loading Data</p>

In [8]:
def read_html(doc_id: int) -> str:
    """
    Reads the HTML document with the given doc_id from the database.
    """
    with open(file=f"htmls/{doc_id}.html", mode= 'r', encoding= "latin1") as f:
        html = f.read()
    return html

##### Read the HTML documents in the train_csv

In [9]:
train_csv["html"] = train_csv["doc_id"].apply(read_html)
train_csv.head()

Unnamed: 0,url,doc_id,label,html
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1,<!DOCTYPE html>\n<!-- jsn_reta_pro 1.0.2 -->\n...
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3,"<!DOCTYPE html>\n<html class=""no-js"" lang=""de""..."
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<..."
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<..."
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<..."


##### Extract the text from the HTML

In [13]:
import warnings
from bs4 import BeautifulSoup
warnings.filterwarnings(action="ignore")

In [15]:
def extract_html_text(html):
    bs = BeautifulSoup(markup=html, features = 'lxml')
    for script in bs(name=["script", "style"]):
        script.decompose()
    return bs.get_text(separator= " ")

In [16]:
train_csv["html_text"] = train_csv["html"].apply(extract_html_text)
train_csv.head()

Unnamed: 0,url,doc_id,label,html,html_text
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1,<!DOCTYPE html>\n<!-- jsn_reta_pro 1.0.2 -->\n...,\n \n \n \n \n \n Elbe-Elster Klinikum - Chiru...
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3,"<!DOCTYPE html>\n<html class=""no-js"" lang=""de""...",\n \n \n \n \n \n \n Onkologisches Zentrum - K...
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Zentrum - SozialpÃ¤diatrisches Zentrum -...
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Leistung - Spezielle UnterstÃ¼tzung bei ...
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Zuweiser - Tumorkonferenzen - Tumorkonfe...


We observe that there is a very large number of the new line characters `\n` in the new `html_text` columns at the very begining of every entry. Accordingly, we would want to provide clear text, with no special characters and in a proper, human-readable format.

In [17]:
from gensim.parsing import preprocessing

def preprocess_html_text(html_text: str) -> str:
    preprocessed_text = preprocessing.strip_non_alphanum(s=html_text)
    preprocessed_text = preprocessing.strip_multiple_whitespaces(s=preprocessed_text)
    preprocessed_text = preprocessing.strip_punctuation(s=preprocessed_text)
    preprocessed_text = preprocessing.strip_numeric(s=preprocessed_text)

    preprocessed_text = preprocessing.stem_text(text=preprocessed_text)
    preprocessed_text = preprocessing.remove_stopwords(s=preprocessed_text)
    return preprocessed_text



In [19]:
train_csv["preprocessed_html_text"] = train_csv["html_text"].apply(preprocess_html_text)
train_csv.head()

Unnamed: 0,url,doc_id,label,html,html_text,preprocessed_html_text
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1,<!DOCTYPE html>\n<!-- jsn_reta_pro 1.0.2 -->\n...,\n \n \n \n \n \n Elbe-Elster Klinikum - Chiru...,elb elster klinikum chirurgi finsterwald suche...
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3,"<!DOCTYPE html>\n<html class=""no-js"" lang=""de""...",\n \n \n \n \n \n \n Onkologisches Zentrum - K...,onkologisch zentrum klinikum bayreuth aktuel ã...
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Zentrum - SozialpÃ¤diatrisches Zentrum -...,zentrum sozialpã diatrisch zentrum stã dtisch ...
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Leistung - Spezielle UnterstÃ¼tzung bei ...,leistung speziel unterstã¼tzung bei der anmeld...
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<...",\n \n Zuweiser - Tumorkonferenzen - Tumorkonfe...,zuweis tumorkonferenzen tumorkonferenz gastroi...


<a class="anchor" id="exploratory_data_analysis"></a>

# <p style="background:#990011 url('pylogo.jpg') no-repeat; font-family:tahoma; font-size:150%; color:#FCF6F5; text-align:center; border-radius:20px 30px; width:60%; padding:30px; font-weight:bold">Exploratory Data Analysis</p>

In [20]:
import plotly.express as px
import plotly.offline as pyo

# set notebook mode to work in offline
pyo.init_notebook_mode(connected=True)

In [21]:
train_csv['preprocessed_html_text'].apply(len)

0       8274
1      22589
2       8980
3       4053
4       4370
       ...  
95    175427
96      7044
97     13288
98     15349
99      8628
Name: preprocessed_html_text, Length: 100, dtype: int64

In [22]:
px.histogram(x=train_csv['preprocessed_html_text'].apply(len), title="Distribution of Text Length (Character Count)")

##### We observe that There is one document with 170-179K characters. Others are with < 50K character count in total.

In [23]:
px.histogram(x=train_csv["preprocessed_html_text"].apply(lambda text: text.split(" ")).apply(len),
             title="Distribution of Text Length (Word Count)")

##### There is one document with 27-28K words. Other documents all have < 6K words in total.


In [24]:
px.histogram(x=train_csv["preprocessed_html_text"].apply(lambda text: set(text.split(" "))).apply(len),
             title="Unique Words Count")

##### There is one document with 6500-7000 unique words. All others consist of < 2000 unique words.
