# Keyword Detection on Websites

## Objective 
+ Create an algorithm, that takes html page as input and infers if the page contains the information about cancer tumorboard or not

## What is a tumor board?
+ Tumor Board is a consilium of doctors (usually from different disciplines) discussing cancer cases in their departments

### Initial Steps

+ Read in test, train and keyword csv's
+ Prepare to use beautiful soup to parse the HTML content needed
+ Install lxml, an HMTL parser

In [1]:
# importing packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# reading in training csv
train_df = pd.read_csv(filepath_or_buffer = 'detection_train.csv')

# printing out 10 random samples from the dataset
train_df.sample(n = 10, random_state = 42)

Unnamed: 0,url,doc_id,label
83,http://www.sbk-vs.de/de/medizin/leistungen-und...,125,1
53,http://www.klinikum-esslingen.de/kliniken-und-...,85,2
70,http://www.malteser-kliniken-rhein-ruhr.de/med...,107,2
45,http://www.klilu.de/medizin__pflege/kliniken_u...,73,2
44,http://www.kk-bochum.de/de/kliniken_zentren_be...,72,1
39,http://www.gesundheit-nordhessen.de/klinikum-k...,57,3
22,http://stereotaxie.uk-koeln.de/erkrankungen-th...,28,2
80,http://www.neurochirurgie.uk-erlangen.de/forsc...,120,1
10,http://krebszentrum.kreiskliniken-reutlingen.d...,13,2
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1


In [2]:
# reading in training csv
test_df = pd.read_csv('detection_test.csv')

# printing out 10 random samples from the dataset
test_df.sample(n = 10, random_state = 42)

Unnamed: 0,url,doc_id
27,http://www.josephstift-dresden.de/pressemittei...,71
40,http://www.pius-hospital.de/kliniken/gynaekolo...,123
26,http://www.interdisziplinaere-endoskopie.mri.t...,70
43,http://www.uk-augsburg.de/krebsbehandlung/diag...,134
24,http://www.hjk-muenster.de/unsere-kompetenzen/...,68
37,http://www.maria-josef-hospital.de/unsere-komp...,109
12,http://www.ctk.de/klinikum/karriere/stellenang...,46
19,http://www.gesundheitszentrum-wetterau.de/342/...,59
4,http://marienhospital-buer.de/mhb-av-chirurgie...,16
25,http://www.hufeland.de/de/abteilung-fuer-gastr...,69


In [3]:
# reading in training csv
keyword_df = pd.read_csv('keyword2tumor_type.csv')

# printing out 10 random samples from the dataset
keyword_df.sample(n = 10, random_state = 42)

Unnamed: 0,keyword,tumor_type
73,magen,Magen
19,malignome,Endokrine malignome
116,molekulare,Molekular
67,transit,Lymphom
94,bone,Sarkome
77,mammographie,Mamma carcinoma
31,gynäkologische,Gynäkologie
53,thyroid,Kopf-hals
117,oral,Oral
44,lympho,Hämatooncology


In [4]:
# now let's understand the distribution of labels across the training doc

train_df['label'].value_counts().reset_index()

Unnamed: 0,index,label
0,2,59
1,1,32
2,3,9


### Next steps

+ Load the data to read the HTML files
+ Create a function that will load the corresponding HTML file from the htmls directory
+ Get the HMTL contents and store it in a new column called "html"
+ Since the documents are in German, the encoding cannot be utf-8 which is commonly used, instead latin1 will be used

In [5]:
# defining a function to read an HMTL file given a document ID
def read_html(doc_id: int) -> str:
    # open the HTML file with the given document ID
    with open(file=f"htmls/{doc_id}.html",
              mode="r",
              encoding="latin1") as f:
        # read the contents of the file and store it as a string
        html = f.read()
    # return the HMTL string    
    return html


# apply the read_html function to each value in the "doc_id" column of the train_df dataframe
# and store the resulting HTML strings in a new column called "html"
train_df["html"] = train_df["doc_id"].apply(read_html)

In [6]:
train_df

Unnamed: 0,url,doc_id,label,html
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1,<!DOCTYPE html>\n<!-- jsn_reta_pro 1.0.2 -->\n...
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3,"<!DOCTYPE html>\n<html class=""no-js"" lang=""de""..."
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<..."
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<..."
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3,"<!doctype html>\n<html lang=""de"">\n<head>\n\t<..."
...,...,...,...,...
95,http://www.unicross.uni-freiburg.de/thema/unifm/,140,1,"<!DOCTYPE html>\n<html lang=""de-DE""\nprefix=""o..."
96,http://www.uniklinik-duesseldorf.de/patienten-...,141,1,"<!DOCTYPE html>\n<html class=""no-js"" lang=""de""..."
97,http://www.vivantes.de/fuer-sie-vor-ort/klinik...,144,2,"<!DOCTYPE html>\n\n<html class=""no-js"" lang=""d..."
98,http://www.vivantes.de/fuer-sie-vor-ort/klinik...,145,2,"<!DOCTYPE html>\n\n<html class=""no-js"" lang=""d..."
