# Keyword Detection on Websites



## Assignment
Your task is to create an algorithm, that takes html page as input and infers if the page contains the information about cancer tumorboard or not. What is a tumor board? Tumor Board is a consilium of doctors (usually from different disciplines) discussing cancer cases in their departments. If you want to know more please read this article.

The expected result is a CSV file for test data with columns [doc_id and prediction].

Bonus: if you would like to go the extra mile in this task try to identify tumor board types interdisciplinary, breast, and any third type of tumor board up to you. For these tumor boards please try to identify their schedule: Day (e.g. Friday), frequency (e.g. weekly, bi-weekly, monthly), and time when they start.

## Data Description
You have train.csv and test.csv files and folder with corresponding .html files.

Files:

train.csv contains next columns: url, doc_id and label
test.csv contains next columns: url and doc_id
htmls contains files with names {doc_id}.html
keyword2tumor_type.csv contains useful keywords for types of tumorboards
Description of tumor board labels:

1 (no evidence): tumor boards are not mentioned on the page
2 (medium confidence): tumor boards are mentioned, but the page is not completely dedicated to tumor board description
3 (high confidence): page is completely dedicated to the description of tumor board types and dates
You are asked to prepare a model using htmls, referred to in train.csv, and make predictions for htmls from test.csv

## Practicalities
You should prepare a Jupyter Notebook with the code that you used for making the predictions and the following documentation:

How did you decide to handle this amount of data?
How did you decide to do feature engineering?
How did you decide which models to try (if you decide to train any models)?
How did you perform validation of your model?
What metrics did you measure?
How do you expect your model to perform on test data (in terms of your metrics)?
How fast will your algorithm performs and how could you improve its performance if you would have more time?
How do you think you would be able to improve your algorithm if you would have more data?
What potential issues do you see with your algorithm?

## Tips
to extract clean text from the page you can use BeautifulSoup module like this

from bs import BeautifulSoup

content = read_html()

soup = BeautifulSoup(content)

clean_text = soup.get_text(' ')


## If you decide that you don't need, for example, tags <p> in your document you can do this:##


from bs import BeautifulSoup

content = read_html()

soup = BeautifulSoup(content)

for tag in soup.find_all('p'):
    tag.decompose()

#### To download the dataset <a href="https://drive.google.com/drive/folders/1Qs2fLj9HmAzx2YGKmqkePCa1Acs5JY3Z?usp=sharing"> Click here </a>

# Load the necessary libraries

In [11]:

import os
import re
import spacy
import pandas as pd
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


In [12]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
# load spacy english model
nlp = spacy.load("en_core_web_sm")

In [14]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [15]:
# Load the dataset
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
keywords_df = pd.read_csv('keyword2tumor_type.csv')
keywords_df.head()

Unnamed: 0,keyword,tumor_type
0,senologische,Brust
1,brustzentrum,Brust
2,breast,Brust
3,thorax,Brust
4,thorakale,Brust


# Handling Data

In [21]:
def preprocess_text(text):
    text = re.sub(r'[^a-zA-Z0-9 ]', '', text)  
    text = text.lower()  
    words = text.split()  
    words = [word for word in words if word not in stop_words]  
    words = [stemmer.stem(word) for word in words] 
    words = [lemmatizer.lemmatize(word) for word in words]  
    return ' '.join(words)

In [22]:
def extract_text_from_html(html_path):
    try:
        with open(html_path, 'r', encoding='utf-8') as f:
            soup = BeautifulSoup(f.read(), 'html.parser')
            raw_text = soup.get_text(separator=" ")  # text without tags
            clean_text = re.sub(r'\s+', ' ', raw_text).strip()  
            processed_text = preprocess_text(clean_text)  
            return processed_text
    except Exception as e:
        print(f"Error processing {html_path}: {e}")
        return ""

In [24]:
def detect_tumor_board_type(text):
    doc = nlp(text)
    tumor_types = set()

    # Extract named entities related to medical terms
    for ent in doc.ents:
        if ent.label_ in ["ORG", "GPE", "NORP", "FACILITY"]: 
            tumor_types.add(ent.text)

    # Check for keyword-based tumor types
    for _, row in keywords_df.iterrows():
        if re.search(row['keyword'], text, re.IGNORECASE):
            tumor_types.add(row['tumor_type'])

    return ", ".join(tumor_types) if tumor_types else "Unknown"

In [25]:
def extract_schedule(text):
    day_pattern = r"\b(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)\b"
    freq_pattern = r"\b(weekly|bi-weekly|monthly|quarterly|annually)\b"
    time_pattern = r"\b([0-9]{1,2}:[0-9]{2} ?[APap][Mm])\b"  # Matches 10:30 AM format

    doc = nlp(text)
    day_match = re.search(day_pattern, text, re.IGNORECASE)
    freq_match = re.search(freq_pattern, text, re.IGNORECASE)
    time_match = re.search(time_pattern, text)

    # Extract dates and times from Named Entity Recognition
    date_entity = "Unknown"
    time_entity = time_match.group(0) if time_match else "Unknown"

    for ent in doc.ents:
        if ent.label_ == "DATE":
            date_entity = ent.text
        if ent.label_ == "TIME":
            time_entity = ent.text

    return {
        "day": day_match.group(0) if day_match else date_entity,
        "frequency": freq_match.group(0) if freq_match else "Unknown",
        "time": time_entity
    }

In [28]:
# Apply text extraction, NER, and schedule detection to training and test sets
train_df['text'] = train_df['doc_id'].apply(lambda doc_id: extract_text_from_html(f'htmls/{doc_id}.html'))
train_df['tumor_board_type'] = train_df['text'].apply(detect_tumor_board_type)
train_df['schedule'] = train_df['text'].apply(extract_schedule)

test_df['text'] = test_df['doc_id'].apply(lambda doc_id: extract_text_from_html(f'htmls/{doc_id}.html'))
test_df['tumor_board_type'] = test_df['text'].apply(detect_tumor_board_type)
test_df['schedule'] = test_df['text'].apply(extract_schedule)

Error processing htmls/72.html: 'utf-8' codec can't decode byte 0xe4 in position 545: invalid continuation byte
Error processing htmls/94.html: 'utf-8' codec can't decode byte 0xe4 in position 1077: invalid continuation byte


  k = self.parse_starttag(i)


In [29]:
# Split training data for ML model
X_train, X_val, y_train, y_val = train_test_split(train_df['text'], train_df['label'], test_size=0.2, random_state=42)

In [30]:
# Build a classification pipeline
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
pipeline = Pipeline([
    ('tfidf', vectorizer),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

In [31]:
# Train the ML model
pipeline.fit(X_train, y_train)

In [32]:
# Evaluate on validation set
y_pred = pipeline.predict(X_val)
print("Validation Classification Report:")
print(classification_report(y_val, y_pred))


Validation Classification Report:
              precision    recall  f1-score   support

           1       0.33      0.14      0.20         7
           2       0.47      0.80      0.59        10
           3       0.00      0.00      0.00         3

    accuracy                           0.45        20
   macro avg       0.27      0.31      0.26        20
weighted avg       0.35      0.45      0.37        20



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Based on the classification report, metrics like precision, recall, F1-score, and accuracy should provide insights that if data is consistent model will perform good.

In [34]:
# Make predictions on test set
test_df['prediction'] = pipeline.predict(test_df['text'])

In [35]:
# Save results
test_df[['doc_id', 'prediction', 'tumor_board_type', 'schedule']].to_csv('test_predictions.csv', index=False)

print("Predictions saved to test_predictions.csv")

Predictions saved to test_predictions.csv




Algorithm Speed and Performance
Speed:

Random Forests can be computationally intensive, especially with large datasets. However, they are generally efficient for text classification tasks.

Optimizing hyperparameters and limiting feature size (e.g., max_features=5000) can help improve performance.

Improvements with More Time:

Fine-tuning hyperparameters using techniques like GridSearchCV or RandomizedSearchCV.

Experimenting with different text vectorization techniques (e.g., Word2Vec, BERT).

Exploring other classification algorithms (e.g., SVM, XGBoost).

Potential Improvements with More Data
Model Generalization:

More data can lead to better generalization and improved model performance.

Ensuring diversity in data (different tumor types, schedules, etc.) can make the model more robust.

Potential Issues:

Imbalanced Classes: If the dataset has imbalanced classes, it may affect model performance. Techniques like SMOTE or class weighting can help.

Overfitting: With more data, the model might overfit the training data. Regularization techniques or more data can mitigate this.