# Domain Relevance Evaluation

Getting domain relevant terms via presented domain relevancy decision tree - methods are listed in parts/domain_relevancy.py

In [1]:
import pandas as pd
from tqdm import tqdm
from collections import Counter
from parts import collect, oie, domain_relevance, cleaning

## Initial Load of Background Domains

!!! only needed first time -> choose to export data to resource folder for faster performance in the future !!!

collect.get_corpus set-up for three different scenarios: adac, chefkoch, car - each detailing different methods for scraping the text from the website. In each case the function takes a root link page and starts scraping links from this one and the following one until the limit is reached. Then each link is scraped and stored in a folder if needed. 

parameters: collect.get_corpus(root_link, domain type, limit link pages, export to folder?)

In [None]:
#adac_corpus = collect.get_corpus(0,"adac",0,1)

In [None]:
#chefkoch_corpus = collect.get_corpus("https://www.chefkoch.de/forum/1,27/Haus-Garten.html","chefkoch",50,1)

In [None]:
#car_corpus = collect.get_corpus("https://www.motor-talk.de/forum/audi-80-90-100-200-v8-b158.html","car",30,1)

## Load Background Domains and Extract Terms

simply loading the scraped forum pages from the respecitve folders.

parameters: collect.load_domain_terms(domain, limit pages, clean?)

In [2]:
adac_domain = collect.load_domain_terms("adac", 10000, 1)

100%|██████████████████████████████████████████████████████████████████████████████| 2530/2530 [01:21<00:00, 31.02it/s]


deleted time references: 0
deleted date references: 287
deleted links: 77
deleted quotes: 63
deleted ireg expressions: 41
deleted abbreviations: 203


In [3]:
car_domain = collect.load_domain_terms("car_bmw", 10000, 1)

100%|██████████████████████████████████████████████████████████████████████████████| 1216/1216 [05:12<00:00,  3.90it/s]


deleted time references: 220
deleted date references: 198
deleted links: 689
deleted quotes: 3263
deleted ireg expressions: 2412
deleted abbreviations: 1721


In [6]:
audi_domain = collect.load_domain_terms("car", 10000, 1)

100%|██████████████████████████████████████████████████████████████████████████████| 1190/1190 [06:15<00:00,  3.17it/s]


deleted time references: 229
deleted date references: 339
deleted links: 952
deleted quotes: 3420
deleted ireg expressions: 2955
deleted abbreviations: 1772


In [4]:
chefkoch_domain = collect.load_domain_terms("chefkoch", 10000, 1)

100%|██████████████████████████████████████████████████████████████████████████████| 2548/2548 [10:32<00:00,  4.03it/s]


deleted time references: 0
deleted date references: 1455
deleted links: 1008
deleted quotes: 1648
deleted ireg expressions: 1985
deleted abbreviations: 2232


## Evaluation of Metrics and Domains

!!! Not needed for domain relevancy evaluation !!!

This section provides and overview of frequency distributions within the scraped domains - adobt the functions to see details about other domains

In [None]:
import matplotlib.pyplot as plt

#### Distribution of metrics

In [10]:
flat_terms = [item for sublist in car_domain for item in sublist]
tf = Counter(flat_terms)
print("extracted terms", len(flat_terms), ", extracted concepts", len(tf))

extracted terms 403041 , extracted concepts 99667


In [None]:
### Term frequency distribution in car_domain
from collections import Counter
flat_terms = [item for sublist in car_domain for item in sublist]
tf = Counter(flat_terms)
bins= range(0,15,1)
plt.hist(tf.values(), bins=bins, edgecolor="k")
plt.xticks(bins)
print(min(tf.values()),max(tf.values()))

In [None]:
# distribution of llr, dw, lor, lor_bg values (just exchange for fitting metric)
bins= range(int(min(llr.values()))-1,int(min(llr.values()))+10,1)
plt.hist(llr.values(), bins=bins, edgecolor="k")
plt.xticks(bins)
print(min(llr.values()),max(llr.values()))

In [12]:
# overview of highest and lowest performing terms in metric
pd.Series(tf).sort_values(ascending = False).head(15)

bmw            7114
km             4372
auto           4362
motor          3974
problem        2458
wagen          2227
hallo          2035
werkstatt      1813
fehler         1738
1er            1665
probleme       1629
steuerkette    1490
fragen         1459
öl             1296
fahrzeug       1282
dtype: int64

In [None]:
len(candidates)
counter = 0
chefkoch_terms = set([item for sublist in chefkoch_domain for item in sublist])
for term in candidates:
    if term in chefkoch_terms and tf[term] > 1:
        counter += 1
        
counter

## Concept Export and Import

In this section concepts are labeled and exported. This is done with the domain_relevance.label_concepts() function. By default it returns the results of the metric and method chosen in the thesis! Optionally the different metrics and methods can be used to label concepts. 

methods: "dw", "llr", "lor-bg", "del"

metrics: "tf", "idf", "tdf", "tf-tdf", "tf-idf"

parameters: domain_relevance.label_concepts(target domain, background domain, contrastive domain, method, metric)

In [8]:
labels = domain_relevance.label_concepts(car_domain, adac_domain, chefkoch_domain)

100%|████████████████████████████████████████████████████████████████████████████| 1216/1216 [00:00<00:00, 5544.96it/s]
100%|████████████████████████████████████████████████████████████████████████████| 2548/2548 [00:00<00:00, 9685.58it/s]


Chosen via background domain: 2272
Chosen via metric: 3664
Chosen via tf > 1 limit: 15133


In [None]:
concepts = set()

for term in labels:
    if labels[term]:
        concepts.add(term)

concepts = list(concepts)

In [None]:
#concepts = list(set([item for sublist in adac_domain for item in sublist]))
with open("bmw_concepts.txt", "w") as fp:
    fp.writelines('\n'.join(concepts))

#### only needed to re-import the concepts

In [None]:
with open("concepts.txt", "r") as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content] 

## Testset comparison

Two test sets were created to evaluate the different domain relevancy methods and metrics. Load the testset and the labeled testset and perform the required domain relevancy method - evalution is provided by sklearn library

In [13]:
# testset_w-uniques.txt
# testset.txt

with open("testset_w-uniques.txt", "r", encoding ="utf-8") as f:
    testset = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
testset = [x.strip() for x in testset] 

In [14]:
# testset_w-unique_labeled.csv
# testset_labeled.csv

import csv
reader = csv.reader(open('testset_w-unique_labeled.csv', 'r', encoding ="utf-8"),delimiter=';')
labeled = {}
for row in reader:
    #need to adjust the number of empty fields
    k,_,_,_,v = row
    labeled[k] = v

In [21]:
labels = domain_relevance.label_concepts(audi_domain, adac_domain, chefkoch_domain)

100%|████████████████████████████████████████████████████████████████████████████| 1190/1190 [00:00<00:00, 7271.59it/s]
100%|███████████████████████████████████████████████████████████████████████████| 2548/2548 [00:00<00:00, 16719.97it/s]


Chosen via background domain: 2330
Chosen via metric: 3873
Chosen via tf > 1 limit: 14983


In [22]:
predicted = {}
for candidate in testset:
    try:
        predicted[candidate] = labels[candidate]
    except KeyError:
        predicted[candidate] = 0
        print(candidate)

zumal überbrückungsfahrzeug fast 4 is
vertraut ... alltagsauto
mal komplett unterschiedliche werte
z-diode
ersatzteillos
evtl anfängersichere anleitung
raum aachen-köln
gelben/orangen blinkern
tausche 1 dämpfer querlenker idioten
vw tl
abs ?
kurvenäußeren rad
audi v8 4,2l
gewinnbringende antworten
degenhard
dauerhaft auf'n
ansaugrohrvorw
schwungs
handelbezeichnungen


In [23]:
import pandas as pd

df = pd.DataFrame.from_dict(labeled,orient='index', columns = ["label"])
df["predicted"] = predicted.values()

In [24]:
from sklearn.metrics import confusion_matrix, classification_report

In [25]:
confusion_matrix(pd.to_numeric(df["label"]), pd.to_numeric(df["predicted"]))

array([[517, 229],
       [ 68, 186]], dtype=int64)

In [26]:
print(classification_report(pd.to_numeric(df["label"]), pd.to_numeric(df["predicted"])))

              precision    recall  f1-score   support

           0       0.88      0.69      0.78       746
           1       0.45      0.73      0.56       254

    accuracy                           0.70      1000
   macro avg       0.67      0.71      0.67      1000
weighted avg       0.77      0.70      0.72      1000

