# Domain Relevancy Sample

Overview of methods used to determine Domain Relevance. In practice three domains are used ADAC, Autoforum & Chefkochforum. Also terms are cleaned before calculating frequencies - this step is skipped since all terms are of high quality.

In [1]:
from collections import Counter
import numpy as np

## Generation of two sample datasets
Each set consists of a list of documents which each contain a list of terms defining the document. Storing the domains as "list of lists" allows to calculate term-frequency as well as term-document-frequency or inverse-document-frequency more easily in later stages.

In [2]:
doc1 = ["hallo", "auto", "problem", "grüße"]
doc2 = ["hallo", "auto", "sensor", "fehler"]
doc3 = ["hallo", "reifen", "fehler", "dank"]
doc4 = ["hi", "sensor", "kaputt", "grüße"]

target_domain = [doc1, doc2, doc3, doc4]

In [3]:
doc1 = ["hallo", "haus", "garten", "grüße"]
doc2 = ["hallo", "sommer", "haus", "grüße"]
doc3 = ["hi", "regen", "pflanzen", "dank"]
doc4 = ["hi", "garten", "zaun", "grüße"]

contrastive_domain = [doc1, doc2, doc3]

# Calculating Relevancy Measures
Presented are three approaches: 
1. combinations of term-frequency, term-document-frequency, inverse-term-document-frequency
2. domain relevance / domain consensus measure from "Ontolearn" System (link!!!)
3. log-likelihood measure from textbook 

### 1a) normalized term-frequency
tf = term-frequency / max_term-frequency

**idea:** terms that appear most are imporant 


-basic idea: pg. 105ff of https://web.stanford.edu/~jurafsky/slp3/

-normalized idea in https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.21231

In [4]:
def get_tf(terms, norm = 0):
    flat_terms = [item for sublist in terms for item in sublist]
    tf = Counter(flat_terms)
    max_freq = Counter(flat_terms).most_common(1)[0][1]
    n = len(set(flat_terms))
    for t in tf:
        if norm:
            tf[t] = (tf[t] / max_freq)
        else:
            tf[t] = (tf[t] / n)

    return tf

In [5]:
get_tf(target_domain, 1)

Counter({'hallo': 1.0,
         'auto': 0.6666666666666666,
         'problem': 0.3333333333333333,
         'grüße': 0.6666666666666666,
         'sensor': 0.6666666666666666,
         'fehler': 0.6666666666666666,
         'reifen': 0.3333333333333333,
         'dank': 0.3333333333333333,
         'hi': 0.3333333333333333,
         'kaputt': 0.3333333333333333})

### 1b) normalized term-document-frequency
tdf = term-document-frequency / max_term-document-frequency

**idea:** terms that appear often in docuemts are important

-basic idea: pg. 105ff of https://web.stanford.edu/~jurafsky/slp3/

-normalized idea in https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.21231

In [6]:
def get_tdf(terms):
    flat_terms = [item for sublist in terms for item in set(sublist)]
    tdf = Counter(flat_terms)
    max_freq = Counter(flat_terms).most_common(1)[0][1]
    for t in tdf:
        tdf[t] = tdf[t] / max_freq

    return tdf

In [7]:
get_tdf(target_domain)

Counter({'hallo': 1.0,
         'auto': 0.6666666666666666,
         'grüße': 0.6666666666666666,
         'problem': 0.3333333333333333,
         'sensor': 0.6666666666666666,
         'fehler': 0.6666666666666666,
         'dank': 0.3333333333333333,
         'reifen': 0.3333333333333333,
         'hi': 0.3333333333333333,
         'kaputt': 0.3333333333333333})

### 1c) inverse-term-document-frequency
idf = log2( 1 / tdf[])

**idea:** terms that appear too much are irrelevant

(commonly used to filter out words like is, have, i, you)

basic idea: pg. 105ff of https://web.stanford.edu/~jurafsky/slp3/

In [8]:
def get_idf(terms):
    flat_terms = [item for sublist in terms for item in set(sublist)]
    idf = Counter(flat_terms)
    for t in idf:
        idf[t] = np.log2(len(terms) / idf[t])

    return idf

In [9]:
get_idf(contrastive_domain)

Counter({'hallo': 0.5849625007211562,
         'haus': 0.5849625007211562,
         'grüße': 0.5849625007211562,
         'garten': 1.584962500721156,
         'sommer': 1.584962500721156,
         'pflanzen': 1.584962500721156,
         'hi': 1.584962500721156,
         'dank': 1.584962500721156,
         'regen': 1.584962500721156})

### 2) domain relevance / domain consensus measure (Ontolearn)
DR = target-frequency / contrastive-frequency 

DC = target-tdf * log2( 1 / target-tdf )

DW = alpha * DR + (1-alpha) * DC

**idea:** domain relevane (DR) determines whether a term is more important to the target or contrastive domain, domain consensus determines whether that term is generic (i.e. appears a lot in every domain) - final result is a combination of both measures 

from https://ieeexplore.ieee.org/document/1179190

In [10]:
def get_dr(target_domain, contrastive_domain, candidates):
    # candidates should be part of the target domain
    dr = {}
    
    target_tf = get_tf(target_domain)
    contrastive_tf = get_tf(contrastive_domain)

    for term in candidates:
        try:
            dr[term] = target_tf[term] / contrastive_tf[term]
        except ZeroDivisionError:
            dr[term] = 0

    return dr


def get_dc(target_domain, candidates):
    dc = {}
    target_tdf = get_tdf(target_domain)

    for term in candidates:
        dc[term] = target_tdf[term]*np.log2(1/target_tdf[term])

    return dc


def get_dw(target_domain, contrastive_domain, candidates, alpha):
    dw = {}
    dr = get_dr(target_domain, contrastive_domain, candidates)
    dc = get_dc(target_domain, candidates)

    for term in candidates:
        dw[term] = alpha * dr[term] + (1 - alpha) * dc[term]

    return dw

In [12]:
candidates = set([item for sublist in target_domain for item in sublist])
get_dw(target_domain, contrastive_domain, candidates, 0.5)

{'sensor': 0.1949875002403854,
 'fehler': 0.1949875002403854,
 'hi': 0.7141604167868594,
 'hallo': 0.675,
 'kaputt': 0.2641604167868593,
 'grüße': 0.6449875002403854,
 'auto': 0.1949875002403854,
 'problem': 0.2641604167868593,
 'dank': 0.7141604167868594,
 'reifen': 0.2641604167868593}

### 3) Log-likelihood-ratio
<img src="img/llr.png" width=50% />
Where "i" is the taget domain and "j" is the contrastive domain

pg. 406 of https://web.stanford.edu/~jurafsky/slp3/

In [13]:
def get_llr(target_domain, contrastive_domain, candidates):
    # candidates should be part of the target domain
    llr = {}
    target_tf = get_tf(target_domain)
    contrastive_tf = get_tf(contrastive_domain)
    
    for term in candidates: 
        target_tf[term] = 0.1 if not target_tf[term] else target_tf[term]
        contrastive_tf[term] = 0.1 if not contrastive_tf[term] else contrastive_tf[term]
        
        llr[term] = np.log(target_tf[term]) - np.log(contrastive_tf[term])
 
    return llr

In [14]:
get_llr(target_domain, contrastive_domain, candidates)

{'sensor': 0.6931471805599452,
 'fehler': 0.6931471805599452,
 'hi': -0.10536051565782589,
 'hallo': 0.30010459245033805,
 'kaputt': 0.0,
 'grüße': -0.10536051565782611,
 'auto': 0.6931471805599452,
 'problem': 0.0,
 'dank': -0.10536051565782589,
 'reifen': 0.0}

### 3b) Log-odds-ratio

<img src="img/lor.png" width=50% />

Where "i" is the taget domain and "j" is the contrastive domain

pg. 407 of https://web.stanford.edu/~jurafsky/slp3/

Note: Can also be used with background knowledge 

In [40]:
def get_lor(target_domain, contrastive_domain, candidates):
    # candidates should be part of the target domain
    lor = {}
    target_tf = get_tf(target_domain)
    contrastive_tf = get_tf(contrastive_domain)
    
    for term in candidates: 
        target_tf[term] = 0.0001 if not target_tf[term] else target_tf[term]
        contrastive_tf[term] = 0.0001 if not contrastive_tf[term] else contrastive_tf[term]
        
        lor[term] = np.log2(target_tf[term]/(1-target_tf[term])) - np.log2(contrastive_tf[term]/(1-contrastive_tf[term]))
 
    return lor

In [41]:
get_lor(target_domain, contrastive_domain, candidates)

{'sensor': 11.287568102831404,
 'fehler': 11.287568102831404,
 'hi': -0.16992500144231215,
 'hallo': 0.5849625007211563,
 'kaputt': 10.117643101389092,
 'grüße': -0.19264507794239583,
 'auto': 11.287568102831404,
 'problem': 10.117643101389092,
 'dank': -0.16992500144231215,
 'reifen': 10.117643101389092}

### 3c) Log-odds-ratio with background knowledge

<img src="img/lor_bg.png" width=50% />

from pg. 407 of https://web.stanford.edu/~jurafsky/slp3/

In [44]:
import math

def get_lor_bg(target_domain, contrastive_domain, candidates, normalized = 0):
    # candidates should be part of the target domain
    lor_bg = {}
    
    # get term frequency for each domain - combining both gives background knowledge
    target_flat_terms = [item for sublist in target_domain for item in sublist]
    target_tf = Counter(target_flat_terms)
    
    contrastive_flat_terms = [item for sublist in contrastive_domain for item in sublist]
    contrastive_tf = Counter(contrastive_flat_terms)
    
    combine_flat_terms = target_flat_terms + contrastive_flat_terms
    combine_tf = Counter(combine_flat_terms)
    
    n_i = len(target_flat_terms)
    n_j = len(contrastive_flat_terms)
    a_0 = len(combine_flat_terms)
    
    for term in candidates: 
        target_tf[term] = 0.1 if not target_tf[term] else target_tf[term]
        contrastive_tf[term] = 0.1 if not contrastive_tf[term] else contrastive_tf[term]
        combine_tf[term] = 0.1 if not combine_tf[term] else contrastive_tf[term]
        
        lor_bg[term] = np.log2((target_tf[term]+combine_tf[term])/n_i + a_0 - (target_tf[term]+combine_tf[term])) - np.log2((contrastive_tf[term]+combine_tf[term])/ n_i + a_0 - (contrastive_tf[term]+combine_tf[term]))
        if normalized:
            sigma = 1/(target_tf[term] + combine_tf[term]) + 1/(contrastive_tf[term] + combine_tf[term])
            lor_bg[term] = lor_bg[term] / math.sqrt(sigma)
    return lor_bg

In [45]:
get_lor_bg(target_domain, contrastive_domain, candidates, 1)

{'sensor': -0.040804996690351154,
 'fehler': -0.040804996690351154,
 'hi': 0.0,
 'hallo': -0.08479322111290803,
 'kaputt': -0.018283544476431025,
 'grüße': 0.0,
 'auto': -0.040804996690351154,
 'problem': -0.018283544476431025,
 'dank': 0.0,
 'reifen': -0.018283544476431025}