<small><i>This notebook was put together by [Roman Prokofyev](http://prokofyev.ch)@[eXascale Infolab](http://exascale.info/). Source and license info is on [GitHub](https://github.com/dragoon/kilogram/).</i></small>

# Prerequisites

* Pandas: ``pip install pandas``
* Matplotlib

# Linking DBPedia entity types in N-gram corpus
In this sub-tutorial, we will link DBPedia entities (and subsequently types) in an N-gram corpus. In our case, we used n-grams generated out of **Wikipedia pages**, to compare if our linking is better than original.

Clone **kilogram** library if you haven't done it yet:

    git clone https://github.com/dragoon/kilogram.git
    cd kilogram/mapreduce
    
Download required **DBPedia** datasets:
    
    wget http://downloads.dbpedia.org/2015-04/dbpedia_2015-04.owl.bz2
    wget -O redirects_transitive_en.nt.bz2 http://downloads.dbpedia.org/2015-04/core-i18n/en/transitive-redirects_en.nt.bz2
    wget -O instance_types_en.nt.bz2 http://downloads.dbpedia.org/2015-04/core-i18n/en/instance-types-transitive_en.nt.bz2

In [1]:
import matplotlib.pyplot as plt
from mpltools import style
import numpy as np
style.use('ggplot')
%matplotlib inline
import pandas as pd
import shelve
from collections import defaultdict

# Generate n-gram corpus from annotated wikipedia texts

    spark-submit --num-executors 20 --master yarn-client ./wikipedia/spark_anchors.py "/data/wikipedia2015_plaintext_annotated" "/user/roman/wikipedia_anchors_orig"
    spark-submit --num-executors 20 --master yarn-client ./wikipedia/spark_orig_ngram_counts.py "/user/roman/wikipedia_anchors_orig" "/user/roman/orig_ngram_counts"

# Organic (human) linkings
Since we want to prove we link entity types better than original (human) annonators, we need to first extract original linkings for comparison.

# Analysis: raw counts

First we try to see what happens if we perform stupid linking by means of exact string matching, using canonical and "redirect" labels of wikipedia pages. All such labels uniquely link to a single entity (at least inside Wikipedia).

We aggregate both the counts of the labels and of their lowercase versions for further analysis.
Initially we will also generate DBPedia entity dictionary which maps labels to entity types (used later):

    python dbpedia_dbm.py
    spark-submit --executor-memory 5g --num-executors 20 --master yarn-client ./wikipedia/spark_plain_ngrams.py "/data/wikipedia2015_plaintext_annotated" "/user/roman/wikipedia_ngrams"
    spark-submit --num-executors 20 --executor-memory 5g --master yarn-client ./wikipedia/spark_lowercase.py "/user/roman/wikipedia_ngrams" "/user/roman/ngram_counts"

Retrieve label counts:

    hdfs dfs -cat /user/roman/ngram_counts/* > dbpedia_counts_inferred

# Construct original counts file

    spark-submit --master yarn-client --num-executors 20 ./wikipedia/spark_anchors.py "/data/wikipedia_anchors" "/user/roman/orig_ngram_counts"
    hdfs dfs -cat /user/roman/orig_ngram_counts/* > dbpedia_counts_original

In [2]:
count_dict = {}
for line in open('dbpedia_counts_inferred'):
    label, values = line.split('\t')
    upper_count, lower_count = values.split(',')
    count_dict[label] = {'infer_upper': int(upper_count), 'infer_lower': int(lower_count), 'len': len(label.split('_')),
                       'label': label, 'organ_upper': 0, 'organ_lower': 0}
for line in open('dbpedia_counts_original'):
    label, values = line.split('\t')
    if label in count_dict:
        upper_count, lower_count = values.split(',')
        count_dict[label].update({'organ_upper': int(upper_count), 'organ_lower': int(lower_count)})
counts_df = pd.DataFrame(count_dict.values())
del count_dict
counts_df.head()

Unnamed: 0,infer_lower,infer_upper,label,len,organ_lower,organ_upper
0,0,4,Feijo,1,0,0
1,0,9,Atlético_Celaya,2,0,1
2,2,11,The_Gigolos,2,0,0
3,0,13,Socialist_Peasants'_Party,3,0,0
4,6,0,Unadorned_rock-wallaby,2,0,0


In [4]:
a = set(counts_df[(counts_df.infer_upper + counts_df.infer_lower > 0)]['type'])
b = set(counts_df[(counts_df.organ_upper + counts_df.organ_lower > 0)]['type'])
print 'Total number of inferred types:',  len(a)
print 'Total number of original types:',  len(b)
print a.difference(b)

Total number of inferred types: 339
Total number of original types: 338
set(['SnookerWorldRanking'])


# Generate excludes by ambiguity

In [4]:
def is_title(label):
    label = label.split('_')
    if len(label) == 1:
        return False
    first_letters = [x[0] for x in label if x[0].isupper()]
    if len(first_letters) == 1:
        return False
    return True

# Write excludes file, create lowercase matchings file
dbpediadb_lower = {}
#excludes = open('../mapreduce/PARAM_plus_achors/dbpedia_uri_excludes.txt', 'w')
excludes = open('dbpedia_uri_excludes.txt', 'w')
for row in counts_df.iterrows():
    row = row[1]
    exclude = False
    label = row['label']
    
    # skip uppercase
    if label.isupper():
        continue
    # skip titlecased labels than are bigrams and higher
    if is_title(label):
        # add to lower db
        if row['organ_lower'] > 0 and row['organ_upper'] == 0:
            dbpediadb_lower[label.lower()] = label
        continue
    if row['organ_upper'] == 0:
        if row['infer_upper'] == 0:
            pass
        else:
            if row['infer_lower'] > 0:
                new_ratio = row['infer_upper']/float(row['infer_lower'])
                if new_ratio < 20:
                    exclude = True
    else:
        new_ratio = row['infer_upper']/float(row['infer_lower'] or 1)
        orig_ratio = row['organ_upper']/float(row['organ_lower'] or 1)
        if new_ratio < orig_ratio:
            exclude = True
    if exclude:
        excludes.write(label+'\n')
    elif row['organ_lower'] > 0:
        dbpediadb_lower[label.lower()] = label
excludes.close()        

In [5]:
#out = open('../mapreduce/PARAM_plus_achors/dbpediadb_lower.txt', 'w')
out = open('dbpedia_lower_includes.txt', 'w')
for lower_label, label in dbpediadb_lower.items():
    out.write('%s\t%s\n' % (lower_label, label))
out.close()

# Statistical significance test

In [15]:
from scipy.stats import ttest_rel
with open('/home/roman/berkeleylm/temp.log') as a:
    simple_data = [float(x.split(';')[0]) for x in a]
with open('/home/roman/berkeleylm/temp1.log') as a:
    generic_data = [float(x.split(';')[0]) for x in a]
del generic_data[-1]
del simple_data[-1]

#print 'T-test merged vs orig:', ttest_rel(merged_wiki_data, orig_wiki_data)
#print 'T-test merged vs custom:', ttest_rel(merged_wiki_data, custom_wiki_data)
print 'T-test custom vs orig:', ttest_rel(generic_data, simple_data)

T-test custom vs orig: (664.539616654572, 0.0)


In [11]:
counts_df[(counts_df.label == 'Saban')]

Unnamed: 0,infer_lower,infer_upper,label,len,organ_lower,organ_upper
265626,0,382,Saban,1,0,0


# Annotating text and computing precision/recall

In [3]:
from __future__ import division
from kilogram.dataset.dbpedia import NgramEntityResolver
import re
ner = NgramEntityResolver("/home/roman/dbpedia/dbpedia_types.txt", "/home/roman/dbpedia/dbpedia_uri_excludes.txt", "/home/roman/dbpedia/dbpedia_lower_includes.txt", "/home/roman/dbpedia/dbpedia_redirects.txt", "/home/roman/dbpedia/dbpedia_2015-04.owl")
ENTITY_MATCH_RE = re.compile(r'<(.+?)\|(.+?)>')

In [20]:
test_data = ['/home/roman/language_models/wekex_test', '/home/roman/language_models/msnbc_test']
precision = 0
for filename in test_data:
    print filename
    recall = 0
    tp = 0
    fp = 0
    fn = 0
    test_file = open(filename).read().splitlines()
    for line in test_file:
        entities = ENTITY_MATCH_RE.findall(line)
        for entity in entities:
            uri, text = entity
            text = text.replace('_', ' ').replace("'s", "")
            uri = ner.redirects_file.get(uri, uri)
            if uri in ner.dbpedia_types:
                entities = ner.resolve_entities(text.split())
                not_valid = [x for x in entities if x.startswith('<dbpedia:') and uri not in x]
                valid = [x for x in entities if x.startswith('<dbpedia:') and uri in x]
                fp += len(not_valid)
                tp += len(valid)
                if not valid:
                    fn += 1
    print tp/(tp+fp)
    print tp/(tp+fn)

/home/roman/language_models/wekex_test
0.921052631579
0.645161290323
/home/roman/language_models/msnbc_test
0.905213270142
0.628289473684
