Look at the structure of the references extracted by `extract_references.py`

In [2]:
import json
import random

In [3]:
hu_refs = json.load(open('../data/json/references/edoc.json'))
tu_refs = json.load(open('../data/json/references/depositonce.json'))
fu_refs = json.load(open('../data/json/references/refubium.json'))

In [4]:
hu = json.load(open('../data/json/dim/edoc/relevant_data.json'))
tu = json.load(open('../data/json/dim/depositonce/relevant_data.json'))
fu = json.load(open('../data/json/dim/refubium/relevant_data.json'))

In [5]:
len(hu_refs), len(tu_refs), len(fu_refs), len(hu_refs) + len(tu_refs) + len(fu_refs)

(577, 7433, 14449, 22459)

How many documents have references, in total and per repository?

In [6]:
print(f'TU: {len(tu_refs)} of {len(tu)} docs have references ({round(len(tu_refs)/len(tu), 2)})')
print(f'HU: {len(hu_refs)} of {len(hu)} docs have references ({round(len(hu_refs)/len(hu), 2)})')
print(f'TU: {len(fu_refs)} of {len(fu)} docs have references ({round(len(fu_refs)/len(fu), 2)})')

TU: 7433 of 7438 docs have references (1.0)
HU: 577 of 7497 docs have references (0.08)
TU: 14449 of 14464 docs have references (1.0)


In [7]:
total_refs_cnt = len(tu_refs) + len(hu_refs)+ len(fu_refs)
print(f'Total: {total_refs_cnt} of {len(tu)+len(hu)+len(fu)} docs have references ({round(total_refs_cnt/(len(tu)+len(hu)+len(fu)), 2)})')

Total: 22459 of 29399 docs have references (0.76)


Why do edoc documents seldom have references?

In [8]:
hu_empty = set(hu.keys()) - set(hu_refs.keys())
random.sample(hu_empty, 5)

['oai:edoc.hu-berlin.de:18452/10114',
 'oai:edoc.hu-berlin.de:18452/22862.2',
 'oai:edoc.hu-berlin.de:18452/23133',
 'oai:edoc.hu-berlin.de:18452/20685',
 'oai:edoc.hu-berlin.de:18452/18014']

In the logs there seems to be some errors: syntax errors and mismatched tags. I assume they belong to refubium. How many IDs are missing in the log, i.e. they were not parsed correctly?

In [9]:
log = open('../logs/extractrefs_1636620650.log').read()

In [10]:
missing_ids = []
parsed_ids = []
for id in hu:
  number = ' ' + id.split('/')[-1] + ' '
  if number not in log:
    missing_ids.append(id)
  else:
    parsed_ids.append(id)
len(missing_ids)


2374

What are the types of the documents that were not parsed?

In [11]:
hu_types = json.load(open('../data/json/dim/edoc/relevant_types.json'))
doc_types = {}
for id in missing_ids:
  doc_type = hu_types[id]
  if doc_type in doc_types:
    doc_types[doc_type] += 1
  else:
    doc_types[doc_type] = 1
doc_types

{'doctoralthesis': 815,
 'article': 608,
 'book': 627,
 'report': 7,
 'conferenceobject': 185,
 'bookpart': 48,
 'workingpaper': 39,
 'masterthesis': 38,
 'periodicalpart': 2,
 'bachelorthesis': 4,
 'studythesis': 1}

In [12]:
random.sample(missing_ids, 5)

['oai:edoc.hu-berlin.de:18452/16948',
 'oai:edoc.hu-berlin.de:18452/22301',
 'oai:edoc.hu-berlin.de:18452/22161',
 'oai:edoc.hu-berlin.de:18452/19707',
 'oai:edoc.hu-berlin.de:18452/22420']

I can't find the reason why some edoc articles were not parsed successfully. I am running it again and will investigate it further afterwards.

How many references are there on average, per repository and in total?

In [14]:
hu_total, tu_total, fu_total = 0, 0, 0
for refs in hu_refs.values():
  hu_total += len(refs)
for refs in tu_refs.values():
  tu_total += len(refs)
for refs in fu_refs.values():
  fu_total += len(refs)
print(f'HU avg.: {round(hu_total/len(hu_refs), 2)}')
print(f'TU avg.: {round(tu_total/len(tu_refs), 2)}')
print(f'FU avg.: {round(fu_total/len(fu_refs), 2)}')
total = hu_total + tu_total + fu_total
print(f'Total avg.: {round(total/(len(hu_refs)+len(tu_refs)+len(fu_refs)), 2)}')


HU avg.: 169.28
TU avg.: 151.7
FU avg.: 158.91
Total avg.: 156.79


That seems like a lot. Theses are surely to blame for these large averages.

In [20]:
hu_types = json.load(open('../data/json/dim/edoc/relevant_types.json'))
tu_types = json.load(open('../data/json/dim/depositonce/relevant_types.json'))
fu_types = json.load(open('../data/json/dim/refubium/relevant_types.json'))
hu_theses, tu_theses, fu_theses = [], [], []
hu_publications, tu_publications, fu_publications = [], [], []
for id in hu_refs:
  refs = hu_refs[id]
  doc_type = hu_types[id]
  if 'thesis' in doc_type:
    hu_theses.append(len(refs))
  else:
    hu_publications.append(len(refs))
for id in tu_refs:
  refs = tu_refs[id]
  doc_type = tu_types[id]
  if 'thesis' in doc_type:
    tu_theses.append(len(refs))
  else:
    tu_publications.append(len(refs))
for id in fu_refs:
  refs = fu_refs[id]
  doc_type = fu_types[id]
  if 'thesis' in doc_type:
    fu_theses.append(len(refs))
  else:
    fu_publications.append(len(refs))
print('Theses')
print(f'HU avg.: {round(sum(hu_theses)/len(hu_theses), 2)}')
print(f'TU avg.: {round(sum(tu_theses)/len(tu_theses), 2)}')
print(f'FU avg.: {round(sum(fu_theses)/len(fu_theses), 2)}')
print(f'Total avg.: {round(sum(hu_theses)+sum(tu_theses)+sum(fu_theses)/(len(hu_theses)+len(tu_theses)+len(fu_theses)), 2)}')
print('Publications')
print(f'HU avg.: {round(sum(hu_publications)/len(hu_publications), 2)}')
print(f'TU avg.: {round(sum(tu_publications)/len(tu_publications), 2)}')
print(f'FU avg.: {round(sum(fu_publications)/len(fu_publications), 2)}')
print(f'Total avg.: {round(sum(hu_publications)+sum(tu_publications)+sum(fu_publications)/(len(hu_publications)+len(tu_publications)+len(fu_publications)), 2)}')


Theses
HU avg.: 322.94
TU avg.: 227.66
FU avg.: 234.56
Total avg.: 815030.95
Publications
HU avg.: 77.33
TU avg.: 91.94
FU avg.: 121.16
Total avg.: 410449.48
