Here we analyse the top 50 subjects for each document. They were computed using the sum cosine method. The candidate subjects are the descendants of the three most similar fields of each document.

In [25]:
import json
from random import sample
from collections import Counter

In [33]:
subjects = json.load(open('../data/distances/sum/top_50_subjects.json'))
subject_info = json.load(open('../data/openalex/subjects.json'))
doc_info = json.load(open('../data/json/dim/all/improved_data.json'))
doc_tokens = json.load(open('../data/json/dim/all/data_lemmas_vocab.json'))
venue_docs = json.load(open('../data/json/dim/all/ert/venue_publications.json'))
for v_file in ['referees', 'advisors']:
  for venue, docs in json.load(open(f'../data/json/dim/all/ert/{v_file}.json')).items():
    if venue in venue_docs:
      venue_docs[venue] += docs
    else:
      venue_docs[venue] = docs
len(subjects)

281

In [16]:
def show(doc, n=5):
  print(doc_info[doc])
  print(doc_tokens[doc])
  cnt = 0
  for subject, dist in subjects[doc].items():
    subject_name = subject_info[subject]['name']
    print(subject_name, dist)
    cnt += 1
    if cnt == n:
      return


In [24]:
doc_ids = sample(list(subjects.keys()), 30)
for doc_id in doc_ids:
  show(doc_id)
  print('\n')

{'title': 'Reservoir characterisation using macromolecular petroleum compounds including asphaltenes: A case study of the Heidrun oil field in the Norwegian North Sea', 'abstract': 'The present thesis is part of the industry partnership project “BioPets Flux” between the GFZ Potsdam and the industry partners BG Group, Devon Energy, ExxonMobil, Petrobras, Repsol YPF, Shell, and Statoil. The study aims at improving predictions of reservoir alteration postfilling, and enhancing the understanding of reservoir charge history. The original composition and the volume of petroleum in reservoirs are often subjected to post-filling alteration processes, in many cases to variable degrees due to the reservoir compartmentalization. Such alteration processes, e.g. biodegradation or water washing, have strong economic consequences since they lead to a decrease in oil quality and producibility. The focus of this thesis was on the total macromolecular petroleum fraction of reservoir rocks (including as

What are the most common subjects?

In [30]:
subject_names = []
for arr in subjects.values():
  subject_names += [subject_info[s]['name'] for s in arr.keys()]
cnt = Counter(subject_names)
print(cnt.most_common(7))

[('Petrography', 71), ('Quartz', 68), ('Particle size', 68), ('Subsoil', 64), ('Contamination', 62), ('Petroleum', 61), ('Clay minerals', 59)]


What are the most recurring subjects per venue? Consider only those with a distance lower than 0.4

In [38]:
venue_subjects = {}
for doc in subjects:
  venue = None
  for v in venue_docs:  # find venue
    if doc in venue_docs[v]:
      venue = v
      break
  if venue not in venue_subjects:
    venue_subjects[venue] = Counter()
  for subject, dist in subjects[doc].items():
    if dist < .4:
      subject_name = subject_info[subject]['name']
      venue_subjects[venue][subject_name] += 1
json.dump(venue_subjects, open('venue_subjects_threshold.json', 'w'))

In [39]:
for venue, cnt in venue_subjects.items():
  print(venue, cnt.most_common(3))

Physical chemistry, chemical physics [('Protein structure', 2), ('Radical', 1), ('Quartz', 1)]
Biochemical engineering journal [('Fermentation', 1), ('Oxidative phosphorylation', 1), ('In situ', 1)]
Tränkle, Günther [('Light-emitting diode', 1), ('Thin film', 1), ('Semiconductor', 1)]
Brain–computer interfaces handbook [('Cognition', 1), ('Sustainability', 1), ('Soil classification', 1)]
Engel, Harald [('Frequency response', 1), ('Logic gate', 1), ('Finite element method', 1)]
Knorr, Andreas [('Petrography', 1), ('Radiometer', 1), ('Ranging', 1)]
Eichler, Hans Joachim [('Petrography', 1), ('Monolayer', 1), ('Protein structure', 1)]
Heinrich, Wolfgang [('Petrography', 1), ('Infrared spectroscopy', 1), ('Microstructure', 1)]
Fortschritte der Akustik - DAGA 2015: 41. Jahrestagung für Akustik, 16. - 19. März 2015 in Nürnberg [('Photography', 1), ('Systems architecture', 1), ('Software architecture', 1)]
Journal of Chemical Physics [('Surface wave', 1)]
Zeitschrift für Naturforschung B [('S

What are the most recurring subjects per venue? Consider only the five most similar subjects for each document.

In [40]:
venue_subjects = {}
for doc in subjects:
  venue = None
  for v in venue_docs:  # find venue
    if doc in venue_docs[v]:
      venue = v
      break
  if venue not in venue_subjects:
    venue_subjects[venue] = Counter()
  cnt = 0
  for subject, dist in subjects[doc].items():
    subject_name = subject_info[subject]['name']
    venue_subjects[venue][subject_name] += 1
    if cnt == 5:
      break
json.dump(venue_subjects, open('venue_subjects_top5.json', 'w'))

In [41]:
for venue, cnt in venue_subjects.items():
  print(venue, cnt.most_common(3))

Physical chemistry, chemical physics [('Protein structure', 2), ('Radical', 1), ('Quartz', 1)]
Biochemical engineering journal [('Fermentation', 1), ('Oxidative phosphorylation', 1), ('In situ', 1)]
Tränkle, Günther [('Light-emitting diode', 1), ('Thin film', 1), ('Semiconductor', 1)]
Brain–computer interfaces handbook [('Cognition', 1), ('Sustainability', 1), ('Soil classification', 1)]
Engel, Harald [('Frequency response', 1), ('Logic gate', 1), ('Finite element method', 1)]
Knorr, Andreas [('Petrography', 1), ('Radiometer', 1), ('Ranging', 1)]
Eichler, Hans Joachim [('Petrography', 1), ('Monolayer', 1), ('Protein structure', 1)]
Heinrich, Wolfgang [('Petrography', 1), ('Infrared spectroscopy', 1), ('Microstructure', 1)]
Fortschritte der Akustik - DAGA 2015: 41. Jahrestagung für Akustik, 16. - 19. März 2015 in Nürnberg [('Casting', 2), ('Photography', 1), ('Systems architecture', 1)]
Journal of Chemical Physics [('Surface wave', 1), ('Optical fiber', 1), ('Noise reduction', 1)]
Zeits