We've downloaded the 100 highest-scoring papers from MAG for each level-0 field. These papers are exemplars of that field, and we'd expect our model to score them highly in that field too.

In [1]:
from collections import Counter

from fos.model import FieldModel
from fos.settings import ASSETS_DIR
import pandas as pd
import numpy as np

from fos.vectors import embed_fasttext

mag_texts = pd.read_pickle(ASSETS_DIR / 'fields/example_text.pkl.gz')
meta = pd.read_pickle(ASSETS_DIR / 'fields/fos.pkl.gz')
fields = FieldModel("en")

In [48]:
# Show an example for each L0 field
for _, row in mag_texts.drop_duplicates('display_name').iterrows():
    print(f"{row['display_name']:<18}", '\t', row['text'][:90])

Art                	 the search for aesthetic meaning in the visual arts the need for the aesthetic tradition i
Biology            	 geographic distribution of the e1 family of genes and their effects on reproductive timing
Business           	 using the financial and business literature electronic resources accounting advertising af
Chemistry          	 the fate of amino acids adsorbed on mineral matter abstract we present here selected resul
Computer science   	 integrating memory consistency models and communication systems the shared memory paradigm
Economics          	 essays in economic theory preface biographical sketch alaknanda patel introduction partha 
Engineering        	 by engineers for engineers the bergeron centre for engineering excellence is more than jus
Environmental science 	 a processbased inventory model for landfill ch4 emissions inclusive of seasonal soil micro
Geography          	 the geography of manitoba its land and its people manitoba is more than one of c

We score them, then see where their high-scoring MAG field ranked among our L0 scores.

In [None]:
ranks = []
i = 0
for _, doc in mag_texts.iterrows():
    # embed with fasttext
    doc_vector = embed_fasttext(doc['text'], fields.fasttext)
    # score the vector against field embeddings
    scores = pd.DataFrame({'field_id': fields.index, 'score': fields.field_fasttext[doc_vector]})
    scores = pd.merge(scores, meta[['display_name', 'level']], left_on='field_id', right_index=True)
    scores = scores.loc[scores.level == 0].sort_values('score', ascending=False)
    rank = np.where(scores['display_name'] == doc['display_name'])[0][0] + 1
    ranks.append((doc['display_name'], rank))
    i += 1
    if i % 500 == 0:
        print(i)  # this takes a little while

In [3]:
ranks = pd.DataFrame(ranks)
ranks.columns = ['field', 'rank']
rank_freqs = ranks.groupby(['field'])['rank'].apply(pd.value_counts) / 100

In [20]:
rank_freqs.reset_index().pivot('field', columns='level_1').fillna('')

Unnamed: 0_level_0,rank,rank,rank,rank,rank,rank,rank
level_1,1,2,3,4,5,7,9
field,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Art,0.54,0.26,0.2,,,,
Biology,0.99,0.01,,,,,
Business,0.98,0.02,,,,,
Chemistry,0.88,0.12,,,,,
Computer science,1.0,,,,,,
Economics,0.84,0.05,0.07,0.02,0.01,0.01,
Engineering,0.82,0.12,0.02,0.01,0.02,,0.01
Environmental science,0.67,0.3,0.03,,,,
Geography,0.86,0.08,0.02,0.03,0.01,,
Geology,1.0,,,,,,


This looks fairly good. For instance, our top field for all of the exemplar CS papers is CS.

Disagreement is highest in physics, environmental science, art, and sociology.

Let's see which fields we're scoring higher than these, when they aren't the top field.

In [52]:
errors = {}
texts = []
for field in ['Physics', 'Environmental science', 'Art', 'Sociology']:
    errors[field] = Counter()
    for _, doc in mag_texts.loc[mag_texts.display_name == field, ].iterrows():
        # embed with fasttext
        doc_vector = embed_fasttext(doc['text'], fields.fasttext)
        # score the vector against field embeddings
        scores = pd.DataFrame({'field_id': fields.index, 'score': fields.field_fasttext[doc_vector]})
        scores = pd.merge(scores, meta[['display_name', 'level']], left_on='field_id', right_index=True)
        scores = scores.loc[scores.level == 0].sort_values('score', ascending=False)
        i = np.where(scores['display_name'] == doc['display_name'])[0][0]
        if i != 0:
            texts.append({
                'field': field,
                'higher-scoring fields': '; '.join(scores.iloc[:i]['display_name'].values),
                'text': doc['text']
            })
            for j in range(i):
                errors[field].update([scores.iloc[j]['display_name']])

In [53]:
for field, counts in errors.items():
    print(f'{field}:')
    for other_field, n in counts.most_common(19):
        print(f'    {other_field:<18} {n}')

Physics:
    Chemistry          52
    Materials science  5
    Biology            4
    Geology            1
Environmental science:
    Geology            32
    Chemistry          2
    Materials science  2
Art:
    Philosophy         42
    History            24
Sociology:
    Political science  18
    History            8
    Psychology         5
    Philosophy         1
    Geography          1
    Art                1


Finally let's take a look at the text.

In [58]:
for text in texts:
    print(text['field'], '<', text['higher-scoring fields'])
    print('    ', text['text'])
    print()

Physics < Chemistry
     properties of narrowu31 based on themdiquonium 651165116511interpretation we study the properties ofu31 assuming that theu is anmdiquoniumsqbar q2q u ord state it is shown that the annihilation decay which becomes the most important for usual diquonia is forbidden foru we show there exist various reasons which makeu narrow nearu31 we expect other narrow diquonia we also compute the electromagnetic mass splitting and find thatu is the heaviest andu0 is the lightest

Physics < Chemistry
     on electromagnetic corrections in mue decay electromagnetic corrections to the angular distribution of electrons were obtained for the v a theory of mu e decay auth

Physics < Chemistry
     existence of atoms and molecules in nonrelativistic quantum electrodynamics we show that the hamiltonian describing n nonrelativistic electrons with spin interacting with the quantized radiation field and several fixed nuclei with total charge z has a ground state when n z the result hold

Sociology < Political science; History
     dynamic literacies and democracy a framework for historical literacy a stated goal of australian schooling is that all students will become active and informed citizens mceetya melbourne declaration of educational goals for young australians barton act ministerial council on education employment training and youth affairs accordingly national education policy and curriculum reforms are increasingly concerned with the attributes or qualities that may be required for an individual to be a successful citizen in the twentyfirst century research in history education has espoused the potential of studying history to help young people to prepare for the kind of reasoning and informed decision making that will be required for participatory citizenship for examples see sam wineburg why learn history when its already on your phone chicago university of chicago press keith barton agency choice and historical action how history teaching can help students