## Introduction

The titles included in the bibliography tell us a few things, for instance the language a publication is written in or the main themes (keywords, authors) that is the subject of a publication. Let's find out about them. 

In [26]:
# === Imports === 

# Basics
import re 
from os.path import realpath, dirname, join
import os
from lxml import etree
from collections import Counter
import pandas as pd

# Visualization
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"
import plotly.express as px


# === Files and parameters === 

wdir  = join("/", "media", "christof", "Data", "Github", "christofs", "bib18")
#bib18_file = join(wdir, "data", "Bib18_test.rdf") 
bib18_file = join(wdir, "data", "Bib18.rdf") 

namespaces = {
    "foaf" : "http://xmlns.com/foaf/0.1/",
    "bib" : "http://purl.org/net/biblio#",
    "dc" : "http://purl.org/dc/elements/1.1/",
    "z" : "http://www.zotero.org/namespaces/export#",
    "rdf" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    }


## Extracting the titles

The first step is to identify the titles in the dataset. Ideally, we would exclude titles of journals, for instance, but this is not done here yet. 

In [27]:
def read_json(bib18_file): 
    """
    Open and read the RDF version of the Bib18 dataset.
    Returns: the XML document as an etree object. 
    """
    bib18 = etree.parse(bib18_file)
    return bib18


def get_titles(bib18): 
    """
    Extract all the primary titles from the dataset. 
    Primary titles are all titles except journal names. 
    """
    # Find all primary "title" elements in the dataset 
    # TODO: find the XPath to exclude journal titles. 
    xpath = "//dc:title/text()"
    titles = bib18.xpath(xpath, namespaces=namespaces)
    print("Number of titles found: " + str(len(titles)) + ".")
    return titles

# === Main === 

def main(): 
    # Make variable global (=> accessible to next cells)
    global titles 
    bib18 = read_json(bib18_file)
    titles = get_titles(bib18)
main()


Number of titles found: 110174.


## Distribution of languages, based on the titles

We don't have abstracts or full texts in the dataset, but on the basis of the titles alone, we can determine the (most likely) language of the publication. (Note that this is an algorithmic process (using py-lingua) based just on the sometimes very short titles, so errors are to be expected. Also, these values are calculated here on-the-fly, not read from the bibliographic data, where fewer languages are distinguished.)

The following table shows the number of times the top 10 languages occur in the dataset.


In [33]:
def identify_language(title): 
    """
    For one title, detect the most likely language. 
    This function is called by "get_titles_langs_counts()". 
    Returns: str (language)
    """
    from lingua import Language, LanguageDetectorBuilder
    languages = [Language.ENGLISH,
                 Language.FRENCH,
                 Language.GERMAN,
                 Language.SPANISH,
                 Language.ITALIAN,
                 Language.PORTUGUESE,
                 Language.ROMANIAN,
                 Language.DUTCH,
                 Language.SWEDISH,
                 Language.NYNORSK,
                 Language.BOKMAL,
                 Language.DANISH,
                 Language.FINNISH,
                 Language.RUSSIAN,
                 Language.SLOVENE,
                 Language.ESTONIAN,
                 Language.LATVIAN,
                 Language.LITHUANIAN,
                 Language.POLISH,                
                 Language.CZECH,
                 ]
    detector = LanguageDetectorBuilder.from_languages(*languages).build()
    confidence_values = detector.compute_language_confidence_values(title)
    languages = [language for language, value in confidence_values]
    values = [value for language, value in confidence_values]
    confidence_values_dict = dict(zip(languages, values))
    max_language = max(confidence_values_dict, key=confidence_values_dict.get)
    lang = max_language.name.title()
    return lang


def get_titles_langs_counts(titles): 
    """
    For each title, collect the language identified by py-lingua. 
    Returns: DataFrame (names and counts of all languages in all titles)
    """
    titles_langs = []
    for title in titles: 
        titles_langs.append(identify_language(title))
    titles_langs_counts = dict(Counter(titles_langs))
    tlc = pd.DataFrame.from_dict(titles_langs_counts, orient="index").reset_index().rename(mapper={"index":"language", 0 : "count"}, axis="columns")
    tlc.sort_values(by="count", ascending=False, inplace=True)

    # Display the table nicely
    display(tlc.head(20).style.hide(axis="index"))
    return tlc


def main(titles): 
   global tlc
   tlc = get_titles_langs_counts(titles) 
main(titles)


language,count
French,79314
English,21496
Spanish,1948
Italian,1760
German,1523
Latvian,1067
Portuguese,580
Romanian,477
Swedish,380
Dutch,370


Let's visualize just the top 10 languages as well. 

In [34]:
def visualize_tlc(tlc): 
    """
    Create a visualization (barplot) of the languages counts. 
    """
    fig = px.bar(
        tlc.head(10), 
        x="language", 
        y="count", 
        title="Distribution of languages in the titles",
        text_auto=True)
    fig.show()


def main(tlc): 
    visualize_tlc(tlc)
main(tlc)

The dominance of French is very clear. 