# Links between authors

In this notebook we will perform a basic analysis of the links between **authors** included in the Sefaria dataset.


## Setup

### Imports

In [None]:
import os
import json
import pathlib
import shutil
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

### Directories

In [None]:
dataset_dirname = "../sample_dataset/"
raw_subdirname = "raw/"
raw_metadata_subdirame = "_schemas/"
links_count_fn = dataset_dirname + raw_subdirname + 'links/links_by_book_without_commentary.csv'
output_subfolder = "./links_files/"

### Constants

Names of books we will want to merge in the analysis:


In [None]:
masekhtot = ['Arakhin', 'Bekhorot', 'Chullin', 'Keritot', 'Meilah', 'Menachot', 'Tamid', 'Temurah', 'Zevachim', 'Beitzah', 'Chagigah', 'Eruvin', 'Megillah', 'Moed Katan', 'Pesachim', 'Rosh Hashanah', 'Shabbat', 'Sukkah', 'Taanit', 'Yoma', 'Gittin', 'Ketubot', 'Kiddushin', 'Nazir', 'Nedarim', 'Sotah', 'Yevamot', 'Avodah Zarah', 'Bava Batra', 'Bava Kamma', 'Bava Metzia', 'Horayot', 'Makkot', 'Sanhedrin', 'Shevuot', 'Niddah', 'Berakhot']

torah_books = ['Deuteronomy', 'Exodus', 'Genesis', 'Leviticus', 'Numbers']

Threshold parameter for display of links in graph:


In [None]:
link_count_threshold = 500

## Load Authors List

We read the metadata files from the dataset. Every book's metadata is in a JSON file named after the book (after spaces are replaced by underscores). We load only a list of book/author from the files.



In [None]:
authors_dict = {}
error_count=0
metadata_dir = dataset_dirname + raw_subdirname + raw_metadata_subdirame

We loop over JSON files and extract author.

In some cases there is no "author" field in the JSON. For example: anonymous books (the author is unknown), eponymous books (the name of the author is in the title of the book). In thoses case we extract the author name from the book title, since often in Jewish Thought an author is nicknamed after his major creation. 

In [None]:
for metadata_fn in os.listdir(metadata_dir):
    filename, file_extension = os.path.splitext(metadata_fn)
    if file_extension != '.json':
        continue
    with open(metadata_dir+metadata_fn, 'r', encoding="utf8") as metadata_file:
        try:
            metadata = json.load(metadata_file)
        except:
            continue
    bookname = filename.replace('_', ' ')
    try:
        author = metadata['authors'][0]['en']
    except:
        if bookname.find(" on ")>0:
            author = bookname[0:idx]
        elif bookname.find(" on ")>0:

        else:
            author = bookname
        error_count+=1
    authors_dict[bookname] = author
print(str(error_count) + ' books out of '+ str(len(os.listdir(metadata_dir))) +' without valid author information were corrected.')

Convert to pandas dataframe and display preview:


In [None]:
book_authors_df = pd.DataFrame.from_dict(authors_dict, columns=['Author'], orient='index')
book_authors_df.head()

## Load links count data

We load the list of all (known) links between books from a CSV file.

In [None]:
all_links_counts = pd.read_csv(links_count_fn)
all_link_counts_filtered = all_links_counts[all_links_counts['Link Count']>=0]

Let's display some preview of the data:

In [None]:
all_link_counts_filtered.head()

## Build graph

We join the link counts list with the author list and keep only the "authors" and "link count" columns:

In [None]:
df1 = all_link_counts_filtered.join(book_authors_df, on='Text 1', rsuffix='_1')
df2 = df1.join(book_authors_df, on='Text 2', rsuffix='_2')
df2.rename(columns={'Author': 'Author_1'}, inplace=True)
authors_links_count = df2.loc[:,['Author_1', 'Author_2', 'Link Count']]
authors_links_count.head()

We merge some books which are port of one single corpus (Torah and Talmud) and remove the Jastrow dictionnary, since links from the Jastrow only mean this is a really exhaustive dictionnary:


In [None]:
authors_links_count.replace(masekhtot, 'Talmud', inplace=True)
authors_links_count.replace(torah_books, 'Torah', inplace=True)
idx2 = ~((authors_links_count['Author_2']=='Marcus Jastrow') | (authors_links_count['Author_1']=='Marcus Jastrow'))
authors_links_count = authors_links_count.loc[idx2]

We also remove self-references:


In [None]:
idx = ~authors_links_count['Author_1'].eq(authors_links_count['Author_2'])
authors_links_count = authors_links_count.loc[idx]

Aggregate identical rows and keep only links above threshold and save to file:

In [None]:
authors_links_count_agg = authors_links_count.groupby(['Author_1', 'Author_2']).sum()
authors_links_count_agg = authors_links_count_agg.reset_index() #or not?
authors_links_count_agg.to_hdf(output_subfolder+"whole_graph.h5"

## Analysis of connections to reference corpuses

We want to vizualise major influences, certainly from reference sources common to most authors. To this end we keep only links above our defined threshold. Moreover, in order to 

In [None]:
authors_links_count_agg_morethan = authors_links_count_agg.loc[authors_links_count_agg['Link Count']>link_count_threshold]