# Inscriptiones Christianae Graecae

In [None]:
!pip install folium --user

## Preliminaries

In [None]:
from citableclass.base import Citableloader
import pandas as pd
import numpy as np
import re
import ipywidgets as widget
import pickle
import nltk
import folium
from collections import OrderedDict, Counter
import matplotlib.pyplot as pl
from nltk.tokenize import word_tokenize as tokenizer

In [None]:
nltk.download('punkt')

## Loading data

Using the citableclass, we can access the collection through its DOI. 

In [None]:
cite = Citableloader('10.17171/1-8')

The collection description can be obtained using the `landingpage` function.

In [None]:
cite.landingpage()

The full collection data can be accessed via the `collection` call.

In [None]:
data = cite.collection()

## Structure of data 

To get an overview of the structure of the collection, we normalize a single entry and display its dataframe.

In [None]:
testDF = pd.io.json.json_normalize(data['68'])

In [None]:
testDF.columns

### Greek characters

Since the original inscriptions are in greek, we might want to search words in greek. For this aim we need a list of greek characters. 

This is given by

In [None]:
greek_letters=[chr(code) for code in range(945,970)]

We can then contruct regular expressions with these characters. 

For example, all sets of two or three characters followed by a whitespace is given by

In [None]:
greek_pat = '[' + ''.join(greek_letters) + ']{2,3}\s' 
greek_pat

Taking the text of entry number 68, we can search in the following way.

In [None]:
greektStri = testDF.search_text.values[0]

In [None]:
re.findall(greek_pat,greektStri)

## Data Preparation

The data consists of a large number of objects, where each objects information is a nested list. To convert this structure into a single dataframe, we loop over each entry.

In [None]:
len(data.keys())

To display the progress, we use FloatProgress, and increment the value for every loop step.

In [None]:
dfList = []
N = len(data.keys()) 
f = widget.IntProgress(min=0, max=N)
display(f)
for key in data.keys():
    f.value += 1
    tempDF = pd.io.json.json_normalize(data[key])
    dfList.append(tempDF)
print('Done!')

Since data preparation takes some time, it is possible to save the converted data as a pickle file, which can be loaded for further editing.

In [None]:
#pickle.dump(dfList, open( "./data/ICG_rawlist.p", "wb" ) )

#Load by using
#dfList = pickle.load(open("./data/icg_rawlist.p", "rb" ) )

## Reducing the data

The dimensions of the full concatenated dataframe are roughly 2800 times 30000 entries, due to the conversion from JSON to the dataframe format. We therefore build a new dataframe with the requiered information only.
Since we have ~30000 different keys in the dataframe we search for keys containing specific strings:

In [None]:
listDF = [x[['_id','doi','ancientcity.name','ancientcity.info','ancientcity.latitude','ancientcity.longitude','dating_centuries','dating_str','transl_text','search_text']] for x in dfList]

Now we can build a new dataframe with only cityname, coordinates, date and the translated text. Using pandas concat we put all data in one dataframe. 

In [None]:
df = pd.concat(listDF).reset_index(drop=True)

In [None]:
df

# Natural language processing for surnames

In [None]:
df.shape

Using tools from natural language processing, the next step is to search through a segmented version of the german translation to find proper nouns.

For this aim, we use a pre-trained [tagger](https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/), which is faster on german text than, e.g., the [stanford pos-tagger](http://nlp.stanford.edu/software/tagger.html). As a link to NLTK, we make use of a [contribution](https://github.com/ptnplanet/NLTK-Contributions).

In [None]:
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger

In [None]:
with open('nltk_german_classifier_data.pickle', 'rb') as f:
    tagger = pickle.load(f)

The parts-of-speech tag 'NE' marks proper nouns (Eigennamen in german), as specified by http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html

In [None]:
[x[0] for x in tagger.tag(tokenizer(df['transl_text'].iloc[13])) if x[1]=='NE']

To extend the dataframe with the information on the proper nouns in the translated text, we create a new column and apply the POS tagger.

In [None]:
dfN = pd.DataFrame()
dfN['transl_text_ge'] = df.ix[:,'transl_text']
dfN.shape

In [None]:
%%time

dfN['proper nouns'] = dfN['transl_text_ge'].apply(lambda row: [x[0] for x in tagger.tag(tokenizer(row)) if row !='' and x[1]=='NE'])

Again, we can save the new dataframe as a pickle file.

In [None]:
# Save data as pickle
#dfN.to_pickle('./data/icg_proper_nouns.pickle')
# Create dateframe from pickle file
#dfn = pd.read_pickle('./data/icg_proper_nouns.pickle')

In [None]:
dfn = dfN.copy()

To habe all information in one dataframe we extend the previously constructed one with the new informations.

In [None]:
df['proper_nouns'] = dfn['proper nouns']

In [None]:
df.dropna()

In [None]:
dfAthens = df[df['ancientcity.name']=='Athens'].reset_index(drop=True)
lAth = list(dfAthens['proper_nouns'])
nameListAthens = [item for sublist in lAth for item in sublist if len(item) > 2]

In [None]:
# Save the final version of the dataframe
#df.to_pickle('./data/icg_full_with_nouns.pickle')
# Load by
#df = pd.read_pickle('./data/icg_full_with_nouns.pickle')

In [None]:
dfNames = df.sort_values('ancientcity.name').dropna(subset=['ancientcity.latitude']).reset_index(drop=True)

In [None]:
dfNames.keys()

Every ancient city name corresponds to several found inscriptions. We therefore collect the information of the proper nouns and their dating into one dictonary, whose keys are the city names.

In [None]:
cityDict = {}
for city in list(dfNames['ancientcity.name']):
    tmpdf = df[df['ancientcity.name']==city].reset_index(drop=True)
    tmpList = [[tmpdf['ancientcity.latitude'].iloc[0],tmpdf['ancientcity.longitude'].iloc[0]]]
    tmpDict = {}
    for index in tmpdf.index:
        if tmpdf['proper_nouns'].iloc[index] != []:
            tmpDict[tmpdf['dating_str'].iloc[index]] = tmpdf['proper_nouns'].iloc[index]
            tmpList.append(tmpDict)
    cityDict[city] = tmpList
        #print(tmpdf.iloc[index])

Getting the city names

In [None]:
cityDict.keys()

Getting the coordinates of a city

In [None]:
cityDict['Acmoneia'][0]

Getting the different dating periods by keys

In [None]:
cityDict['Acmoneia'][1].keys()

Getting the names of a dating period

In [None]:
cityDict['Acmoneia'][1]['200 - 400']

# Histogram of Name occurance

In [None]:
%matplotlib notebook

In [None]:
allNames = [item for sublist in list(df['proper_nouns']) for item in sublist if len(item) > 2 ]
counter=Counter(allNames)
xmax = 30
f, axs = pl.subplots(1, sharex=True, sharey=True)
d = OrderedDict(sorted(counter.items(), key=lambda t: t[1],reverse=True))
X = np.arange(len(d))
pl.bar(X, d.values(), align='center', width=0.5)
pl.xticks(X, d.keys())#,rotation=70)
ymax = max(d.values()) + 1
pl.ylim(0, ymax)
pl.xlim(0,xmax)
pl.setp(axs.xaxis.get_majorticklabels(), rotation=70, horizontalalignment='right' )
pl.tight_layout()
pl.show()

# Geographical distribution of inscriptions



To get an overview of the geographical distribution of the various proper nouns, we make use of the folium package. One can define markers with the corresponding ancient city name and the occuring proper nouns at the given coordinates.

In [None]:
icg_map = folium.Map(location=[df["ancientcity.latitude"].mean(axis=0),df["ancientcity.longitude"].mean(axis=0)], zoom_start=5)
icg_map.add_tile_layer(name='Stamen', tiles='Stamen Terrain')
f = widget.FloatProgress(min=0, max=len(cityDict.keys()))
display(f)
marker_cluster = folium.MarkerCluster('Coordinates').add_to(icg_map)
for name in cityDict.keys():
    f.value += 1
    if len(cityDict[name]) > 1:
        popups = 'City: ' + name + ', Names: ' +  str(cityDict[name][1])
    else:
        popups = 'City: ' + name + ', no names.'
                                                
    folium.Marker(cityDict[name][0], popup=popups).add_to(marker_cluster)
icg_map.add_children(marker_cluster)
icg_map.add_children(folium.map.LayerControl())