# Preliminaries

In [9]:
%matplotlib notebook

In [189]:

from textblob import TextBlob
from textblob.taggers import NLTKTagger

from nltk.tokenize import SExprTokenizer
nltk_tagger = NLTKTagger()

import json
import pandas as pd
import re
    
import nltk.data
from nltk import PunktSentenceTokenizer

In [208]:
with open('secondVersion.json') as json_data:
    PoleisRawData = json.load(json_data)

# Structure of data

First level keys are the regions 

In [209]:
PoleisRawData.keys()

dict_keys(['Crete', 'Sikelia', 'The Black Sea Area', 'The Aegean', 'Attika', 'Makedonia', 'Boiotia', 'Lykia', 'Spain and France (including Corsica)', 'Thrace from Axios to Strymon', 'Doris', 'Troas', 'Phokis', 'Inland Thrace', 'The Saronic Gulf', 'Epeiros', 'Aitolia', 'Lakedaimon', 'Italia and Kampania', 'Ionia', 'Achaia', 'Propontic Thrace', 'Akarnania and Adjacent Areas', 'Thracian Chersonesos', 'The Adriatic', 'Argolis', 'Elis', 'Lesbos', 'The Propontic Coast of Asia Minor', 'Triphylia', 'Thrace from Nestos to Hebros', 'Rhodos', 'Thessalia and Adjacent Regions', 'Arkadia', 'Messenia', 'West Lokris', 'East Lokris', 'Thrace from Strymon to Nestos', 'The South Coast of Asia Minor (Pamphylia Kilikia)', 'Cyprus', 'Euboia', 'Karia', 'Megaris, Korinthia, Sikyonia', 'Aiolis and South-western Mysia'])

Each region contains the city names as sub-keys

In [211]:
PoleisRawData['Karia'].keys()

dict_keys(['Ouranion', 'Kindye', 'Pedasa', 'Kyllandos', 'Taramptos', 'Kyrbissos', 'Pladasa', 'Chios', 'Olymos', 'Bargylia', 'Euromos', 'Pyrnos', 'Bargasa', 'Alabanda', 'Pyrindos', 'Pidasa', 'Amos', 'Tralleis', 'Kalynda', 'Aulai', 'Kaunos', 'Salmakis', 'Idrias', 'Knidos', 'Telemessos', 'Mylasa', 'Telandros', 'Naryandos', 'Chersonesos', 'Bolbai', 'Halikarnassos', 'Myndos', 'Lepsimandos', 'Medmasos', 'Passanda', 'Keramos', 'Kasolaba', 'Thydonos', 'Arlissos', 'Alinda', 'with', 'Hydisos', 'Iasos', 'Naxia'])

To access the text for a certain city, one has to use first and second level keys

In [225]:
hydisosText = PoleisRawData['Karia']['Hydisos']

In [226]:
hydisosText

'Identifier: 891. , (Hydisseus) Map  61.  Lat. 37.10,long.  27.50. Size  of  territory:  ?  Type:  B:?  The  toponym  is  ` (Steph.  Byz. 645.17).  The  earliest  attestation  of  the  toponym is  in  a  C 1  inscription  (I.Stratonikeia 508.10  (c. 81)): `, although  it  was  mentioned  by  Apollonius  Aphrodisiensis, whose may  be  dated  to  C 3  (FGrHist 740  fr. 4).  The city-ethnic  is  `  (IG i³  265.ii.51;Apollonius Aphrodisiensis  (FGrHist 740)  fr. 4  (perhaps  C 3))  or `  (I.Mylasa 401.8  (C 2\xadC1)). Hydisos  was  a  member  of  the  Delian  League,  but  is  registered  only  twice,  in  448/7  (IG i³  264.iii.21,  restored: `[~]) and 447/6  (IG i³  265.ii.51,  restored: `[~]),  paying  a  phoros  of 1  tal. At  the  site  of  Hydisos  there  are  remains  of  city  walls  and towers,  probably  of  early  Hellenistic  date  (L.  Robert  ( 1935) 339\xad40). 890.  (Hymisseis) Map 61,  unlocated,  but  possibly  situat- ed  between  Amyzon  (no. 874)  and  Mylasa  (no. 913

To generate a text of all cities in a region we can use

In [210]:
ioniaText = ''
for key in PoleisRawData['Ionia'].keys():
    ioniaText = ioniaText + (PoleisRawData['Ionia'][key])

# Create dataframe

To keep the original information, we create a dataframe with subindices: the region and the city name. 

In [216]:
user_ids = []

frames = []

for user_id, d in PoleisRawData.items():
    user_ids.append(user_id)
    frames.append(pd.DataFrame.from_dict(d, orient='index'))

df = pd.concat(frames, keys=user_ids)
df.columns = ['fulltext']
df.index.rename(['region','city'], inplace=True)

## Get city identifier

Throughout the full text, cities are referenced by a running index. To make this information part of the dataframe, we extend it with an additional column.

In [266]:
def cityIDFinder(text):
    idList = re.findall("Identifier\: \d{1,4}\.", text)
    if idList: 
        idCity = idList[0].split('.')[0][12:]
        return idCity

In [267]:
df['city_id'] = df['fulltext'].apply(lambda row: cityIDFinder(row))

## Collection of all citations

To collect all citations in the text for one city, we first use a tokenizer from NLTK. This tokenizer collects all parenthesis and is much easier to use, that regular expressions. 

The basic assumption for citations is: They are written in parenthesis, start with a capital letter, and contain at least one blank space (to separate the authors name from text pages, indices, or dates). 

In [217]:
def citationFinder(text):
    import string
    letters=[i for i in string.ascii_uppercase] # List of all capital letters
    paranthesisTokenized = SExprTokenizer(strict=False).tokenize(text) # Tokenize text to search for parenthesis, '( ... )'  
    listCite = [x for x in paranthesisTokenized if x[0] == '(' and x[1] in letters and ' ' in x] # Assume: Citations are in parenthesis, start with a capital letter, and contain at least one blank space ' '
    return listCite

In [263]:
df['sources'] = df['fulltext'].apply(lambda row: citationFinder(row))

## Transformation of coordinates

A simple regular expression is enough to find all coordinates in the text. The coordinates are transformed from degrees/minutes to decimal to enable plotting on a map with common projection.

In [218]:
def coordinateFinder(value,pattern):
    x = re.findall(pattern, value)
    if x:
        coord = x[0][-5:]
        decCord = float(coord.split('.')[0]) +  int(coord.split('.')[-1])/60
        return decCord

In [220]:
df['latitude'] = df["fulltext"].apply(coordinateFinder, pattern="Lat\.\s?\d+\.\d+")
df['longitude'] = df["fulltext"].apply(coordinateFinder, pattern="long\.\s*\d+\.\d+")

## Proper nouns 

To generate a list of all mentioned proper nouns for each city, we use TextBlob. TextBlob is a NLTK tool with parts-of-speech tagger. We are interessted in all parts that are 'NNP' and longer then 3 letters.

This takes some time to process for the full dataframe. Behaviour can be tested by uncommenting the cell below.    

In [232]:
# Uncomment to test routine. 

#namesFinder(df['fulltext'].iloc[10])

In [298]:
def namesFinder(text):
    blobs = TextBlob(text)
    namesList = [x[0] for x in blobs.pos_tags if x[1] == 'NNP' and len(x[0]) > 3]
    return namesList

In [299]:
df['names'] = df['fulltext'].apply(lambda row: namesFinder(row))

## Cross links to other cities

Links to other cities are mentioned in the fulltext with reference to the index (e.g. '(no. 982)'). searching for these should give a link list. 

In [294]:
def linksFinder(text):
    x = re.findall("\(no\. \d{1,4}\)", text)
    if x:
        links = [((z.split(' '))[-1])[:-1] for z in x]
        linksInt = [int(x) for x in links]
        return linksInt

In [296]:
df['linkedCities'] = df['fulltext'].apply(lambda row: linksFinder(row))

## Display dataframe

In [300]:
# Uncomment to display full dataframe

df

Unnamed: 0_level_0,Unnamed: 1_level_0,fulltext,latitude,longitude,sources,city_id,linkedCities,names
region,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Crete,Hierapytna,"Identifier: 963. , (Hierapytnios) Map 60.Lat....",35.000000,25.750000,[(I.Cret. iii.iii.1B (C 3l); IG xii.5840 (C...,963,"[971, 958, 974, 950, 954, 962]","[Hierapytnios, I.Cret, xii.5840, Steph, I.Cret..."
Crete,Eleutherna,"Identifier: 959. , (Eleuthernaios) Map 60.Lat...",35.333333,24.666667,"[(IG ix².117,ll. 87­88 (C 3f)), (SEG 41 742...",959,,"[Eleuthernaios, I.Cret, Skylax, Stephanos, Ste..."
Crete,Stalai,"Identifier: 990. , (Stalites) Map 60. Lat. 3...",35.083333,26.000000,"[(Steph. Byz. 585.12), (I.Cret.iii.vi.7 (C 3...",990,,"[Stalites, Steph, I.Cret.iii.vi.7, I.Cret.iii...."
Crete,Aulon,"Identifier: 952. , Map 60. Lat. 35.05,long. ...",35.083333,25.000000,"[(I.Cret. iv 64 (C 5e)), (Steph. Byz. 147.8...",952,,"[I.Cret, Steph, Stephanos, Aulon, Guarducci, A..."
Crete,Herakleion,"Identifier: 962. , (Herakleiotas) Map 60. La...",35.333333,25.166667,"[(Hellenistic period), (Strabo 10.4.7), (Mil...",962,,"[Herakleiotas, Hellenistic, Strabo, Milet, Her..."
Crete,Lisos,"Identifier: 5 , (Lisios) Map 60. Lat. 35.15,...",35.250000,23.833333,[(I.Cret. ii.xvii.1 (C 3f); BCH 45 ( 1921) ...,,,"[Lisios, I.Cret, Ps.-Skylax, Guarducci, Svoron..."
Crete,Keraia,"Identifier: 967. , (Keraïtas) Map 60.Lat.35.2...",35.416667,24.000000,"[(BCH 45 ( 1921) iii.111 (c. 230­210)), (I.M...",967,,"[Keraïtas, Polyb, Steph, Keraia, Polyb, I.Magn..."
Crete,Aptara,"Identifier: 948. , (Aptaraios) Map 60. Lat. ...",35.416667,24.166667,"[(SEG 41 731 (C 3e)), (BCH 45 ( 1921) iii....",948,,"[Aptaraios, B.The, The-city, I.Cret, I.Cret, S..."
Crete,Rhaukos,"Identifier: 987. , (Rhaukios) Map 60. Lat. 3...",35.250000,25.000000,"[(Ps.- Skylax 47;Polyb. 30.23.1), (I.Cret. i....",987,"[960, 980, 971]","[Rhaukios, Ps.-, Skylax, Polyb, Milet, Rhaukos..."
Crete,Itanos,"Identifier: 966. , (Itanios) Map 60.Lat.35.15...",35.250000,26.250000,"[(Hdt. 4.151.2), (I.Cret. iii.vii.3 (C 6?); c...",966,,"[Itanios, I.Cret, Itanos, Herodotos, Kyrene, T..."


# Mapping the cities

To generate a map with the newly found informations, we use folium. Markers are positioned at (lat/long) and give the city index, i.e. its name, after clicking on the blue marker. 
Note, that cities without coordinates are dropped from the dataframe. 

In [270]:
import folium
from folium import plugins
from folium.map import *

dfPoleisMap = df.dropna(axis=0)

In [271]:
poleis_map = folium.Map(location=[dfPoleisMap["latitude"][0],dfPoleisMap["longitude"][0]], zoom_start=8)

marker = FeatureGroup(name='Poleis')
marker_cluster = folium.MarkerCluster().add_to(marker)

for i in range(len(dfPoleisMap)):
    folium.Marker([dfPoleisMap['latitude'][i], dfPoleisMap['longitude'][i]],popup='Cityname:' + str(dfPoleisMap.index.get_level_values(1)[i]),icon=folium.Icon(icon='ok')).add_to(marker_cluster)
                  
poleis_map.add_children(marker)

poleis_map.add_children(folium.map.LayerControl())

# Training of Tokenizers

By calling PunktSentenceTokenizer with an input text, we can train the detection of sentences. This is usually a problem, since a lot of citations (parenthesis) or special characters hinder the detection of a sentence end. 

In [272]:
trainedTokenizer = PunktSentenceTokenizer(ioniaText)

In [274]:
for item in trainedTokenizer.tokenize(hydisosText):
    print(item)
    print("----")

Identifier: 891. , (Hydisseus) Map  61.  Lat. 37.10,long.
----
27.50. Size  of  territory:  ?
----
Type:  B:?
----
The  toponym  is  ` (Steph.
----
Byz. 645.17).
----
The  earliest  attestation  of  the  toponym is  in  a  C 1  inscription  (I.Stratonikeia 508.10  (c. 81)): `, although  it  was  mentioned  by  Apollonius  Aphrodisiensis, whose may  be  dated  to  C 3  (FGrHist 740  fr. 4).
----
The city-ethnic  is  `  (IG i³  265.ii.51;Apollonius Aphrodisiensis  (FGrHist 740)  fr. 4  (perhaps  C 3))  or `  (I.Mylasa 401.8  (C 2­C1)).
----
Hydisos  was  a  member  of  the  Delian  League,  but  is  registered  only  twice,  in  448/7  (IG i³  264.iii.21,  restored: `[~]) and 447/6  (IG i³  265.ii.51,  restored: `[~]),  paying  a  phoros  of 1  tal.
----
At  the  site  of  Hydisos  there  are  remains  of  city  walls  and towers,  probably  of  early  Hellenistic  date  (L.  Robert  ( 1935) 339­40).
----
890.
----
(Hymisseis) Map 61,  unlocated,  but  possibly  situat- ed  between  Amyz