# Vorlesung 5: Strukturen von Texten, Chunking

Reguläre Ausdrücke: http://www.regexe.de/hilfe.jsp
                    https://pymotw.com/2/re

Pandas: http://www.data-analysis-in-python.org/3_pandas.html
      : https://bitbucket.org/hrojas/learn-pandas
      
## Hintergrund     
     
NLTK: http://www.nltk.org/book/ch05.html
    : http://www.nltk.org/book/ch07.html

Chunking: https://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/dongqing-chunking.pdf


In [1]:
import json
import pandas as pd
import re
import numpy as np
import requests

## Construct Dataframe from full Poleis data

In [2]:
PoleisDataOnline2 = requests.get('http://repository.edition-topoi.org/MISC/ReposMISC/MISC00005/secondVersion.json')
PoleisRawData2 = PoleisDataOnline2.json()
PoleisRawData2.keys()

dict_keys(['Makedonia', 'Boiotia', 'Cyprus', 'Thrace from Strymon to Nestos', 'Phokis', 'The South Coast of Asia Minor (Pamphylia Kilikia)', 'Thrace from Nestos to Hebros', 'Aiolis and South-western Mysia', 'Thessalia and Adjacent Regions', 'West Lokris', 'The Black Sea Area', 'The Propontic Coast of Asia Minor', 'The Aegean', 'Troas', 'Akarnania and Adjacent Areas', 'Elis', 'Inland Thrace', 'Propontic Thrace', 'Lesbos', 'Doris', 'Attika', 'Epeiros', 'The Saronic Gulf', 'Karia', 'Spain and France (including Corsica)', 'Achaia', 'Sikelia', 'Thracian Chersonesos', 'Rhodos', 'East Lokris', 'The Adriatic', 'Thrace from Axios to Strymon', 'Crete', 'Lakedaimon', 'Italia and Kampania', 'Ionia', 'Megaris, Korinthia, Sikyonia', 'Triphylia', 'Arkadia', 'Aitolia', 'Argolis', 'Lykia', 'Euboia', 'Messenia'])

In [3]:
# Read Json into a normalized form, yields ~500 columns with region.city keys
dfPoleisGesamt = pd.io.json.json_normalize(PoleisRawData2)

# rotate and rename dataframe
dfPoleisGesamt= dfPoleisGesamt.transpose()
dfPoleisGesamt.columns=['Beschreibung']
dfPoleisGesamt.head(4)

# reset to new index, return old index as column 'index'
dfPoleisGesamt= dfPoleisGesamt.reset_index()
dfPoleisGesamt.head()

# split entries in column 'index' into region and city part
dfPoleisGesamt['indexSplit'] = dfPoleisGesamt['index'].str.split('.')

# generate new columns out of split index
dfPoleisGesamt['region'] = dfPoleisGesamt['indexSplit'].apply(lambda raw: raw[0])
dfPoleisGesamt['city'] = dfPoleisGesamt['indexSplit'].apply(lambda raw: raw[1])
dfPoleisGesamt.head()

# remove columns 'index' and 'indexSplit', since they contain redundant information
dfPoleisGesamt = dfPoleisGesamt.drop('index', 1)
dfPoleisGesamt = dfPoleisGesamt.drop('indexSplit', 1)
dfPoleisGesamt.head()

Unnamed: 0,Beschreibung,region,city
0,"Identifier: 233. , (Ascheieus) Unlocated. Typ...",Achaia,Ascheion
1,"Identifier: 235. , (Bourios) Map 58. Lat. 38...",Achaia,Boura
2,"Identifier: 236. , (Helikeus) Map 58. Lat. 3...",Achaia,Helike
3,"Identifier: 237. , (Keryneus) Map 58. Lat. 3...",Achaia,Keryneia
4,"Identifier: 238. , (Leontesios) Map 58.Lat.38...",Achaia,Leontion


## Textmustersuche zur Beschreibung einer Polis

### Geographische Koordinaten

In [4]:
def ListePattern(string,pattern):
    x = re.findall(pattern,string)
    if x:
        return(x)

Example for geographical coordinates:

- (?<=Lat\.\s)  Group (?...) Passive (non-capturing) group
- ?<= Lookbehind assertion
- Lat\.\s   das string muster: "Lat. " mit "." und " " als escape
- \s?\d+\.\d+ : space[optional wegen ?]digit[1 oder 2 wegen +].[escaped]digit[1 oder 2]

Code::

    dfPoleisGesamt['Latitude'] = dfPoleisGesamt['Beschreibung'].apply(lambda raw: ListePattern(raw,"(?<=Lat\.\s)\d+\.\d+"))

### Zitatnachweise, Namen, Jahreszahlen

In [5]:
dfPoleisGesamt["Beschreibung"].iloc[2] # icloc is a method to refer to local index position

"Identifier: 236. , (Helikeus) Map  58.  Lat. 38.15,long.  22.10.  Size of  territory: 1  or 2.  Type:  A.  Paus. 7.24.5  locates  Helike  40 stades  from  Aigion  (no. 231),  while  Strabo  8.7.2  (following Herakleides)  places  it  12  stades  from  the  sea.  This  should  put it  between  the  rivers  Selinous  and  Kerynitis  (Morgan  and Hall ( 1996)  175;  Barr.).  The  city,  which  was  overwhelmed  by a  tidal  wave  occasioned  by  an  earthquake  in  373  (Diod. 15.48.1\xad49.4;  Polyb. 2.41.7;Strabo 8.7.2;  Paus. 7.24.6;Ael.  NA 11.19),  was  normally  supposed  to  lie  under  water  (cf.  Ov. Met. 15.293\xad95),  but  sonar  investigation  suggests  that  it  may actually  lie  inland  under  massive  sedimentary  deposits  in the  vicinity  of  Nea  Keryneia  (Petropoulos  ( 1983);  cf.  Ptol. Geog. 3.14.36,  who  lists  Helike  among  the  inland  cities  of Achaia).  However,  Rizakis  ( 1995)  203\xad4  finds all candidates for  ancient  Helike  unconvincing.  The  

## Muster (Pattern) zur Erkennung der Literaturreferenzen

- Primärquellen

(Polyb. 1.18.2)
(Diod. 13.85.4  (r 406))
(Diod. 13.108.2)
(Hdt. 7.165;  IGDS  no. 182a)
(Pind.  Pyth. 6)
(Thuc. 6.4.4: µµ  )
(Xanthos  (FGrHist 765)  fr. 33;  Arist.  fr. 865)

- Sekundärquellen

(Karlsson  ( 1995)  161
(Waele  ( 1971) 195;  Hinz  ( 1998)  79)

- Jahreszahlen
( dddd)

### Mehrere reguläre Ausdrücke nötig, um alle Zitate zu finden

Finde alle Ausdrücke wie oben, denen ein Punkt folgt, mit anschließenden Zifferfolgen der Form [Ziffern][Punkt][Ziffern][Punkt][Ziffern]

In [6]:
dfPoleisGesamt['Beschreibung'].apply(lambda raw: ListePattern(raw,'[A-Z][a-z]{1,10}\. \d{1,3}\.\d{1,3}\.\d{1,3}'))[2]

['Paus. 7.24.5',
 'Diod. 15.48.1',
 'Polyb. 2.41.7',
 'Paus. 7.24.6',
 'Geog. 3.14.36',
 'Diod. 15.49.3',
 'Polyb. 2.41.7',
 'Diod. 15.48.3',
 'Diod. 15.49.3',
 'Polyb. 2.41.6',
 'Paus. 7.24.5']

Finde alle Ausdrücke wie oben, wobei statt des Punktes nach den kleinen Buchstaben zwei Leerzeichen und eine runde Klammer und vier Ziffern folgen können

In [7]:
dfPoleisGesamt['Beschreibung'].apply(lambda raw: ListePattern(raw,'[A-Z][a-z]{1,15}\s{0,2}\(\s*\d{4}\)'))[2]

['Hall ( 1996)',
 'Petropoulos  ( 1983)',
 'Rizakis  ( 1995)',
 'Katsonopoulou  ( 1999)',
 'Petropoulos  ( 1990)',
 'Hall  ( 1996)',
 'Aymard  ( 1938)',
 'Walbank ( 2000)']

Finde alle Authoren, gefolgt von einem oder mehreren Leerzeichen und ([Ziffer][Punkt][Ziffern][Punkt][Ziffern])

In [8]:
dfPoleisGesamt['Beschreibung'].apply(lambda raw: ListePattern(raw,'[A-Z][a-z]{1,15}\s{0,2}\(\s*\d{1,2}\.\d{1,2}\.\d{1,2}\)'))[2]

Wie oben nur ohne Klammern.

In [9]:
dfPoleisGesamt['Beschreibung'].apply(lambda raw: ListePattern(raw,'[A-Z][a-z]{1,15}\s{0,2}\s*\d{1,2}\.\d{1,2}\.\d{1,2}'))[2]

['Strabo  8.7.2', 'Strabo 8.7.2', 'Strabo  8.7.2', 'Strabo  6.1.13']

In [10]:
dfPoleisGesamt['Beschreibung'].apply(lambda raw: ListePattern(raw,'[A-Z][a-z]{1,15}\.\s+[A-Z][a-z]{1,10}\.\s+[A-Za-z]{1,10}\.\s+\d{1,3}\.\d{1,3}'))[2]

['Theophr.  Phys.  Op. 12.122', 'Theophr.  Phys.  op. 12.122']

In [11]:
dfPoleisGesamt['Beschreibung'].apply(lambda raw: ListePattern(raw,'[A-Z][a-z]{1,15}\.\s+[A-Z][a-z]{1,10}\.\s+\d{1,3}\.\d{1,3}'))[2]

['Ov. Met. 15.293',
 'Ptol. Geog. 3.14',
 'Hom.  Il. 2.575',
 'Phys.  Op. 12.122',
 'Hom. Il. 2.575',
 'Hom.  Il. 8.203']

In [26]:
# Zusammenfassen von zwei Bedingungen: kompilieren des regulären Ausdrucks beschleunigt den Suchprozess

pat = re.compile('([A-Z][a-z]{1,10}\. \d{1,3}\.\d{1,3}\.\d{1,3}|[A-Z][a-z]{1,15}\s{0,2}\(\s*\d{4}\)|[A-Z][a-z]{1,15}\s{0,2}\(\s*\d{1,2}\.\d{1,2}\.\d{1,2}\)|[A-Z][a-z]{1,15}\s{0,2}\s*\d{1,2}\.\d{1,2}\.\d{1,2}|[A-Z][a-z]{1,15}\.\s+[A-Z][a-z]{1,10}\.\s+[A-Z][a-z]{1,10}\.\s+\d{1,3}\.\d{1,3}|[A-Z][a-z]{1,15}\.\s+[A-Z][a-z]{1,10}\.\s+\d{1,3}\.\d{1,3})')

dfPoleisGesamt['Beschreibung'].apply(lambda raw: ListePattern(raw,pat))[2]

['Paus. 7.24.5',
 'Strabo  8.7.2',
 'Hall ( 1996)',
 'Diod. 15.48.1',
 'Polyb. 2.41.7',
 'Strabo 8.7.2',
 'Paus. 7.24.6',
 'Ov. Met. 15.293',
 'Petropoulos  ( 1983)',
 'Ptol. Geog. 3.14',
 'Rizakis  ( 1995)',
 'Hom.  Il. 2.575',
 'Theophr.  Phys.  Op. 12.122',
 'Diod. 15.49.3',
 'Polyb. 2.41.7',
 'Diod. 15.48.3',
 'Diod. 15.49.3',
 'Polyb. 2.41.6',
 'Hom. Il. 2.575',
 'Katsonopoulou  ( 1999)',
 'Petropoulos  ( 1990)',
 'Hom.  Il. 8.203',
 'Strabo  8.7.2',
 'Paus. 7.24.5',
 'Hall  ( 1996)',
 'Aymard  ( 1938)',
 'Walbank ( 2000)',
 'Strabo  6.1.13']

In [39]:
pat = re.compile('([A-Z][a-z]{1,10}\. \d{1,3}\.\d{1,3}\.\d{1,3}|[A-Z][a-z]{1,15}\s{0,2}\(\s*\d{4}\)|[A-Z][a-z]{1,15}\s{0,2}\(\s*\d{1,2}\.\d{1,2}\.\d{1,2}\)|[A-Z][a-z]{1,15}\s{0,2}\s*\d{1,2}\.\d{1,2}\.\d{1,2}|[A-Z][a-z]{1,15}\.\s+[A-Z][a-z]{1,10}\.\s+[A-Z][a-z]{1,10}\.\s+\d{1,3}\.\d{1,3}|[A-Z][a-z]{1,15}\.\s+[A-Z][a-z]{1,10}\.\s+\d{1,3}\.\d{1,3}|[A-Z][a-z]{1,10}.?\s+[¹²³]\d+\.\d+|[A-Z][a-z]{1,10}.?\s+[¹²³]\d+|[A-Z][a-z]{1,10}.?\s+\d+\s+\d+)')

dfPoleisGesamt['Quellen'] = dfPoleisGesamt['Beschreibung'].apply(lambda raw: ListePattern(raw,pat))

In [40]:
dfPoleisGesamt.head(10)

Unnamed: 0,Beschreibung,region,city,Quellen
0,"Identifier: 233. , (Ascheieus) Unlocated. Typ...",Achaia,Ascheion,
1,"Identifier: 235. , (Bourios) Map 58. Lat. 38...",Achaia,Boura,"[Paus. 7.25.8, Strabo 8.7.5, Hall ( 1996), R..."
2,"Identifier: 236. , (Helikeus) Map 58. Lat. 3...",Achaia,Helike,"[Paus. 7.24.5, Strabo 8.7.2, Hall ( 1996), Di..."
3,"Identifier: 237. , (Keryneus) Map 58. Lat. 3...",Achaia,Keryneia,"[Paus. 7.25.5, Rizakis ( 1995), Paus. 7.25.5,..."
4,"Identifier: 238. , (Leontesios) Map 58.Lat.38...",Achaia,Leontion,"[Lauffer ( 1989), Polyb. 2.41.7, Strabo 8.7...."
5,"Identifier: 241. , (Olenios) Map 58. Lat. 38...",Achaia,Olenos,"[Paus. 7.18.1, Strabo 8.7.4, Rizakis ( 1995)..."
6,"Identifier: 244. , (Pharaieus) Map 58. Lat. ...",Achaia,Pharai,"[Paus. 7.22.1, Rizakis ( 1995), Polyb. 2.41.8,..."
7,"Identifier: hall, (Tritaieus) Map 58.Lat.37.5...",Achaia,Tritaia,"[Paus. 7.22.6, Rizakis ( 1995), Polyb. 2.41.8..."
8,"Identifier: 801. , (Adramytenos) Map 56. Lat...",Aiolis and South-western Mysia,Adramyttion,"[Hdt. 7.42.1, Xen. An. 7.8, Thuc. 5.1.1, Foss..."
9,"Identifier: 802. , (Aigaieus) Map 56. Lat. 3...",Aiolis and South-western Mysia,Aigai(ai),"[Hdt. 1.149.1, Strabo 13.3.5, Xen. Hell. 4.8,..."


# Datenvalidierung

## Bewertung des Modells mit Performanz- (Konfusions-)matrix

Volle beschreibung für eine Stadt. Quellen per Hand markiert.

In [41]:
#dfPoleisGesamt['Beschreibung'].iloc[2]

"Identifier: 236. , (Helikeus) Map  58.  Lat. 38.15,long.  22.10.  Size of  territory: 1  or 2.  Type:  A.  **Paus. 7.24.5**  locates  Helike  40 stades  from  Aigion  (no. 231),  while  **Strabo  8.7.2**  (following Herakleides)  places  it  12  stades  from  the  sea.  This  should  put it  between  the  rivers  Selinous  and  Kerynitis  (**Morgan  and Hall ( 1996)  175;  Barr.)**.  The  city,  which  was  overwhelmed  by a  tidal  wave  occasioned  by  an  earthquake  in  373  (**Diod. 15.48.1\xad49.4;  Polyb. 2.41.7;Strabo 8.7.2;  Paus. 7.24.6;Ael.  NA 11.19**),  was  normally  supposed  to  lie  under  water  (cf.  **Ov. Met. 15.293\xad95**),  but  sonar  investigation  suggests  that  it  may actually  lie  inland  under  massive  sedimentary  deposits  in the  vicinity  of  Nea  Keryneia  (**Petropoulos  ( 1983);  cf.  Ptol. Geog. 3.14.36**,  who  lists  Helike  among  the  inland  cities  of Achaia).  However,  **Rizakis  ( 1995)  203\xad4**  finds all candidates for  ancient  Helike  unconvincing.  The  toponym  is  usually **`,  (Hom.  Il. 2.575;  SEG 36  718  (C 5e);  Hdt. 1.145)  or `  (Syll. ³90.12)**,  though  **Theophr.  Phys.  Op. 12.122**  cites a  verse  which  gives  the  toponym  as  `.  The  city- ethnic  is  `  **(Diod. 15.49.3)**. Helike  is  called  a  polis  in  the  urban  sense  in  **Heraclid. Pont.  fr. 46a  (r 373)**  and  **Theophr.  Phys.  op. 12.122  (r 373)**,but  is absent  from  Ps.-Skylax's  list  of  Achaian  poleis  ( 42),  which may  suggest  that  this  chapter  was  composed  after  373. Retrospective  evidence  is  provided  by  **Polyb. 2.41.7  (rC 4)**, who  calls  it  a  polis  in  the  political  sense,  and  by  **Diod. 15.48.3 (r 373)**,  who  describes  it  as  a  polis  in  the  urban  sense.  The internal  collective  use  of  the  city-ethnic  is  probably  found (abbreviated)  on  C 4  coins  (infra),  and  the  external  collective use  is  found  in  **Diod. 15.49.3**  (r  ante 373).  A  citizen  of  Helike served  as  Delphic  theorodokos  in  C 5l  **(Syll. ³90.12)**.According to **Polyb. 2.41.6\xad7**  (rC 4),  Helike  had  been  a  member  of  the Achaian  Confederacy. The  early  physical  existence  of  Helike  is  attested  in  **Hom. Il. 2.575**  and  in  a  C 5e  inscription  (**SEG 36  718**;  see  also  **Soter and  Katsonopoulou  ( 1999)**).  Archaeological  investigations have  revealed  the  foundations  of  two  small  temples,  one Archaic,  the  other  Classical,  at  Nea  Keryneia,  which  may possibly  be  associated  with  the  acropolis  of  ancient  **Helike (Petropoulos  ( 1990)**).  The  most  important  sanctuary  at Helike  was,  however,  that  of  Poseidon  Helikonios  **(Hom.  Il. 8.203;Diod.  15.49.2\xad3;Strabo  8.7.2;  Paus. 7.24.5\xad6)**,  and  it  is quite  likely  that  this  sanctuary  acted  as  a  common  place  of union  for  the  Achaians  prior  to  the  destruction  of Helike,  when  that  function  was  assumed  by  the  sanctuary of  Zeus  Homarios  near  Aigion  **(Morgan  and  Hall  ( 1996) 195\xad96,  contra  Aymard  ( 1938)  286\xad87,  293;  Walbank ( 2000))**. According  to  **Strabo  6.1.13**,  Is  of  Helike  was  the  founder  of Sybaris  (no. 70)  in  South  Italy.  The  reading  (  '  &lt;...  &gt; ) is,  however,  unsure,  and **Bérard  ( 1957)**  141  n. 2  proposed  either  &lt; &gt;  or &lt; &gt;. A  series  of  bronze  coins,  dating  to  C 4f,  depicts  obv.  head of  Poseidon.  Legend: (retr.).  Rev.  trident  between dolphins  in  wreath  **(Head,  HN ²414)**. 236. "

Gefundene Quellen

In [42]:
dfPoleisGesamt['Quellen'].iloc[2]

['Paus. 7.24.5',
 'Strabo  8.7.2',
 'Hall ( 1996)',
 'Diod. 15.48.1',
 'Polyb. 2.41.7',
 'Strabo 8.7.2',
 'Paus. 7.24.6',
 'Ov. Met. 15.293',
 'Petropoulos  ( 1983)',
 'Ptol. Geog. 3.14',
 'Rizakis  ( 1995)',
 'Hom.  Il. 2.575',
 'Syll. ³90.12',
 'Theophr.  Phys.  Op. 12.122',
 'Diod. 15.49.3',
 'Polyb. 2.41.7',
 'Diod. 15.48.3',
 'Diod. 15.49.3',
 'Syll. ³90.12',
 'Polyb. 2.41.6',
 'Hom. Il. 2.575',
 'Katsonopoulou  ( 1999)',
 'Petropoulos  ( 1990)',
 'Hom.  Il. 8.203',
 'Strabo  8.7.2',
 'Paus. 7.24.5',
 'Hall  ( 1996)',
 'Aymard  ( 1938)',
 'Walbank ( 2000)',
 'Strabo  6.1.13']

Diskussion der Performanzmatrix: vier Fälle
- soll match vs. tatsächlicher match
- nicht soll match vs. tatsächlich
- soll match vs. nicht tatsächlich
- nicht soll vs. nicht tatsächlich

While 29 citations are found, 10 are neglected. 
Structures not yet captured are
- citations with several capital letters (e.g. SEG 36  718)
- citations with sepcial characters (e.g. Syll. ³90.12)

2 Structures wrongly captured are
- citations with several authors are captured under one name only

To extend the search string, we used

In [43]:
re.findall('([A-Za-z]{1,10}.?\s+[¹²³]\d+\.\d+|[A-Za-z]{1,10}.?\s+[¹²³]\d+|[A-Za-z]{1,10}.?\s+\d+\s+\d+)',dfPoleisGesamt['Beschreibung'].iloc[2])

['SEG 36  718', 'Syll. ³90.12', 'Syll. ³90.12', 'SEG 36  718', 'HN ²414']

## Wertverteilungen, Test auf Dopplungen

Lese Werte der Spalte Quellen als Liste aus und reduziere Unterlisten auf eine Gesamtliste. 

In [44]:
mainList = dfPoleisGesamt['Quellen'].values.tolist()

quellenListe = []
for sublist in mainList:
    if sublist:
        for k in sublist:
            quellenListe.append(k)

In [45]:
quellenListe[:10]

['Paus. 7.25.8',
 'Strabo  8.7.5',
 'Hall  ( 1996)',
 'Rizakis ( 1995)',
 'Polyb. 2.41.13',
 'Tzetz.  Chil. 37.179',
 'Paus. 7.25.8',
 'Polyb. 2.41.7',
 'Diod. 15.48.3',
 'Paus. 7.25.8']

Zähle die Häufigkeit der verschiedenen Quellen und speichere als Dictionary.

In [46]:
quellenVerteilung = {x:quellenListe.count(x) for x in quellenListe}
quellenVerteilung['Diod. 14.90.3']

2

Erzeuge DataFrame, mit neuem Index und Namen der Spalten. Sortiere diesen Nach der Häufigkeit der Quelle.

In [47]:
dfQuellenVerteilung = pd.DataFrame([quellenVerteilung]).T
dfQuellenVerteilung

Unnamed: 0,0
Abmeier ( 1990),1
Accame ( 1941),1
Achaians. Solin. 2.10,1
Adamesteanu ( 1970),3
Adamesteanu ( 1973),1
Adamesteanu ( 1974),4
Adamesteanu ( 1976),1
Adamesteanu ( 1979),1
Adamesteanu ( 1982),1
Adamesteanu ( 1986),1


In [48]:
dfQuellenVerteilung = dfQuellenVerteilung.reset_index()
dfQuellenVerteilung.head()

Unnamed: 0,index,0
0,Abmeier ( 1990),1
1,Accame ( 1941),1
2,Achaians. Solin. 2.10,1
3,Adamesteanu ( 1970),3
4,Adamesteanu ( 1973),1


In [49]:
dfQuellenVerteilung = dfQuellenVerteilung.rename(columns={'index': 'Quelle', 0:'Häufigkeit'})
dfQuellenVerteilung.head()

Unnamed: 0,Quelle,Häufigkeit
0,Abmeier ( 1990),1
1,Accame ( 1941),1
2,Achaians. Solin. 2.10,1
3,Adamesteanu ( 1970),3
4,Adamesteanu ( 1973),1


In [50]:
dfQuellenVerteilung.sort_values(by='Häufigkeit',ascending=False).head(10)

Unnamed: 0,Quelle,Häufigkeit
3169,Jost ( 1985),71
7951,Xen. Hell. 3.2,63
7950,Xen. Hell. 3.1,59
3102,Isaac ( 1986),57
6700,Svoronos ( 1890),55
5266,Rider ( 1966),55
1870,Fossey ( 1988),52
7969,Xen. Hell. 6.5,51
7960,Xen. Hell. 4.8,48
7964,Xen. Hell. 5.4,47
