#Cleaned Sentences
Begin date: September 12, 2023

Authors: Tina, Micaela, Adam

The task for this notebook is to make a data frame of sentences from the cleaned words in words.csv

The steps below are for one words.csv file, and an index will be required to run this interatively over all words.csv files.

## Cleaned Words, first steps (1-5):

1. Load in a words.csv for testing & development
2. Relabel the header for 'Unnamed: 0' to pd_index
3. Add a column to the right of 'word' called 'cleaned'
4. For partitioned words, e.g. para- / digme should be joined: paradigme, and we duplicate and add the same joined word to both cells.
5. Identify all forms of punctuation
6. For words with punctuation:
  0. Create 2 new columns after 'cleaned', called: 'punct_L' and 'punct_R'
  1. If a cell only contains a punctuation mark, do nothing, which will leave the 'clean' column empty, along with 'punct_L' and 'punct_R'
  2. If the punctuation precedes the word, e.g. «faire, then it goes into 'punct_L',
  3. If the punctuation follows the word, e.g. faire»., it goes into 'punct_R'
  4. If there are multiple punctuation marks, we can include both in the appropriate 'punct_L/R' column
7. Save the cleaned_words.csv

Notes: Gensim has punctuation for multiple languages, but their tool only strips punctuation. We include in step 5 a way to identify and make a list of punctuation marks...

##Next Steps (6-8): cuneiform detection & conversion

1. In order to detect whether a 'word' is cuneiform tranliteration (i.e. hyphenated syllables '-', or syllables joined by a period '.' without spaces), we can make a conditional loop to add tokens with hyphens to the 'cleaned with dash' column.

2. When there's a 'word' that is Sumerian or Akkadian, but does not have a hyphen, we will need to detect this as well. Here are some ideas on how to do this.

  2.1. Use an existing dict of Sumerian and Akkadian tokens, which would give us a lot of coverage, but we could only find what we have in the dict.

  2.2. Use a NN that can learn what a Sumerian / Akkadian token is and find new instances that way...

3. We can convert transliteration to unicode cuneiform signs with by using the sign list (which Adam is prparing). We will check a given 'word' in in a look-up table to make a conversion (Adam will provide this code).

4. We will also a use Wikidata query (Adam is working on this) to check that these tokens are labeled properly and obtain the ids for each. Once that is complete, we can be 100% confident that the text is clean.

5. Sentence segmentation: use the punctuation to add [CLS] & [SEP] for sBERT embeddings, and format the words to sentences.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import csv
import re
import spacy

In [3]:
#  workdir = '/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/NER/geography/ocr-output' # for Adam
workdir = '/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/NER/geography/ocr-output' #Tina

In [4]:
# !ls "/content/drive/MyDrive/task 1 words cleaned" # for Micaela
!ls '/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/NER/geography/ocr-output/' #Adam
#!ls '/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/geography/ocr-output/' #Tina

ls: cannot open directory '/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/NER/geography/ocr-output/': Input/output error


## 1 Load in the words.csv for testing and development

In [15]:
sheet_id = '1i8Ys4OR5yIkZaP2vGYZ4sQSj_oDWeIs1QGq4_BQP0kk'
sheet_name = 'Catalog_words'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
catalog = pd.read_csv(url)
catalog

Unnamed: 0,1--F6C-9ItLHwwztNaN8SN1xIpX7TU6R_,ocr-output,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7
0,19thnx95q7XA-5cg9pUvnT1-OfdZ6dCsu,5604dc09256ebb09e90a6cef533474a4,207658,text/csv,2023-07-16T15:26:09.729Z,2023-05-26T21:28:13.396Z,ocr-output/0_Attinger - A propos de AK «faire»...,words.csv
1,1ROLgwBHRcpB8-_L1OciwTXFhM-Ba-hoe,18f68c68865cd4903e66ea8cbd2dccf2,94740,text/csv,2023-07-16T15:39:07.045Z,2023-05-27T00:27:23.598Z,ocr-output/100000_Glassner2012DevinDroiteGauche,words.csv
2,12G93qe-u_AnnW2jl6lOfspQ9Sy4ZIf4d,f04326e78e0474d503ba98e74ae53aab,105992,text/csv,2023-07-16T15:24:23.412Z,2023-05-27T00:44:30.172Z,"ocr-output/100001_BASOR 349, Mountjoy, Mycenea...",words.csv
3,19wIGV-XFe9OrHOfXCMoIaHslLC4x4tOl,62218e7e5d9bac5d5e749a400573cdac,127497,text/csv,2023-07-16T15:44:25.095Z,2023-05-26T23:25:30.380Z,ocr-output/100002_Hallo 1972 The House of Ur-M...,words.csv
4,1aT0SbjJswG5iz1BBYCLil6YJEEQAFnJI,e46eea239d034e565f003d40e0b7c92e,39064,text/csv,2023-07-16T15:15:31.639Z,2023-05-26T21:48:21.569Z,ocr-output/100003_Mizrahi_-_Hebrew_of_Jubilees...,words.csv
...,...,...,...,...,...,...,...,...
205854,10rbyA4uPjqpPBGBQlHNALQ76XhlW1bZv,95df387fc49fc49f1a0a3618485eb626,497055,text/csv,2023-07-16T15:50:30.283Z,2023-05-26T22:13:37.844Z,ocr-output/99999_Huehnergard (1983) JAOS 103,words.csv
205855,1lesRhw8TApPMg3T-ny44reGSzpupWQ9l,3369d4b88ed4ac9b5aff7ea745a54dfa,17689,text/csv,2023-07-16T15:07:24.587Z,2023-05-27T00:09:23.946Z,"ocr-output/9999_7,3 (1953) Goetze Tuttul in Ca...",words.csv
205856,1yJg0KnmIDxdoqdYgc0nQufvYXa1BKGis,17fcac0d7a84e613dea9cceb4c5d704e,129011,text/csv,2023-07-16T15:07:04.845Z,2023-05-27T01:57:17.462Z,ocr-output/999_אריאל_יודיצקי בער בעי עברית ארמית,words.csv
205857,15txP2N7xTFMvbDNAreo3Hoe-vevz9Is1,a675f54b5c5532acda87e4c9bc75667c,65193,text/csv,2023-07-16T15:49:46.753Z,2023-05-27T00:21:02.320Z,"ocr-output/99_Or NS 76, 2007, pp",words.csv


In [5]:
file_id = '19thnx95q7XA-5cg9pUvnT1-OfdZ6dCsu'
url = f'https://drive.google.com/uc?id={file_id}'
df = pd.read_csv(url)
df

Unnamed: 0.1,Unnamed: 0,identifier,word
0,0,0.0.0,46
1,1,0.0.1,Pascal
2,2,0.0.2,Attinger
3,3,0.0.3,A
4,4,0.0.4,propos
...,...,...,...
11237,11237,0.18.707,la-ba-an-ak-
11238,11238,0.18.708,e
11239,11239,0.18.709,.
11240,11240,0.18.710,à


## 2 Relabel header



In [None]:
df.rename(columns={"Unnamed: 0": "pd_index"}, inplace=True)
df

Unnamed: 0,pd_index,identifier,word
0,0,0.0.0,46
1,1,0.0.1,Pascal
2,2,0.0.2,Attinger
3,3,0.0.3,A
4,4,0.0.4,propos
...,...,...,...
11237,11237,0.18.707,la-ba-an-ak-
11238,11238,0.18.708,e
11239,11239,0.18.709,.
11240,11240,0.18.710,à


## 3 Add new 'Cleaned' column for partitioned words and identify words with punctuation

In order to detect whether a 'word' is cuneiform tranliteration (i.e. hyphenated syllables '-', or syllables joined by a period '.' without spaces), we can make a conditional loop to add tokens with hyphens to the 'cleaned with dash' column.

In [None]:
df['cleaned'] = df['word']


df['punct_L'] = ''
df['punct_R'] = ''


punctuation_pattern = r'[^\w\s]'

# Iterate over the DataFrame
for idx, row in df.iterrows():
    word = row['cleaned']
    punctuation_marks = re.findall(punctuation_pattern, word)
    if punctuation_marks:
        for punct in punctuation_marks:
            if word.startswith(punct):
                df.at[idx, 'punct_L'] += word
            if word.endswith(punct):
                df.at[idx, 'punct_R'] += word
    if '-' in word:
       df.at[idx, 'norm0'] = word

for idx, row in df.iterrows():
    if row['word'].endswith('-'):
        if idx + 1 < len(df):
            next_word = df.at[idx + 1, 'word']
            joined_word = row['word'][:-1] + next_word

            df.loc[idx, 'cleaned'] = joined_word
            df.loc[idx + 1, 'cleaned'] = joined_word

for idx, row in df.iterrows():
    if row['word'].endswith('-'):
        if idx + 1 < len(df):
            next_word = df.at[idx + 1, 'word']
            joined_word_dash = row['word'] + next_word

            #df.loc[idx, 'norm0'] = joined_word_dash
            df.loc[idx + 1, 'norm0'] = joined_word_dash

df['norm0'].fillna(df['cleaned'], inplace=True)



In [None]:
df

Unnamed: 0,pd_index,identifier,word,cleaned,punct_L,punct_R,norm0
0,0,0.0.0,46,46,,,46
1,1,0.0.1,Pascal,Pascal,,,Pascal
2,2,0.0.2,Attinger,Attinger,,,Attinger
3,3,0.0.3,A,A,,,A
4,4,0.0.4,propos,propos,,,propos
...,...,...,...,...,...,...,...
11237,11237,0.18.707,la-ba-an-ak-,la-ba-an-ake,,la-ba-an-ak-la-ba-an-ak-la-ba-an-ak-la-ba-an-ak-,la-ba-an-ak-
11238,11238,0.18.708,e,la-ba-an-ake,,,la-ba-an-ak-e
11239,11239,0.18.709,.,.,.,.,.
11240,11240,0.18.710,à,à,,,à


##4 Identify words with punctuation

In [None]:
df.head()


Unnamed: 0,pd_index,identifier,word,cleaned,punct_L,punct_R,norm0
0,0,0.0.0,46,46,,,46
1,1,0.0.1,Pascal,Pascal,,,Pascal
2,2,0.0.2,Attinger,Attinger,,,Attinger
3,3,0.0.3,A,A,,,A
4,4,0.0.4,propos,propos,,,propos


4.1 Abbreviation
Use the Abbreviation dictionary to convert the abbreviations to the orignal format

In [None]:
#abb = pd.read_csv('/content/drive/MyDrive/Tina/AbbreviationsDictionary - Sheet1.csv')#Tina
abb = pd.read_csv('/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/people/Tina/AbbreviationsDictionary - Sheet1.csv')#Adam
abbreviation_dict = dict(zip(abb['abbreviation'], abb['fullform']))
cleaned_abbreviation_dict = {key.rstrip(): value for key, value in abbreviation_dict.items()}

In [None]:
print(cleaned_abbreviation_dict.get("s.v."))

sub voce  


In [None]:
df['fullform'] = df['cleaned'].map(cleaned_abbreviation_dict).fillna(df['cleaned'])
df

Unnamed: 0,pd_index,identifier,word,cleaned,punct_L,punct_R,norm0,fullform
0,0,0.0.0,46,46,,,46,46
1,1,0.0.1,Pascal,Pascal,,,Pascal,Pascal
2,2,0.0.2,Attinger,Attinger,,,Attinger,Attinger
3,3,0.0.3,A,A,,,A,tablets in the collections of the Oriental Ins...
4,4,0.0.4,propos,propos,,,propos,propos
...,...,...,...,...,...,...,...,...
11237,11237,0.18.707,la-ba-an-ak-,la-ba-an-ake,,la-ba-an-ak-la-ba-an-ak-la-ba-an-ak-la-ba-an-ak-,la-ba-an-ak-,la-ba-an-ake
11238,11238,0.18.708,e,la-ba-an-ake,,,la-ba-an-ak-e,la-ba-an-ake
11239,11239,0.18.709,.,.,.,.,.,.
11240,11240,0.18.710,à,à,,,à,à


In [None]:
df[df['cleaned'] == 's.v.']

Unnamed: 0,pd_index,identifier,word,cleaned,punct_L,punct_R,norm0,fullform
180,180,0.0.180,s.v.,s.v.,,s.v.s.v.,s.v.,sub voce
668,668,0.1.170,s.v.,s.v.,,s.v.s.v.,s.v.,sub voce
894,894,0.1.396,s.v.,s.v.,,s.v.s.v.,s.v.,sub voce
1212,1212,0.2.236,s.v.,s.v.,,s.v.s.v.,s.v.,sub voce
1784,1784,0.3.353,s.v.,s.v.,,s.v.s.v.,s.v.,sub voce
4106,4106,0.7.416,s.v.,s.v.,,s.v.s.v.,s.v.,sub voce
5944,5944,0.10.338,s.v.,s.v.,,s.v.s.v.,s.v.,sub voce
8234,8234,0.14.157,s.v.,s.v.,,s.v.s.v.,s.v.,sub voce
8512,8512,0.14.435,s.v.,s.v.,,s.v.s.v.,s.v.,sub voce
8524,8524,0.14.447,s.v.,s.v.,,s.v.s.v.,s.v.,sub voce


##4.1 Use punctuation to create sentences

The 'cleaned' column will for the string for each sentence, and the punct_R will mark the punctuation when it occurs. Basically we join each word in 'cleaned' until we hit a '.' in 'punct_R'.

| id | sentence |
| -- | -- |
|#.#(-#).(sentence_id).1-9| |

* One issue here is that this method ignores footnotes at the bottom of a page... We may want to think about how to address these separately? But at least they are a sentence on the page.


In [None]:
import re
sentences = []
current_sentence = []
start_identifier = df['identifier'][0]
parenthesis_counter = 0

for i in range(len(df) - 1):
    current_word = df['fullform'][i]
    current_sentence.append(current_word)

    # Update the parenthesis counter based on the current word
    for char in current_word:
        if char == '(':
            parenthesis_counter += 1
        elif char == ')':
            parenthesis_counter -= 1

    # Check if the next word starts with an uppercase letter or a special character followed by an uppercase letter
    next_word_starts_with_upper = re.match(r'(\W*[A-Z])', df['fullform'][i+1]) is not None

    # Check if the current word is a single-letter abbreviation or a pattern
    is_single_letter_abbreviation = re.match(r'^[A-Z]\.$', current_word) is not None
    is_abbrev_pattern = re.match(r'^([A-Z]\. ){1,}[A-Z]\.$', current_word) is not None

    # Adjust the condition:
    # End the sentence if:
    # - the current word ends with a period or with ).
    # - the next word starts with an uppercase letter,
    # - the parenthesis counter is zero (all opened parentheses are closed)
    # OR if the current word ends with `).`
    if current_word.endswith(').') or current_word.endswith('//.')or current_word.endswith('».') or ((current_word[-1] == '.' or current_word.endswith(').')) and
        (next_word_starts_with_upper and
        parenthesis_counter == 0 and
        not is_single_letter_abbreviation and
        not is_abbrev_pattern)):
        end_identifier = df['identifier'][i]
        sentence_str = ' '.join(current_sentence)
        sentences.append((start_identifier + '-' + end_identifier, sentence_str))
        current_sentence = []
        start_identifier = df['identifier'][i+1]

if current_sentence:
    end_identifier = df['identifier'].iloc[-1]
    sentence_str = ' '.join(current_sentence)
    sentences.append((start_identifier + '-' + end_identifier, sentence_str))
df_sentences_strict = pd.DataFrame(sentences, columns=['Identifier', 'Sentence'])
df_sentences_strict

Unnamed: 0,Identifier,Sentence
0,0.0.0-0.0.23,46 Pascal Attinger tablets in the collections ...
1,0.0.24-0.0.71,Après une discussion du paradigme paradigme (2...
2,0.0.72-0.0.92,Des remarques de détail touchant les autres le...
3,0.0.93-0.0.93,1.
4,0.0.94-0.0.134,Introduction Il est bien connu qu’à côté de du...
...,...,...
221,0.18.323-0.18.339,"3.4 x sˇe sˇe 3 «faire N1 en N2», «transformer..."
222,0.18.340-0.18.478,"V. aussi 5.176 äab2/ äabx-sˇe 3 ak, 5.504 uäsˇ..."
223,0.18.479-0.18.566,3.6 V. 5.121 gi16-sa(asˇ/esˇ gi16-sa(asˇ/esˇ 2...
224,0.18.567-0.18.659,6.1 et 6.3 V. 5.186 i 3(-gˆ esˇ/nun/sˇaäa/ u d...


In [None]:
# Load the spaCy model
!python -m spacy download fr_core_news_sm
nlp = spacy.load("fr_core_news_sm")
docs = [nlp(sentence)for sentence in df_sentences_strict['Sentence']]
df = pd.DataFrame(columns=['Text', 'Lemma', 'POS', 'Tag', 'Dep', 'Shape', 'Is_Alpha', 'Is_Stop'])
rows = []
for doc in docs:
  for token in doc:
    row = {
        'Text': token.text,
        'Lemma': token.lemma_,
        'POS': token.pos_,
        'Tag': token.tag_,
        'Dep': token.dep_,
        'Shape': token.shape_,
        'Is_Alpha': token.is_alpha,
        'Is_Stop': token.is_stop
    }
    rows.append(row)
new_rows_df = pd.DataFrame(rows)

df = pd.concat([df, new_rows_df], ignore_index=True)
df

Collecting fr-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


Unnamed: 0,Text,Lemma,POS,Tag,Dep,Shape,Is_Alpha,Is_Stop
0,46,46,NUM,NUM,nummod,dd,False,False
1,Pascal,pascal,NOUN,NOUN,ROOT,Xxxxx,True,False
2,Attinger,Attinger,X,X,nummod,Xxxxx,True,False
3,tablets,tablet,NOUN,NOUN,ROOT,xxxx,True,False
4,in,in,X,X,amod,xx,True,False
...,...,...,...,...,...,...,...,...
20812,an,an,NOUN,NOUN,nmod,xx,True,False
20813,-,-,NOUN,NOUN,flat:name,-,False,False
20814,ake,ake,NOUN,NOUN,flat:name,xxx,True,False
20815,.,.,PUNCT,PUNCT,punct,.,False,False


In [None]:
# df.to_csv('/content/drive/MyDrive/Tina/catalog_words.csv')

In [None]:
!pip install nltk
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
from nltk.tokenize import word_tokenize



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:

text = "J’ai relevé encore au hasard de mes lectures: a 2 kala-ga «côtés solides (d’un panier)» (FI 21), ad «poutre, radeau» (passim; pour quelques références, confer A/III 6, 1.2), ad-t ab ur-ra «collier/laisse de chien» (MVN 1, 225:3), u r u d aalan «statue de cuivre» (Isˇme-Dagan tablets in the collections of the Oriental Institute, Univ. of Chicago 300), d a p u 2 «bord de puits» (InSˇuk. [9]4 // [264] ± // 142), d a tur.tur «…» (SP 3.87) e 2 u d u «bergerie» (UTAMI 4, 2397:8), g al z a b a r un grand récipient en bronze (MVN 14, 443:2; ou «traiter»?), g eg i l i m s u rr rr a littéralt «poteaux(?) de roseaux tordus» (Rochester (Rochester 174:9), g u «fil» (SNATBM 260 iii 24), g u kilib kilib «balle» (TIM 6, 1 vi 4´), g eg u r-d a un panier de roseau (Nabnitu VII 133 = A/III 123 lexical (texts) 53), u r u d agur x(sˇe.kin) «faucille» (MVN 18 = AulOr.-S 11, 424:2), gˆ esˇ-a-nagˆ (…), littéralt «Wassertränkholz», d’où «(hölzerne) Grabkammer» (Sallaberger) (TIM 6, 10:5 sq.; confer C. Wilcke, Mél. Vajda 254), gˆ esˇn esˇn u 2 (…) «lit (…)» (UET 3, 772:3–5)37, ib2(.)sa «…» (Peat, Journal of Cuneiform Studies (New Haven . . . Baltimore 1947 ff.) 28, 215 n° 27:4 et Kutscher, Tel-Aviv 7, 173 n° 2:3), dl a m m a dsˇu dsu dsu en «(la statue d’)un génie de Sˇ.» (Owen, ASJ 19, 161/217 n° 50:2), m a 2 «bateau» (Jean, SˇA pl. 75:11´), m a 2(-)da-g a «…» (CST 585:3), m a 2la 2la 2(a 2(a ) «radeau» (AnOr. 1, 62:7 et UTAMI 3, 1911:2), g e muru muru 12 une natte de roseau (SP 14.46; confer B. Alster, Nouvelles Assyriologiques Brèves et Utilitaires (Paris 1987 ff.); cf. Mémoires de NABU (1992 ff.), Cahiers de NABU (1990 ff.) 1999/88), n i gˆ 2-tab «Erhitzer «Erhitzer (= Ofen‹platte›?)» (GiEn. 7; trad. de Sallaberger, W. Sallaberger, Der babylonische Töpfer und seine Gefäse 105), p is agˆ im s ar «corbeille «corbeille pour tablettes écrites» (MVN 18 = AulOr.-S 11, 188:3 et Yale Oriental Series, Babylonian Texts (New Haven 1915 ff.) 4, 168:3), sa 2d 2d u 8 «…» (passim), su k u 5-kesˇe 2 = tiqnu, une parure (pour le cou/la tête) (Kramer, AnSt. 30, 7 sq. = Cuneiform Texts from Babylonian Tablets in the British Museum (London 1896 ff.) 58, 42 l. 79), sˇa 3t 3t u k u 5 «matelas(?)»38 (CST 606:2), u r u d asˇen un chaudron en cuivre (MVN 18, 407 reverse; revised 1), sˇu n n ir «emblème» (Inana D 201)39, z a 3d 3d u 8 «jambages 37 tablets in the collections of the Oriental Institute, Univ. of Chicago distinguer de m u nu nu 2 (…) ak (ES) «préparer un lit (…)» (Inana G 50)."
sentences = sent_tokenize(text)

nltk.download('punkt')
nltk.download('webtext')

def custom_sent_tokenizer(text):

    text = text.replace("--", ". ")

    custom_text = webtext.raw('overheard.txt')
    custom_sent_tokenizer = PunktSentenceTokenizer(custom_text)


    sentences = custom_sent_tokenizer.tokenize(text)


    sentences = [sentence.strip() for sentence in sentences]

    return sentences

sentences = custom_sent_tokenizer(text)

for idx, sentence in enumerate(sentences, 1):
    print(f"Sentence {idx}: {sentence}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Unzipping corpora/webtext.zip.


Sentence 1: J’ai relevé encore au hasard de mes lectures: a 2 kala-ga «côtés solides (d’un panier)» (FI 21), ad «poutre, radeau» (passim; pour quelques références, confer A/III 6, 1.2), ad-t ab ur-ra «collier/laisse de chien» (MVN 1, 225:3), u r u d aalan «statue de cuivre» (Isˇme-Dagan tablets in the collections of the Oriental Institute, Univ.
Sentence 2: of Chicago 300), d a p u 2 «bord de puits» (InSˇuk.
Sentence 3: [9]4 // [264] ± // 142), d a tur.tur «…» (SP 3.87) e 2 u d u «bergerie» (UTAMI 4, 2397:8), g al z a b a r un grand récipient en bronze (MVN 14, 443:2; ou «traiter»?
Sentence 4: ), g eg i l i m s u rr rr a littéralt «poteaux(?)
Sentence 5: de roseaux tordus» (Rochester (Rochester 174:9), g u «fil» (SNATBM 260 iii 24), g u kilib kilib «balle» (TIM 6, 1 vi 4´), g eg u r-d a un panier de roseau (Nabnitu VII 133 = A/III 123 lexical (texts) 53), u r u d agur x(sˇe.kin) «faucille» (MVN 18 = AulOr.-S 11, 424:2), gˆ esˇ-a-nagˆ (…), littéralt «Wassertränkholz», d’où «(hölzerne) G

In [None]:
python_path = f'/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/NER/geography/OpenNMT/bin/'
onmt_path = '/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/NER/geography/OpenNMT/lib/python3.9/site-packages/onmt/bin'

In [None]:
!pip install spacy
import spacy
!python -m spacy download fr_core_news_sm
nlp = spacy.load("fr_core_news_sm")

# Your sentence
sentence = "Le détail m’échappe, mais il est clair que /k/ est particulièrement fréquent après voyelle et/ou devant certains suffixes (ou certaines séquences suffixales), qui ont dû entraîner un déplacement de l’accent, par ex. {e d}, peut-être le {a} de l’impératif 6 (cf. 5 Pour de rares indices pouvant éventuellement plaider en ce sens, confer infra 3.1.1 Ur III écon./jur. sub voce -b a a a et n n aa aa ; une analyse {b/n + ak} est à mon sens plus probable que {b/n + ak + <a}, mais il est vrai que je ne puis justifier la graphie pleine consonant a-a au lieu de -C a . En ce qui concerne la distinction faite, avant l’ép. pB, entre -ak-esˇ 2 (3e pl. ä.) et -ak-ke 4 (3e sg. m.), il convient de noter que ce phénomène n’est pas limité à ak, et n’implique donc pas que la Bä. et la Bm. étaient phonétiquement distinctes. 6 En faveur d’une accentuation finale de l’impératif, confer les formes intransitives B-b i/ n i telle ge 4-b i «Retourne!», etc. (ELS 265 et 299)."

# Process the sentence
doc = nlp(sentence)

# Dependency parsing
for token in doc:
    print(f"{token.text:{12}} {token.dep_:{10}} {token.head.text:{12}}")


2023-11-16 07:16:26.304215: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-16 07:16:26.304315: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-16 07:16:26.304434: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Collecting fr-core-news-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.6.0/fr_core_news_sm-3.6.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fr-core-news-sm
Successfully in

In [None]:
page = pd.read_csv(f'{workdir}/0_Attinger - A propos de AK «faire» (I) (ZA 95, 2005) 46-64/page.csv')
page_text = page["text"]
page_text
page_trans = []
for  sentence in page_text :
  page_trans.append(nlp(sentence))
page['trans'] = page_trans
page_trans

[46
 Pascal Attinger
 A propos de AK «faire»1 (I)
 par Pascal Attinger – Berne
 Ces pages sont consacrées à une étude de ak «faire». Après une discussion du para-
 digme (2) et des graphies (non-)standard et gloses de lecture de ak (3), le lecteur trouvera
 quelques ajouts et corrections au PSD A/III 70–76, ak 1–7 (4) et une liste des «compo-
 sés» de ak et des expressions idiomatiques formées avec ak (5). Des remarques de détail
 touchant les autres lexèmes inclus dans le PSD A/III (pp. 1–69 et 131–217) closent l’ar-
 ticle (6).
 1. Introduction
 Il est bien connu qu’à côté de du11-g/e/di-d «dire», ak «faire» est
 le verbe le plus productif de la langue sumérienne, et il n’est en consé-
 quence pas étonnant qu’un quart du PSD A/III (1998) lui soit consacré
 (pp. 70–131). Il va sans dire que ces pages rendront d’inestimables servi-
 ces tant pour le travail quotidien que comme base d’une recherche sys-
 tématique sur ak, elles sont toutefois entachées de deux défauts qui en
 rendent l’

In [None]:
text = page_trans[0]

doc = nlp(text)

# Sentence Segmentation
i = 0
for sentence in doc.sents:
    i +=1
    print(i,sentence.text)



1 46
Pascal Attinger
A propos de AK «faire»1 (I)
par Pascal Attinger
2 – Berne
Ces pages sont consacrées à une étude de ak «faire».
3 Après une discussion du para-
digme (2) et des graphies (non-)standard et gloses de lecture de ak (3), le lecteur trouvera
quelques ajouts et corrections au PSD A/III 70–76, ak 1–7 (4) et une liste des «compo-
sés» de ak et des expressions idiomatiques formées avec ak (5).
4 Des remarques de détail
touchant les autres lexèmes inclus dans le PSD A/III (pp. 1–69 et 131–217)
5 closent l’ar-

6 ticle (6).
1.
7 Introduction

8 Il est bien connu qu’à côté de du11-g/e/di-d «dire», ak «faire» est
le verbe le plus productif de la langue sumérienne, et il n’est en consé-
quence pas étonnant qu’un quart du PSD A/III (1998) lui soit consacré
(pp. 70–131).
9 Il va sans dire que ces pages rendront d’inestimables servi-
ces tant pour le travail quotidien que comme base d’une recherche sys-
tématique sur ak, elles sont toutefois entachées de deux défauts qui en
rendent 

## 8 Save & export
The final step in the cleaning process is to save the output as a CSV file, called 'wordsclean.csv'.

In [None]:
df_sentences_strict.to_csv('sentences_strict.csv', index=False) ##change the name of your csv if you want
!cp sentences_strict.csv '/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/people/Tina' #Adam