# DNBLab Jupyter Notebook Tutorial

## OCR: Datenabfrage, Auslieferung und Textanalyse 

Entwurf: Dieses DNBLab-Tutorial beschreibt eine Beispielabfrage zu digitalisierten Inhaltsverzeichnissen über die SRU-Schnittstelle. Das Tutorial umfasst eine exemplarische Abfrage, das temporäre Speichern der Inhaltsverzeichnisse als Textdateien und Durchsuchen der Volltexte nach einem beliebigen Stichwort. In der Jupyter Notebook Umgebung kann der dokumentierte Code direkt ausgeführt und angepasst werden.

## Einrichten der Arbeitsumgebung <a class="anchor" id="Teil1"></a>

Um die Arbeitsumgebung für die folgenden Schritte passend einzurichten, sollten zunächst die benötigten Python-Bibliotheken importiert werden. Für Anfragen über die SRU-Schnittstelle wird BeautifulSoup https://www.crummy.com/software/BeautifulSoup/ und zur Verarbeitung der XML-Daten etree https://docs.python.org/3/library/xml.etree.elementtree.html verwendet. Mit Pandas https://pandas.pydata.org/ können Elemente aus dem MARC21-Format ausgelesen werden.

In [1]:
import requests
from bs4 import BeautifulSoup as soup
import unicodedata
from lxml import etree
import pandas as pd

## SRU-Abfrage mit Ausgabe in MARC21-xml<a class="anchor" id="Teil2"></a>

Die Funktion dnb_sru nimmt den Paramter "query" der SRU-Abfrage entgegen und liefert alle Ergebnisse als eine Liste von Records aus. Bei mehr als 100 Records werden weitere Datensätze mit "&startRecord=101" abgerufen (mögliche Werte 1 bis 99.000). Weitere Informationen und Funktionen der SRU- Schnittstelle werden unter https://www.dnb.de/sru beschrieben.

In [3]:
def dnb_sru(query):
    
    base_url = "https://services.dnb.de/sru/dnb"
    params = {'recordSchema' : 'MARC21-xml',
          'operation': 'searchRetrieve',
          'version': '1.1',
          'maximumRecords': '100',
          'query': query
         }
    r = requests.get(base_url, params=params)
    xml = soup(r.content)
    records = xml.find_all('record', {'type':'Bibliographic'})
    
    if len(records) < 100:
        
        return records
    
    else:
        
        num_results = 100
        i = 101
        while num_results == 100:
            
            params.update({'startRecord': i})
            r = requests.get(base_url, params=params)
            xml = soup(r.content)
            new_records = xml.find_all('record', {'type':'Bibliographic'})
            records+=new_records
            i+=100
            num_results = len(new_records)
            
        return records

### Durchsuchen eines MARC-Feldes<a class="anchor" id="Teil3"></a>

Die Funktion parse_records nimmt als Parameter jeweils ein Record entgegen und sucht über xpath die gewünschte Informationen heraus und liefert diese als Dictionary zurück. Die Schlüssel-Werte-Paare können beliebig agepasst und erweitert werden. In diesem Fall werden nur die Permalinks zu den digitalisierten Inhaltsverzeichnissen als "link" ausgegeben.

In [4]:
def parse_record(record):
    
    ns = {"marc":"http://www.loc.gov/MARC21/slim"}
    xml = etree.fromstring(unicodedata.normalize("NFC", str(record)))
    
    #link
    link = xml.xpath("marc:datafield[@tag = '856']/marc:subfield[@code = 'u']", namespaces=ns)
    
    try:
        link = link[0].text
    except:
        link = "unknown"
        
    meta_dict = {"link":link + '/text'}
    
    return meta_dict

Über verschiedenen Indices https://services.dnb.de/sru/dnb?operation=explain&version=1.1 kann die SRU-Abfrage "dnb_sru" mittels CQL https://www.dnb.de/DE/Service/Hilfe/Katalog/kataloghilfeExpertensuche.html eingeschränkt werden. Im Folgenden Code wird die Abfrage über das Stichwort "Sandwespe" im Volltextindex der digitalisierten Inhaltsverzeichnisse eingeschränkt. Durch Anpassen der SRU-Abfrage kann die Trefferliste beliebig geändert werden.

In [5]:
records = dnb_sru('inh=Nähmaschine')
print(len(records), 'Ergebnisse')

750 Ergebnisse


## Beispielanzeige zur weiteren Bearbeitung <a class="anchor" id="Teil4"></a>

Mit der Bibliothek Pandas für Python wird das Ergebnis (Dictionary-Element "link") als Dataframe ausgegeben.

In [6]:
output = [parse_record(record) for record in records]
df = pd.DataFrame(output)
df

Unnamed: 0,link
0,http://deposit.dnb.de/cgi-bin/dokserv?id=043b0...
1,http://deposit.dnb.de/cgi-bin/dokserv?id=f6d16...
2,http://deposit.dnb.de/cgi-bin/dokserv?id=1b6e2...
3,http://deposit.dnb.de/cgi-bin/dokserv?id=1f524...
4,http://deposit.dnb.de/cgi-bin/dokserv?id=f4e7e...
...,...
745,https://d-nb.info/363621814/04/text
746,https://d-nb.info/361208944/04/text
747,https://d-nb.info/362024774/04/text
748,https://d-nb.info/580171256/04/text


Die Ausgabe der ermittelten Links kann je nach Bedarf über verschiedene Funktionen erfolgen:

In [7]:
#print(df.to_string(index=False))
#HTML(df.to_html(index=False))
#document = df.to_dict(orient='list')
#print(document)

Mit der folgenden Funktion df.to_csv() werden die Ergebnisse als "links.csv" in das Jupyter-Verzeichnins der Einstiegsseite abgelegt und können dort heruntergeladen werden. 

In [8]:
df.to_csv("links.csv", index=False)

Mit wget werden alle in der CSV-Datei gespeicherten Links heruntergeladen und als Textdateien (text, text.1, text.2, usw.) im temporären Jupyter-Verzeichnis gespeichert. 

In [9]:
!wget -i links.csv

--2021-08-05 08:31:49--  http://link/
Resolving link (link)... failed: No address associated with hostname.
wget: unable to resolve host address ‘link’
--2021-08-05 08:31:49--  http://deposit.dnb.de/cgi-bin/dokserv?id=043b027d59f04e619522dde80bd12406&prov=M&dok_var=1&dok_ext=htm/text
Resolving deposit.dnb.de (deposit.dnb.de)... 193.175.100.44
Connecting to deposit.dnb.de (deposit.dnb.de)|193.175.100.44|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=043b027d59f04e619522dde80bd12406&prov=M&dok_var=1&dok_ext=htm%2Ftext’

dokserv?id=043b027d     [ <=>                ]     910  --.-KB/s    in 0s      

2021-08-05 08:31:49 (29.7 MB/s) - ‘dokserv?id=043b027d59f04e619522dde80bd12406&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [910]

--2021-08-05 08:31:49--  http://deposit.dnb.de/cgi-bin/dokserv?id=f6d160d66f69408bb6fcae6a33a8d0a6&prov=M&dok_var=1&dok_ext=htm/text
Reusing existing connection to deposit.dnb.de:80.
HTTP reque

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=61b3cc20f48e497ba3de418257398b6a&prov=M&dok_var=1&dok_ext=htm%2Ftext’

dokserv?id=61b3cc20     [ <=>                ]   1.43K  --.-KB/s    in 0s      

2021-08-05 08:31:52 (64.1 MB/s) - ‘dokserv?id=61b3cc20f48e497ba3de418257398b6a&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [1462]

--2021-08-05 08:31:52--  http://deposit.dnb.de/cgi-bin/dokserv?id=639f5717b678437d8c5e793a41359e62&prov=M&dok_var=1&dok_ext=htm/text
Reusing existing connection to deposit.dnb.de:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=639f5717b678437d8c5e793a41359e62&prov=M&dok_var=1&dok_ext=htm%2Ftext’

dokserv?id=639f5717     [ <=>                ]   1.01K  --.-KB/s    in 0s      

2021-08-05 08:31:52 (47.6 MB/s) - ‘dokserv?id=639f5717b678437d8c5e793a41359e62&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [1033]

--2021-08-05 08:31:52--  http://deposit.dnb.de/cg

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=b4196eb358ee4882867554b81f9947ea&prov=M&dok_var=1&dok_ext=htm%2Ftext’

dokserv?id=b4196eb3     [ <=>                ]   1.06K  --.-KB/s    in 0s      

2021-08-05 08:31:56 (48.2 MB/s) - ‘dokserv?id=b4196eb358ee4882867554b81f9947ea&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [1081]

--2021-08-05 08:31:56--  https://d-nb.info/1197007741/04/text
Connecting to d-nb.info (d-nb.info)|193.175.100.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4769 (4.7K) [text/plain]
Saving to: ‘text.11’


2021-08-05 08:31:57 (103 MB/s) - ‘text.11’ saved [4769/4769]

--2021-08-05 08:31:57--  http://deposit.dnb.de/cgi-bin/dokserv?id=1e77ad0fff6a4d57bcf01d272c3d45da&prov=M&dok_var=1&dok_ext=htm/text
Connecting to deposit.dnb.de (deposit.dnb.de)|193.175.100.44|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=1e77

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=3f8addb899cb490eb6163c703f427d5a&prov=M&dok_var=1&dok_ext=htm%2Ftext’

dokserv?id=3f8addb8     [ <=>                ]     782  --.-KB/s    in 0s      

2021-08-05 08:32:01 (35.6 MB/s) - ‘dokserv?id=3f8addb899cb490eb6163c703f427d5a&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [782]

--2021-08-05 08:32:01--  https://d-nb.info/1170088635/04/text
Connecting to d-nb.info (d-nb.info)|193.175.100.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 757 [text/plain]
Saving to: ‘text.23’


2021-08-05 08:32:02 (54.4 MB/s) - ‘text.23’ saved [757/757]

--2021-08-05 08:32:02--  https://d-nb.info/1181981832/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 1738 (1.7K) [text/plain]
Saving to: ‘text.24’


2021-08-05 08:32:02 (47.2 MB/s) - ‘text.24’ saved [1738/1738]

--2021-08-05 08:32:02--  https://d-nb.info/1184413207/

dokserv?id=f4b7cf73     [ <=>                ]   1.27K  --.-KB/s    in 0s      

2021-08-05 08:32:08 (55.7 MB/s) - ‘dokserv?id=f4b7cf732da745628af4a99454297bab&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [1302]

--2021-08-05 08:32:08--  http://deposit.dnb.de/cgi-bin/dokserv?id=1144d10332c04e8cb94076c11ea84dab&prov=M&dok_var=1&dok_ext=htm/text
Reusing existing connection to deposit.dnb.de:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=1144d10332c04e8cb94076c11ea84dab&prov=M&dok_var=1&dok_ext=htm%2Ftext’

dokserv?id=1144d103     [ <=>                ]     961  --.-KB/s    in 0s      

2021-08-05 08:32:08 (44.1 MB/s) - ‘dokserv?id=1144d10332c04e8cb94076c11ea84dab&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [961]

--2021-08-05 08:32:08--  http://deposit.dnb.de/cgi-bin/dokserv?id=56ff06c728b446a58106361a1e267165&prov=M&dok_var=1&dok_ext=htm/text
Reusing existing connection to deposit.dnb.de:80.
HTTP request sent, awaiting response... 

HTTP request sent, awaiting response... 200 OK
Length: 4860 (4.7K) [text/plain]
Saving to: ‘text.48’


2021-08-05 08:32:12 (114 MB/s) - ‘text.48’ saved [4860/4860]

--2021-08-05 08:32:12--  http://deposit.dnb.de/cgi-bin/dokserv?id=a7e3fdab9d704bb4833681562bdc825f&prov=M&dok_var=1&dok_ext=htm/text
Connecting to deposit.dnb.de (deposit.dnb.de)|193.175.100.44|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=a7e3fdab9d704bb4833681562bdc825f&prov=M&dok_var=1&dok_ext=htm%2Ftext’

dokserv?id=a7e3fdab     [ <=>                ]   1.16K  --.-KB/s    in 0s      

2021-08-05 08:32:12 (50.2 MB/s) - ‘dokserv?id=a7e3fdab9d704bb4833681562bdc825f&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [1192]

--2021-08-05 08:32:12--  http://deposit.dnb.de/cgi-bin/dokserv?id=60ce54254f6645a781c966d118d9aed0&prov=M&dok_var=1&dok_ext=htm/text
Reusing existing connection to deposit.dnb.de:80.
HTTP request sent, awaiting response... 200 OK
Length: u

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=773c3e366abb4c63a79e29d09a0029bd&prov=M&dok_var=1&dok_ext=htm%2Ftext’

dokserv?id=773c3e36     [ <=>                ]     753  --.-KB/s    in 0s      

2021-08-05 08:32:16 (34.1 MB/s) - ‘dokserv?id=773c3e366abb4c63a79e29d09a0029bd&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [753]

--2021-08-05 08:32:16--  https://d-nb.info/1143736869/04/text
Connecting to d-nb.info (d-nb.info)|193.175.100.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1283 (1.3K) [text/plain]
Saving to: ‘text.57’


2021-08-05 08:32:16 (137 MB/s) - ‘text.57’ saved [1283/1283]

--2021-08-05 08:32:16--  https://d-nb.info/1140182056/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 12762 (12K) [text/plain]
Saving to: ‘text.58’


2021-08-05 08:32:16 (822 KB/s) - ‘text.58’ saved [12762/12762]

--2021-08-05 08:32:16--  http://deposit.dnb.

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=681ef7811a0c438f8d2d94981a59b995&prov=M&dok_var=1&dok_ext=htm%2Ftext’

dokserv?id=681ef781     [ <=>                ]     891  --.-KB/s    in 0s      

2021-08-05 08:32:19 (41.1 MB/s) - ‘dokserv?id=681ef7811a0c438f8d2d94981a59b995&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [891]

--2021-08-05 08:32:19--  http://deposit.dnb.de/cgi-bin/dokserv?id=0644b858067d45909f4f1fa36e18386c&prov=M&dok_var=1&dok_ext=htm/text
Reusing existing connection to deposit.dnb.de:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=0644b858067d45909f4f1fa36e18386c&prov=M&dok_var=1&dok_ext=htm%2Ftext’

dokserv?id=0644b858     [ <=>                ]   1.25K  --.-KB/s    in 0s      

2021-08-05 08:32:19 (56.1 MB/s) - ‘dokserv?id=0644b858067d45909f4f1fa36e18386c&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [1281]

--2021-08-05 08:32:19--  https://d-nb.info/1119355

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=5125953&prov=M&dok_var=1&dok_ext=htm%2Ftext’

dokserv?id=5125953&     [ <=>                ]     886  --.-KB/s    in 0s      

2021-08-05 08:32:25 (44.0 MB/s) - ‘dokserv?id=5125953&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [886]

--2021-08-05 08:32:25--  http://deposit.dnb.de/cgi-bin/dokserv?id=514a525b148b4d8697311149fcc96046&prov=M&dok_var=1&dok_ext=htm/text
Reusing existing connection to deposit.dnb.de:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dokserv?id=514a525b148b4d8697311149fcc96046&prov=M&dok_var=1&dok_ext=htm%2Ftext’

dokserv?id=514a525b     [ <=>                ]   1.37K  --.-KB/s    in 0s      

2021-08-05 08:32:25 (62.0 MB/s) - ‘dokserv?id=514a525b148b4d8697311149fcc96046&prov=M&dok_var=1&dok_ext=htm%2Ftext’ saved [1401]

--2021-08-05 08:32:25--  https://d-nb.info/1116290022/04/text
Connecting to d-nb.info (d-nb.info)|19

HTTP request sent, awaiting response... 200 OK
Length: 749 [text/plain]
Saving to: ‘text.101’


2021-08-05 08:32:34 (56.4 MB/s) - ‘text.101’ saved [749/749]

--2021-08-05 08:32:34--  https://d-nb.info/1062926587/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 10152 (9.9K) [text/plain]
Saving to: ‘text.102’


2021-08-05 08:32:35 (60.4 MB/s) - ‘text.102’ saved [10152/10152]

--2021-08-05 08:32:35--  https://d-nb.info/1077464266/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 7559 (7.4K) [text/plain]
Saving to: ‘text.103’


2021-08-05 08:32:35 (172 MB/s) - ‘text.103’ saved [7559/7559]

--2021-08-05 08:32:35--  https://d-nb.info/1077464088/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 7587 (7.4K) [text/plain]
Saving to: ‘text.104’


2021-08-05 08:32:35 (167 MB/s) - ‘text.104’ saved [7587/7587]

--2021-08-05 

HTTP request sent, awaiting response... 200 OK
Length: 1933 (1.9K) [text/plain]
Saving to: ‘text.123’


2021-08-05 08:32:42 (52.4 MB/s) - ‘text.123’ saved [1933/1933]

--2021-08-05 08:32:42--  https://d-nb.info/1070529818/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 6949 (6.8K) [text/plain]
Saving to: ‘text.124’


2021-08-05 08:32:42 (76.9 MB/s) - ‘text.124’ saved [6949/6949]

--2021-08-05 08:32:42--  https://d-nb.info/1070529834/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 476 [text/plain]
Saving to: ‘text.125’


2021-08-05 08:32:42 (41.3 MB/s) - ‘text.125’ saved [476/476]

--2021-08-05 08:32:42--  https://d-nb.info/1069864722/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 2521 (2.5K) [text/plain]
Saving to: ‘text.126’


2021-08-05 08:32:43 (48.7 MB/s) - ‘text.126’ saved [2521/2521]

--2021-08-05 0

HTTP request sent, awaiting response... 200 OK
Length: 2336 (2.3K) [text/plain]
Saving to: ‘text.147’


2021-08-05 08:32:52 (58.2 MB/s) - ‘text.147’ saved [2336/2336]

--2021-08-05 08:32:52--  https://d-nb.info/1050664264/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 2449 (2.4K) [text/plain]
Saving to: ‘text.148’


2021-08-05 08:32:52 (57.8 MB/s) - ‘text.148’ saved [2449/2449]

--2021-08-05 08:32:52--  https://d-nb.info/1050835263/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 1437 (1.4K) [text/plain]
Saving to: ‘text.149’


2021-08-05 08:32:52 (109 MB/s) - ‘text.149’ saved [1437/1437]

--2021-08-05 08:32:52--  https://d-nb.info/1044649984/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 2238 (2.2K) [text/plain]
Saving to: ‘text.150’


2021-08-05 08:32:52 (55.5 MB/s) - ‘text.150’ saved [2238/2238]

--202

HTTP request sent, awaiting response... 200 OK
Length: 4610 (4.5K) [text/plain]
Saving to: ‘text.193’


2021-08-05 08:33:05 (105 MB/s) - ‘text.193’ saved [4610/4610]

--2021-08-05 08:33:05--  https://d-nb.info/1028360444/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 448 [text/plain]
Saving to: ‘text.194’


2021-08-05 08:33:06 (35.8 MB/s) - ‘text.194’ saved [448/448]

--2021-08-05 08:33:06--  https://d-nb.info/1017504318/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 4653 (4.5K) [text/plain]
Saving to: ‘text.195’


2021-08-05 08:33:06 (33.9 MB/s) - ‘text.195’ saved [4653/4653]

--2021-08-05 08:33:06--  https://d-nb.info/1016562802/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 602 [text/plain]
Saving to: ‘text.196’


2021-08-05 08:33:06 (54.3 MB/s) - ‘text.196’ saved [602/602]

--2021-08-05 08:33:06--  

HTTP request sent, awaiting response... 200 OK
Length: 3595 (3.5K) [text/plain]
Saving to: ‘text.217’


2021-08-05 08:33:12 (89.8 MB/s) - ‘text.217’ saved [3595/3595]

--2021-08-05 08:33:12--  https://d-nb.info/1011200848/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 12319 (12K) [text/plain]
Saving to: ‘text.218’


2021-08-05 08:33:12 (792 KB/s) - ‘text.218’ saved [12319/12319]

--2021-08-05 08:33:12--  https://d-nb.info/1022215434/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 2831 (2.8K) [text/plain]
Saving to: ‘text.219’


2021-08-05 08:33:12 (55.8 MB/s) - ‘text.219’ saved [2831/2831]

--2021-08-05 08:33:12--  https://d-nb.info/1027775357/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 535 [text/plain]
Saving to: ‘text.220’


2021-08-05 08:33:13 (47.5 MB/s) - ‘text.220’ saved [535/535]

--2021-08-05 

HTTP request sent, awaiting response... 200 OK
Length: 2153 (2.1K) [text/plain]
Saving to: ‘text.241’


2021-08-05 08:33:20 (52.5 MB/s) - ‘text.241’ saved [2153/2153]

--2021-08-05 08:33:20--  https://d-nb.info/1011631245/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 5005 (4.9K) [text/plain]
Saving to: ‘text.242’


2021-08-05 08:33:20 (127 MB/s) - ‘text.242’ saved [5005/5005]

--2021-08-05 08:33:20--  https://d-nb.info/1008902721/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 17960 (18K) [text/plain]
Saving to: ‘text.243’


2021-08-05 08:33:20 (1.10 MB/s) - ‘text.243’ saved [17960/17960]

--2021-08-05 08:33:20--  https://d-nb.info/1013733738/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 1302 (1.3K) [text/plain]
Saving to: ‘text.244’


2021-08-05 08:33:21 (109 MB/s) - ‘text.244’ saved [1302/1302]

--20

HTTP request sent, awaiting response... 200 OK
Length: 5130 (5.0K) [text/plain]
Saving to: ‘text.288’


2021-08-05 08:33:32 (68.5 MB/s) - ‘text.288’ saved [5130/5130]

--2021-08-05 08:33:32--  https://d-nb.info/987101811/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 1367 (1.3K) [text/plain]
Saving to: ‘text.289’


2021-08-05 08:33:33 (114 MB/s) - ‘text.289’ saved [1367/1367]

--2021-08-05 08:33:33--  https://d-nb.info/984554505/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 2316 (2.3K) [text/plain]
Saving to: ‘text.290’


2021-08-05 08:33:33 (62.2 MB/s) - ‘text.290’ saved [2316/2316]

--2021-08-05 08:33:33--  https://d-nb.info/985595396/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 25321 (25K) [text/plain]
Saving to: ‘text.291’


2021-08-05 08:33:33 (1.52 MB/s) - ‘text.291’ saved [25321/25321]

--2021

HTTP request sent, awaiting response... 200 OK
Length: 1281 (1.3K) [text/plain]
Saving to: ‘text.312’


2021-08-05 08:33:38 (106 MB/s) - ‘text.312’ saved [1281/1281]

--2021-08-05 08:33:38--  https://d-nb.info/957556284/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 18584 (18K) [text/plain]
Saving to: ‘text.313’


2021-08-05 08:33:38 (79.0 MB/s) - ‘text.313’ saved [18584/18584]

--2021-08-05 08:33:38--  https://d-nb.info/956755933/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 6958 (6.8K) [text/plain]
Saving to: ‘text.314’


2021-08-05 08:33:38 (94.6 MB/s) - ‘text.314’ saved [6958/6958]

--2021-08-05 08:33:38--  https://d-nb.info/1114682284/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 41657 (41K) [text/plain]
Saving to: ‘text.315’


2021-08-05 08:33:39 (1.27 MB/s) - ‘text.315’ saved [41657/41657]

--2

HTTP request sent, awaiting response... 200 OK
Length: 1061 (1.0K) [text/plain]
Saving to: ‘text.358’


2021-08-05 08:33:52 (86.4 MB/s) - ‘text.358’ saved [1061/1061]

--2021-08-05 08:33:52--  https://d-nb.info/1049219600/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 8722 (8.5K) [text/plain]
Saving to: ‘text.359’


2021-08-05 08:33:52 (14.2 MB/s) - ‘text.359’ saved [8722/8722]

--2021-08-05 08:33:52--  https://d-nb.info/890667500/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 1645 (1.6K) [text/plain]
Saving to: ‘text.360’


2021-08-05 08:33:52 (137 MB/s) - ‘text.360’ saved [1645/1645]

--2021-08-05 08:33:52--  https://d-nb.info/103679475X/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 2722 (2.7K) [text/plain]
Saving to: ‘text.361’


2021-08-05 08:33:52 (68.5 MB/s) - ‘text.361’ saved [2722/2722]

--2021

HTTP request sent, awaiting response... 200 OK
Length: 11594 (11K) [text/plain]
Saving to: ‘text.404’


2021-08-05 08:34:03 (68.9 MB/s) - ‘text.404’ saved [11594/11594]

--2021-08-05 08:34:03--  https://d-nb.info/880932651/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 13108 (13K) [text/plain]
Saving to: ‘text.405’


2021-08-05 08:34:04 (810 KB/s) - ‘text.405’ saved [13108/13108]

--2021-08-05 08:34:04--  https://d-nb.info/860297063/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 1917 (1.9K) [text/plain]
Saving to: ‘text.406’


2021-08-05 08:34:04 (55.8 MB/s) - ‘text.406’ saved [1917/1917]

--2021-08-05 08:34:04--  https://d-nb.info/20596883X/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 5432 (5.3K) [text/plain]
Saving to: ‘text.407’


2021-08-05 08:34:04 (126 MB/s) - ‘text.407’ saved [5432/5432]

--202

HTTP request sent, awaiting response... 200 OK
Length: 2037 (2.0K) [text/plain]
Saving to: ‘text.428’


2021-08-05 08:34:10 (56.4 MB/s) - ‘text.428’ saved [2037/2037]

--2021-08-05 08:34:10--  https://d-nb.info/850905737/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 2528 (2.5K) [text/plain]
Saving to: ‘text.429’


2021-08-05 08:34:10 (65.4 MB/s) - ‘text.429’ saved [2528/2528]

--2021-08-05 08:34:10--  https://d-nb.info/850682134/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 1686 (1.6K) [text/plain]
Saving to: ‘text.430’


2021-08-05 08:34:11 (143 MB/s) - ‘text.430’ saved [1686/1686]

--2021-08-05 08:34:11--  https://d-nb.info/860613895/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 1562 (1.5K) [text/plain]
Saving to: ‘text.431’


2021-08-05 08:34:11 (110 MB/s) - ‘text.431’ saved [1562/1562]

--2021-08

HTTP request sent, awaiting response... 200 OK
Length: 9210 (9.0K) [text/plain]
Saving to: ‘text.452’


2021-08-05 08:34:16 (67.5 MB/s) - ‘text.452’ saved [9210/9210]

--2021-08-05 08:34:16--  https://d-nb.info/209696818/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 2172 (2.1K) [text/plain]
Saving to: ‘text.453’


2021-08-05 08:34:16 (47.5 MB/s) - ‘text.453’ saved [2172/2172]

--2021-08-05 08:34:16--  https://d-nb.info/840194226/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 3207 (3.1K) [text/plain]
Saving to: ‘text.454’


2021-08-05 08:34:17 (18.7 MB/s) - ‘text.454’ saved [3207/3207]

--2021-08-05 08:34:17--  https://d-nb.info/830705783/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 2424 (2.4K) [text/plain]
Saving to: ‘text.455’


2021-08-05 08:34:17 (63.0 MB/s) - ‘text.455’ saved [2424/2424]

--2021-

HTTP request sent, awaiting response... 200 OK
Length: 5674 (5.5K) [text/plain]
Saving to: ‘text.476’


2021-08-05 08:34:23 (207 MB/s) - ‘text.476’ saved [5674/5674]

--2021-08-05 08:34:23--  https://d-nb.info/457572604/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 518 [text/plain]
Saving to: ‘text.477’


2021-08-05 08:34:23 (43.4 MB/s) - ‘text.477’ saved [518/518]

--2021-08-05 08:34:23--  https://d-nb.info/457572590/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 503 [text/plain]
Saving to: ‘text.478’


2021-08-05 08:34:23 (40.2 MB/s) - ‘text.478’ saved [503/503]

--2021-08-05 08:34:23--  https://d-nb.info/457572574/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 504 [text/plain]
Saving to: ‘text.479’


2021-08-05 08:34:24 (38.1 MB/s) - ‘text.479’ saved [504/504]

--2021-08-05 08:34:24--  https://d-nb.

HTTP request sent, awaiting response... 200 OK
Length: 875 [text/plain]
Saving to: ‘text.500’


2021-08-05 08:34:29 (72.8 MB/s) - ‘text.500’ saved [875/875]

--2021-08-05 08:34:29--  https://d-nb.info/58009393X/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 9632 (9.4K) [text/plain]
Saving to: ‘text.501’


2021-08-05 08:34:30 (68.6 MB/s) - ‘text.501’ saved [9632/9632]

--2021-08-05 08:34:30--  https://d-nb.info/365935980/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 6263 (6.1K) [text/plain]
Saving to: ‘text.502’


2021-08-05 08:34:30 (137 MB/s) - ‘text.502’ saved [6263/6263]

--2021-08-05 08:34:30--  https://d-nb.info/368227960/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 5850 (5.7K) [text/plain]
Saving to: ‘text.503’


2021-08-05 08:34:30 (104 MB/s) - ‘text.503’ saved [5850/5850]

--2021-08-05 08:34:

HTTP request sent, awaiting response... 200 OK
Length: 5397 (5.3K) [text/plain]
Saving to: ‘text.546’


2021-08-05 08:34:42 (70.6 MB/s) - ‘text.546’ saved [5397/5397]

--2021-08-05 08:34:42--  https://d-nb.info/366648837/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 15717 (15K) [text/plain]
Saving to: ‘text.547’


2021-08-05 08:34:42 (967 KB/s) - ‘text.547’ saved [15717/15717]

--2021-08-05 08:34:42--  https://d-nb.info/361270046/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 837 [text/plain]
Saving to: ‘text.548’


2021-08-05 08:34:43 (65.8 MB/s) - ‘text.548’ saved [837/837]

--2021-08-05 08:34:43--  https://d-nb.info/572208847/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 4431 (4.3K) [text/plain]
Saving to: ‘text.549’


2021-08-05 08:34:43 (107 MB/s) - ‘text.549’ saved [4431/4431]

--2021-08-05 08:3

HTTP request sent, awaiting response... 200 OK
Length: 4428 (4.3K) [text/plain]
Saving to: ‘text.592’


2021-08-05 08:34:55 (102 MB/s) - ‘text.592’ saved [4428/4428]

--2021-08-05 08:34:55--  https://d-nb.info/366708457/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 8982 (8.8K) [text/plain]
Saving to: ‘text.593’


2021-08-05 08:34:56 (71.1 MB/s) - ‘text.593’ saved [8982/8982]

--2021-08-05 08:34:56--  https://d-nb.info/579200442/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 1201 (1.2K) [text/plain]
Saving to: ‘text.594’


2021-08-05 08:34:56 (92.5 MB/s) - ‘text.594’ saved [1201/1201]

--2021-08-05 08:34:56--  https://d-nb.info/365868833/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 16730 (16K) [text/plain]
Saving to: ‘text.595’


2021-08-05 08:34:56 (1.01 MB/s) - ‘text.595’ saved [16730/16730]

--2021

HTTP request sent, awaiting response... 200 OK
Length: 1694 (1.7K) [text/plain]
Saving to: ‘text.638’


2021-08-05 08:35:09 (137 MB/s) - ‘text.638’ saved [1694/1694]

--2021-08-05 08:35:09--  https://d-nb.info/361548354/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 2730 (2.7K) [text/plain]
Saving to: ‘text.639’


2021-08-05 08:35:10 (66.5 MB/s) - ‘text.639’ saved [2730/2730]

--2021-08-05 08:35:10--  https://d-nb.info/366134248/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 13396 (13K) [text/plain]
Saving to: ‘text.640’


2021-08-05 08:35:10 (842 KB/s) - ‘text.640’ saved [13396/13396]

--2021-08-05 08:35:10--  https://d-nb.info/991241649/04/text
Reusing existing connection to d-nb.info:443.
HTTP request sent, awaiting response... 200 OK
Length: 17138 (17K) [text/plain]
Saving to: ‘text.641’


2021-08-05 08:35:10 (1.05 MB/s) - ‘text.641’ saved [17138/17138]

--202

Die heruntergeladenen Textdateien können jetzt nach einem Suchwort, z.B. search = "Wild" durchsucht werden. Hierbei wird die Groß- und Kleinschreibung beachtet.
Die Treffer werden mit Angabe der Zeile und Datei ausgegeben. Dabei entspricht die Dateibenennung den im Verzeichnis heruntergeladen Textdateien (text, text1, text2 usw.).
Das Suchwort kann beliebig geändert und durch Ausführen des Codes die Suche angepasst werden.

In [None]:
search = 'Handarbeit'

filename = 'text'
with open(filename) as f:
    for num, line in enumerate(f, 1):
        if search in line:
            print('%s - found at line in text:' % search, num)
        else:
            print('Kein Ergebnis vorhanden')
            
filename2 = 'text.1'
with open(filename2) as f:
    for num, line in enumerate(f, 1):
        if search in line:
            print('%s - found at line in text.1:' % search, num)
        else:
            print('Kein Ergebnis vorhanden')

In [None]:
#Import os module
import os

# Ask the user to enter string to search
search_path = input("Enter directory path to search : ")
file_type = input("File Type : ")
search_str = input("Enter the search string : ")

# Append a directory separator if not already present
if not (search_path.endswith("/") or search_path.endswith("\\") ): 
        search_path = search_path + "/"
                                                          
# If path does not exist, set search path to current directory
if not os.path.exists(search_path):
        search_path ="."

# Repeat for each file in the directory  
for fname in os.listdir(path=search_path):

   # Apply file type filter   
   if fname.endswith(file_type):

        # Open file for reading
        fo = open(search_path + fname)

        # Read the first line from the file
        line = fo.readline()

        # Initialize counter for line number
        line_no = 1

        # Loop until EOF
        while line != '' :
                # Search for string in line
                index = line.find(search_str)
                if ( index != -1) :
                    print(fname, "[", line_no, ",", index, "] ", line, sep="")

                # Read next line
                line = fo.readline()  

                # Increment line counter
                line_no += 1
        # Close the files
        fo.close()

## Analysieren von OCR-Dateien

## OCR-Datei nach Suchbegriffen durchsuchen

Zum Analysieren von OCR-Dateeien wird die Library "re" benötigt. 

In [1]:
import re

In diesem Tutorial nutzen wir eine OCR-Datei des DNBLab. Diese wird im (Binder)Verzeichnis abgelegt. Damit die Datei durchsuchbar wird, wird der Text des Dokuments in einen String umgewandelt. Für eine bessere Durchsuchbarkeit, werden alle Zeichen des Dokuments in Kleinbuchstaben umgewandelt. 

In [None]:
# Öffnen der Datei, 'r' regelt den Rechtezugriff
document_text = open ('102655246X_OCR.txt', 'r') 

# Umwandeln der Datei in einen String
text_string = document_text.read()

# Umwandeln der Textdatei in Kleinbuchstaben
text_string1 = text_string.lower()

Für die weitere Verarbeitung wird eine Liste erstelt, die die Inhalte der OCR-Datei beinhaltet.

In [None]:
lines = [] #Erstellt eine leere Liste
for line in text_string1.splitlines():
    lines.append(line) #befüllt die Liste mit der jeweiligen Zeile

Suchen wir nun einen bestimmten Begriff, so nutzen wir einen re.search-Befehl, um die Position des Wortes im Dokument zu finden. 

In [None]:
for element in lines:
    a = re.search('file', element)
    if a: 
        print(a)

Alternativ kann die Search-Funktion, die bekannt aus der Analyse des Inhaltsverzeichnissen ist, genutzt werden. 

In [None]:
search = 'krieg'

filename = '102655246X_OCR.txt'
with open(filename) as f:
    for num, line in enumerate(f, 1):
        if search in line:
            print('%s - found at line in text:' % search, num)

## Häufigkeitsanalyse von regulären Ausdrucken 

Nach dem Import der Libary re und dem Öffnen OCR-Datei sowie der Umwandlung in String und Kleinschreibung, muss im Code ein regulärer Ausdruck definiert werden. 

In [31]:
import re 

# Öffnen der Datei, 'r' regelt den Rechtezugriff
document_text = open ('102655246X_OCR.txt', 'r') 

# Umwandeln der Datei in einen String
text_string = document_text.read()

# Umwandeln der Textdatei in Kleinbuchstaben
text_string1 = text_string.lower()

## Anmerkung, hier muss noch der Teil zu regulären Ausdrücken + Erklärungen der einzelnen Abschnitte hin 
for r in lines:
    
    a = re.match('document_text', text_string)
    
print(a) 



None


## Ausgabe der OCR 

Zur Weiterverarbeitung ist es auch möglich, sich den Inhalt der OCR-Datei direkt im Dokument ausgeben zu lassen. Dafür genügt es den Inhalt der OCR-Datei als Liste zu definieren und sich über einen Print-Befehl ausgeben zu lassen. 

In [None]:
##OCR-Ausgabe :D - Testen! 

document_text = open ('102655246X_OCR.txt', 'r') 
text_string = document_text.read ()
text_string = document_text.read (). lower ()

with open("102655246X_OCR.txt", "r") as tf:
    lines = tf.read().split('\n')

for r in lines:
    
    a = re.match('document_text', text_string)
    
print(lines) 



## Codeschnippsel für reguläre Ausdrücke beginnen hier 

In [16]:
match_pattern = re.findall ('102655246X_OCR.txt' '\ b [a-z] 3,15  \ b ') 



TypeError: findall() missing 1 required positional argument: 'string'

In [None]:
match_pattern: count = frequency.get (word, 0) Frequenz [word] = count + 1

In [None]:
frequency_list = frequency.keys ()

In [None]:
frequency_list: Wörter drucken, Häufigkeit [Wörter]

SyntaxError: invalid syntax (<ipython-input-1-df4597f046d6>, line 1)

In [None]:
Ideen:
    Eine Suchanfrage, die alle regulären Ausdrücke findet
    Eine Suchanfrage für einen bestimmen Begriff 