This notebook provides an regular expression based approach to search for letter openings in a literary text corpus. The first part of the notebook covers the preparation of the corpus. The second part is the extraction of letter openings by looking for typical expressions in the beginning of German-language letters. The openings are collected in a list. The algorithms loops over that list to see if a list entry matches with a pattern in the corpus texts. The list can be modified. Please take into account that too generic letter openings like r'Mein[e]' cannot be processed because there are too many matches in the corpus texts (the most of them false positives).

In [23]:
# Import 
import os
import pandas as pd
import regex as re
from pathlib import Path
from collections import Counter
import csv



## Read in the corpus

In the following, we read in the corpus. Here, we only take a subset of the corpus to shorten the process time.  

In [2]:
# Generate a corpus by loading all the txt files from the chosen directory 
# and list the names of the first 10 txt files 
corpus = os.listdir('d-prose_1870-1920_V.2.0')
corpus[:10]

['von_Wolzogen_Ernst_Vom_Peperl_und von andern_Raritaeten_Der_Raritaetenliabhaber.txt',
 'Ernst_Otto_Satiren_Die_Neunte_in_Kluetenbuettel.txt',
 'Anzengruber_Ludwig_Kalendergeschichten_Treff-Ass.txt',
 'Fontane_Theodor_Cecile.txt',
 'Scheerbart_Paul_Das_grosse_Licht_Die_Perlmutterstadt.txt',
 'Hollaender_Felix_Die_Briefe_des_Fraeulein.txt',
 'Heyse_Paul_Gegen_den_Strom.txt',
 'Federer_Heinrich_Umbrische_Reisegeschichtlein_San_Benedettos_Dornen.txt',
 'Groller_Balduin_Detektiv_Dagobert_Eine_teure_Depesche.txt',
 'von_Zobeltitz_Fedor_Das_Heiratsjahr.txt']

With the next cell, we ask for the number of corpus texts in the chosen subset. 

In [3]:
# Print how many txt files are in the corpus
corpus_length = len(corpus)
print(corpus_length)

2511


## Convert the corpus to a dataframe 

We then create an empty dictionary and add the file name and the text of the document as columns to build a dataframe of two columns.

In [7]:
# Create an empty dictionary for preparation of the conversion of the txt-file-corpus to a data frame
empty_dictionary = {}

# Loop through the folder of documents to open and read each one
for document in corpus:
    with open('d-prose_1870-1920_V.2.0/' + document, 'r', encoding = 'utf-8') as to_open:
         empty_dictionary[document] = to_open.read()

# Populate the data frame with two columns: file name and document text
d_prose_texts = (pd.DataFrame.from_dict(empty_dictionary, 
                                       orient = 'index')
                .reset_index().rename(index = str, 
                                      columns = {'index': 'file_name', 0: 'document_text'}))

In the next cell, we verify the content of the first 10 lines of the dataframe.

In [8]:
# show the first 10 lines of the data frame
d_prose_texts[:10]

Unnamed: 0,file_name,document_text
0,von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...,"Der Raritätenliabhaber\n\n»Ja, grüaß Eahna Got..."
1,Ernst_Otto_Satiren_Die_Neunte_in_Kluetenbuette...,Die Neunte in Klütenbüttel\n\nIn Klütenbüttel ...
2,Anzengruber_Ludwig_Kalendergeschichten_Treff-A...,"Treff-Aß\n\nGibt es ein Buch des Schicksals, s..."
3,Fontane_Theodor_Cecile.txt,Cécile\n\nErstes Kapitel\n\n»Thale. Zweiter…«\...
4,Scheerbart_Paul_Das_grosse_Licht_Die_Perlmutte...,Die Perlmutterstadt\n\nBekanntlich weilt der u...
5,Hollaender_Felix_Die_Briefe_des_Fraeulein.txt,"Die Briefe des Fräulein Brandt\n\nIserbaude, 7..."
6,Heyse_Paul_Gegen_den_Strom.txt,Gegen den Strom\n\nErstes Kapitel.\n\nEs war z...
7,Federer_Heinrich_Umbrische_Reisegeschichtlein_...,San Benedettos Dornen und San Francescos Rosen...
8,Groller_Balduin_Detektiv_Dagobert_Eine_teure_D...,Eine teure Depesche\n\nSie saßen wieder zu dri...
9,von_Zobeltitz_Fedor_Das_Heiratsjahr.txt,Das Heiratsjahr.\n\nErstes Kapitel.\n\nIn welc...


In the next cell, we extract the title of the text by extracting the first line followed by two line breaks. This is possible because we know about the structure of the text that were manually prepared following that schema.

In [9]:
#extract the title of the text as a further column with metadata


# Define the regular expression pattern to extract the title followed by double line break \n\n
pattern = r'^(.*?)\n\n'

# Extract the first line and create a new 'titles' column
d_prose_texts['title'] = d_prose_texts['document_text'].str.extract(pattern, flags=re.DOTALL)

# Print the DataFrame to see the results
d_prose_texts[:10]


Unnamed: 0,file_name,document_text,title
0,von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...,"Der Raritätenliabhaber\n\n»Ja, grüaß Eahna Got...",Der Raritätenliabhaber
1,Ernst_Otto_Satiren_Die_Neunte_in_Kluetenbuette...,Die Neunte in Klütenbüttel\n\nIn Klütenbüttel ...,Die Neunte in Klütenbüttel
2,Anzengruber_Ludwig_Kalendergeschichten_Treff-A...,"Treff-Aß\n\nGibt es ein Buch des Schicksals, s...",Treff-Aß
3,Fontane_Theodor_Cecile.txt,Cécile\n\nErstes Kapitel\n\n»Thale. Zweiter…«\...,Cécile
4,Scheerbart_Paul_Das_grosse_Licht_Die_Perlmutte...,Die Perlmutterstadt\n\nBekanntlich weilt der u...,Die Perlmutterstadt
5,Hollaender_Felix_Die_Briefe_des_Fraeulein.txt,"Die Briefe des Fräulein Brandt\n\nIserbaude, 7...",Die Briefe des Fräulein Brandt
6,Heyse_Paul_Gegen_den_Strom.txt,Gegen den Strom\n\nErstes Kapitel.\n\nEs war z...,Gegen den Strom
7,Federer_Heinrich_Umbrische_Reisegeschichtlein_...,San Benedettos Dornen und San Francescos Rosen...,San Benedettos Dornen und San Francescos Rosen
8,Groller_Balduin_Detektiv_Dagobert_Eine_teure_D...,Eine teure Depesche\n\nSie saßen wieder zu dri...,Eine teure Depesche
9,von_Zobeltitz_Fedor_Das_Heiratsjahr.txt,Das Heiratsjahr.\n\nErstes Kapitel.\n\nIn welc...,Das Heiratsjahr.


You can see still some \n\n these are line breaks. You can use regular expressions to extract the first line as title of the text. And the file name contains the name of the author, but that is not as easy to extract. Better to be extracted from the metadata with a comparison of filename and filename indicated in metadata.

In the next cell, we do some basic text cleaning steps.

## Preprocessing

In [10]:
#create a new column
#use regular expressions to clean the plain text and store the cleaned text in a new column as a further layer of the text without deleting the original version
d_prose_texts['clean_text'] = d_prose_texts['document_text'].str.replace('\s+', ' ') # remove double white space
d_prose_texts['clean_text'] = d_prose_texts['clean_text'].str.replace('\n+', '\n') # remove double line break
d_prose_texts['clean_text'] = d_prose_texts['clean_text'].str.replace('&', 'and') # exchange & for 'and'


  d_prose_texts['clean_text'] = d_prose_texts['document_text'].str.replace('\s+', ' ') # remove double white space
  d_prose_texts['clean_text'] = d_prose_texts['clean_text'].str.replace('\n+', '\n') # remove double line break


In [11]:
# show the first 10 lines of the data frame
d_prose_texts[:10]

Unnamed: 0,file_name,document_text,title,clean_text
0,von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...,"Der Raritätenliabhaber\n\n»Ja, grüaß Eahna Got...",Der Raritätenliabhaber,"Der Raritätenliabhaber »Ja, grüaß Eahna Gott, ..."
1,Ernst_Otto_Satiren_Die_Neunte_in_Kluetenbuette...,Die Neunte in Klütenbüttel\n\nIn Klütenbüttel ...,Die Neunte in Klütenbüttel,Die Neunte in Klütenbüttel In Klütenbüttel sol...
2,Anzengruber_Ludwig_Kalendergeschichten_Treff-A...,"Treff-Aß\n\nGibt es ein Buch des Schicksals, s...",Treff-Aß,"Treff-Aß Gibt es ein Buch des Schicksals, so k..."
3,Fontane_Theodor_Cecile.txt,Cécile\n\nErstes Kapitel\n\n»Thale. Zweiter…«\...,Cécile,Cécile Erstes Kapitel »Thale. Zweiter…« »Letzt...
4,Scheerbart_Paul_Das_grosse_Licht_Die_Perlmutte...,Die Perlmutterstadt\n\nBekanntlich weilt der u...,Die Perlmutterstadt,Die Perlmutterstadt Bekanntlich weilt der ural...
5,Hollaender_Felix_Die_Briefe_des_Fraeulein.txt,"Die Briefe des Fräulein Brandt\n\nIserbaude, 7...",Die Briefe des Fräulein Brandt,"Die Briefe des Fräulein Brandt Iserbaude, 7. J..."
6,Heyse_Paul_Gegen_den_Strom.txt,Gegen den Strom\n\nErstes Kapitel.\n\nEs war z...,Gegen den Strom,Gegen den Strom Erstes Kapitel. Es war zu Anfa...
7,Federer_Heinrich_Umbrische_Reisegeschichtlein_...,San Benedettos Dornen und San Francescos Rosen...,San Benedettos Dornen und San Francescos Rosen,San Benedettos Dornen und San Francescos Rosen...
8,Groller_Balduin_Detektiv_Dagobert_Eine_teure_D...,Eine teure Depesche\n\nSie saßen wieder zu dri...,Eine teure Depesche,Eine teure Depesche Sie saßen wieder zu dritt ...
9,von_Zobeltitz_Fedor_Das_Heiratsjahr.txt,Das Heiratsjahr.\n\nErstes Kapitel.\n\nIn welc...,Das Heiratsjahr.,Das Heiratsjahr. Erstes Kapitel. In welchem si...


Now, we want to find out more about the texts in our corpus. 
For instance, we want to find out if there are letters in the corpus texts.

# Search for Letter Openings
The code of the next cell iterates over the corpus texts and looks for matches with the regular expressions of the list "letter_openings".

## Define the Regular Expressions to find Letter Openings

In [12]:
# Liste von Briefanfängen und -enden
letter_openings = [
    r'[»]?Mein[e]? [lL]iebe[rsn]?\s[A-Za-z]*[!]',
    r'[»]?Hochverehrte[rsn]?\s[A-Za-z]*[!]',
    r'[»]?Einziggeliebt[er]?\s[A-Za-z]*[.!,]?',
    r'[»]?Geehrte[rsn]?\s[A-Za-z]*[.!,]?',
    r'[»]?Sehr geehrte[srn]?\s[A-Za-z]*[.!,]?',
    r'[»]?Sehr verehrte[rs]?\s[A-Za-z]*[.!,]?',
    r'[»]?Grüss dich\s[A-Za-z]*[.!,]?',
    r'[»]?Grüß dich\s[A-Za-z]*[.!,]?',
    #r'[»]?\b(?!die\s+)Liebe\b\s[A-Za-z]*[.!,]?',
    #r'[»]?Liebe[sr]+\s[A-Za-z]*[.!,]?',
    r'»Liebe[sr]+\s[A-Z][a-z\s]*[A-Za-z!]*',
    r'»Liebste[sr]?\s[A-Za-z]*[A-Za-z!]*',
    r'[»]?Lieber Vater[!]',
    r'[»]?Liebster Vater[!]',
    r'[»]?Liebe Mutter[!]',
    r'[»]?Liebste Mutter[!]',
    r'[»]?Lieber Freund[!]',
    r'[»]?Geliebteste[r]?[!]',
    r'[»]?Geliebte[r][!]',
    r'[»]?Geliebte[!]',
    r'[»]?Einzig geliebteste[r]?[.!,]?',
    r'[»]?Einzig geliebte[r]?[.!,]?',
    r'[»]?Hochverehrter Herr[.!,]?',
    r'[»]?Hochverehrte Frau[.!,]?',
    r'[»]?Werte[rs]?\s[A-Za-z]*[!]',
    r'[»]?Mein geliebte[rs]?\s[A-Za-z]*[.!,]?',
    r'[»]?Teuerste[sr]?\s[A-Za-z]*[.!,]?',
    r'[»]?Teure[sr]?\s[A-Za-z]*[.!,]?',
    r'[»]?Mein Liebchen[.!,]?',
    r'(?:[A-ZÄÖÜa-zäöüß\s]+,\s*den\s+\d+\.\s+[A-ZÄÖÜa-zäöüß]+\s+\d{4}\.)',
    #r'\b\d{1,2}\. (Januar|Februar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember)(?: \d{2,4})?\b'
    r'(?:[A-ZÄÖÜa-zäöüß\s]+,\s*den\s+\d+\.\s+[A-ZÄÖÜa-zäöüß]+\s+\d{4}\.)|\b\d{1,2}\. (Januar|Februar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember)(?: \d{2,4})?\b'
    # Ab hier weitere auszuschließende Fälle
]


## Do the matching and get the output as a dictionary

In [200]:
#This function is to only extract the letter openings matching with the regex.

# Function to extract the exact regex matches 
def extract_matches(text):
    extracted_matches = []
    for pattern in letter_openings:
        for match in re.finditer(pattern, text):
            match_text = match.group(0)  # Get the matched text
            match_start = match.start()  # Get the start index of the match
            
            # Append the detected text and the index of the first character to the list
            extracted_matches.append((match_text, match_start))
    
    return extracted_matches

# Apply the extraction function to each row of the DataFrame
d_prose_texts['extracted_matches'] = d_prose_texts['clean_text'].apply(extract_matches)

# Print the DataFrame to see the results
print(d_prose_texts[['clean_text', 'extracted_matches']])


                                            clean_text  \
0    Cécile Erstes Kapitel »Thale. Zweiter…« »Letzt...   
1    Die Briefe des Fräulein Brandt Iserbaude, 7. J...   
2    Gegen den Strom Erstes Kapitel. Es war zu Anfa...   
3    San Benedettos Dornen und San Francescos Rosen...   
4    Eine teure Depesche Sie saßen wieder zu dritt ...   
..                                                 ...   
489  Enzio Es war ein weites, bequem eingerichtetes...   
490  Dagoberts unfreiwillige Reise Andreas Grumbach...   
491  Die Liebe Gottes Wer auf fröhlichcr Sommerreis...   
492  Dort oben Ja, Kinder, das ist wohl so, das gla...   
493  Der blinde Passagier Wir hatten uns über das u...   

                                     extracted_matches  
0    [(»Lieber Pierre, 28539), (»Lieber Freund, 310...  
1    [(»Liebes Kind, 57567), (Teuerster wird, 16684...  
2    [(Geehrten nun, 349367), (»Mein geliebtes Herz...  
3                                                   []  
4                 

In [201]:
# Function to count the number of matches for each text
def count_matches(matches):
    return len(matches)

# Create a new column with the count of matches for each text
d_prose_texts['match_count'] = d_prose_texts['extracted_matches'].apply(count_matches)

# Print the DataFrame to see the results
print(d_prose_texts[['clean_text', 'extracted_matches', 'match_count']])

                                            clean_text  \
0    Cécile Erstes Kapitel »Thale. Zweiter…« »Letzt...   
1    Die Briefe des Fräulein Brandt Iserbaude, 7. J...   
2    Gegen den Strom Erstes Kapitel. Es war zu Anfa...   
3    San Benedettos Dornen und San Francescos Rosen...   
4    Eine teure Depesche Sie saßen wieder zu dritt ...   
..                                                 ...   
489  Enzio Es war ein weites, bequem eingerichtetes...   
490  Dagoberts unfreiwillige Reise Andreas Grumbach...   
491  Die Liebe Gottes Wer auf fröhlichcr Sommerreis...   
492  Dort oben Ja, Kinder, das ist wohl so, das gla...   
493  Der blinde Passagier Wir hatten uns über das u...   

                                     extracted_matches  match_count  
0    [(»Lieber Pierre, 28539), (»Lieber Freund, 310...            9  
1    [(»Liebes Kind, 57567), (Teuerster wird, 16684...           28  
2    [(Geehrten nun, 349367), (»Mein geliebtes Herz...            3  
3                      

In [13]:
# Function to extract matches and following 100 characters for a match

def extract_matches_and_following(text):
    extracted_matches = []
    for pattern in letter_openings:
        for match in re.finditer(pattern, text):
            match_text = match.group(0)  # Get the matched text
            match_start = match.start()  # Get the start index of the match
            
            # Extract the following 100 characters
            #match_end = match_start + len(match_text)
            following_text = text[match_start:match_start + 100]
            
            # Append the detected text and the index of the first character to the list
            extracted_matches.append((following_text, match_start))
    
    return extracted_matches

# Apply the extraction function to each row of the DataFrame
d_prose_texts['extracted_matches_context'] = d_prose_texts['clean_text'].apply(extract_matches_and_following)

# Print the DataFrame to see the results
print(d_prose_texts[['clean_text', 'extracted_matches_context']])


                                             clean_text  \
0     Der Raritätenliabhaber »Ja, grüaß Eahna Gott, ...   
1     Die Neunte in Klütenbüttel In Klütenbüttel sol...   
2     Treff-Aß Gibt es ein Buch des Schicksals, so k...   
3     Cécile Erstes Kapitel »Thale. Zweiter…« »Letzt...   
4     Die Perlmutterstadt Bekanntlich weilt der ural...   
...                                                 ...   
2506  In den Ferien Es ist die große Vakanz gewesen,...   
2507  Die Einwanderer »Eine dumme Geschichte das«, d...   
2508  Noblesse oblige Erstes Buch. Erstes Kapitel. D...   
2509  Der blinde Passagier Wir hatten uns über das u...   
2510  Das Hamaïl Zwischen Bir el asuad und Ain tajib...   

                              extracted_matches_context  
0                                                    []  
1                                                    []  
2                                                    []  
3     [(»Lieber Pierre«, sagte sie dann mit sich ras...  
4

In [14]:
# Function to count the number of matches for each text
def count_matches(matches):
    return len(matches)

# Create a new column with the count of matches for each text
d_prose_texts['match_count'] = d_prose_texts['extracted_matches_context'].apply(count_matches)

# Print the DataFrame to see the results
print(d_prose_texts[['clean_text', 'extracted_matches_context', 'match_count']])

                                             clean_text  \
0     Der Raritätenliabhaber »Ja, grüaß Eahna Gott, ...   
1     Die Neunte in Klütenbüttel In Klütenbüttel sol...   
2     Treff-Aß Gibt es ein Buch des Schicksals, so k...   
3     Cécile Erstes Kapitel »Thale. Zweiter…« »Letzt...   
4     Die Perlmutterstadt Bekanntlich weilt der ural...   
...                                                 ...   
2506  In den Ferien Es ist die große Vakanz gewesen,...   
2507  Die Einwanderer »Eine dumme Geschichte das«, d...   
2508  Noblesse oblige Erstes Buch. Erstes Kapitel. D...   
2509  Der blinde Passagier Wir hatten uns über das u...   
2510  Das Hamaïl Zwischen Bir el asuad und Ain tajib...   

                              extracted_matches_context  match_count  
0                                                    []            0  
1                                                    []            0  
2                                                    []            0  
3     [

In [15]:
d_prose_texts['extracted_matches_context'][20:70]

20    [(»Liebes Fräulein!« sprach er mild und freund...
21    [(1. März 1872 angenommenes Gesetz, durch welc...
22    [(»Lieber Gott, die g'scheiten Leut kenna's, d...
23    [(Teures Blut …« – »Bah! Was denn? Nicht so sc...
24                                                   []
25                                                   []
26    [(»Liebster Herr Professor! Diese Aufregung! D...
27    [(»Mein liebes Kind!« sagte er … »…ich bitte S...
28                                                   []
29                                                   []
30    [(»Mein geliebtes Kind! Du bist ein fertiger M...
31                                                   []
32    [(»Lieber Bolz, dasselbe was Sie mir sagen, ha...
33                                                   []
34    [(»Mein liebes Kind! Möge dieses Buch Dein tre...
35                                                   []
36    [(Mein lieber Cord!« »Also, bis hierher und ni...
37                                              

## Save the dictionary to a csv file

In [206]:
#save the original huge dataframe to a new csv.file

#d_prose_texts.to_csv('df_letter_openings_d-prose.csv', index=False) # Attention! could be much do huge

## Postprocessing

In [32]:

# Now look into the output and find the most common false positives.
# Collect them in this list of common false positives 
#that you can consequently use to delete the values from the received matches of letter openings to clean the output as a postprocessing step.

common_false_positives = [
    r'[»]?Lieber Gott',
    r'Liebe und',
    r'Liebe zu',
    r'Liebe von',
    r'[»]?Liebe[sr]+\s[[A-Za-zäöü,!]*[«]? sag[te]?',
    r'[»]?Lieber [A-Za-zäöü]*[«]?[,]? sag[te]?',
    r'Liebe [A-Za-zäöü,\s]*[«]? sag[te]?',
    r'Lieber [A-Za-zäöü,\s]*[«]? sag[te]?',
    r'Liebes [A-Za-zäöü,\s]*[«]? sag[te]?',
    r'Grüß dich Gott, [A-Za-zäöü]',
    r'Teurer [A-Za-zäöü,\s]* sag[te]?',
    r'Hochverehrte Festversammlung!',
    r'Teuerste bedeutet',
    r'Geehrten',
    r'Grüß dich Gott, ',
    r'Grüß dich Gott!« sag[te]?',
    r'Liebster Jesu!',
    r'Lieber Jesu!',
    r'Lieber Vogel,',
    r'Teuerste war',
    #r'(?:[A-ZÄÖÜa-zäöüß\s]+,\s*den\s+\d+\.\s+[A-ZÄÖÜa-zäöüß]+\s+\d{4}\.)|\b\d{1,2}\. (Januar|Februar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember)(?: \d{2,4})?\b'
    
]

In [33]:
# Function to remove common false positives
def remove_common_false_positives(extracted_matches_column, common_false_positives):
    # Create a copy of the DataFrame column to avoid modifying the original
    cleaned_matches_column = extracted_matches_column.copy()

    # Iterate through the rows of the DataFrame column
    for index, row in cleaned_matches_column.iteritems():
        # Create a list to store valid matches for the current row
        valid_matches = []

        for match_text, match_start in row:
            # Check if the match_text matches any of the common_false_positives patterns
            if not any(re.search(fp_pattern, match_text) for fp_pattern in common_false_positives):
                valid_matches.append((match_text, match_start))

        # Update the DataFrame column with valid matches for the current row
        cleaned_matches_column[index] = valid_matches

    return cleaned_matches_column

# Remove common false positives
cleaned_extracted_matches = remove_common_false_positives(d_prose_texts['extracted_matches_context'], common_false_positives)

# Update the 'extracted_matches' column in the DataFrame with the cleaned matches
d_prose_texts['cleaned_matches_context'] = cleaned_extracted_matches

# Print the modified DataFrame with cleaned matches
print(d_prose_texts)




                                              file_name  \
0     von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...   
1     Ernst_Otto_Satiren_Die_Neunte_in_Kluetenbuette...   
2     Anzengruber_Ludwig_Kalendergeschichten_Treff-A...   
3                            Fontane_Theodor_Cecile.txt   
4     Scheerbart_Paul_Das_grosse_Licht_Die_Perlmutte...   
...                                                 ...   
2506                     Thoma_Ludwig_In_den_Ferien.txt   
2507  Loens_Hermann_Tiergeschichten_Die_Einwanderer.txt   
2508           Spielhagen_Friedrich_Noblesse_oblige.txt   
2509  Ganghofer_Ludwig_Fliegender_Sommer_Der_blinde_...   
2510                            May_Karl_Das_Hamail.txt   

                           title  \
0         Der Raritätenliabhaber   
1     Die Neunte in Klütenbüttel   
2                       Treff-Aß   
3                         Cécile   
4            Die Perlmutterstadt   
...                          ...   
2506               In den Ferien   
250

In [36]:
# Function to count the number of matches for each text
def count_matches(matches):
    return len(matches)

# Create a new column with the count of matches for each text
d_prose_texts['cleaned_match_count'] = d_prose_texts['cleaned_matches_context'].apply(count_matches)

# Print the DataFrame to see the results
#print(d_prose_texts[['clean_text', 'cleaned_matches_context', 'cleaned_match_count']])

## Save the dictionary to a csv file

In [212]:
#save the dataframe segment to a new csv.file

d_prose_texts.to_csv('output_cleaned_letter_openings_subset2_big.csv', index=False)

In [37]:
#show big dataframe 
d_prose_texts

Unnamed: 0,file_name,title,extracted_matches_context,match_count,cleaned_matches_context,cleaned_match_count
0,von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...,Der Raritätenliabhaber,[],0,[],0
1,Ernst_Otto_Satiren_Die_Neunte_in_Kluetenbuette...,Die Neunte in Klütenbüttel,[],0,[],0
2,Anzengruber_Ludwig_Kalendergeschichten_Treff-A...,Treff-Aß,[],0,[],0
3,Fontane_Theodor_Cecile.txt,Cécile,"[(»Lieber Pierre«, sagte sie dann mit sich ras...",9,[(»Lieber Freund. Es geht nicht so weiter. Sei...,8
4,Scheerbart_Paul_Das_grosse_Licht_Die_Perlmutte...,Die Perlmutterstadt,[(17. Mai 1910. Kaping (Mittel-China). Die Chi...,7,[(17. Mai 1910. Kaping (Mittel-China). Die Chi...,7
...,...,...,...,...,...,...
2506,Thoma_Ludwig_In_den_Ferien.txt,In den Ferien,[],0,[],0
2507,Loens_Hermann_Tiergeschichten_Die_Einwanderer.txt,Die Einwanderer,[],0,[],0
2508,Spielhagen_Friedrich_Noblesse_oblige.txt,Noblesse oblige,"[(Geliebter! uns beide! ich, die ich muß; du u...",15,"[(Geliebter! uns beide! ich, die ich muß; du u...",15
2509,Ganghofer_Ludwig_Fliegender_Sommer_Der_blinde_...,Der blinde Passagier,[],0,[],0


As the dataframe is to huge to have a look at comfortably, we copy the dataframe and limit it to the columns of our interest by deleting the columns we do not need.

In [41]:
#delete the columns from the copy that we do not need or that need to much space
#del d_prose_texts['document_text']
#del d_prose_texts['clean_text']
#del d_prose_texts['extracted_matches_context']
del d_prose_texts['match_count']

In [42]:
#show small dataframe 
d_prose_texts

Unnamed: 0,file_name,title,cleaned_matches_context,cleaned_match_count
0,von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...,Der Raritätenliabhaber,[],0
1,Ernst_Otto_Satiren_Die_Neunte_in_Kluetenbuette...,Die Neunte in Klütenbüttel,[],0
2,Anzengruber_Ludwig_Kalendergeschichten_Treff-A...,Treff-Aß,[],0
3,Fontane_Theodor_Cecile.txt,Cécile,[(»Lieber Freund. Es geht nicht so weiter. Sei...,8
4,Scheerbart_Paul_Das_grosse_Licht_Die_Perlmutte...,Die Perlmutterstadt,[(17. Mai 1910. Kaping (Mittel-China). Die Chi...,7
...,...,...,...,...
2506,Thoma_Ludwig_In_den_Ferien.txt,In den Ferien,[],0
2507,Loens_Hermann_Tiergeschichten_Die_Einwanderer.txt,Die Einwanderer,[],0
2508,Spielhagen_Friedrich_Noblesse_oblige.txt,Noblesse oblige,"[(Geliebter! uns beide! ich, die ich muß; du u...",15
2509,Ganghofer_Ludwig_Fliegender_Sommer_Der_blinde_...,Der blinde Passagier,[],0


In [43]:
#save the shortened dataframe to a new csv.file

d_prose_texts.to_csv('d-prose_cleaned_letter_openings.csv', index=False)

In [45]:
d_prose_texts['cleaned_matches_context'].to_csv('d-prose_only_letter_openings.csv', index=False)

end of notebook 2