# Preprocessing of the "Briefwechsel"

This notebook 
1. Partially cleans plain text of Husserl's correspondence contained in *HuaDok III, 1-9 - Briefwechsel (1993)* 
2. Segmentates the letters to obtain a tabular structure
3. Generates .txt files of the cleaned text
4. And .csv files from the segmentated text 

In [1]:
######### I will use regex and pandas libraries
import re
import pandas as pd

In [3]:
############ open the file
with open("Vol1.txt", encoding ="utf8") as f:
    text = f.read()


# 1: OCR postcorrection, cleaning wrong characters
In this first part of the code I replace wrong characters with regex

In [4]:
replace0 = re.sub(" +", " ", text)
replace1 = re.sub(r"ã", "ä", replace0)
replace2 = re.sub(r"õ", "ö", replace1)

In [5]:
############ Next re.sub will match words ending with capital B 
############ We need the following function to replace only the last charachter of the match object
############ source: https://pynative.com/python-regex-replace-re-sub/#h-regex-replacement-function

def convert_to_Eszett(match_obj):
    if match_obj.group() is not None:
        word = match_obj.group()
        word = word[0:-1]
        word += "ß"
        return word

In [6]:
replace3 = re.sub(r"\w+B\b", convert_to_Eszett, replace2)        

In [10]:
########## The following regular expression matches words containing capital B
########## but not words that start with B

x = re.findall(r"\b(?!B)[a-zA-Z]*B[a-zA-Z]*", replace3)
print(x)

['VERBINDUNG', 'ELISABETH', 'HERAUSGEGEBEN', 'groBe', 'muBte', 'groBen', 'groBen', 'fleiBig', 'anlaBlich', 'groBe', 'groBen', 'auBerordentlich', 'SchlieBlich', 'VerheiBungen', 'lieBe', 'MiBverdienst', 'bloBes', 'hieBe', 'groBen', 'groBe', 'groBer', 'anzuschlieBen', 'VerheiBung', 'muBten', 'groBe', 'groBer', 'groBen', 'schlieBt', 'auBerordentliche', 'verstieBen', 'maBgebend', 'miBtrauisch', 'groBen', 'ieBen', 'dahinflieBe', 'EinfluBreichen', 'EinfluBreichen', 'groBe', 'fuBt', 'auffaBt', 'SchluBkapitel', 'MeBversuchen', 'groBen', 'groBe', 'MeBkunst', 'muBte', 'entschlieBen', 'MeBkunst', 'auBerwesentlich', 'ausschlieBen', 'MaBe', 'groBen', 'MaBe', 'schlieBe', 'GroBes', 'FBrentano', 'einigennaBen', 'unvergeBlichen', 'groBe', 'muBte', 'groBe', 'verdrieBen', 'auszuschlieBen', 'muBte', 'HeiBt', 'groBen', 'venniBt', 'heiBen', 'muBte', 'FBrentano', 'FBrentano', 'groBer', 'groBe', 'MaBe', 'groBe', 'groBen', 'groBen', 'auBer', 'ausschlieBt', 'bloBen', 'groBe', 'groBe', 'GruBe', 'heiBt', 'auBer', 

In [7]:
############ As in the last re.sub match, 
############ we need the following function to replace only the last charachter of the match object
############ source: https://pynative.com/python-regex-replace-re-sub/#h-regex-replacement-function

def convert_to_Eszett2(match_obj):
    if match_obj.group() is not None:
        word = match_obj.group()
        if all(letter.isupper() for letter in word):
            return word
        else:
            for letter in word:
                if letter == "B":
                    word = word.replace("B","ß")
                    return word
                    

In [8]:
final_text = re.sub(r"\b(?!B)[a-zA-Z]*B[a-zA-Z]*", convert_to_Eszett2, replace3)        

## Note
Although further cleaning can be done (footnotes, page/line numbers, line breaks). We can already perform some techniques of computational text analysis with this corpus. 

For later usability, I will create a function.

In [15]:
def ocr_postcorrection(txt_file):
    
    with open(txt_file, newline='', encoding="utf8", errors='ignore') as f:
        text = f.read()
    
    replace0 = re.sub(' +', ' ', text)
    replace1 = re.sub(r"ã", "ä", replace0)
    replace2 = re.sub(r"õ", "ö", replace1)
    
    ############ Next re.sub will match words ending with capital B 
    ############ We need the following function to replace only the last charachter of the match object
    ############ source: https://pynative.com/python-regex-replace-re-sub/#h-regex-replacement-function
    def convert_to_Eszett(match_obj):
        if match_obj.group() is not None:
            word = match_obj.group()
            word = word[0:-1]
            word += "ß"
            return word
        
    replace3 = re.sub(r"\w+B\b", convert_to_Eszett, replace2)
    
    ############ As in the last re.sub match, 
    ############ we need the following function to replace only the B of the match object
    ############ source: https://pynative.com/python-regex-replace-re-sub/#h-regex-replacement-function
    def convert_to_Eszett2(match_obj):
        if match_obj.group() is not None:
            word = match_obj.group()
            if all(letter.isupper() for letter in word): # we want to ignore words in uppercase
                return word
            else:
                for letter in word:
                    if letter == "B":
                        word = word.replace("B","ß")
                        return word
                 
                # This expression matches words containing B but not starting with B   
    final_text = re.sub(r"\b(?!B)[a-zA-Z]*B[a-zA-Z]*", convert_to_Eszett2, replace3)
    
    return final_text

# 2. Creating a new structure: Tabular Data

In [10]:
############ The following expressions indicate the beginning of a letter

letter_by_Husserl = re.findall(r"Husserl an.*?\d*\.*\d{4}", final_text)
letter_to_Husserl =re.findall(r"\w+ an Husserl.*?\d*\.*\d{4}", final_text)
print(letter_by_Husserl)
print("---------------------------------------------")
print(letter_to_Husserl)

['Husserl an Brentano, 29. XII. 1886', 'Husserl an Brentano, 23. X. 1891', 'Husserl an Brentano, 29. XII. 1892', 'Husserl an Brentano, ca. Anfang 1894', 'Husserl an Brentano, ca. 15. 1. 1898', 'Husserl an Brentano, 11./15. X. 1904', 'Husserl an Brentano, 27. III. 1905', 'Husserl an Brentano, 22. VIII. 1906', 'Husserl an Brentano, 5. IV. 1907', 'Husserl an Brentano, 6. V. 1907', 'Husserl an Brentano, 22. XI. 1911', 'Husserl an Brentano, 31. XII. 1913', 'Husserl an Höfler, 13. VII. 1897', 'Husserl an Höfler, 1. V. 1901', 'Husserl an Marty, 7. VII. 1901', 'Husserl an Marty, 11J13. X. 1905', 'Husserl an Masaryk, ca. 25. XII. 1902', 'Husserl an Masaryk, 3. X. 1921', 'Husserl an Masaryk, 2. III. 1922', 'Husserl an Masaryk, 3. 1. 1935', 'Husserl an Meinong, 22. V. 1891', 'Husserl an Meinong, ca. Juli 1891', 'Husserl an Meinong, 25. 1. 1892', 'Husserl an Meinong, 16. II. 1892', 'Husserl an Meinong, 14. II. 1894', 'Husserl an Meinong, 22. XI. 1894', 'Husserl an Meinong, 19. VII. 1896', 'Husserl

In [11]:
all_letters = re.findall(r"Husserl an.*?\d*\.*\d{4}|\w+ an Husserl.*?\d*\.*\d{4}", final_text)
split = re.split(r"Husserl an.*?\d*\.*\d{4}|\w+ an Husserl.*?\d*\.*\d{4}", final_text)

print(len(split), len(all_letters))

89 88


In [12]:
list_for_df = []
index = 0

for title in all_letters:
    
    ########## extracting the year
    #year = re.findall(r"\d{4}", title)
    mat_obj = re.search(r"\d{4}", title)
    if mat_obj.group() is not None:
        year = mat_obj.group()
    
    ######### sender and recipient
    if title in letter_by_Husserl:
        sender = "Husserl"
        rec = title.split()
        recipient = rec[2]
    else:
        recipient = "Husserl"
        send = title.split()
        sender = send[0]
    
    ########## indicating position of the letter in the split list
    index = index + 1    
    
    letter = [title, year, sender, recipient, split[index]]
    
    list_for_df.append(letter)
    
#print(list_for_df[75])

In [13]:
df = pd.DataFrame (list_for_df, columns = ["title", "year", "sender", "recipient", "letter"])
print (df)

                                    title  year      sender    recipient  \
0      Husserl an Brentano, 29. XII. 1886  1886     Husserl    Brentano,   
1       Brentano an Husserl, 27. IV. 1891  1891    Brentano      Husserl   
2       Brentano an Husserl, ca. Mai 1891  1891    Brentano      Husserl   
3        Husserl an Brentano, 23. X. 1891  1891     Husserl    Brentano,   
4        Brentano an Husserl, 26. 1. 1892  1892    Brentano      Husserl   
..                                    ...   ...         ...          ...   
83   Husserl an Twardowski, 13. VII. 1928  1928     Husserl  Twardowski,   
84  Twardowski an Husserl, 17. VIII. 1928  1928  Twardowski      Husserl   
85          Utitz an Husserl, 22. XI.1930  1930       Utitz      Husserl   
86          Husserl an Utitz, 2. XI. 1931  1931     Husserl       Utitz,   
87          Utitz an Husserl, 6. XI. 1932  1932       Utitz      Husserl   

                                               letter  
0   \n\n\n\n\n\nHochverehrter H

## Note 2
Again, I put all together in a function

In [30]:
def tabular_structure(text):
    
    #with open(file, 'w', encoding='utf-8') as f:
     #   text = f.read()
    
    all_letters = re.findall(r"Husserl an.*?\d*\.*\d{4}|\w+ an Husserl.*?\d*\.*\d{4}", text)
    letter_by_Husserl = re.findall(r"Husserl an.*?\d*\.*\d{4}", text)
    
    split = re.split(r"Husserl an.*?\d*\.*\d{4}|\w+ an Husserl.*?\d*\.*\d{4}", text)
    
    list_for_df = []
    index = 0

    for title in all_letters:
    
        ########## extracting the year
        mat_obj = re.search(r"\d{4}", title)
        if mat_obj.group() is not None:
            year = mat_obj.group()
    
        ######### sender and recipient
        if title in letter_by_Husserl:
            sender = "Husserl"
            rec = title.split()
            recipient = rec[2]
        else:
            recipient = "Husserl"
            send = title.split()
            sender = send[0]
    
        ########## indicating position of the letter in the split list
        index = index + 1    
    
        letter = [title, year, sender, recipient, split[index]]
    
        list_for_df.append(letter)
        
        df = pd.DataFrame (list_for_df, columns = ["title", "year", "sender", "recipient", "letter"])
        
        return df

# 3. Loop in all volumes, new txt files

In [24]:
########## For-in-loop using the ocr_postcorrection function

vol = 0
for vulume in range(9):
    vol += 1
    file = "Vol%s.txt" % (str(vol))
    new_file_name = "HuaDokIII-%s.txt" % (str(vol))
    with open(new_file_name, 'w', encoding='utf-8') as f:
        f.write(ocr_postcorrection(file))
    

# 4. Loop-in all volumes, new csv files

In [31]:
########## For-in-loop using the ocr_postcorrection function
########## When deleting the comments, the txt and the csv files are generated simultaneously

vol = 0
for vulume in range(9):
    
    vol += 1
    file = "Vol%s.txt" % (str(vol))
    #new_txt_name = "HuaDokIII-%s.txt" % (str(vol))
    name_csv = "TABULAR_HuaDokIII-%s.csv" % (str(vol))

    postcorrection = ocr_postcorrection(file)
    
    #with open(new_txt_name, 'w', encoding='utf-8') as f:
     #   f.write(postcorrection)
    
    tabular_structure(postcorrection).to_csv(name_csv)