# Normalization

There are various ways to normalize.

## Normalization 1. Remove and replace

Let's go back to the example of the sonnet about writing a sonnet, by Lope de Vega, in a French translation.

The output of the collation mostly contains differences of punctuation and capitalization.

In [None]:
from collatex import *
collation = Collation()
witness_1707 = open( "../data/sonnet/Lope_soneto_FR_1707.txt", encoding='utf-8' ).read()
witness_1822 = open( "../data/sonnet/Lope_soneto_FR_1822.txt", encoding='utf-8' ).read()
collation.add_plain_witness( "wit 1707", witness_1707 )
collation.add_plain_witness( "wit 1822", witness_1822 )
alignment_table = collate(collation, output='html2')

Imagine that we are not interested in punctuation and capitalization: we only want what might be called 'substantive variants'.

The "hard way" of obtaining the expected result is to **remove punctuation and lower-case all the texts**. The code below will do just that: it will
- create a new directory, inside the `data/sonnet` dir, called 'norm'
- make a normalized copy (without punctuation and all lower-case) of each file inside the new 'norm' dir

The creation of a normalized copy is safer than just normalizing the original transcriptions. If you keep the originals, you can always come back to them and perform other kinds of normalization if needed.

**Note**: the code below contains lots of comments, that is string that will not be executed but can be used for documentation. You've seen that in XML comments are inside <\!-- -->. In Python, comment are marked with the sign #.

In [None]:
import glob, re, os

path = '../data/sonnet/'  # put the path into a variable 

os.makedirs(path + 'norm', exist_ok=True)  # create a new folder, if does not exist

files = [os.path.basename(x) for x in glob.glob(path+'*.txt')]  # take all txt files in the directory

for file in files:  # for each file in the directory
    
    ### READ THE FILE CONTENT
    file_opened = open(path+file, 'r', encoding='utf-8') # open the file in mode 'r' (read)
    content = file_opened.read()  # read the file content
    
    ### ALL TO LOWER CASE
    lowerContent = content.lower() 
    
    ### REMOVE PUNCTUATION 
    # remove everything that is not alphanumeric character (\w) or space (\s), and substitute it with whitespace
    noPunct_lowerContent = re.sub(r'[^\w\s]',' ',lowerContent) 
    
    ### REMOVE MULTIPLE WHITESPACES
    regularSpaces_noPunct_lowerContent = " ".join(noPunct_lowerContent.split())
    
    ### CREATE A NEW FILE
    filename = file.split('.')[0]
    new_file = open(path+'norm/' + filename + '_norm.txt', 'w', encoding='utf-8') # open the new file in mode 'w' (write)
    
    ### WRITE THE NEW CONTENT INTO THE NEW FILE
    new_file.write(regularSpaces_noPunct_lowerContent) 
    
    ### CLOSE THE FILE
    new_file.close()
    
print('Finished! All normalized!')

Now, let's **collate the normalized copies**.

Attention to the new path. The output should be different from the one above!


In [None]:
from collatex import *
collation = Collation()
witness_1707 = open( "../data/sonnet/norm/Lope_soneto_FR_1707_norm.txt", encoding='utf-8' ).read()
witness_1822 = open( "../data/sonnet/norm/Lope_soneto_FR_1822_norm.txt", encoding='utf-8' ).read()
collation.add_plain_witness( "wit 1707", witness_1707 )
collation.add_plain_witness( "wit 1822", witness_1822 )
alignment_table = collate(collation, output='html2')

## Normalization 2. Annotate