# Translate texts (version: 2022-05-05)

_by A. Maurits van der Veen_  

_Modification history:_  
_2022-05-05 - cleaned up from longer notebook, for general use_  

This notebook provides code to translate a corpus of texts using supplied translation dictionaries.

Note that this is word-level translation: the resulting texts will not read like good English; however, for additional text analysis steps such as sentiment analysis or topic modeling, good English is not required: just that the individual words that matter to the meaning of the text get translated correctly.

The format of the dictionary is expected to be csv, with source words in the first column and target words in the second.


### 0. Set-up

In [None]:
projectfolder = '/Users/maurits/STAIR/'  # Adapt as needed


In [None]:
import sys
import os
import csv

sys.path.append(projectfolder + 'Code')  # The place to look for local code files
import translation4tru  # Auxiliary code file for this notebook

# Print summary version info (for fuller info, simply print sys.version)
print('You are using python version {}.'.format(sys.version.split()[0]))

### 1. Load translation dictionary


In [None]:
# Specify translation pair & direction

sourcelang = 'de'  # change as applicable
targetlang = 'en'  # ,,

langpair = sourcelang + '-' + targetlang


In [None]:
# Load translation dictionary (adapt pathname as needed)
translationfile = projectfolder + 'Translation/Dictionaries/' + langpair + '_transl.txt'
translationdict = translation4tru.load_cleanpairs(translationfile, reverse=False, usecsv=True)

# Show dictionary length
print('The {} -> {} word translation dictionary contains {:,} entries.'.format(sourcelang, targetlang, len(translationdict)))

### 2. Translate texts

Unknown words will be flagged in the output by having an out-of-vocabulary marker pre-pended.  
To have these words stand out, use something easily identifiable, such as `*oov*`; to have these words simply copied as is without any special marking (usually the preferred approach, as many are just proper names), use an empty string `''` as the marker.

In [None]:
oov_marker = ''  # use '*oov*' to have words stand out in translation


In [None]:
sourcefolder = projectfolder + 'Corpora/mycorpus/'    
corpusfilename = 'corpusname'

# Specify filenames of source and for target corpus
sourcesuffix = '.csv'
translsuffix = '_2' + targetlang + '.csv'


In [None]:
sourcefile = sourcefolder + corpusfilename + sourcesuffix
targetfile = sourcefolder + corpusfilename + translsuffix


In [None]:
nrtoshow = 10  # number of most common unknown words to show

nrtranslated, unknownwords, unknownFD = \
    translation4tru.translate_corpus(sourcefile, targetfile, translationdict, 
                                 oov_marker=oov_marker, header=False, textcol=1, keeplines=True,
                                 update_interval=20000, show_unknown=False, track_unknown=True)
print('Translated {} texts in corpus {}; encountered {} distinct unknown words.'.format( \
      nrtranslated, corpusname, len(unknownwords)))
if len(unknownwords) > 0:
    print('{} most common unknown words:'.format(min(len(unknownwords), nrtoshow)))
    unknownitems = sorted(unknownFD.items(), key=lambda x: x[1], reverse=True)
    for unknownword, count in unknownitems[:nrtoshow]:
        print('{:24}: {} occurrences'.format(unknownword, count))
        

### 3. Examine translation

Check one or more entries to make sure the translation worked well. We can do this at random (by just picking some row numbers), or else we might check for particular keywords and look for the first N articles containing the keyword in question. The code here successively does both.

In [None]:
rows2check = [0, 1000]  # List of row numbers for which to diplay both the original and the translation

translation4tru.display_translation_byrow(sourcefile, targetfile, rows2check)


In [None]:
source_searchstring = 'Treppe'
firstN = 3

translation4tru.display_translation_bycontent(sourcefile, targetfile, source_searchstring, firstN)