## Building the Cooc Table

In [41]:
import pprint as pp
from collections import defaultdict
import re

In [45]:
def build_cooc_table(filepath_fr, filepath_en) :
    
    # defaultdict provides a default value for the key that does not exist.
    
    cooc_table = {}
    
    fr = open(filepath_fr, 'r')
    en = open(filepath_en, 'r')
    
    for line_fr, line_en in zip(fr, en):
        line_fr = re.sub('[\.,:;]','', line_fr) # simple regular expression to get rid of basic punctuation
        line_en = re.sub('[\.,:;]','', line_en)
        line_fr, line_en = line_fr.strip().split(), line_en.strip().split()
        #print(cooc_table)
        # use set to remove any duplicates
        for word_fr in set(line_fr):
            # build count dict for the English sentence
            if word_fr in cooc_table :
                # copy dict if the word in French has already been seen and exists in the cooc table
                counts_en = cooc_table[word_fr]
            else:
                # otherwise initialize a defaultdict =>  "int" specifies the type and means we can directly add an int 
                # value to the count without initializing anything (a default of 0 is set)
                counts_en = defaultdict(int)
         
            for word_en in set(line_en):
                counts_en[word_en] += 1

            cooc_table[word_fr] = counts_en
        #print (cooc_table)
    
    return cooc_table

In [43]:
cooc_table = build_cooc_table('french.corpus', 'english.corpus')
cooc_table

{'TERRE': defaultdict(int,
             {'TO': 1, 'THE': 1, 'FROM': 1, 'EARTH': 1, 'MOON': 1}),
 'LUNE': defaultdict(int,
             {'TO': 1,
              'THE': 2,
              'FROM': 1,
              'EARTH': 1,
              'MOON': 2,
              'OF': 1,
              'ROMANCE': 1}),
 'DE': defaultdict(int,
             {'TO': 1,
              'THE': 7,
              'FROM': 2,
              'EARTH': 1,
              'MOON': 2,
              "'S": 1,
              'COMMUNICATION': 1,
              'PRESIDENT': 1,
              'EFFECT': 1,
              'OF': 6,
              'CAMBRIDGE': 1,
              'OBSERVATORY': 1,
              'REPLY': 1,
              'ROMANCE': 1,
              'IGNORANCE': 1,
              'AND': 1,
              'BELIEF': 1,
              'PERMISSIVE': 1,
              'UNITED': 1,
              'STATES': 1,
              'LIMITS': 1,
              'IN': 1,
              'CASTING': 1,
              'FETE': 1,
              'PASSENGER': 1,
   

## Sorting the cooc table and printing it to a file

In [48]:
def sorted_cooc(cooc_table):
    """
    Extract top co-occurrences for each French word
    """
    top_coocs = []
    for word_fr in cooc_table:
        # we sort the cooccurrences for each french word by looking at the frequency
        # for each english word that has been encountered and sort the co-occurrences
        # in descending order (highest to lowest)
        
        # the sorted function has a key parameter which takes a function specifying which elements shoud be compared
        # since we are using the frequencies to order our tuples (position 2 in each tuple), the elmt in pos 2 is 
        # what the function should return
        
        # lambda functions are a quick way of writing functions :
        # lambda cooc_tuple : cooc_tuple[1] 
        # is equivalent to 
        # def return_freq(cooc_tuple):
        #     return cooc_tuple[1]
        #cooc_list.sort(key=lambda cooc_tuple : cooc_tuple[2], reverse=True)  
        sorted_coocs = sorted(cooc_table[word_fr].items(), key=lambda x: x[1], reverse=True) #.items returns a list of keys and values as a tuples
        
        # sorted_coocs is  a list of tuple (word_en, freq) in descending order
        # we now retrieve the top occurring tuple sorted_coocs[0] and create a new tuple with
        # the french word, the english word (elmt [0] of the top tuple) and the freq (elmt[1] of the top tuple)
        top_coocs.append((word_fr, sorted_coocs[0][0], sorted_coocs[0][1])) # append the tuple to the list
    
    # can end by sorting the tuples in alphabetical order
    # use the default settings:
    # which word : first word in each tuple, ie. the french word
    # which roder : ascending (from a to z)
    top_coocs.sort()
    return top_coocs

In [50]:
# we can finish by writing this lexicon to a file
with open('./naive_lexicon.txt', 'w') as f:
    f.write(pp.pformat(sorted_cooc(cooc_table))) # pformat will write a "prettier" version of the list to the file

### Mini Topo sur les fonctions lambda
Lambda functions can be very practical sometimes :  usually a shortcut for declaring small single-expression anonymous functions.
They behave just like regular functions declared with the "def" keyword.
Lambdas are restricted to a songle expression, so there isn't even a return statement...

In practice:
Most frequently used to write short and concise "key functions" for sorting iterables by an alternate key, like in the sorted_cooc function above.

In [2]:
# Some examples:
add = lambda x, y: x + y 
print(add(5,3))

# Can be used directly inline as an expression :
(lambda x, y: x + y)(5,3)

8


8

In [6]:
# For sorting :
tuples = [(1, 'd'), (2, 'b'), (3, 'a')]
print(sorted(tuples, key=lambda x : x[1]))

print(sorted(range(-5, 6), key=lambda x: x * x))

[(3, 'a'), (2, 'b'), (1, 'd')]
[0, -1, 1, -2, 2, -3, 3, -4, 4, -5, 5]


In [36]:
# Caveat :
# Although it can look "cool" to use lambdas whenever you can, it's not always the clearest way to write your code...
# Take a second to think if using a lambda function is really the best way to go
# If you find yourself doing something remotely complex with a lambda function, using a classic "def" function is usually a better idea

# When filtering a list for example:
print(list(filter(lambda x: x % 2 == 0, range(16)))) # not necessarily as readable

# vs.
print([x for x in range(16) if x % 2 == 0]) # usually a little clearer

#vs.

def filter_odd_numbers(nums_list):
    only_evens = []
    for x in nums_list:
        if x %2 == 0:
            only_evens.append(x)
    return only_evens
print(filter_odd_numbers(range(16)))

[0, 2, 4, 6, 8, 10, 12, 14]
[0, 2, 4, 6, 8, 10, 12, 14]
[0, 2, 4, 6, 8, 10, 12, 14]


In [10]:
# The "Zen of Python" Easter Egg by Tim Peters
# Just a couple of guidelines by the creator you can revisit as much as you like to become a better pythonista
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
