Some recommendations:

- *Do not try to find a clever or optimized solution, do something that works before.*
- *Please don't get the solution from your colleagues*

# Wordcount

- [Wikipedia](https://en.wikipedia.org/wiki/Word_count)

- Word count example reads text files and counts how often words occur. 
- Word count is commonly used by translators to determine the price for the translation job.
- This is the "Hello World" program of Big Data.

# Create sample text file

In [30]:
from lorem import text

with open("sample.txt", "w") as f:
    for i in range(10000):
        f.write(text())

### Exercise 1

Write a python program that counts the number of lines, words and characters in that file.

In [14]:
%%bash
wc sample.txt
wc -l sample.txt
wc -w sample.txt
du -h sample.txt

   70036  2013573 14225998 sample.txt
70036 sample.txt
2013573 sample.txt
14M	sample.txt


- Compute number of lines

In [19]:
with open("sample.txt") as f:
    lines = list(f)
len(lines)

70037

- Compute number of words

In [20]:
nwords = 0
for line in lines:
    nwords += len(line.split())
nwords

2013573

In [21]:
nchars = 0
for line in lines:
  nchars += len(line)
nchars

14225998

- `set` gives the list of unique elements from words list. 

In [8]:
s = set(words)
s

{'Dolor',
 'Est',
 'Ipsum',
 'Neque',
 'Numquam',
 'Tempora',
 'adipisci',
 'amet',
 'amet.',
 'dolore',
 'eius',
 'est',
 'est.',
 'ipsum',
 'labore.',
 'magnam',
 'modi',
 'neque',
 'non',
 'numquam',
 'numquam.',
 'quaerat',
 'quiquia',
 'tempora',
 'tempora.',
 'ut.',
 'velit',
 'velit.',
 'voluptatem'}

### Exercise 2

Create a function called `wordcount` that take a file name as argument and return a lists containing all words as items.

```pytb
wordcount("sample.txt")
['labore', 'modi', 'ipsum', 'eius', 'eius', 'tempora', 'sed']
```

In [75]:
def wordcount(filename):
    words = []
    for line in open(filename, "r"):
        words.extend(line.replace('.', ' ').split())
    words.sort()
    return words
    
wordcount("sample.txt")

['Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',
 'Adipisci',

## Sorting a dictionary by value

By default, if you use `sorted` function on a `dict`, it will use keys to sort it.
To sort by values, you can use [operator](https://docs.python.org/3.6/library/operator.html).itemgetter(1)
Return a callable object that fetches item from its operand using the operand’s `__getitem__(` method. It could be used to sort results.

In [7]:
import operator
fruits = [('apple', 3), ('banana', 2), ('pear', 5), ('orange', 1)]
getcount = operator.itemgetter(1)
dict(sorted(fruits, key=getcount))

{'orange': 1, 'banana': 2, 'apple': 3, 'pear': 5}

`sorted` function has also a `reverse` optional argument.

In [8]:
dict(sorted(fruits, key=getcount, reverse=True))

{'pear': 5, 'apple': 3, 'banana': 2, 'orange': 1}

### Exercise 3

Modify the function `wordcount` to reduce the list of words and return a dictionary containing all words as keys and number of occurrences as values.

```pybt
wordcount('sample.txt')
{'tempora': 2, 'non': 1, 'quisquam': 1, 'amet': 1, 'sit': 1}
```

In [76]:
def reduce(sorted_words):
    res = {}
    current_word = None
    for word in sorted_words:
        if word == current_word:
            res[word] += 1
        else:
            res[word] = 1
            current_word = word
    return res
reduce(wordcount("sample.txt"))

{'Adipisci': 12499,
 'Aliquam': 12527,
 'Amet': 12507,
 'Consectetur': 12536,
 'Dolor': 12542,
 'Dolore': 12526,
 'Dolorem': 12363,
 'Eius': 12573,
 'Est': 12439,
 'Etincidunt': 12452,
 'Ipsum': 12491,
 'Labore': 12538,
 'Magnam': 12452,
 'Modi': 12447,
 'Neque': 12482,
 'Non': 12450,
 'Numquam': 12478,
 'Porro': 12428,
 'Quaerat': 12593,
 'Quiquia': 12554,
 'Quisquam': 12586,
 'Sed': 12456,
 'Sit': 12458,
 'Tempora': 12557,
 'Ut': 12270,
 'Velit': 12481,
 'Voluptatem': 12530,
 'adipisci': 62119,
 'aliquam': 62349,
 'amet': 62650,
 'consectetur': 62408,
 'dolor': 62461,
 'dolore': 62197,
 'dolorem': 62551,
 'eius': 63160,
 'est': 62250,
 'etincidunt': 62519,
 'ipsum': 62407,
 'labore': 62438,
 'magnam': 62456,
 'modi': 62526,
 'neque': 62364,
 'non': 62790,
 'numquam': 62339,
 'porro': 62799,
 'quaerat': 61834,
 'quiquia': 62635,
 'quisquam': 62542,
 'sed': 62319,
 'sit': 62306,
 'tempora': 62567,
 'ut': 62642,
 'velit': 62297,
 'voluptatem': 62432}

- Example of lambda function

In [39]:
f = lambda x : x*x + 1
f(3)

10

reduce function using python KeyError error

In [77]:
def reduce(words):
    res = {}
    for word in words:
        try:
            res[word] += 1
        except KeyError:
            res[word] = 1
    return res

reduce(wordcount("sample.txt"))

{'Adipisci': 12499,
 'Aliquam': 12527,
 'Amet': 12507,
 'Consectetur': 12536,
 'Dolor': 12542,
 'Dolore': 12526,
 'Dolorem': 12363,
 'Eius': 12573,
 'Est': 12439,
 'Etincidunt': 12452,
 'Ipsum': 12491,
 'Labore': 12538,
 'Magnam': 12452,
 'Modi': 12447,
 'Neque': 12482,
 'Non': 12450,
 'Numquam': 12478,
 'Porro': 12428,
 'Quaerat': 12593,
 'Quiquia': 12554,
 'Quisquam': 12586,
 'Sed': 12456,
 'Sit': 12458,
 'Tempora': 12557,
 'Ut': 12270,
 'Velit': 12481,
 'Voluptatem': 12530,
 'adipisci': 62119,
 'aliquam': 62349,
 'amet': 62650,
 'consectetur': 62408,
 'dolor': 62461,
 'dolore': 62197,
 'dolorem': 62551,
 'eius': 63160,
 'est': 62250,
 'etincidunt': 62519,
 'ipsum': 62407,
 'labore': 62438,
 'magnam': 62456,
 'modi': 62526,
 'neque': 62364,
 'non': 62790,
 'numquam': 62339,
 'porro': 62799,
 'quaerat': 61834,
 'quiquia': 62635,
 'quisquam': 62542,
 'sed': 62319,
 'sit': 62306,
 'tempora': 62567,
 'ut': 62642,
 'velit': 62297,
 'voluptatem': 62432}

# Container datatypes

`collection` module implements specialized container datatypes providing alternatives to Python’s general purpose built-in containers, `dict`, `list`, `set`, and `tuple`.

- `defaultdict` :	dict subclass that calls a factory function to supply missing values
- `Counter`	: dict subclass for counting hashable objects

## defaultdict

When you implement the `reduce` function you probably add some probleme to append key-value pair to your `dict`. If you try to change the value of a key that is not present 
in the dict, the key is not automatically created.

The `defaultdict` is the solution. This container is a `dict` subclass that calls a factory function to supply missing values.
For example, using list as the default_factory, it is easy to group a sequence of key-value pairs into a dictionary of lists:

In [40]:
from collections import defaultdict
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d = defaultdict(list)
for k, v in s:
    d[k].append(v)

dict(d)

{'yellow': [1, 3], 'blue': [2, 4], 'red': [1]}

### Exercise 4

- Modify the reduce part of `wordcount` function you wrote above by using a defaultdict with the most suitable factory.

In [78]:
from collections import defaultdict

def reduce(sorted_words):
    d = defaultdict(int)
    for w in sorted_words:
        d[w] += 1
    return dict(sorted(d.items(), key=lambda v:v[1]))

%timeit reduce(wordcount("sample.txt"))
reduce(wordcount("sample.txt"))

996 ms ± 65.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


{'Ut': 12270,
 'Dolorem': 12363,
 'Porro': 12428,
 'Est': 12439,
 'Modi': 12447,
 'Non': 12450,
 'Etincidunt': 12452,
 'Magnam': 12452,
 'Sed': 12456,
 'Sit': 12458,
 'Numquam': 12478,
 'Velit': 12481,
 'Neque': 12482,
 'Ipsum': 12491,
 'Adipisci': 12499,
 'Amet': 12507,
 'Dolore': 12526,
 'Aliquam': 12527,
 'Voluptatem': 12530,
 'Consectetur': 12536,
 'Labore': 12538,
 'Dolor': 12542,
 'Quiquia': 12554,
 'Tempora': 12557,
 'Eius': 12573,
 'Quisquam': 12586,
 'Quaerat': 12593,
 'quaerat': 61834,
 'adipisci': 62119,
 'dolore': 62197,
 'est': 62250,
 'velit': 62297,
 'sit': 62306,
 'sed': 62319,
 'numquam': 62339,
 'aliquam': 62349,
 'neque': 62364,
 'ipsum': 62407,
 'consectetur': 62408,
 'voluptatem': 62432,
 'labore': 62438,
 'magnam': 62456,
 'dolor': 62461,
 'etincidunt': 62519,
 'modi': 62526,
 'quisquam': 62542,
 'dolorem': 62551,
 'tempora': 62567,
 'quiquia': 62635,
 'ut': 62642,
 'amet': 62650,
 'non': 62790,
 'porro': 62799,
 'eius': 63160}

## Counter

A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts.

Elements are counted from an iterable or initialized from another mapping (or counter):

In [14]:
from collections import Counter

violet = dict(r=23,g=13,b=23)
print(violet)
cnt = Counter(violet)  # or Counter(r=238, g=130, b=238)
print(cnt['c'])
print(cnt['r'])

{'r': 23, 'g': 13, 'b': 23}
0
23


In [15]:
print(*cnt.elements())

r r r r r r r r r r r r r r r r r r r r r r r g g g g g g g g g g g g g b b b b b b b b b b b b b b b b b b b b b b b


In [16]:
cnt.most_common(2)

[('r', 23), ('b', 23)]

In [17]:
cnt.values()

dict_values([23, 13, 23])

### Exercise 5

Use a `Counter` object to count words occurences in the sample text file. 

In [48]:
from collections import Counter

def wordcounter(filename):
    return Counter(wordcount(filename))

%timeit wordcounter("sample.txt")
wordcounter("sample.txt")

880 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Counter({'Adipisci': 12499,
         'Aliquam': 12527,
         'Amet': 12507,
         'Consectetur': 12536,
         'Dolor': 12542,
         'Dolore': 12526,
         'Dolorem': 12363,
         'Eius': 12573,
         'Est': 12439,
         'Etincidunt': 12452,
         'Ipsum': 12491,
         'Labore': 12538,
         'Magnam': 12452,
         'Modi': 12447,
         'Neque': 12482,
         'Non': 12450,
         'Numquam': 12478,
         'Porro': 12428,
         'Quaerat': 12593,
         'Quiquia': 12554,
         'Quisquam': 12586,
         'Sed': 12456,
         'Sit': 12458,
         'Tempora': 12557,
         'Ut': 12270,
         'Velit': 12481,
         'Voluptatem': 12530,
         'adipisci': 62119,
         'aliquam': 62349,
         'amet': 62650,
         'consectetur': 62408,
         'dolor': 62461,
         'dolore': 62197,
         'dolorem': 62551,
         'eius': 63160,
         'est': 62250,
         'etincidunt': 62519,
         'ipsum': 62407,
         'la

The Counter class is similar to bags or multisets in some Python libraries or other languages. We will see later how to use Counter-like objects in a parallel context. 

# Process two files

- Create two files containing `lorem` text named 'sample1.txt' and 'sample2.txt'
- If you process these two files you get two dictionaries.
- You have to loop over them to sum occurences and return the reslulted dict. To iterate on specific mappings, Python standard library provides some useful feartures in `itertools` module.
- [itertools.chain(*mapped_values)](https://docs.python.org/3.6/library/itertools.html#itertools.chain) could be used for treating consecutive sequences as a single sequence. 

In [19]:
import itertools, operator
fruits = [('apple', 3), ('banana', 2), ('pear', 5), ('orange', 1)]
vegetables = [('endive', 2), ('spinach', 1), ('celery', 5), ('carrot', 4)]
getcount = operator.itemgetter(1)
dict(sorted(itertools.chain(fruits,vegetables), key=getcount))

{'orange': 1,
 'spinach': 1,
 'banana': 2,
 'endive': 2,
 'apple': 3,
 'carrot': 4,
 'pear': 5,
 'celery': 5}

### Exercise 6

- Write the program that creates both files, processes and use `itertools.chain` to get the merged word count dictionary.

In [54]:
from lorem import text
for i in range(4):
    with open("sample{}.txt".format(i), "w") as f:
        for x in range(10000):
            f.write(text())

In [56]:
from glob import glob

glob("*.txt")

['sample2.txt', 'sample.txt', 'sample1.txt', 'sample3.txt', 'sample0.txt']

In [79]:
from itertools import chain

words1 = wordcount("sample1.txt")
words2 = wordcount("sample2.txt")

reduce(chain(words1,words2)) # word count on two files

{'Voluptatem': 24761,
 'Porro': 24766,
 'Quiquia': 24821,
 'Quisquam': 24845,
 'Dolor': 24848,
 'Est': 24855,
 'Velit': 24881,
 'Numquam': 24889,
 'Dolorem': 24915,
 'Amet': 24932,
 'Aliquam': 24941,
 'Modi': 24951,
 'Eius': 24959,
 'Consectetur': 24987,
 'Etincidunt': 25014,
 'Non': 25020,
 'Quaerat': 25024,
 'Magnam': 25027,
 'Sit': 25044,
 'Ipsum': 25046,
 'Tempora': 25057,
 'Sed': 25080,
 'Dolore': 25088,
 'Adipisci': 25094,
 'Neque': 25119,
 'Labore': 25148,
 'Ut': 25320,
 'voluptatem': 124296,
 'porro': 124381,
 'quisquam': 124530,
 'labore': 124688,
 'dolore': 124696,
 'tempora': 124710,
 'eius': 124724,
 'velit': 124753,
 'dolorem': 124773,
 'est': 124796,
 'consectetur': 124806,
 'ut': 124858,
 'dolor': 124882,
 'quaerat': 124893,
 'non': 124910,
 'adipisci': 124997,
 'ipsum': 125041,
 'amet': 125057,
 'neque': 125058,
 'aliquam': 125122,
 'modi': 125183,
 'numquam': 125211,
 'quiquia': 125211,
 'sit': 125254,
 'etincidunt': 125363,
 'sed': 125463,
 'magnam': 125541}

- wordcount on a list of files

In [80]:
from itertools import chain
from glob import glob

reduce(chain(*[wordcount(file) for file in glob("sample*.txt")]))

{'Est': 62119,
 'Numquam': 62147,
 'Magnam': 62154,
 'Dolorem': 62245,
 'Porro': 62256,
 'Quiquia': 62267,
 'Ut': 62393,
 'Modi': 62403,
 'Amet': 62417,
 'Dolor': 62433,
 'Etincidunt': 62437,
 'Non': 62439,
 'Quisquam': 62464,
 'Consectetur': 62465,
 'Aliquam': 62487,
 'Sed': 62494,
 'Voluptatem': 62523,
 'Labore': 62528,
 'Ipsum': 62533,
 'Velit': 62606,
 'Adipisci': 62641,
 'Dolore': 62694,
 'Tempora': 62772,
 'Quaerat': 62801,
 'Neque': 62907,
 'Eius': 62937,
 'Sit': 62964,
 'quisquam': 311558,
 'dolore': 311620,
 'voluptatem': 311713,
 'quaerat': 311843,
 'labore': 311954,
 'est': 312053,
 'consectetur': 312160,
 'dolor': 312217,
 'velit': 312301,
 'aliquam': 312371,
 'sit': 312434,
 'tempora': 312501,
 'porro': 312508,
 'modi': 312546,
 'adipisci': 312573,
 'non': 312623,
 'ipsum': 312631,
 'dolorem': 312718,
 'sed': 312816,
 'amet': 312867,
 'etincidunt': 312943,
 'numquam': 313024,
 'quiquia': 313068,
 'ut': 313085,
 'neque': 313317,
 'magnam': 313354,
 'eius': 313905}

Example of string interpolation (f-strings)

In [27]:
from math import pi

print(f" la valeur de Pi est {pi:07.3f}")
print(" la valeur de Pi est {:07.3f}".format(pi))
print(" la valeur de Pi est %07.3f " % (pi))

 la valeur de Pi est 003.142
 la valeur de Pi est 003.142
 la valeur de Pi est 003.142 


### Exercise 7

- Modify the `wordcount` function in order to accept several files as arguments and
return the result dict.

```
wordcount(file1, file2, file3, ...)
```

[Hint: arbitrary argument lists](https://docs.python.org/3/tutorial/controlflow.html#arbitrary-argument-lists)

In [82]:
from itertools import chain
from glob import glob
import re
def wordcount(*args): # arbitrary argument list
    
    # MAP 
    mapped_values = []
    for filename in args:
        with open(filename) as f:
            data = f.read()
        words = data.lower().replace('.','').strip().split()
        mapped_values.append(sorted(words))
    
    # REDUCE 
    return reduce(chain(*mapped_values))

wordcount(*glob("sample*.txt"))

{'laboreeius': 42,
 'eiuseius': 44,
 'temporaipsum': 45,
 'consecteturmagnam': 48,
 'ipsumeius': 49,
 'temporamagnam': 49,
 'utest': 50,
 'velitlabore': 50,
 'velitnumquam': 51,
 'aliquammagnam': 52,
 'doloremconsectetur': 52,
 'nonmodi': 52,
 'ipsumnon': 53,
 'ipsumquaerat': 53,
 'nequeadipisci': 53,
 'quisquamvelit': 53,
 'temporanumquam': 53,
 'voluptatemdolorem': 53,
 'dolorequaerat': 54,
 'dolormodi': 54,
 'eiussed': 54,
 'etincidunttempora': 54,
 'ipsumvelit': 54,
 'modietincidunt': 54,
 'porroadipisci': 54,
 'porroipsum': 54,
 'quaeratquaerat': 54,
 'voluptatemvelit': 54,
 'adipisciquiquia': 55,
 'consecteturut': 55,
 'doloremetincidunt': 55,
 'dolorenumquam': 55,
 'numquamdolorem': 55,
 'numquamipsum': 55,
 'quisquamipsum': 55,
 'sedconsectetur': 55,
 'adipiscilabore': 56,
 'ametadipisci': 56,
 'estquiquia': 56,
 'ipsumquisquam': 56,
 'laboredolorem': 56,
 'nequeconsectetur': 56,
 'nonlabore': 56,
 'nonporro': 56,
 'porrodolore': 56,
 'quisquamquiquia': 56,
 'sedadipisci': 56,


- Example of use of arbitrary argument list and arbitrary named arguments.

In [83]:
def func( *args, **kwargs):
    for arg in args:
        print(arg)
        
    print(kwargs)
        
func( "3", [1,2], "bonjour", x = 4, y = "y")

3
[1, 2]
bonjour
{'x': 4, 'y': 'y'}


In [33]:
it = range(10)
print(*it) # to print out an iterator you need to use the unpack operator '*'

0 1 2 3 4 5 6 7 8 9


In [34]:
files = glob("sample*.txt")
files

['sample06.txt',
 'sample07.txt',
 'sample05.txt',
 'sample04.txt',
 'sample00.txt',
 'sample01.txt',
 'sample03.txt',
 'sample02.txt',
 'sample.txt']

In [35]:
func(*files)

sample06.txt
sample07.txt
sample05.txt
sample04.txt
sample00.txt
sample01.txt
sample03.txt
sample02.txt
sample.txt
{}
