Some recommendations:
- *Don't google too much, ask me or use the python documentation through `help` function.*
- *Do not try to find a clever or optimized solution, do something that works before.*
- *Please don't get the solution from your colleagues*
- *Notebooks will be updated next week with solutions*

# Wordcount

- [Wikipedia](https://en.wikipedia.org/wiki/Word_count)

- Word count example reads text files and counts how often words occur. 
- Word count is commonly used by translators to determine the price for the translation job.
- This is the "Hello World" program of Big Data.

# Create sample text file

In [201]:
from lorem import text

with open("sample.txt", "w") as f:
    f.write(text())

### Exercise 4.1

Write a python program that counts the number of lines, words and characters in that file.

In [202]:
%%cmd
echo "Number of non-empty lines:"
find/v /c "" sample.txt

Microsoft Windows [version 10.0.16299.665]
(c) 2017 Microsoft Corporation. Tous droits réservés.

(big-data) C:\Users\bruno\Desktop\S9\Outils du Big Data\big-data\notebooks>echo "Number of non-empty lines:"
"Number of non-empty lines:"

(big-data) C:\Users\bruno\Desktop\S9\Outils du Big Data\big-data\notebooks>find/v /c "" sample.txt

---------- SAMPLE.TXT: 11

(big-data) C:\Users\bruno\Desktop\S9\Outils du Big Data\big-data\notebooks>

In [203]:
def wc(filename):
    with open(filename, 'r') as f:
        text = f.read()
        values = [filename, 1, 1, 0]
        for i in text:
            if i == '\n':
                values[1]+=1
            if i == ' ':
                values[2]+=1 # Problème: les mots en débuts de  nouvelles lignes ne sont pas comptabilisés
            values[3]+=1
    print(values)

wc("sample.txt")

['sample.txt', 11, 234, 1683]


In [204]:
def wc2(filename, without_spaces = False):
    """ Take a file and print a list containing the following informations:
    _ file name
    _ number of lines
    _ number of words
    _ number of characters
    """
    values = [filename, 0, 0, 0]
    with open(filename, 'r') as f:
        lines = f.readlines()
        values[1] = len(lines)
        values[2] = sum([len(line.split()) for line in lines])
        if without_spaces:
            values[3] = sum([sum([len(word) for word in line.split()]) for line in lines])
        else:
            values[3] = sum([len(line) for line in lines])
    print(values)

wc2("sample.txt")
wc2("sample.txt", True)

['sample.txt', 11, 239, 1683]
['sample.txt', 11, 239, 1440]


In [205]:
help(wc2)

Help on function wc2 in module __main__:

wc2(filename, without_spaces=False)
    Take a file and print a list containing the following informations:
    _ file name
    _ number of lines
    _ number of words
    _ number of characters



### Exercise 4.2

Create a function called `wordcount` that take a file name as argument and return a lists containing all words as items.

```pytb
wordcount("sample.txt")
['labore', 'modi', 'ipsum', 'eius', 'eius', 'tempora', 'sed']
```

In [206]:
def wordcount(filename):
    """Function that takes a file name as argument
    and return a lists containing all words as items in alphabetic order."""
    with open(filename, 'r') as f:       
        text = f.read()
        words = text.lower().replace('.', '').split()
    return(sorted(words))
    
wordcount("sample.txt")

['adipisci',
 'adipisci',
 'adipisci',
 'adipisci',
 'adipisci',
 'adipisci',
 'adipisci',
 'adipisci',
 'adipisci',
 'adipisci',
 'aliquam',
 'aliquam',
 'aliquam',
 'aliquam',
 'aliquam',
 'aliquam',
 'aliquam',
 'aliquam',
 'aliquam',
 'aliquam',
 'aliquam',
 'amet',
 'amet',
 'amet',
 'amet',
 'amet',
 'amet',
 'consectetur',
 'consectetur',
 'consectetur',
 'consectetur',
 'consectetur',
 'consectetur',
 'consectetur',
 'dolor',
 'dolor',
 'dolor',
 'dolore',
 'dolore',
 'dolore',
 'dolore',
 'dolore',
 'dolore',
 'dolorem',
 'dolorem',
 'dolorem',
 'dolorem',
 'dolorem',
 'dolorem',
 'dolorem',
 'dolorem',
 'eius',
 'eius',
 'eius',
 'eius',
 'eius',
 'eius',
 'eius',
 'eius',
 'est',
 'est',
 'est',
 'est',
 'est',
 'etincidunt',
 'etincidunt',
 'etincidunt',
 'etincidunt',
 'etincidunt',
 'etincidunt',
 'etincidunt',
 'etincidunt',
 'etincidunt',
 'etincidunt',
 'ipsum',
 'ipsum',
 'ipsum',
 'ipsum',
 'ipsum',
 'ipsum',
 'ipsum',
 'ipsum',
 'ipsum',
 'ipsum',
 'ipsum',
 'labore

## Sorting a dictionary by value

By default, if you use `sorted` function on a `dict`, it will use keys to sort it.
To sort by values, you can use [operator](https://docs.python.org/3.6/library/operator.html).itemgetter(1)
Return a callable object that fetches item from its operand using the operand’s `__getitem__(` method. It could be used to sort results.

In [207]:
import operator
fruits = [('apple', 3), ('banana', 2), ('pear', 5), ('orange', 1)]
getcount = operator.itemgetter(1)
dict(sorted(fruits, key=getcount))

{'apple': 3, 'banana': 2, 'orange': 1, 'pear': 5}

`sorted` function has also a `reverse` optional argument.

In [208]:
dict(sorted(fruits, key=getcount, reverse=True))

{'apple': 3, 'banana': 2, 'orange': 1, 'pear': 5}

### Exercise 4.3

Modify the function `wordcount` to reduce the list of words and return a dictionary containing all words as keys and number of occurrences as values.

```pybt
wordcount('sample.txt')
{'tempora': 2, 'non': 1, 'quisquam': 1, 'amet': 1, 'sit': 1}
```

In [209]:
def reduce2(word_list):
    """Function that takes a list of words IN ALPHABETIC ORDER as argument and
    returns a dictionary containing all words as keys and number of occurrences as values"""
    word_dict = {}
    last_word = None
    for word in word_list:
        if word != last_word:
            word_dict[word] = 1
            last_word = word
        else:
            word_dict[word] += 1
    return word_dict
    #return sorted(word_dict.items(), reverse=True, key=operator.itemgetter(1))
    #return sorted(word_dict.items(), reverse=True, key=lambda t:t[1])
    
    
reduce(wordcount("sample.txt"))
#sum(reduce(wordcount("sample.txt")).values())

{'adipisci': 10,
 'aliquam': 11,
 'amet': 6,
 'consectetur': 7,
 'dolor': 3,
 'dolore': 6,
 'dolorem': 8,
 'eius': 8,
 'est': 5,
 'etincidunt': 10,
 'ipsum': 11,
 'labore': 8,
 'magnam': 11,
 'modi': 15,
 'neque': 5,
 'non': 12,
 'numquam': 10,
 'porro': 8,
 'quaerat': 11,
 'quiquia': 10,
 'quisquam': 12,
 'sed': 8,
 'sit': 7,
 'tempora': 9,
 'ut': 8,
 'velit': 16,
 'voluptatem': 4}

In [210]:
def reduce(word_list):
    """Function that takes a list of words as argument and
    returns a dictionary containing all words as keys and number of occurrences as values"""
    word_dict = {}
    for word in word_list:
        try:
            word_dict[word] += 1
        except KeyError:
            word_dict[word] = 1
    return word_dict

reduce2(wordcount("sample.txt"))

{'adipisci': 10,
 'aliquam': 11,
 'amet': 6,
 'consectetur': 7,
 'dolor': 3,
 'dolore': 6,
 'dolorem': 8,
 'eius': 8,
 'est': 5,
 'etincidunt': 10,
 'ipsum': 11,
 'labore': 8,
 'magnam': 11,
 'modi': 15,
 'neque': 5,
 'non': 12,
 'numquam': 10,
 'porro': 8,
 'quaerat': 11,
 'quiquia': 10,
 'quisquam': 12,
 'sed': 8,
 'sit': 7,
 'tempora': 9,
 'ut': 8,
 'velit': 16,
 'voluptatem': 4}

You probably notice that these two simple functions are not easy to implement. Python standard library provides some features that can help.

# Container datatypes

`collection` module implements specialized container datatypes providing alternatives to Python’s general purpose built-in containers, `dict`, `list`, `set`, and `tuple`.

- `defaultdict` :	dict subclass that calls a factory function to supply missing values
- `Counter`	: dict subclass for counting hashable objects

## defaultdict

When you implement the `reduce` function you probably add some probleme to append key-value pair to your `dict`. If you try to change the value of a key that is not present 
in the dict, the key is not automatically created.

The `defaultdict` is the solution. This container is a `dict` subclass that calls a factory function to supply missing values.
For example, using list as the default_factory, it is easy to group a sequence of key-value pairs into a dictionary of lists:

### Exercise 4.4

- Modify the reduce part of `wordcount` function you wrote above by using a defaultdict with the most suitable factory.

In [211]:
from collections import defaultdict
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d = defaultdict(list)
for k, v in s:
    d[k].append(v)

dict(d)

{'blue': [2, 4], 'red': [1], 'yellow': [1, 3]}

In [212]:
from collections import defaultdict

def reduce3(word_list):
    word_dict = defaultdict(int)
    for word in word_list:
        word_dict[word] += 1
    return(dict(word_dict))

reduce3(wordcount("sample.txt"))

{'adipisci': 10,
 'aliquam': 11,
 'amet': 6,
 'consectetur': 7,
 'dolor': 3,
 'dolore': 6,
 'dolorem': 8,
 'eius': 8,
 'est': 5,
 'etincidunt': 10,
 'ipsum': 11,
 'labore': 8,
 'magnam': 11,
 'modi': 15,
 'neque': 5,
 'non': 12,
 'numquam': 10,
 'porro': 8,
 'quaerat': 11,
 'quiquia': 10,
 'quisquam': 12,
 'sed': 8,
 'sit': 7,
 'tempora': 9,
 'ut': 8,
 'velit': 16,
 'voluptatem': 4}

## Counter

A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts.

Elements are counted from an iterable or initialized from another mapping (or counter):

In [213]:
from collections import Counter

violet = dict(r=23,g=13,b=23)
print(violet)
cnt = Counter(violet)  # or Counter(r=238, g=130, b=238)
print(cnt['c'])
print(cnt['r'])

{'g': 13, 'b': 23, 'r': 23}
0
23


In [214]:
print(*cnt.elements())

g g g g g g g g g g g g g b b b b b b b b b b b b b b b b b b b b b b b r r r r r r r r r r r r r r r r r r r r r r r


In [215]:
cnt.most_common(2)

[('b', 23), ('r', 23)]

In [216]:
cnt.values()

dict_values([13, 23, 23])

### Exercise 4.5

Use a `Counter` object to count words occurences in the sample text file. 

In [217]:
from collections import Counter

def reduce4(word_list):
    cnt = Counter(word_list)
    return dict(cnt)

reduce4(wordcount("sample.txt"))    

{'adipisci': 10,
 'aliquam': 11,
 'amet': 6,
 'consectetur': 7,
 'dolor': 3,
 'dolore': 6,
 'dolorem': 8,
 'eius': 8,
 'est': 5,
 'etincidunt': 10,
 'ipsum': 11,
 'labore': 8,
 'magnam': 11,
 'modi': 15,
 'neque': 5,
 'non': 12,
 'numquam': 10,
 'porro': 8,
 'quaerat': 11,
 'quiquia': 10,
 'quisquam': 12,
 'sed': 8,
 'sit': 7,
 'tempora': 9,
 'ut': 8,
 'velit': 16,
 'voluptatem': 4}

The Counter class is similar to bags or multisets in some Python libraries or other languages. We will see later how to use Counter-like objects in a parallel context. 

# Process two files

- Create two files containing `lorem` text named 'sample1.txt' and 'sample2.txt'
- If you process these two files you get two dictionaries.
- You have to loop over them to sum occurences and return the reslulted dict. To iterate on specific mappings, Python standard library provides some useful feartures in `itertools` module.
- [itertools.chain(*mapped_values)](https://docs.python.org/3.6/library/itertools.html#itertools.chain) could be used for treating consecutive sequences as a single sequence. 

In [218]:
import itertools, operator
fruits = [('apple', 3), ('banana', 2), ('pear', 5), ('orange', 1)]
vegetables = [('endive', 2), ('spinach', 1), ('celery', 5), ('carrot', 4)]
getcount = operator.itemgetter(1)
dict(sorted(itertools.chain(fruits,vegetables), key=getcount))

{'apple': 3,
 'banana': 2,
 'carrot': 4,
 'celery': 5,
 'endive': 2,
 'orange': 1,
 'pear': 5,
 'spinach': 1}

### Exercise 4.6

- Write the program that creates both files, processes and use `itertools.chain` to get the merged word count dictionary.

In [219]:
# Création de 2 nouveaux fichiers textes

from lorem import text

with open('sample1.txt', 'w') as f:
    f.write(text())
with open('sample2.txt', 'w') as f:
    f.write(text())

In [220]:
from itertools import chain

def wordcount_two_files(filename1, filename2):
    """Function that takes 2 file names as argument
    and return a lists containing all words as items in alphabetic order of the two files."""
    word_list1 = wordcount(filename1)
    word_list2 = wordcount(filename2)
    return sorted(word_list1 + word_list2)

reduce(wordcount_two_files("sample1.txt", "sample2.txt"))

{'adipisci': 13,
 'aliquam': 14,
 'amet': 14,
 'consectetur': 11,
 'dolor': 18,
 'dolore': 3,
 'dolorem': 12,
 'eius': 9,
 'est': 9,
 'etincidunt': 7,
 'ipsum': 15,
 'labore': 9,
 'magnam': 13,
 'modi': 10,
 'neque': 10,
 'non': 12,
 'numquam': 18,
 'porro': 16,
 'quaerat': 13,
 'quiquia': 9,
 'quisquam': 15,
 'sed': 13,
 'sit': 12,
 'tempora': 14,
 'ut': 14,
 'velit': 8,
 'voluptatem': 15}

In [221]:
from itertools import chain

wc1 = wordcount("sample1.txt")
wc2 = wordcount("sample2.txt")

reduce(chain(wc1, wc2))

{'adipisci': 13,
 'aliquam': 14,
 'amet': 14,
 'consectetur': 11,
 'dolor': 18,
 'dolore': 3,
 'dolorem': 12,
 'eius': 9,
 'est': 9,
 'etincidunt': 7,
 'ipsum': 15,
 'labore': 9,
 'magnam': 13,
 'modi': 10,
 'neque': 10,
 'non': 12,
 'numquam': 18,
 'porro': 16,
 'quaerat': 13,
 'quiquia': 9,
 'quisquam': 15,
 'sed': 13,
 'sit': 12,
 'tempora': 14,
 'ut': 14,
 'velit': 8,
 'voluptatem': 15}

### Exercise 4.7

- Modify the `wordcount` function in order to accept several files as arguments and
return the result dict.

```
wordcount(file1, file2, file3, ...)
```

[Hint: arbitrary argument lists](https://docs.python.org/3/tutorial/controlflow.html#arbitrary-argument-lists)

In [222]:
import lorem

def create_samples(n):
    """Function create n files with text from lorem package"""
    for i in range(n):
        filename = "sample" + str(i+1) + ".txt"
        with open(filename, 'w') as f:
            f.write(lorem.text())

create_samples(15)

In [223]:
def wordcount_n_files(*args):
    """Function that takes n file names as argument
    and return a lists containing all words as items in alphabetic order of the n files."""
    all_words = []
    nb_files = len(args)
    for i in range(nb_files):
        all_words += wordcount(args[i])
    return all_words

reduce(wordcount_n_files("sample1.txt", "sample2.txt", "sample3.txt", "sample4.txt", "sample4.txt"))      

{'adipisci': 37,
 'aliquam': 33,
 'amet': 29,
 'consectetur': 38,
 'dolor': 51,
 'dolore': 34,
 'dolorem': 35,
 'eius': 31,
 'est': 36,
 'etincidunt': 27,
 'ipsum': 33,
 'labore': 27,
 'magnam': 41,
 'modi': 38,
 'neque': 38,
 'non': 38,
 'numquam': 36,
 'porro': 47,
 'quaerat': 38,
 'quiquia': 47,
 'quisquam': 39,
 'sed': 30,
 'sit': 35,
 'tempora': 44,
 'ut': 48,
 'velit': 36,
 'voluptatem': 48}

In [224]:
from glob import glob

glob("sample*.txt")

['sample.txt',
 'sample00.txt',
 'sample01.txt',
 'sample02.txt',
 'sample03.txt',
 'sample04.txt',
 'sample05.txt',
 'sample06.txt',
 'sample07.txt',
 'sample1.txt',
 'sample10.txt',
 'sample11.txt',
 'sample12.txt',
 'sample13.txt',
 'sample14.txt',
 'sample15.txt',
 'sample2.txt',
 'sample3.txt',
 'sample4.txt',
 'sample5.txt',
 'sample6.txt',
 'sample7.txt',
 'sample8.txt',
 'sample9.txt']

In [225]:
reduce(wordcount_n_files(*glob("sample*.txt")))

{'adipisci': 182,
 'aliquam': 207,
 'amet': 154,
 'consectetur': 169,
 'dolor': 198,
 'dolore': 162,
 'dolorem': 187,
 'eius': 186,
 'est': 187,
 'etincidunt': 168,
 'ipsum': 193,
 'labore': 183,
 'magnam': 223,
 'modi': 206,
 'neque': 179,
 'non': 194,
 'numquam': 189,
 'porro': 195,
 'quaerat': 182,
 'quiquia': 199,
 'quisquam': 201,
 'sed': 164,
 'sit': 179,
 'tempora': 217,
 'ut': 199,
 'velit': 208,
 'voluptatem': 198}

In [226]:
reduce(chain(*[wordcount(file) for file in glob("sample*.txt")]))

{'adipisci': 182,
 'aliquam': 207,
 'amet': 154,
 'consectetur': 169,
 'dolor': 198,
 'dolore': 162,
 'dolorem': 187,
 'eius': 186,
 'est': 187,
 'etincidunt': 168,
 'ipsum': 193,
 'labore': 183,
 'magnam': 223,
 'modi': 206,
 'neque': 179,
 'non': 194,
 'numquam': 189,
 'porro': 195,
 'quaerat': 182,
 'quiquia': 199,
 'quisquam': 201,
 'sed': 164,
 'sit': 179,
 'tempora': 217,
 'ut': 199,
 'velit': 208,
 'voluptatem': 198}