# Word count 🗄️
If you've ever heard of "Hello, world!" for web development, "Word count" is the actual equivalent for distributed computing.

In this notebook, we will setup a MapReduce pipeline that will count the words in a document. 

Map Reduce is a style of coding that can be implemented in different languages. For now, we will do this in Python just to familiarize ourselves with this new way of coding. In the next lectures, you'll learn how to implement this task by using Spark, which will allow to scale to big documents by parallelizing some operations.

Below is a schematic representation of the Map Reduce algorithm in the case of word count:

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/FULL_STACK_12_WEEK/M05/map_reduce_word_count_process-ed4d1e0b-1180-4609-88e1-e2b1054829e7.png" />

To implement this algorithm, we will proceed by steps:
* Reading the document
* Writing a `map` function that creates tuples in the form `('word', 1)` where each tuple is created by a word encountered in the document
* Writing a `group_by_key` function that creates a dictionary in the form `{'word': [1, 1, 1]}` where `[1, 1, 1]` means that 'word' has been encountered 3 times in the document
* Writing a `reduce` function that finally produces a dictionary in the form `{'word': 3}` where 3 is the total number of occurrences of 'word' in the document

## Reading the document

1. Use python's `open` and `readlines` functions to read the `wordcount.txt`. The output should be a list where each element contains the words of one line, in string format.
(please refer to this [documentation](https://docs.python.org/3/tutorial/inputoutput.html))

In [4]:
with open('wordcount.txt', 'r', encoding='utf-8') as fichier:
    # Lire toutes les lignes et les mettre dans une liste
    lignes = fichier.readlines()
print (lignes)

['word count from Wikipedia the free encyclopedia\n', 'the word count is the number of words in a document or passage of text Word counting may be needed when a text\n', 'is required to stay within certain numbers of words This may particularly be the case in academia legal\n', 'proceedings journalism and advertising Word count is commonly used by translators to determine the price for\n', 'the translation job Word counts may also be used to calculate measures of readability and to measure typing\n', 'and reading speeds usually in words per minute When converting character counts to words a measure of five or\n', 'six characters to a word is generally used Contents Details and variations of definition Software In fiction\n', 'In non fiction See also References Sources External links Details and variations of definition\n', 'This section does not cite any references or sources Please help improve this section by adding citations to\n', 'reliable sources Unsourced material may be challen

In [5]:
print(type(lignes))

<class 'list'>


## The map function
2. Complete the following function:
* It takes one argument `line` that represents one line of the document, in string format
* It returns a variable `pairs` that is a list `[('word1', 1), ('word2', 1)]` that contains as may elements as there are words in the line.

🤓 This [function](https://python-reference.readthedocs.io/en/latest/docs/str/split.html) might help you

In [15]:
for elt in [lignes] :
    mots=elt[0].split()
    l_res=[(mot, mots.count(mot)) for  mot in mots]
    print(l_res)

    

        


[('word', 1), ('count', 1), ('from', 1), ('Wikipedia', 1), ('the', 1), ('free', 1), ('encyclopedia', 1)]


In [16]:
def map(line):
    ## TO BE COMPLETED ##
    mots=line.split()
    pairs=[(mot, mots.count(mot)) for  mot in mots]
    return pairs


3. Now, iterate over all the document's lines, and use the `map` function to create a variable named `pairs_list` that contains all the `('word', 1)` pairs encountered in the whole document:

[('word', 1), ('count', 1), ('from', 1), ('Wikipedia', 1), ('the', 1), ('free', 1), ('encyclopedia', 1), ('the', 1), ('word', 1), ('count', 1), ('is', 1), ('the', 1), ('number', 1), ('of', 1), ('words', 1), ('in', 1), ('a', 1), ('document', 1), ('or', 1), ('passage', 1), ('of', 1), ('text', 1), ('Word', 1), ('counting', 1), ('may', 1), ('be', 1), ('needed', 1), ('when', 1), ('a', 1), ('text', 1), ('is', 1), ('required', 1), ('to', 1), ('stay', 1), ('within', 1), ('certain', 1), ('numbers', 1), ('of', 1), ('words', 1), ('This', 1), ('may', 1), ('particularly', 1), ('be', 1), ('the', 1), ('case', 1), ('in', 1), ('academia', 1), ('legal', 1), ('proceedings', 1), ('journalism', 1), ('and', 1), ('advertising', 1), ('Word', 1), ('count', 1), ('is', 1), ('commonly', 1), ('used', 1), ('by', 1), ('translators', 1), ('to', 1), ('determine', 1), ('the', 1), ('price', 1), ('for', 1), ('the', 1), ('translation', 1), ('job', 1), ('Word', 1), ('counts', 1), ('may', 1), ('also', 1), ('be', 1), ('used'

## Group by key
4. Complete the following function:
* It takes in argument the variable `pairs_list` defined above
* It returns a dictionary in the form: `{'word': [1, 1, 1]}` where [1, 1, 1] means that the 'word' appears exactly three times in the document

🤓 For a very DRY and pythonic way of writing this piece of code, you can use [dict comprehensions](https://peps.python.org/pep-0274/)

In [10]:
def group_by_key(pairs_list):
    ## TO BE COMPLETED ##
    return grouped_by_key

In [11]:
grouped_by_key = group_by_key(pairs_list)
grouped_by_key

{'word': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'count': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'from': [1, 1],
 'Wikipedia': [1],
 'the': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'free': [1],
 'encyclopedia': [1],
 'is': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'number': [1, 1, 1],
 'of': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'words': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'in': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'a': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'document': [1, 1],
 'or': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'passage'

### Reduce 
5. Complete the function below:
* It takes as argument the `grouped_by_key` variable that was returned by the `reduce_by_key` function
* It returns the final result of the Map Reduce algorithm in the form `{'word': 3}`

In [13]:
def reduce(grouped_by_key):
    ## TO BE COMPLETED ##
    return reduced

In [14]:
reduced = reduce(grouped_by_key)
reduced

{'word': 24,
 'count': 11,
 'from': 2,
 'Wikipedia': 1,
 'the': 38,
 'free': 1,
 'encyclopedia': 1,
 'is': 19,
 'number': 3,
 'of': 25,
 'words': 21,
 'in': 11,
 'a': 28,
 'document': 2,
 'or': 11,
 'passage': 1,
 'text': 8,
 'Word': 4,
 'counting': 6,
 'may': 8,
 'be': 8,
 'needed': 1,
 'when': 2,
 'required': 1,
 'to': 18,
 'stay': 1,
 'within': 1,
 'certain': 2,
 'numbers': 1,
 'This': 2,
 'particularly': 1,
 'case': 1,
 'academia': 1,
 'legal': 1,
 'proceedings': 1,
 'journalism': 1,
 'and': 23,
 'advertising': 1,
 'commonly': 1,
 'used': 4,
 'by': 5,
 'translators': 1,
 'determine': 1,
 'price': 1,
 'for': 10,
 'translation': 1,
 'job': 1,
 'counts': 3,
 'also': 5,
 'calculate': 1,
 'measures': 1,
 'readability': 1,
 'measure': 2,
 'typing': 1,
 'reading': 1,
 'speeds': 1,
 'usually': 3,
 'per': 3,
 'minute': 1,
 'When': 1,
 'converting': 1,
 'character': 2,
 'five': 1,
 'six': 1,
 'characters': 2,
 'generally': 2,
 'Contents': 1,
 'Details': 2,
 'variations': 2,
 'definition': 3,

## Bonus question
6. What are the 10 most frequent words?

🤓 The [sorted](https://docs.python.org/3/library/functions.html#sorted) function and it argument `key` may be helpful 

[('in', 11), ('or', 11), ('to', 18), ('is', 19), ('words', 21), ('and', 23), ('word', 24), ('of', 25), ('a', 28), ('the', 38)]
