# Mandatory Assignment 1: Counting Words

**This is the first of three mandatory assignments to be handed in as part of the assessment for the course 02807 Computational Tools for Data Science at Technical University of Denmark, autumn 2019.**

#### Practical info
- **The assignment is to be done individually. You are under no circumstances allowed to collaborate with anyone on solving the exercises (cf. the full policy on this on the course website)**
- **You must hand in one Jupyter notebook (this notebook) with your solution**
- **The hand-in of the notebook is due 2019-10-13, 23:59 on DTU Inside**

#### Your solution
- **Your solution should be in Python**
- **For each question you should use exactly the cells provided for your solution**
- **You should not remove the problem statements, and you should not modify the structure of the notebook**
- **Your notebook should be runnable, i.e., clicking [>>] in Jupyter should generate the result that you want to be assessed**

---
## Introduction
In this assignment you will build data structures for counting words in a text. Suppose you are given a very large corpus of text, and from time to time you need to count how many times a word occurs. You could write a program that searches the text from start to end every time you want to make a query, but for large texts, this will be very slow. A common way to handle this, is to preprocess the text into a data structure that contains exactly the information needed to answer specific queries like the frequency of a word.

Given a text, you should build data structures that can answer the following questions efficiently:
- How many times does a given word occur in the text? (exercise 2)
- How many times in total does a word starting with a given prefix occur? (exercise 3)

For each of these questions you should write a function that organizes data in a way (the data structure) that makes it possible to write an efficient function (the query) to answer the questions. A good data structure is one where the query is much faster than just searching the text while still is not using too much space.

You should not use any Python libraries (except `string`!) to solve the exercises. You may use build-ins like lists, dictionaries, `map`, `filter`, and so on.

You should provide implementations for the functions in this notebook. Do not change the names of the functions.

To test your programs, we will use the complete works of William Shakespeare. Please run the cells below to load the text and show a preview. You should be online before you do this. Note that a good solution will have to work for even larger texts than this.

In [1]:
import requests
text = requests.get('https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt').text

In [2]:
print('{} ...'.format(text[0:1000]))

This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
Library of the Future and Shakespeare CDROMS.  Project Gutenberg
often releases Etexts that are NOT placed in the Public Domain!!

Shakespeare

*This Etext has certain copyright implications you should read!*

<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS
PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE
WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE
DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS
PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
COMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>

*Project Gutenberg is proud to cooperate with The World Library*
in the presentation of The Complete Works of William Shakespeare
for your reading for educatio

---
## Exercise 1
In this exercise you should create a helper function for parsing a text into an iterable (e.g., a list) of words. You will need this in the subsequent exercises.

You should make sure that:
- each element of the iterable is exactly one word,
- all words are lower case,
- words do not contain punctuation.

Write a function `text_to_words()` that takes as input a string and output an iterable of the words in the string.

In [3]:
import string

def text_to_words(text):
    
    # Map all punctuation characters to 'None'
    translator = str.maketrans('', '', string.punctuation)
    # Remove punctuation characters from the string and do lowercase
    text = text.translate(translator).lower()
    # Split the string
    words = text.split()
    
    return words

In [4]:
# Should output 'this is a test'
' '.join(text_to_words('THIS is, a test!'))

'this is a test'

In [5]:
words = text_to_words(text)

---
## Exercise 2
In this exercise you should create a data structure that can give an answer to the following query.
- How many times does a given word occur in the text?

You should do the following:
- Write a function `build_word_count_data_structure()` that takes as input an iterable of strings, and outputs a data structure for looking up how many times a string occurs.

- Write a function `get_word_count()` that uses your data structure to count the number of times a word occurs.

- Explain with words how your data is organized in your data structure, how your function constructs it, and how a query works. Why is it efficient?

In [6]:
def build_word_count_data_structure(words):
    
    # Create a dict to save the number of occurrences per word
    word_occurrences = {}
    # Check every word of the text
    for word in words:
        try:
            # Try to increment the number of occurrences for a word
            word_occurrences[word] += 1
        except KeyError:
            # A 'KeyError' exception raises if the word had not been detected yet:
            # Initialize the count for the word
            word_occurrences[word] = 1
    
    return word_occurrences

In [7]:
ds = build_word_count_data_structure(words)

In [8]:
def get_word_count(ds, word):
    return ds[word]

In [9]:
get_word_count(ds, 'romeo')

137

### Explanation
Two different implementations have been contemplated to get the number of occurences of each word in the text efficiently:
1. Use a dictionary to save the number of occurrences for a word. Then iterate the word array and use the function 'text.count(word)' to count the occurrences within the text. If the word has been analized previously (is already stored in the dictionary) just avoid to repeat the counting process.
2. A similar process than the previous one, but just avoiding using the function 'text.count(word)'. Instead of it, just increase by 1 the number of occurrences every time the word is detected. If the word is still not stored in the dictionary the count will be initialized to 1.

Some implementations, like using the 'filter()' function to find the occurrences, have been discarded because they localize their computational cost in the 'get_word_count()' function, while executing the search and not previously (in the 'build_word_count_data_structure()' function). It's better to process the data only one time and then get the required information in a quick way than process it every time a query is performed.

Other implementations were discarded due to their complexity, such as try to organize the words by the letters to avoid iterating all the words. However, create complex structures to store data implies the design of ways to recover all the words in an efficient way.

Finally the two selected methods were tested. The first one took over 370secs to finish and the second one only 0,15secs, so the first one was discarted. Just before finish a last improvement was tested with the final method: instead of check if a word is allready stored in the dictionary just try to increment its count. If an exception is raised (because the word is still not stored) catch it and initialize the count of the word to 1. With this last change the execution time finally was only 0,11secs.

**Why is it efficient?**

In the first method the function 'text.count(word)' is being executed one time per each different word in the text. This function seems to works well with small strings but is slow while working with big ones (as the text used in this exercise). The second method is very simple and very efficient because it only needs to iterate the words' array and not analyze the entire text.

---
## Exercise 3
In this exercise you should create a data structure that can give an answer to the following query.
- How many times in total does a word starting with a given prefix occur?

For example, `bar` is a prefix of `bar`, `barracuda`, and `barrier`, and the result of the query should include the sum of the number of times each of those words occur.

You should do the following:
- Write a function `build_prefix_count_data_structure()` that takes as input a list of strings, and outputs a data structure for looking up how many times a prefix occurs in words.

- Write a function `get_prefix_count()` that uses your data structure to count the number of times a prefix occurs.

- Explain with words how your data is organized in your data structure, how your function constructs it, and how a query works. Why is it efficient?

In [10]:
def build_prefix_count_data_structure(words):
    return words

In [11]:
ds = build_prefix_count_data_structure(words)

In [12]:
def get_prefix_count(ds, prefix):
    
    # Create an integer to count the number of words with the prefix
    count = 0
    # Check every word of the text
    for word in ds:
        # Increment by 1 if the word starts with the prefix
        if (word.startswith(prefix)):
            count += 1
    
    return count

In [13]:
get_prefix_count(ds, 'rom')

852

### Explanation
The data structure containing the words from the test is the same one used in the previous exercise (no changes made to the array in the function 'build_prefix_count_data_structure()').

Three different methods have been tested to get the number of words in a text containing a prefix:
1. A simple loop with a counter incremented by 1 if the word starts with the prefix, using the function 'word.startswith(prefix)'.
2. Use the 'filter()' function to get all the words containing the prefix and then just get the length of the result.
3. Almost equal to the first method mentioned but checking the beggining letters of the words by one instead of using the 'word.startswith(prefix)' function.

With the same prefix "rom" the first method only took 0,1sec to finish, meanwhile the second and the third method needed 0,2 and 0,3sec respectively. The analyzed text is very long so the performance is good for all of them, although the first method is the most efficient one and is the one being used in this notebook.

**Why is it efficient?**

The '.startswith()' seems to be very well optimized and is a better option compared to checking the characters one by one. Also the 'filter()' function is very useful to manage the data but seems to be a bit slower than the process used in the first method.

The problem here, compared to the previous exercise, is that the computational cost is located when retrieving the number of words with a prefix and not while creating the data structure to save the information. To be able to avoid the computational cost in the 'get_prefix_count()' it's necessary to implement some complex system to tokenize the words within the array (as search engines do).