## **Programming with Python for Data Science**


###**Lesson: Cliff Note Generator** 

## **Notebook Preparation for Lesson in 1•2 steps:**
Each lesson will start with a similar template:  
1. **save** the notebook to your google drive (copy to drive)<br/> ![](https://drive.google.com/uc?export=view&id=1NXb8jeYRc1yNCTz_duZdEpkm8uwRjOW5)
 
2. **update** the NET_ID to be your netID (no need to include @illinois.edu)

In [2]:
LESSON_ID = 'p4ds:ds:cng3'   # keep this as is
NET_ID    = 'salonis3' # CHANGE_ME to your netID (keep the quotes)

#**Lesson Cliff Note Generator**
###**Computing**
The computation stage within the data science pipeline is where you can finally do something useful to the data that you have been carefully preparing. Although it can involve visualizations, this stage is also about building models and doing analysis including pattern discovery, machine learning, etc. This stage usually requires the most creativity and hence is the most fun.

Unfortunately for this example, the end goal is not too exciting -- it's only to build a 'table'. As we saw earlier, our goal is to produce the following table (words and numbers made up):
```
word,  count 
the,   400 
Tom,   305 
Polly, 206
```
However, once this table is built, we can easily use it for other analyses including visualizations (coming soon).

###**Collections Module**
Python has a set of useful containers to manage data in the collections module. Specifically, the Counter type is a great way to keep track of the counts of unique items. For example, here is a simple example:

```
1. import collections
2. words = ['apple', 'pear', 'apple']
3. counter = collections.Counter()
4. for w in words:
5.    counter[w] += 1
6. print(counter.most_common())
```

The lines 3 to 5 can be replaced with a single line of code using collections ```counter = collections.Counter(words)```

```
1. import collections
2. words = ['apple', 'pear', 'apple']
3. counter = collections.Counter(words)
4. print(counter.most_common())
```

In [None]:
# type&run the above example/exercise in this cell

Read (type, and run) that code carefully and make sure you understand the power and usage of collections.Counter(). It's a common pattern. 

See the [documentation](https://docs.python.org/3.7/library/collections.html#collections.Counter) for more details.

###**Before you go, you should know:**

* what the collections.Counter type is and why it's useful

#**Lesson Assignment**
There's not much content to this lesson other than to create the table of words and counts (which will be a list of tuples). The words are already parsed out for you (same as the previous lesson).

###**Build the following three functions:**
```
def clean(words):
  normalizes the words so that letter case is ignored
  returns an array of 'cleaned' words
```
```
def build_table(words):
  builds a dictionary of counts
  returns a Python dictionary or collections.Counter type
```
```
def top_n(table, n):
  returns the n most frequent words(keys) in table
  the return type is an array of tuples
  the tuple's first value is the word; the second value is the count
```
###**Notes:**
the function top_n does not have to worry about the order of items for those words that have the same count. This feature is called stable sorting -- where the items after the sort will always be in the same order (more discussion in the extra credit). You can use collections.Counter to help you with this lesson, but it will NOT return a stable order.

Be sure to test your pipeline on multiple texts. Each 'run' should not affect others:
```
v1 = list(pipeline(['a','b','c'], 5))
v2 = list(pipeline(['a','b','c'], 5))
print(v1 == v2)
```

In [27]:
data = ['YOU', "don't", 'know', 'about', 'me', 'without', 'you', 'have', 'read', 'a', 'book', 'by', 'the', 'name', 'of', 'The', 'Adventures', 'of', 'Tom', 'Sawyer', 'but', 'that', "ain't", 'no', 'matter', 'That', 'book', 'was', 'made', 'by', 'Mr', 'Mark', 'Twain', 'and', 'he', 'told', 'the', 'truth', 'mainly', 'There', 'was', 'things', 'which', 'he', 'stretched', 'but', 'mainly', 'he', 'told', 'the', 'truth', 'That', 'is', 'nothing', 'I', 'never', 'seen', 'anybody', 'but', 'lied', 'one', 'time', 'or', 'another', 'without', 'it', 'was', 'Aunt', 'Polly', 'or', 'the', 'widow', 'or', 'maybe', 'Mary', 'Aunt', 'Polly', "Tom's", 'Aunt', 'Polly', 'she', 'is', 'and', 'Mary', 'and', 'the', 'Widow', 'Douglas', 'is', 'all', 'told', 'about', 'in', 'that', 'book', 'which', 'is', 'mostly', 'a', 'true', 'book', 'with', 'some', 'stretchers', 'as', 'I', 'said', 'before']
import collections

def clean(words):
    normalize_words = [word.lower() for word in words]
    return normalize_words
  
def build_table(words):
    counter = collections.Counter(words)
    return counter

def top_n(table, n):
    return table.most_common(n)

def pipeline(tokens, n):
    return top_n(build_table(clean(tokens)), n)

###**Extra Credit:**
Solve the entire solution **without** using the collections module. Use a regular Python dictionary for all parts of the pipeline. You must create new functions but do not change the previous functions. Both versions of these functions will be evaluated separately.


In [28]:
def clean2(words):
    normalize_words = [word.lower() for word in words]
    return normalize_words
  
def build_table2(words):
    mydict = {}
    for word in words:
        if word not in mydict:
            mydict[word] = 1
        else:
            mydict[word] = mydict[word] + 1
    return mydict

def top_n2(table, n):
    return sorted(table.items(),key=lambda x:x[1],reverse=True)[:n]

def pipeline2(tokens, n):
    return top_n2(build_table2(clean2(tokens)), n)

You will change your implementation of your previous top_n so the top_n2 will do the following:

* sorts the table (a dictionary) in reverse order.
* the sorting criteria is the count (the 2nd value in the tuple)
* returns a subset of the sorted data (use array slicing)
* you must comment out any code that imports collections or the Counter
* you should comment out any code that uses the collections module outside of any function

The following is what you want to print out the top 20 words:
```
print(top_n2(build_table2(clean2(tokens)), 20))
```
```
[
('the', 6), ('book', 4), ('is', 4), 
('that', 4), ('and', 3), ('aunt', 3), 
('but', 3), ('he', 3), ('or', 3), 
('polly', 3), ('told', 3), ('was', 3), 
('a', 2), ('about', 2), ('by', 2), 
('i', 2), ('mainly', 2), ('mary', 2), 
('of', 2), ('truth', 2)]
]
```

It's important to NOTE that the sort from this **must be stable**. A stable sort is one in which items are always returned in the same order. You wouldn't want one run of a sort to return a different order than another run.

* the words with the highest counts come first
* if there is a tie (e.g. 'book' has 4 and 'is' has 4) the two are returned in alphabetical order.
* the collections.Counter type does not do this (your function is actually going to be better!)

###**Extra Credit Hints:**
* your helper function that you will pass to sort will accept a tuple. You will need to use both parts of the tuple to determine the sort order
* the [following](https://www.peterbe.com/plog/in-python-you-sort-with-a-tuple) shows a good example of tuple sorting


##**Submission**

After implementing all the functions and testing them please download the notebook as "solution.py" and submit to gradescope under "Week10: DS: CNG_computing" assignment tab and Moodle.

**NOTES**

* Be sure to use the function names and parameter names as given. 
* DONOT use your own function or parameter names. 
* Your file MUST be named "solution.py". 
* Comment out any lines of code and/or function calls to those functions that produce errors. If your solution has errors, then you have to work on them but if there were any errors in the examples/exercies then comment them before submitting to Gradescope.
* Grading cannot be performed if any of these are violated.