Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). Check your output to make sure it all looks as you expected.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [1]:
NAME = "Bryan Tchakote"

---

**Here are some key points on how to use the notebook and submit your work.**
We will grade based on assuming you have read and understood them.
    
1. **Using the Notebook to Show Your Work**: You must learn to write code in the notebook... It is a core tool for data science and will make it easier to develop and document your work if you become good at using it. Writing your code in another tool and pasting it into the notebook will probably not work well (forgetting to include code elsewhere, messing up the spacing, or code that don't run as copied). You must be sure your code cells execute because we will test them.  So learn how to run code in the notebook cells to double check your work.
2. **Read Directions Carefully**:  The instructions for your code are very important. If you don't follow the requirements, your application won't do as requested. Making it work correctly is part of learning to program.  If we worded something unclearly, ask the teacher.

The second set of issues are coding style:

1. **Indentation**: In Python, indentation matters and must be consistent. If you write your code in the Notebook, the **tab** key will indent properly.  If you use another editor and paste into the notebook, it might not be correctly indented (when you do write code in another editor, make sure you set your tabs to indent as 4 spaces, not as a tab character.)  You must make sure that your pasted code runs in the notebook or it will not get a good grade. Anyway, we recommend beginners to work in a Jupyter notebook for this course, whether it's this one or a draft file.
2. **Spacing**: Follow closely the spacing shown in the lessons. There should not be a space  between a function name and the parentheses with the arguments. As a programmer, style is very important. If you work with programmers in the future, they sometimes have "lint" checkers to test your code for style and reject if it doesn't follow the approprate spacing and blank-line-rules. Think of it as a matter of politness for other people reading your code \ (•◡•) /
3. **Names of Variables**: In Python, there's a culture of making everything readable. Don't use ``x`` and ``y`` as your variable names... use words like ``pounds`` and ``kilograms``. It will be easier for colleagues (and yourself) to understand the code later.
4. **Error Messages**: Please use informative error messages that tell the user what they did wrong and what kind of input you expect. Imagine you are designing the user experience! Think about how to help your user. And remember **you** are the user when you debug!

We will take points off for issues of non-standard spacing, indentation, bad error messages, and bad variable names in the future.  This will continue for the entire course.

There are multiple ways to code all the answers.  Here are a few more code style tips:

1. If you do a calculation or a transformation, like ``float(pounds)`` -- do it once and save it as a variable, don't do it multiple times.  You should try not to have code that repeats itself too much.  If you repeat things, you can make mistakes like typos and it will be harder to find them. Also, it's wasting computer power.
2. Tests like "4 < test < 40" need to be saved in a variable or used in a ``if`` statement.  It won't do anything relevant otherwise.
3. ``try``/``except`` should be used to catch errors. (In fancier, more formal Python, there is more careful error catching where the type of error is detected and handled. We're just doing the basic try/except right now.) Anytime you have a conversion or something that could result in an error, you should wrap it in try/except. Do not allow a user to run code that results in an un-handled error.

---

# Using Modules in Python

So far we have been learning "built-in" basic python concepts. 

Collections of Python code, or indeed any .py file, can be used as modules in your code. Python "ships" with a bunch of useful standard modules that allow you to do things with your computer, with data files and formats, with data types, etc.  You can read more about modules and how they are used and written here: https://docs.python.org/3/tutorial/modules.html

The way to tell the difference between "built-in" functions and standard module functions that Python includes is to look at this page: https://docs.python.org/3/library/index.html  The **Built-in** packages are described until section 6, and the rest are modules with useful tools you probably have to import to use.  We explain how to import them below.

Any time we tell you (or a Python author tells you) that you need to "install" other Python libraries to use some function, it means you need to install modules that are available online (usually using "pip install" or "conda install" or by downloading from github).  Some data science modules that are installed with Anaconda are **numpy, scikit-learn, pandas, and nltk**.  To use those, which are not "standard", you must import them in your code.  Then you can use their functions.

## Importing A Module

The simplest way to import is to put an "import" statement at the top of your code. Usually these imports occur first in your file or notebook.  Then you can use functions on them, if you use the "dot" notation method: [module].[function].

In [2]:
# this is my import of the string module
import string

# This is how I can use a function or property defined inside the module.
# In this case, it's a list of punctuation characters. I reference the module, string, and 
# then the thing I am trying to use, the punctuation:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [3]:
import math

# using math to print the square root of 9:
print(math.sqrt(9))

3.0


In [4]:
import random

# create a random integer between 0 and 10
print(random.randint(0, 11))

1


If you want to see what the options on a module are, you can try it in the notebook: type "random." and then the tab key and see what options appear in a little popup menu.

In [None]:
# put your mouse right after the "." and press tab - these are all the functions you can call!
random.

If you want to see the documentation for a function, you can try that in the notebook too: type a "?" after a function and then execute the cell. A window at the bottom of you browser will tell you about the function.  You can resize it. You can close it by clicking the x in the upper right corner.

In [5]:
# execute this with shift-enter
random.randint?

### Import a Module and Give it a Nickname

In the data science world, there are conventions for importing modules with certain nicknames.  You will see this code all the time:

````
import numpy as np
import pandas as pd
````

If you do this nickname method, you can then call functions using the short name instead:
`np.mean([43,45,52,40,34])`


   
    

### Importing a single function from a module

Sometimes you don't want an entire module, just a single function from it.  You can import it specifically by importing it by name from its module, like defaultdict which lives in the module collections:

````
from collections import defaultdict
````

Or 
````
from string import punctuation
```` 

After you do that, you don't need to use the "dot" notation to access the value of `punctuation` (or the function `defaultdict` above):

In [6]:
from string import punctuation
print(punctuation)   # not string.punctuation!

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


## defaultdict  - no key errors!  Default values!

You should read about defaultdict now: https://www.accelebrate.com/blog/using-defaultdict-python/

Dictionaries are great for keeping track of things, but if a key doesn't exist and you try to save something to it, you get an error of type KeyError, because the key doesn't exist:


In [7]:
mydict = {'jan': 31, 'fev': 28, 'mar': 31, 'avr': 30}  # notice no mai key:value pair
print(mydict['mai'])

KeyError: 'mai'

Errors like this should never be seen by a user and should be handled by try/except, maybe something like this:

In [None]:
try:
    print(mydict['mai'])
except:
    print("No entry exists!")
    mydict['mai'] = 31
    print(mydict['mai'])

Even better code is to actually look for a **specific error** in the try/except, like this:

In [None]:
mydict = {'jan': 31, 'fev': 28, 'mar': 31, 'avr': 30}  # notice no mai key:value pair
try:
    print(mydict['mai'])
except KeyError:    # <-----   we name it here!  Because we want to handle it specifically.
    print("No entry exists!")
    mydict['mai'] = 31
    print(mydict['mai'])

The book recommends using `mydict.get('mai')` instead of requiring a try/except to prevent returning an error - it will instead return `None`, which is much friendlier and easier to test for.

In [None]:
# see, no error!
mydict = {'jan': 31, 'fev': 28, 'mar': 31, 'avr': 30}
print(mydict.get('mai'))

In [None]:
# using it in a test:
mydict = {'jan': 31, 'fev': 28, 'mar': 31, 'avr': 30}

# if None is returned, this test passes (because of the "not") and we print the message:
if not mydict.get('mai'):
    print("We are missing this key.")
else:
    print(mydict.get('mai'))

Another alternative is to use the `defaultdict` inside the module `collections`.  `Defaultdict` sets a default value for any key when you define the dictionary.  ("Default" means "use this if you don't specify another thing").

In [None]:
from collections import defaultdict
mydict = defaultdict(int)  # this says my values will be ints. The default value of an int is 0.

# no error even without any definition -- it defaults to 0.
mydict['mai']

The equivalent with the "get" function, which allows you to set a default value (here "0"), is:

In [None]:
mydict.get('mai', 0)

Here is code to use a defaultdict to count words in romeo.txt:
    

In [None]:
filename = "data_files/romeo.txt"
wordcounter = defaultdict(int)

try:
    fhand = open(filename)
except:
    print("File wasn't found or the name is wrong. It has to be in this directory with the notebook.")
    
for line in fhand:
    words = line.split()
    for word in words:
        wordcounter[word] += 1  # this is the same as wordcounter[word] = wordcounter[word] + 1

# let's print it nicely:
for key, val in wordcounter.items():   # items returns a pair for each dict element, the key and value.
    print(key, val)
    
fhand.close() # if you open the file without a "with open(filen) as x:" line, you need to close it manually.

## Counter in Collections

There is another great dictionary tool in Collections called a Counter. In data science, we count things a lot. You should know about this tool.

Read the section here about the Counter: https://docs.python.org/3/library/collections.html#collections.Counter

If you give a counter a list of words, it will automatically count them for you.

In [None]:
# First, let's define this function for trying to open a file and returning the filehandle:
def open_a_file(filename):
    try:
        fhand = open(filename)
        return fhand
    except:
        return None  # we return this if there is an error opening the file.

In [None]:
from collections import Counter  # import the Counter

# define a function to take your filehandle and count the words in the file.
def get_counts(filehandle):
    # Takes a filehandle, returns a counter object
    allwords = []
    for line in fhand:
        words = line.split()
        allwords = allwords + words  # add the words to the list of all words -- this prevents embedded lists
    mycount = Counter(allwords)  # this is a list of words, remember
    return mycount

Now we can call this code using any filename, and our functions!

In [None]:
fhand = open_a_file("data_files/romeo.txt")
if fhand:  # if it didn't return "None", which means it was an error!
    counts = get_counts(fhand)

print("All words in the counter dictionary ordered by frequency:", counts.most_common())  # might be long, beware!
print("Top 5 words", counts.most_common(5))

In [None]:
# Just like other dictionaries, you can get the words (the keys), but not ordered:
counts.keys()

In [None]:
counts.values()   # also, not ordered

In [8]:
counts.items()  # the usual function for getting it all, again not ordered

NameError: name 'counts' is not defined

In [9]:
counts['microwave']  # any word not in the keys returns a 0 count, like defaultdict. Because it appeared 0 times.

NameError: name 'counts' is not defined

**Update** on Counter -- an important feature.

A small reminder about a Counter... After you declare it, you must use `update` with a list as the argument in order to add to it. If you want to use it in a loop, like with each row or each word of your file, instead of waiting till the whole file is processed, you must make sure you put each word into a list:

In [10]:
from collections import Counter

mycount = Counter()
mycount.update(['words', 'here', 'already'])   # i have to use update because I already declared it above.
words = ['hi', 'hi', 'there', 'fred', 'here', 'already']
for word in words:
    mycount.update([word])  # now I am adding to the counts with more words, one at a time!

In [11]:
print(mycount)

Counter({'here': 2, 'already': 2, 'hi': 2, 'words': 1, 'there': 1, 'fred': 1})


## Sorting Dictionaries

If you want to sort a dictionary by either key, or value, you can use the `sorted` function.  It takes a special second argument to tell it how you want to sort.  But it's a little cryptic.  The sorted function takes a key that says what to sort by.  In a dictionary, we can treat the key:value pairs as little lists (technically "tuples"), with key as the 0th element, and value as the 1st element.  

There are 2 ways to say use "value" to sort by:

In [12]:
mydict = {'jan': 1, 'fev': 5, 'mar': 20, 'avr': 10, 'sep': 14, 'juin': 25}

In [13]:
# the lambda expression is a tiny function without a name and with less code-- 
# here it says "use my 0th" (the first) item to sort by. It returns a list of the keys in this order.
sorted(mydict.items(), key=lambda p: p[0])

[('avr', 10), ('fev', 5), ('jan', 1), ('juin', 25), ('mar', 20), ('sep', 14)]

In [14]:
sorted(mydict.items(), key=lambda p: p[1])  # use my second element, the value, to sort the keys:

[('jan', 1), ('fev', 5), ('avr', 10), ('sep', 14), ('mar', 20), ('juin', 25)]

If you want to reverse the sort order, so the highest numbers are at the top, use another argument:

In [15]:
sorted(mydict.items(), key=lambda p: p[1], reverse=True)

[('juin', 25), ('mar', 20), ('sep', 14), ('avr', 10), ('fev', 5), ('jan', 1)]

### Another way, using a module that does the lambda for you:

In [16]:
from operator import itemgetter
sorted(mydict.items(), key=itemgetter(1), reverse=True)

[('juin', 25), ('mar', 20), ('sep', 14), ('avr', 10), ('fev', 5), ('jan', 1)]

In [17]:
sorted(mydict.items(), key=itemgetter(0))  # alphabetical order, sorting by the key, the 0th item

[('avr', 10), ('fev', 5), ('jan', 1), ('juin', 25), ('mar', 20), ('sep', 14)]