## Assignment 1: Onegram Counter
You probably know about Google Book's Ngram Viewer: when you enter phrases into it, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British English", "English Fiction", "French") over the selected years.

Your assignment for this course is something similar: build a Python function that can take the file data/corpus.txt (UTF-8 encoded) from this repo as an argument and print a count of the 100 most frequent 1-grams (i.e. single words).

In essence the job is to do this:

In [17]:
from collections import Counter
import os

def onegrams(file):
    with open(file, 'r') as corpus:
        text = corpus.read()
        # .casefold() is better than .lower() here
        # https://www.programiz.com/python-programming/methods/string/casefold
        normalize = text.casefold()
        words = normalize.split(' ')
        count = Counter(words) 
        return count

ngram_viewer = onegrams(r"C:\Users\anouk\OneDrive\Documenten\GitHub\InformationScience\course\data\corpus.txt")
print(ngram_viewer.most_common(100))

[('the', 11852), ('', 5952), ('of', 5768), ('and', 5264), ('to', 4027), ('a', 3980), ('in', 3548), ('that', 2336), ('his', 2061), ('it', 1517), ('as', 1490), ('i', 1488), ('with', 1460), ('he', 1448), ('is', 1400), ('was', 1393), ('for', 1337), ('but', 1319), ('all', 1148), ('at', 1116), ('this', 1063), ('by', 1042), ('from', 944), ('not', 933), ('be', 863), ('on', 850), ('so', 763), ('you', 718), ('one', 694), ('have', 658), ('had', 647), ('or', 638), ('were', 551), ('they', 547), ('are', 504), ('some', 498), ('my', 484), ('him', 480), ('which', 478), ('their', 478), ('upon', 475), ('an', 473), ('like', 470), ('when', 458), ('whale', 456), ('into', 452), ('now', 437), ('there', 415), ('no', 414), ('what', 413), ('if', 404), ('out', 397), ('up', 380), ('we', 379), ('old', 365), ('would', 350), ('more', 348), ('been', 338), ('over', 324), ('only', 322), ('then', 312), ('its', 307), ('such', 307), ('me', 307), ('other', 301), ('will', 300), ('these', 299), ('down', 270), ('any', 269), ('

However, you can't use the `collections` library. :-)

Moreover, try to think about what else may be suboptimal in this example. For instance, in this code all of the text is loaded into memory in one time (with the `read()` method). What would happen if we tried this on a really big text file? 

**Most importantly, the count is also wrong**. Check by counting in an editor, for instance, and try to find out why.

If this is an easy task for you, you can also think about the graphical representation of the 1-gram count.

---

> SOLUTION:

Not quite there yet but closer

In [16]:
import string
from operator import itemgetter

def onegrams(file, mostfreq=100):
    """Returns n (mostfreq argument) most frequent words in a file"""
    # initialize output variable for frequency count
    counts = {} 

    with open(file, "r", encoding="utf8") as corpus:
        for line in corpus:
            # remove punctuation and digits from lines
            # this solves problems like "the" != "the,"
            remove_noise = line.translate(line.maketrans("", "", string.punctuation+string.digits))
            # removed the argument from split function, otherwise new lines are not taken into account when splitting
            normalized_words = remove_noise.casefold().split()

            # update frequency of each word in the dictionary
            for word in normalized_words:
                try:
                    counts[word] += 1
                except KeyError:
                    counts[word] = 1
                    
    # sort frequency dictionary, make into list and slice to show most common    
    return list(sorted(counts.items(), key=itemgetter(1), reverse=True))[:mostfreq]

onegrams(r"C:\Users\anouk\OneDrive\Documenten\GitHub\InformationScience\course\data\corpus.txt", 10)

[('the', 14400),
 ('of', 6645),
 ('and', 6403),
 ('a', 4666),
 ('to', 4642),
 ('in', 4161),
 ('that', 2950),
 ('his', 2520),
 ('it', 2388),
 ('i', 1944)]