# DSCI 512 Lab 1

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import urllib.request
from collections import defaultdict, Counter
%config InlineBackend.figure_formats = ['svg']

## Instructions
rubric={mechanics:10}

Follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).

## Exercise 1: time complexity

For each of the following functions, determine the time complexity as a function of the input $n$ using big-O notation and briefly justify your answer. If you get stuck, it's fair game to test things empirically and then try to understand what you observe. **Please state your assumptions if you don’t know how long some operation in Python takes.** 

The first question is done for you, as an example.

In [None]:
def example(n):
    for i in range(n):
        print(i)
        print(i**2)
        x = 9
        y = 10

In [None]:
example(5)

**Sample answer**: The time complexity of `example` is  $O(n)$ because the function loops over $n$ elements and only performs constant-time operations inside the loop. 

#### 1(a)
rubric={reasoning:3}

In [None]:
def loopy(n):
    for i in range(n):
        for j in range(n):
            print('i =', i, '  j =', j)

In [None]:
loopy(4)

#### 1(b)
rubric={reasoning:3}

In [None]:
def triangle(n):
    for i in range(n):
        for j in range(i):
            print("+", end='')
        print("")

In [None]:
triangle(7)

#### 1(c)
rubric={reasoning:3}

In [None]:
def foo(n):
    x = np.zeros(n)
    x = x + 1000
    return x

In [None]:
print('size of x: ', len(foo(100000)))

#### 1(d)
rubric={reasoning:1}

In [None]:
def bar(n):
    x = np.zeros(1000)
    x = x + n
    return x

In [None]:
print('size of x: ', len(bar(100000)))

#### 1(e)
rubric={reasoning:3}

In [None]:
def broken(n):
    for i in range(n**2):
        if i == n:
            break  # "break" exits the innermost loop
        print(i)

In [None]:
broken(4)

#### 1(f)
rubric={reasoning:3}

In [None]:
def cabin(n):
    i = n
    while i > 1:
        print('i = ', i)
        i = i // 2  

**Note:** the `//` operator performs integer division, meaning the result is rounded *down* to the nearest integer.

In [None]:
cabin(2048)

#### 1(g)
rubric={reasoning:3}

In [None]:
def cabin10(n):
    i = n
    while i > 1:
        print('i = ', i)
        i = i // 10

In [None]:
cabin10(2048)

#### 1(h)
rubric={reasoning:3}

For this question, answer in terms of both $m$ and $n$.

In [None]:
def blahblah(n, m):
    x = 0

    for i in range(n):
        for j in range(m):
            x = x + 1

    for i in range(n):
        x = x + 1

    for i in range(m):
        x = x + 1
        
    return x

In [None]:
blahblah(2,3)

#### 1(i)
rubric={reasoning:3}

For this question, answer in terms of both $m$ and $n$.

In [None]:
def bllllergh(n, m):
    x = 0
    for i in range(n):
        for j in range(m):
            for k in range(m):
                x = x + 1
    for i in range(n):
        for j in range(n):
            for k in range(m):
                x = x + 1
    return x

In [None]:
bllllergh(2,3)

#### 1(j)
rubric={reasoning:1}

In [None]:
def log_cabin(n):
    for i in range(n):
        print('i = ', i)
        for j in range(n//3):
            print('j = ', j)
            cabin(n)
        print('-----------')

In [None]:
log_cabin(4)

#### (challenging) 1(k)
rubric={reasoning}

In [None]:
def oh_no(n):
    i = 0
    while i < n:
        i = i - 1

## Exercise 2: space complexity

For each of the following functions, determine the space complexity as a function of the input $n$ using big-O notation and briefly justify your answer. 

#### 2(a)
rubric={reasoning:3}

In [None]:
def foo(n):
    x = np.random.rand(n)
    y = np.random.rand(n)
    total = 0
    for x_i in x:
        for y_i in y:
            total += x_i*y_i
    return total

#### 2(b)
rubric={reasoning:3}

In [None]:
def bar(n):
    x = np.zeros(1000)
    x = x + n
    return x

#### 2(c)
rubric={reasoning:3}

In [None]:
def FUNCTION(n):
    x = set()
    for i in range(n):
        for j in range(n):
            x.add(j)
    return x

## Exercise 3: Timing for searching, sorting, hashing

#### 3(a)
rubric={reasoning:3}

Is searching in a Python list faster when the element you're looking for is at the start of the array? Design and run an "experiment" comparing the runtime of Python's `in` operator for the case when the element being sought is at the start vs. the end of a list.

(Optional: It would be nice to show a plot of running time vs. different list sizes to illustrate the trend.)

Does it seem to make a difference? Briefly discuss your results.

#### 3(b)
rubric={reasoning:2}

Is sorting in Python faster when the array you're sorting is already sorted? Design and run an "experiment" comparing the runtime of numpy's `.sort()` for the caes when the the array is vs. isn't already sorted. Does it seem to make a difference?

#### 3(c)
rubric={reasoning:5}

We saw in class that hash tables like Python's `dict` grow when they get too full. Make a plot of the _size_ of a dictionary using `sys.getsizeof` vs. the number of elements. At what sizes does the dictionary seem to grow? Discuss your results.

#### (challenging) 3(d)
rubric={reasoning}

Now do the same experiment but with a `list` instead of a `dict`. Discuss your results.

## Exercise 4: Markov Model of language

_Meta-comment 1_: this is more of a programming exercise. There was some talk of having it in DSCI 511, but we ended up putting it here. However, it's definitely good practice in _using_ Python data structures like dictionaries. There are some more challenging questions about time/space complexity at the end, which you can skip if you don't have enough time for them. Overall, this is not a perfect thematic fit with DSCI 512, but it's very good practice (and hopefully fun!).

In this exercise we will try to synthesize English text by "learning" from some input text, also known as a _corpus_. As an example, let's say the input text is the following, taken from the MDS website:

> Data is everywhere. Continuously generated and collected across every domain, it is a vast and largely untapped resource of information with the potential to reveal insights about every aspect of our lives and the world we live in. However, the ability to uncover these insights is a highly specialized skill possessed by far too few. 

Our algorithm involves a parameter, which we'll call $n$. Let me first explain the approach when $n=1$: 

- We will start with an initial character, say "y". There are 8 occurences of "y" in the input text above. What character typically comes after "y"? It turns out (according to the input text above) the next letter is "w" the first time and " " (space character) all the other 7 times. So we estimate the conditional probability distribution of the next character, given that the current character is "y", to be "w" with probability 1/8 and " " with probability 7/8 (and probability zero for all other characters).

- To generate the next character, we generate a sample from this simple distribution. Say we pick " ", so we add a " " to our output text and it is now "y ". Now " " is our current character. To generate the next character, we'd need to probability distribution of what comes after " " so that we could sample from it. We'd repeat this until the output text reaches a pre-specified length.

What about larger $n$? For $n=3$, we pick the next character by looking at the _preceding 3 characters_. We use the name [_n-gram_](https://en.wikipedia.org/wiki/N-gram) for a sequence of $n$ characters. Our method should work for any $n>0$.

For example, take our initial text to be the 3 characters "is ":
There are 3 occurrences of this $n$-gram in the text. In this case, the next letter is "e" once and "a" twice, so we estimate the conditional distribution to be 1/3 "e" and 2/3 "a" after "is ". So we pick randomly from this distribution. Say we pick "e". Then our output text is now "is e" but our current $n$-gram is just "s e" because we're only using $n=3$. So to pick the next character after this, we'd look at what happens after occurrences of "s e". And so on.

In order to implement this idea efficiently, you will pre-compute the conditional probabilitity distribution for every possible $n$ gram. To do that we need to count, for every possibly $n$-gram, the freqeuncies of the possible next characters, and then normalize them into probability distributions.

*Attribution*: this exercise adapted with permission from Princeton COS 126, [_Markov Model of Natural Language_]( http://www.cs.princeton.edu/courses/archive/fall15/cos126/assignments/markov.html). Original assignment was developed by Bob Sedgewick and Kevin Wayne. If you are interested in more background info, you can take a look at the original version. The original paper by Shannon, [A Mathematical Theory of Communication](https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf), essentially created the field of information theory and is thought to be one of the best scientific papers ever written (in terms of both impact and readability).

In [None]:
# Grimms' Fairy Tales by Jacob and Wilhelm Grimm
data_url = 'http://www.gutenberg.org/files/2591/2591-0.txt'
corpus = urllib.request.urlopen(data_url).read().decode("utf-8")

# remove the first chunk of characters, which contains some header stuff
corpus = corpus[2820:]

In [None]:
print(corpus[:200])  # print out the first 200 characters

#### 4(a): implementation
rubric={accuracy:10,quality:10}

You will implement the above algorithm in a class called `MarkovModel`. Your class will have the following methods:

- `__init__`, which is already implemented for you.
- `fit`, which calculates and stores the _frequencies_ of all possible next characters given an $n$-gram. These frequencies should be stored in a `dict` of `dicts`, where the keys of the outer `dict` are the $n$-grams and the keys of the inner `dict` are the possible next characters, and the values of the inner `dict` are the frequencies (counts). Then, at the end of `fit`, normalize these frequencies into empirical probabilities and store them in `self.probabilities`.
**Note:** before starting the calculations, append the first $n$ characters of your corpus to the end of the corpus, making it "circular"; this will avoid a situation where you your `generate` function might get stuck when your loop reaches the end of the corpus.
- `generate`, which creates a random text of a specified length by generating one character at a time from the appropriate (discrete) probability distribution. To perform the random sampling, use the parameter `p=` of `np.random.choice`. You can start the output text with the first $n$ characters of the input text.

**Note:** you may find some of the fancy dictionaries in the [`collections`](https://docs.python.org/3.7/library/collections.html) package useful, namely `defaultdict` and/or `Counter`. However, you can also just use `dict`; either way is fine.

**Hint:** if you find yourself searching for all occurrences of an $n$-gram in the text, you are approaching this incorrectly—in that case, ask us for help!

In [None]:
class MarkovModel:
    """A Markov model of languages based on character frequencies in text."""

    def __init__(self, n):
        self.n = n
        self.probabilities = None 
        self.starting_chars = None

    def fit(self, text):
        """
        Fit a Markov model and create a transition matrix.

        Parameters
        ----------
        text : str
            a corpus of text 
        """
        
        # store the first n characters of the training text, as we will use these
        # to seed our `generate` function
        self.starting_chars = text[:self.n]
        
        # make text circular so Markov chain doesn't get stuck
        circ_text = text + text[:self.n]

        # Step 1: Compute frequencies
        # FILL IN THE REST OF THE CODE HERE
        
        # Step 2: Normalize the frequencies into probabilities
        # FILL IN THE REST OF THE CODE HERE
        
    def generate(self, seq_len):
        """
        Generate a sequence of length seq_len, Markov model learned in `fit`.

        Parameters
        ----------
        seq_len : int
            the desired length of the sequence

        Returns
        -------
        str
            the generated sequence
        """
        s = self.starting_chars
        while len(s) < seq_len:
            current_ngram = s[-self.n:]
            
            # FILL IN THE REST OF THE CODE HERE
            
        return s
    

Here are some tests that should pass if `fit` is implemented correctly (it will not work yet):

In [None]:
mm = MarkovModel(n=2)
test_corpus = "2 + 2 = 4; 2 + 3 = 5; 3 + 3 = 9; 3 + 2 = 5;"
mm.fit(test_corpus)

assert mm.starting_chars == '2 '
assert mm.probabilities['2 ']['+'] == 1/2
assert mm.probabilities[' 3'][' '] == 1
assert mm.probabilities[';2'][' '] == 1

In [None]:
print(mm.generate(40))

And here we run it on our fairy tales corpus:

In [None]:
mm = MarkovModel(n=5)
mm.fit(corpus)

In [None]:
print(mm.generate(200))

#### 4(b): fun with language models
rubric={reasoning:5}

1. Explain what happens as you increase $n$ from 1 to larger and larger values. At what point does it start to look like English? At what point is your model just memorizing the input corpus?
2. Generate some random sequences using the data set of your choice. Submit your favourite randomly generated sequence as well as the link to the data you used to generate it. If you are out of ideas, you may find some text files of popular books [here](http://www.gutenberg.org/).

#### (challenging) 4(c): time complexity of `fit`
rubric={reasoning}

For the above implementation, what is the (worst case) time complexity of running `fit` in terms of:

- $n$, the length of each $n$-gram
- the length of the corpus, which we'll call $N$
- the length of the sequence to generate, `seq_len`, which we'll call $T$

You can assume `np.random.choice` takes $O(1)$ time. You can also assume $n \ll N$ and $n \ll T$.

#### (challenging) Exercise 4d: time complexity of `generate`
rubric={reasoning}

For the above implementation, what is the (worst case) time complexity of running `generate` in terms of $n$, $N$, and $T$?

#### (challenging) Exercise 4(e): total time complexity
rubric={reasoning}

What is the total time complexity of running `fit` once and then `generate` once, in terms of $n$, $N$, and $T$?

#### (challenging) 4(f): space complexity
rubric={reasoning}

What are the space complexities of `fit` and `generate`?