## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

## Module submission group
- Group member 1
    - Name: Eric Benton
    - Email: emb393@drexel.edu
- Group member 2
    - Name: Michael Wesner
    - Email: mw3344@drexel.edu
- Group member 3
    - Name: Dustin Luchmee
    - Email: dbl47@drexel.edu
- Group member 4
    - Name: NA
    - Email: NA

### Additional submission comments
- Tutoring support received: Hunter Heidenreich
- Other (other): NA

# DSCI 691: Natural language processing with deep learning <br> Assignment 2: Abstracting Summaries of the News
## Data and Utilities 
Here, we'll be working again with the same linked NewsTweet data and some essential utilities presented in the __Chapter 1 Notes__.

In [1]:
import json
newstweet = json.load(open('./data/newstweet-subsample-linked.json'))
exec(open('./01-utilities.py').read())

## Overview 
The purpose of this assignment (52 pts) is to gain some experience with the extremely important NLP task called language modeling. We'll explore its standard $n$-gram approach, and likewise work to generalize a more flexible semantically-dense model that uses CBOW statistics.

Since this is a language modeling (LM) assignment, for the sanity checks we'll be working on a single document from the data set throughout, focused on a Robert Downey Jr. movie. But in principle you should be able to apply this assignment to any of the articles and generate text for summaries, and you should&mdash;it's fun!

Anyway, here's the article of focus:

In [1]:
print(newstweet[5]['text'])

As we continue, we'll explore ways that we can use both sparse and dense frequency-based models to regenerate this document, and along the way we'll get some experience with modeling sampling and perplexity as a performance measure.

## Experiment

### 1. (3 pts) Operate the Chapter 1 Statistical Engine
Here, you must complete the `reduced_rownormed_CoM(newstweet)` function, which entails operating the `make_CoM` and `svdsub` functions under the default settings over all `newstweet` `'text'` fields, and storing their outputs.

In particular, you _must_ recover the following named objects:

- a `CoM` from the `make_CoM` function;
- the `CoM_d` dimensional reduction from `svdsub`;
- a `type_index` from `make_CoM`; _and_
- a transformed copy of `CoM_d` which has had its rows divided by their `np.linalg.norm()`s.

In [3]:
# A1:Function(3/3)
import numpy as np

def reduced_normed_CoM(newstweet):

    #--- your code starts here
    CoM, type_index = make_CoM([x['text'] for x in newstweet])
    CoM_d = svdsub(CoM)
    CoM_d_normed = CoM_d.T / np.linalg.norm(CoM_d, axis = 1)
    CoM_d_normed = CoM_d_normed.T
    #--- your code stops here
    
    return CoM, type_index, CoM_d, CoM_d_normed

For reference, your output should be:
```
((31980, 50), 31979.99999999998)
```

In [4]:
# A1:SanityCheck

CoM, type_index, CoM_d, CoM_d_normed = reduced_normed_CoM(newstweet)
CoM_d_normed.shape, (CoM_d_normed**2).sum()

((31980, 50), 31979.99999999999)

### 2. (3 pts) Build an $n$-gram counter
Given an input list of `tokens`, use list slices to complete the `count(tokens, n = 1)` function to produce and return the `ngram_counts` object, as a `Counter()` of `n`-sized tuples.

In [5]:
# A2:Function(3/3)
from collections import Counter

def count(tokens, n = 1):
    
    #--- your code starts here
    
    ngram_counts = Counter(zip(*[tokens[i:] for i in range(n)]))
    
    #--- your code stops here
    
    return ngram_counts


For reference, your output should be:

```
Counter({('this', ' ', 'is', ' ', 'an'): 1,
         (' ', 'is', ' ', 'an', ' '): 1,
         ('is', ' ', 'an', ' ', 'example'): 1,
         (' ', 'an', ' ', 'example', ' '): 1,
         ('an', ' ', 'example', ' ', 'of'): 1,
         (' ', 'example', ' ', 'of', ' '): 1,
         ('example', ' ', 'of', ' ', 'a'): 1,
         (' ', 'of', ' ', 'a', ' '): 1,
         ('of', ' ', 'a', ' ', 'token'): 1,
         (' ', 'a', ' ', 'token', ' '): 1,
         ('a', ' ', 'token', ' ', 'stream'): 1})
```

In [6]:
# A2:SanityCheck

count(["this", " ", "is", " ", "an", " ", "example", " ", 
       "of", " ", "a", " ", "token", " ", "stream"], 5)

Counter({('this', ' ', 'is', ' ', 'an'): 1,
         (' ', 'is', ' ', 'an', ' '): 1,
         ('is', ' ', 'an', ' ', 'example'): 1,
         (' ', 'an', ' ', 'example', ' '): 1,
         ('an', ' ', 'example', ' ', 'of'): 1,
         (' ', 'example', ' ', 'of', ' '): 1,
         ('example', ' ', 'of', ' ', 'a'): 1,
         (' ', 'of', ' ', 'a', ' '): 1,
         ('of', ' ', 'a', ' ', 'token'): 1,
         (' ', 'a', ' ', 'token', ' '): 1,
         ('a', ' ', 'token', ' ', 'stream'): 1})

## 3. (4 pts) Build $n$-gram frequencies up to a size
Here, your job will be to apply the `count` function to build $n$-gram frequency distributions up to a given maximum size. To do this, complete the `make_ngram_frequency(documents, n = 1, space = True)`
The function's main argument is `documents`, which will be a list of strings, and the function's only output will be `ngram_frequencies`, which will be a list of $n$-gram `Counter()`s, up to a specified (by `n`) size.

In [7]:
# A3:Function(4/4)

def make_ngram_frequency(documents, n = 1, space = True):
    ngram_frequencies = []
    
    #--- your code starts here
    docs = ' '.join(documents)
    for i in range(1, n+1):
        ngram_frequencies.append(count(tokenize(docs), i))
    #--- your code stops here
        
    return ngram_frequencies

For reference, your output should be:
```
8
```

In [8]:
# A3:SanityCheck

n = 9
ngram_frequencies = make_ngram_frequency([x['text'].lower() for x in newstweet], n = n)
ngram_frequencies[5][tuple(tokenize('robert downey jr.'))]

8

## 4. (6 pts) Build the standard LM
Now it's time for the model, i.e., computing the LM probabilities from __Section 2.1.4.1__. This will entail completing the two tricks, first, `varepsilon`-smoothing:
$$
\hat{P}(t_n|t_1, t_2, \cdots t_{n-1}) = \frac{\varepsilon + f(t_1, t_2,\cdots, t_n)}{\varepsilon|W| + f(t_1, t_2,\cdots, t_{n-1})}
$$
where some small, constant non-zero weight ($\varepsilon$) is distributed to each type of the model in _every_ context, regardless of it's actual appearance in the data.

The second component you'll have to sort out is _backoff_, where the desired probabilities are approximated via the next-lower-$n$ context adjacent to the prediction point:
$$
\hat{P}(t_n|t_1, \cdots t_{n-1})\approx\hat{P}(t_n|t_2, \cdots t_{n-1}).
$$

For both cases, this amounts to determining `t_Ps`, as a `Counter()` of types-as-keys with probability values and thus completing the function:
```
P_next(gram, ngram_frequencies, type_index, epsW = 0.1)
```
for which `gram` corresponds to the vector, $\vec{t} = [t_1,\cdots,t_{n-1}]$ of tokens preceeding the prediction point.  Here, `epsW` will indicate _the total mass_ used by the smoothing parameter, $\varepsilon$. Hence, the default setting `epsW = 0.1` will mean:
$$
\varepsilon = \frac{0.1}{|W|}.
$$
which should allow for convienient parameterization of _slight_ smoothings, which won't totally swamp the model's performance.

[Hint. use the `n1` (context size) to navigate the `ngram_frequencies` object, and don't be afraid to slice `gram`s].

In [9]:
# A4:Function(6/6)

def P_next(gram, ngram_frequencies, type_index, epsilon = .1):
    n1 = len(gram)
    epsilon /= len(type_index)
    x = []
    t_Ps = Counter()
    if gram in ngram_frequencies[n1-1]: ## use gram to condition frequencies with epsilon-smoothing
        
        #--- your code starts here
        
        for key in type_index.keys():
            x.append((key, ((epsilon + ngram_frequencies[n1][gram+(key,)]) / (epsilon*len(type_index) + ngram_frequencies[n1-1][gram]))))
        
        for j in x:
            t_Ps[j[0]] = j[1]    
        #--- your code stops here
        
    else: ## recursively back off to lower-n model
        
        #--- your code starts here
        gram = gram[1:]
        return P_next(gram, ngram_frequencies, type_index, epsilon = .1)
        #--- your code stops here

    
    return t_Ps

For reference, your output should be:
```
[1.0000000000000002,
 [('adventure', 0.9090937517766786),
  ('\n', 2.8426857695150374e-06),
  (' ', 2.8426857695150374e-06),
  ('!', 2.8426857695150374e-06),
  ('"', 2.8426857695150374e-06),
  ('#', 2.8426857695150374e-06),
  ('$', 2.8426857695150374e-06),
  ('%', 2.8426857695150374e-06),
  ('&', 2.8426857695150374e-06),
  ("'", 2.8426857695150374e-06)],
 [(' ', 0.9090937517766786),
  ('\n', 2.8426857695150374e-06),
  ('!', 2.8426857695150374e-06),
  ('"', 2.8426857695150374e-06),
  ('#', 2.8426857695150374e-06),
  ('$', 2.8426857695150374e-06),
  ('%', 2.8426857695150374e-06),
  ('&', 2.8426857695150374e-06),
  ("'", 2.8426857695150374e-06),
  ("''", 2.8426857695150374e-06)]]
```

In [10]:
# A4:SanityCheck

[np.nansum([x for x in P_next(tuple(tokenize("he goes on an ")), ngram_frequencies, type_index).values()]), 
 list(P_next(tuple(tokenize("he goes on an ")), ngram_frequencies, type_index).most_common(10)), 
 list(P_next(tuple(tokenize(" he goes on an")), ngram_frequencies, type_index).most_common(10))]

[1.0000000000000002,
 [('adventure', 0.9090937517766786),
  ('\n', 2.8426857695150374e-06),
  (' ', 2.8426857695150374e-06),
  ('!', 2.8426857695150374e-06),
  ('"', 2.8426857695150374e-06),
  ('#', 2.8426857695150374e-06),
  ('$', 2.8426857695150374e-06),
  ('%', 2.8426857695150374e-06),
  ('&', 2.8426857695150374e-06),
  ("'", 2.8426857695150374e-06)],
 [(' ', 0.9090937517766786),
  ('\n', 2.8426857695150374e-06),
  ('!', 2.8426857695150374e-06),
  ('"', 2.8426857695150374e-06),
  ('#', 2.8426857695150374e-06),
  ('$', 2.8426857695150374e-06),
  ('%', 2.8426857695150374e-06),
  ('&', 2.8426857695150374e-06),
  ("'", 2.8426857695150374e-06),
  ("''", 2.8426857695150374e-06)]]

### 5. (7 pts) Build a model sampler
Now that we have a LM, we need a way to sample from it. To start complete the function:
```
sample_LM(gram, LM_args, top = 1., LM = P_next)
``` 
which must perform a weghted random sample via `np.random.choice()`, using the `gram` for the context of a prediction point (as in __Part 5.__, for `P_next()`). However, this sampler must deploy one of two sampling algorithms, as specified by the `top` parameter. Specifically:
1. when `type(top) == float`, the floating point value of `top` should represent the cumulative probabiliy of top-scoring predictions to weight a sample from; and
2. when `type(top) == int`, the integer value of `top` should represent the `top` highest-ranking predicitons to weight a sample from.

In case (1), the sample might range over many or few possibilities, depending on the confusion of the model at the point of the prediction, and in case (2) the sample might be constrained to a limited vocabulary at each step. However, in both your code should use, e.g., a boolean mask to filter the `Ps` (prediciton probabilities) and `ts` (prediction types) down to just those in the `top`, i.e., 'viable' set that will be passed to the sampler.

Note: in both cases your filtered prediction probabilities (`Ps`) must be re-normalized for the weighted random sample!

In [11]:
# A5:Function(7/7)

def sample_LM(gram, LM_args, top = 1., LM = P_next):
    
    Ps = LM(gram, *LM_args)
    ts, Ps = map(np.array, zip(*Ps.most_common()))
    Ps /= Ps.sum()
    
    #--- your code starts here
    if isinstance(top, float):
        prob = 0.0
        ts_temp = []
        ps_temp = []
        
        i = 0
        while prob < top and i < len(Ps):
            prob += Ps[i]
            ps_temp.append(Ps[i])
            ts_temp.append(ts[i])
            i += 1
            
        ts, Ps = ts_temp, ps_temp
            
    elif isinstance(top, int):
        Ps = Ps[0:top]
        ts = ts[0:top]

    ps_sum = sum(Ps)
    for i in range(0, len(Ps)): 
        Ps[i] /= ps_sum
     
    #--- your code stops here
    s = np.random.choice(ts, size=1, replace=False, p=Ps)[0]
    
    return s

For reference, your output should be:
```
'adventure'
```

In [12]:
# A5:SanityCheck

np.random.seed(691)
sample_LM(tuple(tokenize("he goes on an ")), (ngram_frequencies, type_index, 0.01))

'adventure'

### 6. (5 pts) Build a recitation function for the LM
Here, our goal will be to have the LM 'recite' a given `document` string input. In particular, your job is to complete the function:
```
recitation, likelihood = recite(document, LM_args, LM = P_next, n = 5, top = 1., verbose = True)
```
which has the following arguments:
- `document`: a string to be modeled by its ngrams
- `LM_args`: a tuple of all arguments to be passed to the LM
- `LM (= P_next)`: the LM function to be operated for the recitation
- `n (= 5)`: an integer number indicating the gram-size to model from
- `top (= 1.)`: the sampling paramater for the `sample_LM` function
- `verbose (= True)`: a boolean, indicating whether the model should print the text it produces, while operating

and has the following return values are:
- `recitation`: a list of the tokens which the LM _predicts_, in order
- `likelihood`: a list of the probabilities for the _correct_ targets of the LM as it operates

In [13]:
# A6:Function(5/5)

def recite(document, LM_args, LM = P_next, n = 5, top = 1., verbose = True):
    tokens = tokenize(document)
    ngram_stream = [tuple(tokens[i:i+n]) for i in range(0,len(tokens) - n + 1)]

    if verbose:
        print("generated document, starting from \""+"".join(ngram_stream[0][:-1])+"\":\n")
        
    likelihood = []; recitation = []
    for ix, ngram in enumerate(ngram_stream):
        
        #--- your code starts here
        s = sample_LM(ngram[:-1], LM_args)
        recitation.append(s)

        prob = LM(tuple(ngram_stream[ix][:-1]), *LM_args)[ngram_stream[ix][-1]]
        likelihood.append(prob)
        #--- your code stops here
        
        if verbose:
            print(recitation[-1], end = '')
    
    return recitation, likelihood

For reference, your output should be:
```
generated document, starting from "robert downey ":

jr. in withdraw have to you than people as of r460 owner. robert  although, morbid she this?official “bake” is released sunday, jim replied not forced to beat nikola post and set sail across the country,is a string island.

andthe were to control but to make on this weeks journey,” downey whispers to his teammates and furry friends.

and,this’involves lots into debt of of dangerous situations, and north chained in a blistering dungeon with a big who greets him with  whenhello, lunch” and another they him. he you this inform about a feat voiced by rami malek comes to these fans.

also,taking: universal moves robert downey jr.  message of doctor ned to january 2020 
“dolittle” tells the story a the doctor and whistleblower during the last got year victoria’s england, have. joshua brown, who are to service will the massive,of his child who years,at,caused him to be the party who only wanted to join  but there i dishwasher,queen (jessiechipotlebuckley, “wild rose”) falls ill, gruber did down sale amazing to find out a would-be. remember played signed on this perilous by a ghost person (harry collett, “dunkirk”) and the males friends, who " aircraft gorilla (malek), an event duck (octavia spencer), an hr ostrich (kumail nanjiani), an island polar bear (john cena), and if t-shirt parrot (emma thompson).

the brown's also stars as rassouli, along with head sheen as mudfly. additional voice performers include marion cotillard, frances de la tour, with ejogo, ralph fiennes, selena gomez  tom jurich, and one robinson.

“he” is a by james hawking (“syriana,” “closed”), as the alongside sony biden and jeff cox under their new/kirschenbaum films banner (“alice in wonderland,” “repugnant”), and it as the susan downey,(“sherlock holmes” franchise, aewthe mlf”) for its downey. downey jr. gets produces and iwth sarah bradshaw (“the mummy,” “closed”),and guides roth (“maleficent: mistress of evil extra-large.
```

In [14]:
# A6:SanityCheck

j = 5
document = newstweet[j]['text'].lower()
np.random.seed(691)
recitation, likelihood = recite(document, (ngram_frequencies, type_index, 0.01))

generated document, starting from "robert downey ":

jr. in withdraw have to you than people as of r460 owner. robert  although, morbid she this?official “bake” is released sunday, jim replied not forced to beat nikola post and set sail across the country,is a string island.

andthe were to control but to make on this weeks journey,” downey whispers to his teammates and furry friends.

and,this’involves lots into debt of of dangerous situations, and north chained in a blistering dungeon with a big who greets him with  whenhello, lunch” and another they him. he you this inform about a feat voiced by rami malek comes to these fans.

also,taking: universal moves robert downey jr.  message of doctor ned to january 2020 
“dolittle” tells the story a the doctor and whistleblower during the last got year victoria’s england, have. joshua brown, who are to service will the massive,of his child who years,at,caused him to be the party who only wanted to join  but there i dishwasher,queen (jessiec

### 7. (2 pts) Build a perplexity performance evaluator
We want to know how well this LM works, so let's compute the _average_ perplexity with respect to the document's stream of tokens (prediction points): $t_1, \cdots, t_m$. For each $i=1,\cdots,m$ of these, let $\hat{y}_i\in[0,1]^{|W|}$ be the probabilistic prediction vector over the vocabulary, so that $\hat{y}_{i,t_i}$ is the prediction probability for the _correct_ type, $t_i$ at the $i^\text{th}$ prediction point. Under this notation, we wisht to compute the perplexity across our a document:
$$
\mathcal{T}(t_1,\cdots,t_m) = e^{
    -\frac{1}{m}\sum_{i = 1}^m\log{\hat{y}_{i,t_i}}
}
$$
In order to do this, we'll have to work from the `recite()` function's `likelihood` output format, which should now be a `list` of the $\hat{y}_i\in[0,1]^{|W|}$ values. 

With this all in mind, your job is to complete the `perplexity(likelihood)`, which returns a floating point number named `average_perplexity` (computed as above).


In [15]:
# A7:Function(2/2)

def perplexity(likelihood):
    
    #--- your code starts here
    
    average_perplexity = 0.0

    for l in likelihood:
        average_perplexity += np.log(l)
    
    average_perplexity = average_perplexity * (-1 / len(likelihood))
    average_perplexity = np.e**(average_perplexity)

    #--- your code stops here
    
    return average_perplexity

For reference, your output should be:
```
average perplexity of recitation:  1.8009612714103027
```

In [16]:
# A7:SanityCheck

print("average perplexity of recitation: ", perplexity(likelihood))

average perplexity of recitation:  1.8019266613339904


### 8. (6 pts) Build a rambling function for the LM
Here, our goal will be to have the LM 'ramble' from a given `prompt` of token-stream (list of strings) input. Note: while this function can go 'off the script', it must start from a prompt within its vocabulary, as specified within `LM_args[-1]`, i.e., `type_index` object. In particular, you must complete the function:
```
rambling, likelihood = ramble(prompt, docsize, LM_args, LM = P_next, n = 5, top = 1., verbose = True)
```
which accepts the following arguments:
- `prompt`: a list of strings (tokens) which define the starting ngram for rambling prediction
- `docsize`: the integer number of tokens to generate in the `rambling`
- `LM_args`: same as for `recite()`
- `LM (= P_next)`: same as for `recite()`
- `n (= 5)`: same as for `recite()`
- `top (= 1.)`: same as for `recite()`
- `verbose (= True)`: same as for `recite()`

and has the following return values are:
- `rambling`: like `recitation`, a list of the tokens which the LM _predicts_, in order
- `likelihood`: _now_, a list of the probabilities for the _predictions_ made by the LM as it operates

In [17]:
# A8:Function(6/6)

def ramble(prompt, docsize, LM_args, LM = P_next, n = 5, top = 1., verbose = True):
    
    if verbose:
        print("generated document, starting from \""+"".join(prompt)+"\":\n")
    
    likelihood = []; rambling = []
    n1gram = prompt[-n:]
    while len(rambling) < docsize:
        #--- your code starts here
        s = sample_LM(n1gram, LM_args)
        rambling.append(s)
        prob = LM(tuple(n1gram), *LM_args) 
        likelihood.append(prob[s])
        n1gram = list(n1gram)
        n1gram.append(s)
        n1gram = n1gram[1:]
        n1gram = tuple(n1gram)
        #--- your code stops here
        if verbose:
            print(rambling[-1], end = '')
    return rambling, likelihood

For reference, your output should be:
```
generated document, starting from "robert downey jr":

. executive produces and co-stars in the second half, jones and xavien howard, the dolphins have done is build a private event.

average perplexity of ramble:  1.6576789181598104
```

In [18]:
# A8:SanityCheck

j = 5
np.random.seed(691)
document = tokenize(newstweet[j]['text'].lower())
rambling, likelihood = ramble(tuple(document[:5]), 46, 
                              (ngram_frequencies, type_index, 0.01))
print("\n\naverage perplexity of ramble: ", perplexity(likelihood))

generated document, starting from "robert downey jr":

. executive produces and co-stars in the second half, jones and xavien howard, the dolphins have done is build a private event.

average perplexity of ramble:  1.6577216073888121


### 9. (4 pts) Constraining the vocabulary of a ramble
Since we'd like to summarize these news articles, an easy trick to get the LM talk about the 'right stuff' is simply to constrain to the vocabulary of a given document. As such, we can and will make `type_index`-like objects for each article and then just use the same architecture as above.

So here, you must complete the `make_doc_types()`, which accepts a list of strings named `documents`, the overall `type_index`, and a usual `space` boolean parameter. This amounts to constructing the `doc_types` object as a list of dictionaries, each of which has the same format as `type_index`, with the caveat, that each of `doc_types[j]` should only contain the type-index mapping for its given `j`th `document`, from `documents`.



In [19]:
# A9:Function(4/4)

def make_doc_types(documents, type_index, space = True):
    
    doc_types = []
    
    #--- your code starts here
    for document in documents:
        document = document.lower()
        toked_doc = tokenize(document, space = space)
        document_dict = {}
        for token in toked_doc:
            document_dict[token] = type_index[token]
        doc_types.append(document_dict)
    #--- your code stops here
        
    return doc_types

For reference, your output should be:
```
generated document, starting from "robert downey jr":

. and his wife and her friends as no people after the new england is nanjiani), an cynical ostrich (kumail nanjiani), an enthusiastic duck (octavia spencer), an enthusiastic duck (octavia spencer), an enthusiastic duck (octavia spencer), an upbeat polar bear (john cena), and all of the people that have the no. 

average perplexity of ramble:  2.2622080624237064
```

In [20]:
# A9:SanityCheck

j = 5
np.random.seed(691)
document = tokenize(newstweet[j]['text'].lower())
doc_types = make_doc_types([x['text'].lower() for x in newstweet], type_index)
rambling, likelihood = ramble(tuple(document[:5]), 120, 
                              (ngram_frequencies, doc_types[j], 0.01))
print("\n\naverage perplexity of ramble: ", perplexity(likelihood))

generated document, starting from "robert downey jr":

. and his wife and her friends as no people after the new england is nanjiani), an cynical ostrich (kumail nanjiani), an enthusiastic duck (octavia spencer), an enthusiastic duck (octavia spencer), an enthusiastic duck (octavia spencer), an upbeat polar bear (john cena), and all of the people that have the no. 

average perplexity of ramble:  2.264691877974366


### 10. (5 pts) Build CBOW semantic vectors for the ngram contexts
Now that we can have the model speak the 'right stuff', consider this quote from a [not-that-old NLP paper](https://arxiv.org/pdf/1508.06615.pdf):
> Neural Language Models (NLM) address the n-gram data sparsity issue through parameterization of words as vectors (word embeddings) and using them as inputs to a neural network. The parameters are learned as part of the training process. Word embeddings obtained through NLMs exhibit the property whereby semantically close words are likewise close in the induced vector space.

What it's saying&mdash;and you should be observing this in the assignment's experiments&mdash;is that $n$-gram language models suffer from data sparsity issues, which make them overfit and brittle. Overcoming this issue is a core impact of nerual LM. But is this&mdash;neual&mdash;approach really the reason _why_ neural models accomplish better results, or is a large part of those improved results to do with the same _CBOW phenomenon_, i.e., moving from a sparse to dense linear-semantic representation?

This is precisely why we've ported in our __Chapter 1__ `01-utilities.py` code, i.e., to build a semantically-dense LM that makes more-flexible decisions based on multiple semantic dimensions. To allow for this, you must complete the function:
```
ngram_semantics = make_context_representations(ngram_frequencies, type_index, CoM)
```
which builds an object named `ngram_semantics` that generalizes the frequency-based utility of `ngram_frequencies`. In particular, for each `i`-gram length, and each corresponding `ngram` ($\vec{t}$) in `ngram_frequencies[i]` (i.e., containing $f\left(\vec{t}\right)$), you must break the `ngram` into it's last token, `t` ($t_i$), and its context, `c` ($c=\vec{t}_{1,\cdots,i-1} = [t_1, \cdots, t_{i-1}]$), comprised of all others besides the last token, `t`. Using these, the value of `ngram_semantics[i][c]` should then sum up the squared semantic components from all ngrams, $\vec{t}$, conaining the context $c$&mdash;in particular, these come from the vector $CoM_{t_i}$ corresponding to the _types_, $t_i$, appearing just after $c$. However, we'll be usng the pointwise-squared semantic ($CoM$) matrix, which we denote by $CoM^2$:
$$
CoM^2_c = \sum_{\vec{t}_{1,\cdots,i-1} = c}f\left(\vec{t}\right)CoM_{t_i}^2
$$
As you complete this function, try to intuitively follow how the `ngram_semantics` object collects the semantic mass of types, according to their $n$-gram contexts.


In [21]:
# A10:Function(5/5)
#t-arrow = n-gram

def make_context_representations(ngram_frequencies, type_index, CoM):

    ngram_semantics = {}
    for i in range(len(ngram_frequencies)):
        ngram_semantics[i] = Counter()
        for ngram in sorted(ngram_frequencies[i]):
            
            #--- your code starts here
            f_t = ngram_frequencies[i][tuple(ngram)]
            t = ngram[-1]
            t_index = type_index[t]
            t_i = CoM[t_index, :]
            c = ngram[:-1]
            ngram_semantics[i][c] += f_t * np.square(t_i)  
            
            #--- your code stops here
        
    return ngram_semantics

For reference, your output should be:

```
array([7.97611695e+00, 8.99872340e-03, 2.15038866e-04, 4.26420133e-03,
       3.25341871e-04, 1.76035146e-04, 1.74704542e-03, 3.74165424e-06,
       5.72603698e-04, 1.23402418e-03, 1.44665778e-03, 1.14847206e-05,
       1.53250713e-03, 3.18149604e-04, 1.12255674e-04, 1.78973157e-04,
       6.06263814e-04, 6.78922037e-06, 2.12840899e-04, 3.68067133e-04,
       7.93624931e-05, 5.41433409e-05, 2.98419701e-05, 1.43090555e-04,
       2.20631917e-04, 7.46823189e-05, 6.16067335e-05, 1.85061200e-05,
       2.88289832e-05, 1.82055806e-05, 7.61080382e-05, 1.04006281e-05,
       9.33963080e-05, 1.32908572e-04, 5.69448595e-06, 3.81600501e-05,
       4.63674505e-05, 1.85668763e-05, 9.56517458e-06, 1.80490734e-05,
       1.89096497e-05, 2.51367852e-06, 5.16072034e-06, 8.65617541e-05,
       5.57011586e-06, 2.83364689e-05, 6.97222406e-05, 1.47325894e-05,
       6.56538317e-06, 1.36116788e-04])
```

In [22]:
# A10:SanityCheck

ngram_semantics = make_context_representations(ngram_frequencies, type_index, CoM_d_normed)
ngram_semantics[6][tuple(tokenize('robert downey jr.'))]

array([6.09491358e-01, 2.60286170e-01, 2.34935240e+00, 2.41477748e-01,
       3.27120687e+00, 9.57983175e-02, 2.95200542e-03, 8.64430449e-03,
       1.64753209e-04, 3.53254269e-04, 2.00267652e-01, 1.74109214e-02,
       2.41203616e-02, 4.32334473e-02, 5.93200066e-04, 3.28836994e-02,
       6.11567153e-03, 4.97924144e-03, 2.61648300e-02, 5.43861689e-02,
       7.74695500e-02, 3.20148982e-02, 1.40884173e-03, 2.21735219e-03,
       7.75303322e-03, 2.28725974e-02, 4.35066414e-04, 1.92064604e-02,
       1.77977054e-03, 1.75828291e-04, 1.57058489e-03, 1.37739557e-02,
       2.75536566e-02, 4.30162065e-02, 9.56687387e-04, 2.76719269e-04,
       4.08678114e-02, 1.31895146e-02, 6.61341588e-03, 1.80720664e-02,
       4.80854791e-03, 1.29062074e-03, 8.87355810e-03, 1.98261447e-03,
       3.47211325e-02, 1.74234468e-02, 1.29041312e-01, 1.35195445e-01,
       7.51812252e-02, 1.03757020e-02])

### 11. (7 pts) Generalize the standard LM with CBOW semantic similarity
Now that we have an aggregated representation of semantic mass for the types which appear in the $n$-gram contexts (`ngram_semantics`) it's time to build a CBOW-semantic, $n$-gram LM. You job here is to complete the function:
```
t_Ps = P_next_CBOW(gram, ngram_semantics, CoM, type_index, epsilon = 0.1)
```
which will operate as an LM in a manner that is quite similar to that of `P_next()`. Here, the new arguments are:
- `ngram_semantics`: the $CoM^2_c$ aggregated semantic masses for types in context
- `CoM`: a normed co-ocurrence matrix, output from __Part 1__'s `reduced_normed_CoM(newstweet)` function
and all other inputs and outpus are the same as for `P_next()`.

Here, the LM will once again employ backoff and smoothing. Backoff will be accomplished again by recursively calling the (now `P_next_CBOW()`) function, which will be quite straightforward. However, generalizing the smoothing process will require a bit more work. Here, you'll instead be computing the `t_Ps` object, i.e., $\hat{P}(t_n|t_1, t_2, \cdots t_{n-1})$ as a `Counter()` of type-prediction probabilities via the product of independent, semantic-component likelihoods:
$$
\hat{P}(t_n|t_1, t_2, \cdots t_{n-1}) = 
\left[
\prod_{k = 1}^d
\frac{CoM_{t_n, k}^2\left(f(t_1, t_2,\cdots, t_n) + \varepsilon\right) }
{CoM_{c,k}^2 + \varepsilon\sum_{t\in W}CoM_{t, k}^2}
\right]^{1/d}
$$
where here, $k$ ranges over the semantic dimensionality $d$. Note: this formulation is most easily computed via log space as an arithmetic average:
$$
\log\hat{P}(t_n|t_1, t_2, \cdots t_{n-1}) = 
\frac{1}{d}\sum_{k = 1}^d
\log\left(
\frac{CoM_{t_n, k}^2\left(f(t_1, t_2,\cdots, t_n) + \varepsilon\right) }
{CoM_{c,k}^2 + \varepsilon\sum_{t\in W}CoM_{t, k}^2}
\right)
$$
and then exponentiated back to probability space for output.


In [23]:
gram = tuple(tokenize("he goes on an "))
n1 = len(gram)
epsilon = 0.1
epsilon /= len(type_index)
y = []
t_Ps = Counter()
types = [t for t in type_index]
type_indices = [type_index[t] for t in types]

In [24]:
if gram in ngram_semantics[n1]:
    print(ngram_semantics[n1][gram])

[2.58672955e-03 3.45406756e-03 3.24547592e-02 3.79494287e-03
 1.26469245e-01 1.46776957e-02 7.87821789e-05 3.05450437e-05
 2.97599549e-08 6.77184200e-09 1.52889512e-02 1.18264114e-02
 1.35976404e-02 1.57601465e-02 2.50173858e-03 5.57180879e-03
 4.07023786e-03 3.43606754e-03 3.50593044e-02 1.24946242e-01
 1.18946755e-01 3.56580610e-02 1.50380511e-03 2.12012485e-03
 1.52040428e-05 7.85705017e-02 1.20608039e-02 6.60276062e-03
 2.52862476e-04 6.26927464e-05 1.23393963e-02 1.53281393e-03
 5.23273498e-03 3.69859909e-02 2.12193086e-03 7.25054839e-02
 3.87804461e-03 5.01682383e-03 7.54521505e-02 3.39746900e-02
 1.54733793e-04 9.43821504e-03 7.32567890e-03 2.43037895e-03
 1.05048474e-02 1.70876621e-02 5.31765250e-03 2.97673538e-05
 1.32080793e-03 2.59512732e-02]


In [25]:
# A11:Function(7/7)

#adapt what you wrote for P_next but for the vectors, instead of using the frequencies, you are using the vectors which you are performing math on

def P_next_CBOW(gram, ngram_semantics, CoM, type_index, epsilon = 0.1):
    n1 = len(gram)
    epsilon /= len(type_index)
    y = []
    t_Ps = Counter()
    types = [t for t in type_index]
    type_indices = [type_index[t] for t in types]
    if gram in ngram_semantics[n1]: ## use gram to condition frequencies with epsilon-smoothing
        
        #--- your code starts here
        for key in type_index.keys():
            y.append((key, ((epsilon + ngram_semantics[n1][gram+(key,)]) / (epsilon*len(type_index) + ngram_semantics[n1-1][gram]))))
        
        for j in y:
            t_Ps[j[0]] = j[1]    

        #--- your code stops here
        
    else: ## recursively back off to lower-n model
        
        #--- your code starts here
        
        gram = gram[1:]
        print(gram)
        return P_next_CBOW(gram, ngram_semantics, type_index, epsilon = .1)

        #--- your code stops here
    
    return t_Ps

For reference, your output should be:
```
[0.37939464029388503,
 [('adventure', 0.2818891515803675),
  ('andrewcmccarthy', 0.00011029111215864133),
  ('lizzzyacker', 0.00011029111215864133),
  ('cfbplayoff', 0.00010564393592587125),
  ('cowhide', 5.9069417953547754e-05),
  ('7b', 5.7778635208333525e-05),
  ('selah', 5.5544304078059575e-05),
  ('ever-evolving', 5.1995269581865545e-05),
  ('incidentally', 5.1995269581865545e-05),
  ('realistically', 5.1995269581865545e-05)],
 [(' ', 0.014840491124005418),
  ('andrewcmccarthy', 0.00022725969991163811),
  ('lizzzyacker', 0.00022725969991163811),
  ('cfbplayoff', 0.00021768398836584537),
  ('cowhide', 0.00012171514037114209),
  ('7b', 0.00011905542560730372),
  ('selah', 0.00011445148779009385),
  ('ever-evolving', 0.00010713854571529673),
  ('incidentally', 0.00010713854571529673),
  ('realistically', 0.00010713854571529673)]]
```

In [26]:
# A11:SanityCheck

[np.nansum([x for x in P_next_CBOW(tuple(tokenize("he goes on an ")), ngram_semantics, CoM_d_normed, type_index).values()]), 
 list(P_next_CBOW(tuple(tokenize("he goes on an ")), ngram_semantics, CoM_d_normed, type_index).most_common(10)), 
 list(P_next_CBOW(tuple(tokenize(" he goes on an")), ngram_semantics, CoM_d_normed, type_index).most_common(10))]

[1.0,
 [('\n', 3.1269543464665416e-05),
  (' ', 3.1269543464665416e-05),
  ('!', 3.1269543464665416e-05),
  ('"', 3.1269543464665416e-05),
  ('#', 3.1269543464665416e-05),
  ('$', 3.1269543464665416e-05),
  ('%', 3.1269543464665416e-05),
  ('&', 3.1269543464665416e-05),
  ("'", 3.1269543464665416e-05),
  ("''", 3.1269543464665416e-05)],
 [('\n', 3.1269543464665416e-05),
  (' ', 3.1269543464665416e-05),
  ('!', 3.1269543464665416e-05),
  ('"', 3.1269543464665416e-05),
  ('#', 3.1269543464665416e-05),
  ('$', 3.1269543464665416e-05),
  ('%', 3.1269543464665416e-05),
  ('&', 3.1269543464665416e-05),
  ("'", 3.1269543464665416e-05),
  ("''", 3.1269543464665416e-05)]]

#### Now let's see what happens when we ramble across the entire vocabulary
For reference, your output should be:
```
generated document, starting from "robert downey jr":

.'s 'voyage of doctor ned, mad moxxi’s underdome riot, the secret armory of general knoxx, and claptrap’s new robot revolution. 

average perplexity of ramble:  1.429765446348872

```

In [27]:
# A11:SanityCheck
j = 5
np.random.seed(691)
document = tokenize(newstweet[j]['text'].lower())
rambling, likelihood = ramble(tuple(document[:5]), 49, 
                              (ngram_semantics, CoM_d_normed, type_index, 0.0001),
                              LM = P_next_CBOW)
print("\n\naverage perplexity of ramble: ", perplexity(likelihood))

generated document, starting from "robert downey jr":



TypeError: P_next() takes from 3 to 4 positional arguments but 5 were given

#### Next, let's see what ramble looks like with document-specific types 
For reference, your output should be:
```
generated document, starting from "robert downey jr":

.'s 'voyage of doctor dolittle' to january 2020

“dolittle” tells the story of dr. dolittle in read: “syriana,” “traffic”),

average perplexity of ramble:  3.756691687638365
```

In [3]:
# A11:SanityCheck

j = 5
np.random.seed(691)
document = tokenize(newstweet[j]['text'].lower())
rambling, likelihood = ramble(tuple(document[:5]), 50, 
                              (ngram_semantics, CoM_d_normed, doc_types[j], 0.0001),
                              LM = P_next_CBOW)
print("\n\naverage perplexity of ramble: ", perplexity(likelihood))

NameError: name 'np' is not defined

#### Now let's see what the average perplexity of our model is in a recitation over the document types
For reference, your output should be:
```
generated document, starting from "robert downey ":

jr.'s( he have to is.than people as well young trailer. dolittle  but, when well this official “dolittle” trailer released sunday, new released not forced to leave his hideaway and set sail across the sea is a young island.

“we have to choice but to embark on this during journey,” downey whispers to his soon and furry friends.

and,then’is getting into all sorts of dangerous situations, like being chained in a medieval dungeon with a tiger who greets him with, buthello, lunch” and “ jumps him. dr we for set — a gorilla voiced by rami malek comes to his feathered.

also,on: “ moves robert downey jr.'s 'voyage of doctor dolittle' to january 2020 
bydolittle” trailer the story along famed doctor and veterinarian during the falls ( queen victoria’s england, dr. dolittle dolittle, who is to action after the team,of his wife on years,earlier,caused him to judge a hermit who only talks to animals. but we we team,people (jessie buckley, “wild rose”) falls ill, he voiced on in adventure to find her a cure. he also in by this story by a young apprentice (harry collett, “dunkirk”) and zachary animal friends, that the anxious gorilla (malek), an upbeat duck (octavia spencer), an cynical ostrich (kumail nanjiani), an enthusiastic polar bear (john cena), and he doctor parrot (emma thompson).zachary
read banderas also stars as rassouli, along with his sheen as mudfly. additional voice performers include marion cotillard, frances de la tour, carmen ejogo, ralph fiennes, selena gomez  tom holland  and then robinson.

“we” trailer directed by stephen gaghan (“syriana,” “maleficent”), as we by joe roth and jeff kirschenbaum under their roth/kirschenbaum films banner (“alice in wonderland,” “maleficent”)  and well as we susan downey (“sherlock holmes” franchise, “the voice”) for team downey. downey jr.'sexecutive produces along iwth sarah bradshaw (“the mummy,” “maleficent”),and zachary roth (“maleficent: mistress of evil”).

average perplexity of recitation:  2.9565105975316377
```

In [2]:
# A11:SanityCheck

j = 5
np.random.seed(691)
document = newstweet[j]['text'].lower()
recitation, likelihood = recite(document, (ngram_semantics, CoM_d_normed, doc_types[j], 0.0001),
                                LM = P_next_CBOW)
print("\n\naverage perplexity of recitation: ", perplexity(likelihood))

NameError: name 'np' is not defined

#### Finally, let's see what the average perplexity of our model is in a recitation across the entire vocabulary
For reference, your output should be:
```
generated document, starting from "robert downey ":

jr.'sthrived he have to people.than people as well doj fish. shaun  unlike, morbid frenzied this?official toldbake” trailer released sunday, gov doesn't owed forced to militarily his hideaway and set sail across the entryway creature a simple island.

“i owe mobs ambitions but to prosecute on quests one journey,” downey whispers to his success and furry friends.

and,thank’it getting into debt sorts of schemes situations, 9-on-7 wollin cinderella in a football-centric dungeon with a rear-naked who greets him with, howeverhello, lunch” and michelle reevaluate him. didn if because cheat about a 230 voiced by rami malek comes to see feathered.

also,known: delta moves robert downey jr.'s 'voyage of doctor dolittle' to january 2020 
floyddolittle” trailer the hollywood is famed doctor and patient during the easter she discharge victoria’s england, 1966. pino's barrasso, who smugly to stewart's after the padres,of c judgement was years,earlier,caused him to anticipate inflamed focus who only had to reuters. but we cops u,family (jessieratifybuckley, “wild rose”) falls ill, gruber said about injured unqualified to fruition deadpool's a&fifth. he played still by stage biological by a conservative-leaning fleet (harry collett, “dunkirk”) and zachary males friends, but mendocino 0-6 gorilla (malek), an appointee duck (octavia spencer), an nih ostrich (kumail nanjiani), an easy polar bear (john rudoff), and nataliya celebration parrot (emma thompson).minions
additionally banderas also stars as rassouli, along with household sheen as mudfly. additional reporting performers include marion cotillard, frances de la industria, 1972 ejogo, ralph fiennes, selena gomez  27 jurich  and javier robinson.

“oh” trailer turn-based more stephen hawking (“syriana,” “maleficent”), as christabel alongside sony roth and jeff kirschenbaum under their roth/kirschenbaum films banner (“alice in wonderland,” “maleficent”)  and rosengren as jerushah susan downey (“sherlock holmes” franchise, tuathe gateway”) for haskins usa. downey jr.'sol produces along iwth sarah bradshaw (“the mummy,” “maleficent”),and unitas roth (“maleficent: mistress of evil”).

average perplexity of recitation:  3.125198389248689
```

In [None]:
# A11:SanityCheck

j = 5
np.random.seed(691)
document = newstweet[j]['text'].lower()
recitation, likelihood = recite(document, (ngram_semantics, CoM_d_normed, type_index, 0.0001),
                                LM = P_next_CBOW)
print("\n\naverage perplexity of recitation: ", perplexity(likelihood))