<div style="text-align: right">
    <i>
        LIN 5300<br>
        Spring 2022 <br>
        Aniello De Santo
    </i>
</div>


In [29]:
from google.colab import files
from collections import Counter
from pprint import pprint  
import re

# From counts to probabilities and bigram models

In the previous notebook we have played around with ways to obtain counts for words, and do some basic pre-processing.
But it would still be nice to get a few other metrics, such as

1. the frequencies of word types rather than their total counts (this makes it easier to compare different texts since 50 mentions of "buletic" in a 1000-page novel doesn't have the same weight as 50 mentions in a 1000-word essay),
2. Frequencies of bigrams (or ngrams)
3. Use counts information to get an MLE of conditional probabilites


Before we continue, though, we once again have to run all the relevant code to get counts for our texts *Hamlet* and *Princess of Mars*.

In [None]:
#Import files (this code is specific to reading files into Colab)
# Be sure to upload the file matching the variable name

hamlet = files.upload()
hamlet_full  = hamlet['hamlet_clean.txt'].decode('utf-8')


mars = files.upload()
mars_full = mars['mars_clean.txt'].decode('utf-8')



In [3]:
def tokenize(the_string):
    """Convert string to list of words"""
    return re.findall(r"\w+", the_string)


# define a variable for each token list
hamlet = Counter(tokenize(hamlet_full))
mars = Counter(tokenize(mars_full))


## Calculating frequencies

The frequency of a word indicates how many percent of a text are taken up by its tokens.
For example, if a word type has 6 tokens in a text of 1000 words, then its frequency is $\frac{6}{1000} = 0.006 = 0.6\%$.
So we get the frequency of a word *w* by dividing the count of *w* by the total number of tokens in the text.

We already have many of the tools that are needed to calculate frequencies:

1. a counter for each tokenized text, and
1. `for`-loops, and
1. the `len`-function, and
1. the `sum`-function, and
1. the usage of keys such as `[x]` to get the value of a specific item `x` in a counter.

We can recombine these techniques to define a custom function for computing word frequencies.

First, we will need to determine the total number of tokens in the text.
But we already know how to do that with `sum` and `Counter.values`.
Once we know the total, we can calculate the frequency of a word type by dividing its number of tokens by `total`.
Let us put the relevant code for these steps into a function that prints the frequency of every word type.

In [12]:
def frequencies(word_counter):
    """print frequency for each word type in counter"""
    total = sum(Counter.values(word_counter))
    # calculate frequencies for all the words in the counter
    for current_word in word_counter:
        number_of_tokens = word_counter[current_word]
        frequency = number_of_tokens / total
        print(current_word, frequency)

In [None]:
# We let's test it of we want
# But be warned that it is going to be a long (in terms of screen space) print

frequencies(hamlet)

### Adding frequencies to the counter

The `frequencie` function is still somewhat unsatisfying in that it prints the frequency of each word type.
Printing to screen isn't very useful most of the time, in particular with tens of thousands of words.
It would be better if we could simply replace the absolute values in the counter by frequencies.
This is actually fairly easy.
The `[x]` notation is not only useful for retrieving the value of an element, it also allows us to **specify** the value of an element.

In [None]:
# define our test counter
test = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
print(test)

# let's change the value of "a";
# here's what it is right now
print("a's current value:", test["a"])
# and now we'll change it to 0.1
test["a"] = 0.1
print("a's new value:", test["a"])

# and now we add a new element "d" to the counter
test["d"] = 10
print("d was added with count:", test["d"])
print("The new counter is:", test)

Counter({'a': 6, 'b': 4, 'c': 2})
a's current value: 6
a's new value: 0.1
d was added with count: 10
The new counter is: Counter({'d': 10, 'b': 4, 'c': 2, 'a': 0.1})


Now we can finalize the `frequencies` function.
Instead of printing the frequency of `current_word`, the function should override the value of `current_word` in `word_counter` with `frequency`.
At the end, the function returns `word_counter`.
You can test your code in the second cell.
The results should be `0.t` for `a`, `0.3333` for `b`, and `0.1666` for `c`.

In [17]:
# change and complete the code below

def frequencies(word_counter):
    # add an updated docstring here
    total = sum(Counter.values(word_counter))
    for current_word in word_counter:
        number_of_tokens = word_counter[current_word]
        frequency = number_of_tokens / total
        word_counter[current_word] = frequency  # this part needs to change
    return word_counter

In [18]:
# test your code here
test_counts = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
print(test_counts)

test_frequency = frequencies(test_counts)
print(test_frequency)

Counter({'a': 6, 'b': 4, 'c': 2})
Counter({'a': 0.5, 'b': 0.3333333333333333, 'c': 0.16666666666666666})


### An unintended side-effect

Let us run the test one more time, with just a minor change in the order of the `print`-statements.
Now we first compute `test_frequency` and the print `test_counts` and `test_frequency`.

In [19]:
# test your code here
test_counts = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
test_frequency = frequencies(test_counts)

print(test_counts)
print(test_frequency)

Counter({'a': 0.5, 'b': 0.3333333333333333, 'c': 0.16666666666666666})
Counter({'a': 0.5, 'b': 0.3333333333333333, 'c': 0.16666666666666666})


Uhm, what's going on here?
Why do `test_counts` and `test_frequency` look the same?
Where did the absolute word counts go?

The problem is with how we wrote the function `frequencies`.
This is a function that takes a word counter as an argument and then **overwrites** the count of each word type with its frequency.
So if we run `frequencies` over `test_counts`, all the values of `test_counts` are replaced by frequencies.
That's not really what we want.
Instead, we want to produce a copy of `test_counts` with frequencies while keeping the original version of `test_counts` untouched.
We can create a copy of a counter with the function `Counter.copy`.

In [None]:
# test your code here
test_counts = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
test_frequency = frequencies(Counter.copy(test_counts))

print(test_counts)
print(test_frequency)

Now we now longer run `frequencies` on `test_counts`, but a dynamically created copy of `test_counts`.
Hence the values of `test_counts` remain unaltered, and we get different outputs for `print` at the end.

**Python Practice.**
Copy-paste your definition of the `frequencies` function into the cell below, then change it so that it always creates a copy `temp_copy` of `word_counter` at the beginning and then carries out all operations over `temp_copy` instead of `word_counter`.
Then run the code in the next cell to verify that your new definition of `frequencies` works correctly.

In [None]:
# copy-paste your code for frequencies here, then modify it as described

In [None]:
# test your code here
test_counts = Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"])
test_frequency = frequencies(test_counts)

print(test_counts)
print(test_frequency)

Ok, we have now seen how to obtain unigram frequencies out of a corpus, and compare frequencies across corpora.

Note that under MLE appraoch, we said that counts can approximate probabilities, so that $P(A)$ can be approximated by:

<center> $\large p(A) \approx \frac{count(A)}{\sum_{w \in V}count(w)}$ </center>

So we now have a pythonic way of getting unigram probabilities out of a corpus. Can we extend this approach to n-grams?

Well, first, we must obtain the n-grams! Let's work with bigrams.
Many Python libraries can extract bigrams for us, but it is also fairly easy to do it from scratch.

In [25]:
#this function takes a list of words
#and returns a list of bigrams

def bigrammer(input_list):
    bigram_list = []
    for i in range(len(input_list)-1):
        bigram_list.append((input_list[i], input_list[i+1]))
    return bigram_list


In [26]:
hamlet_bigrams = bigrammer(tokenize(hamlet_full))
mars_bigrams = bigrammer(tokenize(mars_full))

In [30]:
#let's check what we got
pprint(hamlet_bigrams[:10])
pprint(mars_bigrams[:10])

[('enter', 'francisco'),
 ('francisco', 'and'),
 ('and', 'barnardo'),
 ('barnardo', 'two'),
 ('two', 'sentinels'),
 ('sentinels', 'barnardo'),
 ('barnardo', 'who'),
 ('who', 's'),
 ('s', 'there'),
 ('there', 'francisco')]
[('i', 'am'),
 ('am', 'a'),
 ('a', 'very'),
 ('very', 'old'),
 ('old', 'man'),
 ('man', 'how'),
 ('how', 'old'),
 ('old', 'i'),
 ('i', 'do'),
 ('do', 'not')]


**Question:** What do you think the reason is to want the bigrams as tuples and not a single string?

 # Computing Conditional Probabilities
 
Recall that a conditional probability $P(B | A)$ is a measure that allows us to estimate how many of our observations of $B$ occur having already seen $A$.
We can compute it using an MLE approach.

We have already seen that $p(A)$ can be approximated by:

<center> $\large p(A) \approx \frac{count(A)}{\sum_{w \in V}count(w)}$ </center>

In the previous lecture, we then saw that we can also directly approximate $P(B | A)$ as:

<center> $\large p(A) \approx \frac{count(A,B)}{\sum_{w \in V}count(A)}$ </center>



In [36]:
# Let's get bigram counts
hamlet_bgcounts = Counter(hamlet_bigrams)

In [34]:
def conditional(bg_counter,word_counter):
    "Obtain Conditional bigram probabilites according to MLE"
    bg_counter = Counter.copy(bg_counter)
    for current_bg in bg_counter:
        number_of_tokens = bg_counter[current_bg]
        cond_prob = number_of_tokens / word_counter[current_bg[0]]
        bg_counter[current_bg] = cond_prob
    return bg_counter

In [37]:
# Let's test it
 test_conditional = conditional(hamlet_bgcounts,hamlet)

**Python practice.** Now test the fucntion by printing the conditional probabilities for the first 15 bigrams in hamlet.
