Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All) to avoid typical problems with Jupyter notebooks. **Unfortunately, this does not work with Chrome right now, you will also need to reload the tab in Chrome afterwards**.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". Please put your name here:

In [1]:
NAME = "Aymane Hachcham"

---

## Text Generation with Markov Chains

In this exercise, we want to make a Donald Trump fake tweet generator using a simple Markov chain.

In [2]:
# Load the tweet data:
import gzip, re, sys
tweets = [[sys.intern(y) for y in re.split(r"\s", x.strip())] for x in gzip.open("/data/tweets_realDonaldTrump_sanitized.txt.gz", "rt")]

In [3]:
# Explore the data:

tweets[:10]

[['Thank', 'you', 'Arkansas!', '#Trump2016', '#SuperTuesday'],
 ['Thank', 'you', 'Virginia!', '#Trump2016', '#SuperTuesday'],
 ['Thank', 'you', 'Alabama!', '#Trump2016', '#SuperTuesday'],
 ['Thank', 'you', 'Tennessee!', '#Trump2016', '#SuperTuesday'],
 ['Thank', 'you', 'Massachusetts!', '#Trump2016', '#SuperTuesday'],
 ['Thank', 'you', 'Georgia!', '#SuperTuesday', '#Trump2016'],
 ['Wow!',
  'Thank',
  'you',
  'Louisville,',
  'Kentucky!',
  '',
  '#VoteTrump',
  'on',
  '3/5/2016!',
  'Lets',
  '#MakeAmericaGreatAgain!',
  'http://somelink.com/',
  'http://someotherlink.com/'],
 ['Lets',
  'go',
  'America!',
  'Get',
  'out',
  '&',
  '#VoteTrump!',
  '#Trump2016',
  '#MakeAmericaGreatAgain!',
  '#SuperTuesday',
  'http://somelink.com/',
  'http://someotherlink.com/'],
 ['MAKE', 'AMERICA', 'GREAT', 'AGAIN!'],
 ['Thank', 'you', 'Columbus,', 'Ohio!', 'http://somelink.com/']]

## Collect the necessary statistics for the Markov model

We need the term frequencies to predict the next word given the previous 0...order words.

Use `()` (empty tuple) as a start and stop token. Use tuples as keys in your maps.

For the 0th order, this is simply the word frequency!

In [65]:
# Aggregate the data as necessary for Markov model of order 0...order
def aggregate(tweets, order):
    from collections import defaultdict, Counter
    models = []
    # As 0th order, use the first tokens only. 
    tokens = []
    for tweet in tweets:
        tokens.append(tweet[0])
    counter = Counter(tokens)
    
    models.append({(): counter})
    
    for o in range(1, order+1):
        model = {} # use tuple() -> word (or empty tuple)
        
        if o == 1:
            # Code for the first order:
            for tweet in tweets:
                n_tokens = len(tweet)
            
                for index, key in enumerate(tweet):
                    if n_tokens > (index + 1):
                        word = tweet[index + 1]
                    
                        if key not in model:
                            model[key] = [word]
                        else:
                            model[key].append(word)
                    else:
                        word = ()
                        if key not in model:
                            model[key] = [word]
                        else:
                            model[key].append(word)
            
            for k in model:
                model[k] = dict(Counter(model[k]))
            
            models.append(model)
        
        else:
            # Code for the second order:
            for tweet in tweets:
                n_tokens = len(tweet)
    
                for i, key1 in enumerate(tweet):  
                    if n_tokens > i + 2:
                        key2 = tweet[i + 1]
                        word = tweet[i + 2]
                        if (key1, key2) not in model:
                            model[(key1, key2)] = [word]
                        else:
                            model[(key1, key2)].append(word)
                    
                    elif i + 2 == n_tokens:
                        key2 = tweet[i + 1]
                        word = ()
                        if (key1, key2) not in model:
                            model[(key1, key2)] = [word]
                        else:
                            model[(key1, key2)].append(word)
                            
            for k in model:
                model[k] = dict(Counter(model[k]))
            
            models.append(model)
       
    return models

In [66]:
#### AUTOMATIC TESTS
_tmp = aggregate(tweets[:100], order=2)
assert len(_tmp) == 3 and isinstance(_tmp, list), "Wrong result"
assert all(isinstance(x, dict) for x in _tmp), "Wrong result"
assert not () in _tmp[0][()], "0th order must not include the end token"
assert sum(_tmp[0][()].values()) == 100, "0th order incorrect."
assert sum(sum(x.values()) for x in _tmp[1].values()) == sum(len(x) for x in tweets[:100]), "1th order incomplete."
assert sum(sum(x.values()) for x in _tmp[2].values()) == sum(len(x)-1 for x in tweets[:100]), "2nd order incomplete."

In [None]:
#### Additional hidden AUTOMATIC TESTS
del _tmp

### Train your model

In [67]:
%time model = aggregate(tweets, order=3)

CPU times: user 1.17 s, sys: 423 ms, total: 1.59 s
Wall time: 1.59 s


In [184]:
# Testin the function how it should be:
model = aggregate(tweets[:2], order=2)

order = len(model) - 1

w1 = random.choice([t for tweet in tweets[:2] for t in tweet])
random_order = random.randrange(order)

# print(max(model[1][w1], key=model[1][w1].get))

f_t = []
for i in range(10):
    w1 = random.choice([t for tweet in tweets[:2] for t in tweet])
    
#     print(max(model[1][w1], key=model[1][w1].get))
    
    if max(model[1][w1], key=model[1][w1].get) == ():
        continue
    else:
        w2 = w1 + ' ' + max(model[1][w1], key=model[1][w1].get) 
    
    f_t.append(w2)

f_t

['Virginia! #Trump2016',
 'you Arkansas!',
 'Thank you',
 'Virginia! #Trump2016',
 'you Arkansas!',
 'you Arkansas!',
 'Thank you',
 'Virginia! #Trump2016',
 '#Trump2016 #SuperTuesday']

## Make Trump tweet again

Lets make Trump tweet again.

Write a function "trump" that randomly generates trumpesque garbage given the above model, by randomly sampling from the appropriate distribution.

In [104]:
import random
order = 2
# random.choice[i for i in range(order)]

random.randrange(order)

# random.choice(order)

1

In [199]:
def trump(model):
    """Generate Trumpesque nonsense from a trained Markov model"""
    import random
    order = len(model) - 1
    output = []
    for i in range(0, 100): # enforce a max length 100, in case your stopping does not work
        
        # YOUR CODE HERE
        random_order = random.randrange(order)
        w1 = random.choice([t for tweet in tweets[:2] for t in tweet])
        if w1 in model[random_order]:
            if max(model[random_order][w1], key=model[random_order][w1].get) == ():
                continue
            else:
                next_word = w1 + ' ' + max(model[random_order][w1], key=model[random_order][w1].get)
        else:
            continue
        
        output.append(next_word)
    return output

In [200]:
#### AUTOMATIC TESTS
_tmp = aggregate(tweets[:100], order=2)
_tmp = [trump(_tmp) for x in range(100)]
assert any('http://somelink.com/' in x for x in _tmp), "Does not work right."
assert any('#MakeAmericaGreatAgain' in x for x in _tmp), "Does not work right."
assert any(x in tweets[:100] for x in _tmp), "Some tweet must be reproduced"
assert any(x not in tweets[:100] for x in _tmp), "Some tweet must be fake"

AssertionError: Does not work right.

In [None]:
#### Additional hidden AUTOMATIC TESTS

In [None]:
#### Additional hidden AUTOMATIC TESTS
del _tmp

In [None]:
#### Additional hidden AUTOMATIC TESTS

## Make Donald Trump tweet garbage again

Lets make Donald Trump tweet again. Generate some Trumpesque nonsense:

In [None]:
for i in range(10):
    print(*trump(model))