# Madlibs

## Madlibs style word substitution
Given a (multiline) f-string, match the substitutes to the targets. If the targets have no subs, put an empty string (default value).

In [44]:
# Attempt one
def g(inp:str, **subs) -> str:
    return inp.format(**subs)

In [45]:
replacements = {"x":70}
# replacements = {"x":70, "w":80, "z": 90}
# replacements = {"x":70, "w":80, }

sample = "The number is {x} {w} {z}"
try:
    test = g(sample, **replacements)
    print(test)
except KeyError as e:
    print(e)

'w'


Problem: if the dict doesnt contain it, theres a keyerror

In [46]:
# Attempt 2: use a custom formatter (PEP 3101)
from string import Formatter
from typing import Dict
class MadLibber(Formatter):
    def __init__(self, default="") -> None:
        super().__init__()
        self.default=default

    def get_value(self, key, args, kwds:Dict):
        if isinstance(key, str):
            return kwds.get(key, self.default)
        else:
            return super().get_value(key, args, kwds)


In [47]:
mL = MadLibber()
print(mL.format(sample, **replacements))

The number is 70  


Hopefully we can improve on its inelegance. Also I dont fully understand the formatter class.

In [48]:
# Attempt 3: format_map
class Default(dict):
    def __missing__(self, key):
        return '{'+key+'}'
class Default2(dict):
    def __missing__(self, key):
        return ""

In [49]:
print(sample.format_map(Default(replacements)))
print(sample.format_map(Default2(replacements)))

The number is 70 {w} {z}
The number is 70  


In [50]:
multiline ="""
Lorem Ipsum is {adj} dummy text of the printing 
and typesetting industry. Lorem Ipsum has been the 
industry's standard dummy text ever since the 1500s, 
when an unknown printer took a {noun} of type and 
scrambled it to make a type specimen book. It has 
survived not only five centuries, but also the 
leap {preposition} electronic typesetting, remaining 
essentially unchanged. It was {verb} in the 
1960s with the release of Letraset sheets containing 
Lorem Ipsum passages, and more recently with 
desktop publishing software like Aldus PageMaker 
including versions of Lorem Ipsum.
"""
subs = {"adj":"interestingly",
        "noun": "human",
        "preposition": "at"}
print(multiline.format_map(Default(subs)))
print(multiline.format_map(Default2(subs)))


Lorem Ipsum is interestingly dummy text of the printing 
and typesetting industry. Lorem Ipsum has been the 
industry's standard dummy text ever since the 1500s, 
when an unknown printer took a human of type and 
scrambled it to make a type specimen book. It has 
survived not only five centuries, but also the 
leap at electronic typesetting, remaining 
essentially unchanged. It was {verb} in the 
1960s with the release of Letraset sheets containing 
Lorem Ipsum passages, and more recently with 
desktop publishing software like Aldus PageMaker 
including versions of Lorem Ipsum.


Lorem Ipsum is interestingly dummy text of the printing 
and typesetting industry. Lorem Ipsum has been the 
industry's standard dummy text ever since the 1500s, 
when an unknown printer took a human of type and 
scrambled it to make a type specimen book. It has 
survived not only five centuries, but also the 
leap at electronic typesetting, remaining 
essentially unchanged. It was  in the 
1960s with the rel

## Getting a corpus of passages
Let's actually make a game! We want lots of different passages of similar lengths.

In [51]:
import pandas as pd
import math
import inspect
import random
import spacy
from spacy.matcher import Matcher
import time

In [52]:
# extract articles from https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail?resource=download
df = pd.read_csv("archive/cnn_dailymail/test.csv")
articles_df = df['article'].str.replace('. ','.\n', regex=False) #

In [53]:
# check the first article in a single file
with open("passage.txt", "w") as f:
    text = articles_df.iloc[0]
    print(text)
    f.write(text)
    

Ever noticed how plane seats appear to be getting smaller and smaller? With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting passengers at risk.
They say that the shrinking space on aeroplanes is not only uncomfortable - it's putting our health and safety in danger.
More than squabbling over the arm rest, shrinking space on planes putting our health and safety in danger? This week, a U.S consumer advisory group set up by the Department of Transportation said at a public hearing that while the government is happy to set standards for animals flying on planes, it doesn't stipulate a minimum amount of space for humans.
'In a world where animals have more rights to space and food than humans,' said Charlie Leocha, consumer representative on the committee. 'It is time that the DOT and FAA take a stand for humane treatment of passengers.' But could crowding on planes lead to more serious issues than fighting for space 

In [54]:
# export to CSV
articles_df.to_csv("articles.csv")


## NLP
We want to use part-of-speech taggers to tag words to different "parts of speech": nouns, verbs, adjectives, adverbs, connectives, pronouns and prepositions, etc.
This allows us to choose certain numbers of each word type to blank out.

Let's try out spaCy on just the first article using this [guide](https://freecontent.manning.com/detecting-word-types-with-part-of-speech-tagging-part-1/).

Note: We also have to run `py -m  spacy download en_core_web_sm` to download that model before we can use it. We're just using the small model for testing purposes.

In [55]:
nlp = spacy.load("en_core_web_sm")
doc= nlp(text)

Let's look inside them using the inspect module.

In [56]:
print(type(doc))
print(inspect.getmembers(doc[0]))

<class 'spacy.tokens.doc.Doc'>
[('_', <spacy.tokens.underscore.Underscore object at 0x0000016E0A4E9D50>), ('__bytes__', <built-in method __bytes__ of spacy.tokens.token.Token object at 0x0000016E14167FB0>), ('__class__', <class 'spacy.tokens.token.Token'>), ('__delattr__', <method-wrapper '__delattr__' of spacy.tokens.token.Token object at 0x0000016E14167FB0>), ('__dir__', <built-in method __dir__ of spacy.tokens.token.Token object at 0x0000016E14167FB0>), ('__doc__', 'An individual token – i.e. a word, punctuation symbol, whitespace,\n    etc.\n\n    DOCS: https://spacy.io/api/token\n    '), ('__eq__', <method-wrapper '__eq__' of spacy.tokens.token.Token object at 0x0000016E14167FB0>), ('__format__', <built-in method __format__ of spacy.tokens.token.Token object at 0x0000016E14167FB0>), ('__ge__', <method-wrapper '__ge__' of spacy.tokens.token.Token object at 0x0000016E14167FB0>), ('__getattribute__', <method-wrapper '__getattribute__' of spacy.tokens.token.Token object at 0x0000016E1

That's not very useful. Let's try a different approach.

In [57]:
token = doc[0]
print([token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop])

['Ever', 'ever', 'ADV', 'RB', 'advmod', 'Xxxx', True, True]


In [58]:
for token in doc[:len(doc)//20]: # limit output
    print([token.text, token.pos_, spacy.explain(token.pos_)]) # .pos is an int, .pos_ is a string

['Ever', 'ADV', 'adverb']
['noticed', 'VERB', 'verb']
['how', 'SCONJ', 'subordinating conjunction']
['plane', 'NOUN', 'noun']
['seats', 'NOUN', 'noun']
['appear', 'VERB', 'verb']
['to', 'PART', 'particle']
['be', 'AUX', 'auxiliary']
['getting', 'VERB', 'verb']
['smaller', 'ADJ', 'adjective']
['and', 'CCONJ', 'coordinating conjunction']
['smaller', 'ADJ', 'adjective']
['?', 'PUNCT', 'punctuation']
['With', 'ADP', 'adposition']
['increasing', 'VERB', 'verb']
['numbers', 'NOUN', 'noun']
['of', 'ADP', 'adposition']
['people', 'NOUN', 'noun']
['taking', 'VERB', 'verb']
['to', 'ADP', 'adposition']
['the', 'DET', 'determiner']


Alright, we've got POS tagging done! Now to put take out certain parts of the text.

We use spacy's matcher to look for words with certain POS tags, such as verbs, and replace them with a placeholder or suggestion.

In [59]:
matcher = Matcher(nlp.vocab)
pattern = [
    {
        "POS" :"VERB", # match verbs
    "OP":"*"           # match 0 or more verbs
    }
    ]
matcher.add("verb", [pattern])
matches = matcher(doc)

for match_id, start, end in matches[:len(matches)//10]: # limit output
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

6360137228241296794 verb 1 2 noticed
6360137228241296794 verb 5 6 appear
6360137228241296794 verb 8 9 getting
6360137228241296794 verb 14 15 increasing


Note: A [Span](https://spacy.io/api/span) is a slice from the document.

In [60]:
# adapted from https://stackoverflow.com/questions/62785916/spacy-replace-token
def replace_word(orig_doc):
    text = ''
    buffer_start = 0
    for _, match_start, _ in matcher(orig_doc):
        if match_start > buffer_start:  # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
            text += orig_doc[buffer_start: match_start].text + orig_doc[match_start - 1].whitespace_
            
        token = orig_doc[match_start]
        # tense = token.morph.get('Tense')[0] if token.morph.get('Tense') else ""
        # long_pos = spacy.explain(token.pos_)
        tag = spacy.explain(token.tag_)
        replacement = f"<{tag}>"
        text += replacement + token.whitespace_  # Replace token, with trailing whitespace if available
        buffer_start = match_start + 1
    text += orig_doc[buffer_start:].text
    return text

replace_word(doc)

"Ever <verb, past tense> how plane seats <verb, non-3rd person singular present> to be <verb, gerund or present participle> smaller and smaller? With <verb, gerund or present participle> numbers of people <verb, gerund or present participle> to the skies, some experts are <verb, gerund or present participle> if <verb, gerund or present participle> such <verb, past participle> out planes is <verb, gerund or present participle> passengers at risk.\nThey <verb, non-3rd person singular present> that the <verb, gerund or present participle> space on aeroplanes is not only uncomfortable - it's <verb, gerund or present participle> our health and safety in danger.\nMore than <verb, gerund or present participle> over the arm rest, <verb, gerund or present participle> space on planes <verb, gerund or present participle> our health and safety in danger? This week, a U.S consumer advisory group <verb, past participle> up by the Department of Transportation <verb, past tense> at a public hearing th

We've replaced all the verbs in a given text with a suggestion (that's a tad verbose). 

However, in a real game of Madlibs, we want to only replace a few verbs, and replace other types of speech too. 

Let's try finding the number of each POS in a text, as both absolute number and percentage of the word count.

In [61]:
stats = dict()
token_count = len(doc)
for token in doc:
    long_pos = token.pos_
    try:
        stats[long_pos] = stats[long_pos] + 1
    except KeyError:
        stats[long_pos] = 0

# tabulate in pandas dataframe
stats_df = pd.DataFrame(stats.items(), columns=["POS", "Num"]) 
stats_df["Ratio"] = round(stats_df["Num"] /token_count, 2)
stats_df["longPOS"] = stats_df["POS"].apply(lambda x:spacy.explain(x))
stats_df

Unnamed: 0,POS,Num,Ratio,longPOS
0,ADV,6,0.01,adverb
1,VERB,45,0.11,verb
2,SCONJ,12,0.03,subordinating conjunction
3,NOUN,102,0.24,noun
4,PART,5,0.01,particle
5,AUX,15,0.04,auxiliary
6,ADJ,17,0.04,adjective
7,CCONJ,11,0.03,coordinating conjunction
8,PUNCT,40,0.09,punctuation
9,ADP,55,0.13,adposition


In [62]:
stats_df.sort_values(by=["Num"], ascending=False)

Unnamed: 0,POS,Num,Ratio,longPOS
3,NOUN,102,0.24,noun
9,ADP,55,0.13,adposition
1,VERB,45,0.11,verb
8,PUNCT,40,0.09,punctuation
10,DET,34,0.08,determiner
13,PROPN,29,0.07,proper noun
6,ADJ,17,0.04,adjective
14,NUM,16,0.04,numeral
5,AUX,15,0.04,auxiliary
2,SCONJ,12,0.03,subordinating conjunction


While there's a lot of nouns (legitimate tokens we might want to swap out), there's also a large number of punctionation.

That's alright. We can just black/whitelist certain POS and specify a ratio of words per POS to swap out (say 10 % of nouns).

Let's try setting a ratio, configuring the matcher, and feeding that into the replacer.

In [63]:
# stats builder
def get_stats(orig_doc:spacy.tokens.doc.Doc):
    """
    Returns the number of unique POS tags in the doc as {POS : Num} dictionary.
    """
    stats = dict()
    token_count = len(doc)
    for token in orig_doc:
        long_pos = token.pos_ # we can't use the long form explanation any longer because we need it to map in the matcher
        try:
            stats[long_pos] = stats[long_pos] + 1
        except KeyError:
            stats[long_pos] = 0
    return stats

# pattern builder
def get_patterns(stats:dict, blacklist:[str]=[], ratio:float=1.0):
    """
    Returns a list of lists of dictionary of POS and number of desired matches
    from a given dictionary of {POS : Num}, after removing the blacklisted POS tags.
    """
    assert(ratio >= 0 and ratio <= 1.0)
    whitelist = stats
    for ban in blacklist:
        whitelist.pop(ban, None)
    
    return [[{
            "POS": pos,
            "OP" : f"{{{math.ceil(num * ratio)}}}" # each pattern is its own list
        }]
        for pos, num in whitelist.items()]


def replace_word(orig_doc, matches):
    """
    Returns a string where the matches have been replaced by the corresponding POS tag information.
    """
    text = ''
    buffer_start = 0
    for _, match_start, _ in matches:
        if match_start > buffer_start:  # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
            text += orig_doc[buffer_start: match_start].text + orig_doc[match_start - 1].whitespace_
            
        token = orig_doc[match_start]
        tag = spacy.explain(token.tag_)
        replacement = f"{{{tag}}}"
        text += replacement + token.whitespace_  # Replace token, with trailing whitespace if available
        buffer_start = match_start + 1
    text += orig_doc[buffer_start:].text
    return text


In [64]:
RATIO = 0.1
BLACKLIST= ["determiner", "punctuation"]
df = pd.read_csv("archive/cnn_dailymail/test.csv")
nlp = spacy.load("en_core_web_sm")
text = df.iloc[0]['article']
doc= nlp(text)
stats = get_stats(doc)
print(stats)
matcher = Matcher(nlp.vocab)
patterns = get_patterns(stats= stats, blacklist= BLACKLIST, ratio= RATIO )

matcher.add("PATTERNS", patterns)

matches = matcher(doc)

with open("blanked_passage.txt", "w") as f:
    blanked_txt = replace_word(doc, matches)
    blanked_txt = blanked_txt.replace(". ", ".\n").replace("? ", "?\n")
    f.write(blanked_txt)
    print(blanked_txt)


{'ADV': 6, 'VERB': 45, 'SCONJ': 12, 'NOUN': 102, 'PART': 5, 'AUX': 15, 'ADJ': 17, 'CCONJ': 11, 'PUNCT': 40, 'ADP': 55, 'DET': 34, 'PRON': 11, 'PROPN': 29, 'SPACE': 0, 'NUM': 16, 'SYM': 0}
{adverb} noticed how plane seats appear {infinitival "to"} be getting smaller and smaller?
With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting passengers at risk.
They say that the shrinking space on aeroplanes is {adverb} {adverb} uncomfortable - it's putting our health and safety in danger.
More than squabbling over the arm rest, shrinking space on planes putting our health and safety in danger?
This week, a U.S consumer advisory group set up by the Department of Transportation said at a public hearing {conjunction, subordinating or preposition} while the government is happy {infinitival "to"} set standards for animals flying on planes, it does{adverb} stipulate a minimum amount of space for humans.
'In a world where animals

It's looking pretty good! Can you spot a few problems? 

Problems: 
1. We might need to blacklist that {infinitival "to"}. In fact, blacklisting might not actually be working because the long form POS arent in stats.keys()
2. We need to randomise the locations of the replacements. Right now, it's replaced in order. 
    - Instead of calculating the match ratio, we can implement it as a random chance that's rolled each time we might need to replace a token.  

In [65]:
# stats builder 2
def get_stats(orig_doc:spacy.tokens.doc.Doc):
    """
    Returns the number of unique POS tags in the doc as (POS, Num, longPOS) Dataframe.
    """
    stats = dict()
    token_count = len(doc)
    for token in orig_doc:
        try:
            stats[token.pos_] = stats[token.pos_] + 1
        except KeyError:
            stats[token.pos_] = 0
    stats_df = pd.DataFrame(stats.items(), columns=["POS", "Num"])
    stats_df["longPOS"] = stats_df["POS"].apply(lambda x:spacy.explain(x))
    return stats_df

# pattern builder 2
def get_patterns(stats_df:pd.DataFrame, blacklist:[str]=[]):
    """
    Returns a list of lists of dictionary of POS and number of desired matches
    from a given Dataframe of (POS, Num, longPOS), after removing the blacklisted POS tags.
    """
    whitelist = stats_df[~stats_df["longPOS"].isin(blacklist)]
    return [[{
            "POS": pos,
            "OP" : f"{{,{num}}}" # each pattern is its own list
        }]
        for pos, num in zip(whitelist["POS"], whitelist["Num"] )]

def get_matches(patterns, matcher, doc, ratio=1.0):
    """
    Use a for loop to get the matches for each pattern.
    Take RATIO of each matches and replace them.
    """
    matches = []
    for pattern in patterns:
        matcher.add("PATTERN", [pattern])
        match = matcher(doc)
        matches.extend(
            random.sample(match, math.ceil(ratio * len(match)))
        )
        matcher.remove("PATTERN")

    # sort matches 

    return sorted(matches, key=lambda x:x[1])

# pasted again
def replace_word(orig_doc, matches):
    """
    Returns a string where the matches have been replaced by the corresponding POS tag information.
    """
    text = ''
    buffer_start = 0
    for _, match_start, _ in matches:
        if match_start > buffer_start:  # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
            text += orig_doc[buffer_start: match_start].text + orig_doc[match_start - 1].whitespace_
        
        token = orig_doc[match_start]
        tag = spacy.explain(token.tag_)
        replacement = f"{{{tag}}}"

        text += replacement + token.whitespace_ # Replace token, with trailing whitespace if available
        buffer_start = match_start + 1
    text += orig_doc[buffer_start:].text
    return text


In [66]:
start= time.monotonic()
RATIO = 0.1
BLACKLIST= ["determiner", "punctuation", "particle"]
df = pd.read_csv("archive/cnn_dailymail/test.csv")
nlp = spacy.load("en_core_web_sm")
text = df.iloc[0]['article']
doc= nlp(text)
stats_df = get_stats(doc)
matcher = Matcher(nlp.vocab)
patterns = get_patterns(stats_df= stats_df, blacklist= BLACKLIST)
matches = get_matches(patterns, matcher, doc, ratio=RATIO)

with open("blanked_passage.txt", "w") as f:
    blanked_txt = replace_word(doc, matches)
    blanked_txt = blanked_txt.replace(". ", ".\n").replace("? ", "?\n")
    f.write(blanked_txt)
    print(blanked_txt)
print(time.monotonic()-start)


Ever noticed how plane seats appear to {verb, base form} getting smaller and smaller?
With increasing numbers of people {verb, gerund or present participle} {conjunction, subordinating or preposition} the skies, some experts are {verb, gerund or present participle} if having such packed out planes is putting passengers at risk.
They say that the shrinking space on aeroplanes is not {adverb} uncomfortable - it's {verb, gerund or present participle} our health and safety in danger.
More than squabbling over the arm {noun, singular or mass}, shrinking space on planes putting our health {conjunction, coordinating} safety in danger?
This week, a U.S consumer advisory group set {adverb, particle} by the Department of Transportation said {conjunction, subordinating or preposition} a public hearing that {conjunction, subordinating or preposition} the government is {adjective (English), other noun-modifier (Chinese)} to set standards for animals flying on planes, it doesn't stipulate a minimum 

It's a little slow, but it works.