## [Original article from Peter Norvig's notebook](http://nbviewer.jupyter.org/url/norvig.com/ipython/xkcd1313.ipynb)

## My Notes
- It seems that most of Peter Norvig's puzzle solutions have some common features, not sure if he did explcitly or implicitly, e.g.,
    - Have the "tests" ready AEAP as a way of defining the problem
    - Define the glossary, and utlity function, in a domain-specific language style
    - Iterative approaching: come with baseline, approximations
    - Always back solutions with analysis: when it does/doesn't work, what's the pros and cons, what is the complexity, etc.

## Defining the problems (through tests...)
- find a regular expression that matches all the positive strings, and anti-matches all negative strings

### Usually elaboration of problem leads to representation/data structures and etc.
- regular expression: python `re` pattern
- postitive/negative strings: a set of strings
- match/not match: search(re_expr, str) == None

### and finally, we need to define the test cases

In [9]:
import re

In [31]:
# we could define the yes/no test directly, but
# better to provide some details, for debugging purpose

def mistakes(pat, positives, negatives):
    """
    - pat: regular expression
    - positives: a set of strings to match
    - negatives: a set of strings to anti-match
    - returns: a set of mistaken strings with labels
    """
    return ({"missed positive %s" % p 
             for p in positives if not pat.search(p)} | 
            {"missed negative %s" % n
             for n in negatives if pat.search(n)})

def verify(pat, positives, negatives):
    if type(pat) is str:
        pat = re.compile(pat)
    assert not mistakes(pat, positives, negatives)
    return True

## Strategy for Solutions

Again it seems that Peter Norvig has seveal "patterns" on attaching a problem, which have interesting correlations with ["How to Solve it"](https://www.amazon.com/How-Solve-Mathematical-Princeton-Science/dp/069116407X/ref=sr_1_1?ie=UTF8&qid=1493345035&sr=8-1&keywords=how+to+solve+it).

- try different special cases to gain insights, if you have no clue how to start at all
- link it to recognized problem, e.g., Peter Norvig links it to "set cover" problem
- analysis solutions, pros/cons, complexity and etc.

__Read the original article for insights of Peter Norvig's solution, and how he analyzed it.__

So in summary, the solution is like
- come up with basic elements for composing partial solutions, i.e., regexs that match at least one positive, without matching any negative
- with that, it is a search problem, and using greedy search is usually a good approximation to NP prolem like set-cover-problem
- define the utility functions as close to the domain-specific language as possible

In [32]:
import itertools

def findregex(winners, losers, k=4):
    "Find a regex that matches all winners but no losers (sets of strings)."
    # Make a pool of regex parts, then pick from them to cover winners.
    # On each iteration, add the 'best' part to 'solution',
    # remove winners covered by best, and keep in 'pool' only parts
    # that still match some winner.
    pool = regex_parts(winners, losers)
    solution = []
    def score(part): return k * len(matches(part, winners)) - len(part)
    while winners:
        best = max(pool, key=score)
        solution.append(best)
        winners = winners - matches(best, winners)
        pool = {r for r in pool if matches(r, winners)}
    return OR(solution)

def matches(regex, strings):
    "Return a set of all the strings that are matched by regex."
    return {s for s in strings if re.search(regex, s)}

OR = '|'.join # Join a sequence of strings with '|' between them

In [33]:
def regex_parts(winners, losers):
    "Return parts that match at least one winner, but no loser."
    wholes = {'^' + w + '$'  for w in winners}
    parts = {d for w in wholes for p in subparts(w) for d in dotify(p)}
    return wholes | {p for p in parts if not matches(p, losers)}

def subparts(word, N=4):
    "Return a set of subparts of word: consecutive characters up to length N (default 4)."
    return set(word[i:i+n+1] for i in range(len(word)) for n in range(N)) 
    
def dotify(part):
    "Return all ways to replace a subset of chars in part with '.'."
    choices = map(replacements, part)
    return {cat(chars) for chars in itertools.product(*choices)}

def replacements(c): return c if c in '^$' else c + '.'

cat = ''.join

In [48]:
def words(s):
    return set(s.split())

positives = words('jacob mason ethan noah william liam jayden michael alexander aiden')
negatives = words('sophia emma isabella olivia ava emily abigail mia madison elizabeth')
solution = findregex(positives, negatives)
verify(solution, positives, negatives)

len(solution), solution

(11, 'e.$|a.$|a.o')

In [51]:
pat = re.compile(solution)
pat.search("michelle")

The learnt regular expression can be thought of as a disjuction of different "features"

In [63]:
solution = findregex(set(["I love", "I hate"]), set(["I", "love", "hate"]))
solution

' '

## Reflections

I love the reflection part the most. Peter Norvig has a very clear understanding (or instinct) about what approachs are the best for what problems. Of course it takes a lot of experiences to build it.
```
I was asked whether Randall was wrong to come up with "only" a 10-character Star Wars regex, whereas I showed there is a 9-character version. I would say that, given his role as a cartoonist, author, public speaker, educator, and entertainer, he has chosen ... wisely. He wrote a program that was good enough to allow him to make a great webcomic. A 9-character regex would not improve the comic. Randall stated that he used a genetic algorithm to find his regexes, and it has been said that genetic algorithms are often the second (or was it the third?) best method to solve any problem, and that's all he needed. But if you consider that in addition to all those roles, Randall is also still a practicing computer scientist, you could say he chose ... poorly. Genetic algorithms are good when you want to combine the structure of two solutions to yield a better solution, so they would work well if the best regexes had a complicated tree structure. But they don't! The best solutions are disjunctions of small parts. So the genetic algorithm is trying to combine the first half of one disjunction with the second half of another—but that isn't useful, because the components of a disjunction are unordered; imposing an ordering on them doesn't help.
```