# Natural Language Processing

Chapter 20 of [Data Science from Scratch](http://shop.oreilly.com/product/0636920033400.do). Joel's code: [natural_language_processing.py](https://github.com/joelgrus/data-science-from-scratch/blob/master/code-python3/natural_language_processing.py)

In [34]:
from bs4 import BeautifulSoup
from collections import defaultdict
import random
import requests
import re

## n-gram models

In [43]:
url = 'http://radar.oreilly.com/2010/06/what-is-data-science.html'
html = requests.get(url).text

In [44]:
soup = BeautifulSoup(html, 'html5lib')
content = soup.find('div', {'class': 'article-body'})

In [45]:
regex = r"[\w']+|[\.]"
document = []
for paragraph in content("p"):
    words = re.findall(regex, paragraph.text)
    document.extend(words)

In [46]:
bigrams = zip(document, document[1:])
transitions = defaultdict(list)
for prev, current in bigrams:
    transitions[prev].append(current)

In [47]:
len(transitions)

1470

In [48]:
def generate_using_bigrams():
    # this means the next word will start a sentence
    current = "."
    result = []
    
    while True:
        next_word_candidates = transitions[current]
        current = random.choice(next_word_candidates)
        result.append(current)
        if current == ".": return " ".join(result)

In [49]:
generate_using_bigrams()

'But the Philadelphia County by the track lengths and delivers intermediate results in near real people you ve all the grammatical structure of operations fails you have real time .'

In [51]:
trigrams = zip(document, document[1:], document[2:])
trigram_transitions = defaultdict(list)
starts = []

In [55]:
for a, b, c in trigrams:
    if a == ".":
        starts.append(b)
    trigram_transitions[(a, b)].append(c)

In [57]:
def generate_using_trigrams():
    current = random.choice(starts)
    prev = "."
    result = [current]

    # choose a random starting word
    # and precede it with a '.'
    while True:
        next_word_candidates = trigram_transitions[(prev, current)]
        next_word = random.choice(next_word_candidates)
        prev, current = current, next_word
        result.append(current)
        
        if current == ".":
            return " ".join(result)

In [70]:
generate_using_trigrams()

'Near real time reports on trending topics report only needs to be able to understand the grammatical structure of a flu virus through a population Making data tell its story .'

## Grammars

In [71]:
grammar = {
    "_S"  : ["_NP _VP"],
    "_NP" : ["_N",
             "_A _NP _P _A _N"],
    "_VP" : ["_V",
             "_V _NP"],
    "_N"  : ["data science", "Python", "regression"],
    "_A"  : ["big", "linear", "logistic"],
    "_P"  : ["about", "near"],
    "_V"  : ["learns", "trains", "tests", "is"]
}

In [72]:
def is_terminal(token):
    return token[0] != "_"

In [77]:
def expand(grammar, tokens):
    for i, token in enumerate(tokens):

        # ignore terminals
        if is_terminal(token): continue

        # choose a replacement at random
        replacement = random.choice(grammar[token])

        if is_terminal(replacement):
            tokens[i] = replacement
        else:
            tokens = tokens[:i] + replacement.split() + tokens[(i+1):]
        return expand(grammar, tokens)

    # if we get here we had all terminals and are done
    return tokens

In [78]:
def generate_sentence(grammar):
    return expand(grammar, ["_S"])

In [79]:
generate_sentence(grammar)

['linear', 'data science', 'near', 'logistic', 'Python', 'is']