## Basics of NLP using Bigrams and Trigrams
Here we are going to programmatically generate some Data Science content using bigrams and trigrams. First by sourcing a corpus of Data Science documents from the web by scraping a data science article on oreilly.com. Next we will clean and prepare the data for our bigram and trigram algorithms. 

This is a notebook based on the NLP chapter in "Data Science from Scratch" by Joel Grus. For more examples or information please check out the book, which I recommend, here:
https://learning.oreilly.com/library/view/data-science-from/9781492041122/


Lets use the Requests and Beautiful soup libraries to get our corpus of documents from an essay written by Mike Loukides titled "What is Data Science" 

In [1]:
import re
import random
from bs4 import BeautifulSoup
import requests
from collections import defaultdict

url = "https://www.oreilly.com/ideas/what-is-data-science"
html = requests.get(url).text
soup = BeautifulSoup(html, 'html5lib')

Taking a look a the html data, the article body can be accessed by finding the div with the class "main-post-radar-content"

In [2]:
content = soup.find("div", "main-post-radar-content")   # find article-body div

### Clean and Prep the Data
As with most data sourced this way it is far from clean. So lets do some basic cleaning. First, we will clean the apostrophes that are character represented as "â" and replace them with normal apostrophes with a function called fix_apostrophes. 

Next we will want to create a sequence (list) of words and use the "." as a way of marking where each sentence ends.  

In [3]:
def fix_apostrophes(text: str) -> str:
    return text.replace("â", "'")

regex = r"[\w']+|[\.]"                      
document = []

for paragraph in content("p"):
    words = re.findall(regex, fix_apostrophes(paragraph.text))
    document.extend(words)

In [4]:
document[:10]

["We'",
 've',
 'all',
 'heard',
 'it',
 'according',
 'to',
 'Hal',
 'Varian',
 'statistics']

### Now lets create bigram model
This is a model that, given some starting word, will randomly choose one of the words that follow it in the corpus. This is done by using a defaultdict to create a lookup of words that follow a given target word. Once we have this lookup we can randomly select one of the values in the list of words that follow our target word. When the model encounters a "." it will stop and return the string of content.

In [5]:
transitions = defaultdict(list)
for prev, current in zip(document, document[1:]):
    transitions[prev].append(current)
    
def bigrams_model() -> str:
    current = "."
    result = []
    while True:
        next_word_candidates = transitions[current]
        current = random.choice(next_word_candidates)
        result.append(current)
        if current == ".": return " ".join(result)

In [6]:
bi_gram_text = bigrams_model()
print(bi_gram_text, '-Bigram Model')

Asking things change over time . -Bigram Model


### Let's try using trigrams 
Trigrams are triplets of consecutive words. Lets see if they create sentences that make a little more sense.

In [7]:
trigram_transitions = defaultdict(list)
starts = []

for prev, current, next in zip(document, document[1:], document[2:]):

    if prev == ".":              
        starts.append(current) 

    trigram_transitions[(prev, current)].append(next)

In [8]:
def trigrams_model() -> str:
    current = random.choice(starts)   
    prev = "."                        
    result = [current]
    while True:
        next_word_candidates = trigram_transitions[(prev, current)]
        next_word = random.choice(next_word_candidates)

        prev, current = current, next_word
        result.append(current)

        if current == ".":
            return " ".join(result)

In [19]:
tri_gram_text = trigrams_model()
tri_gram_text

"' Almost any e commerce application is a great example of data jiujitsu identifying music by analyzing an audio stream directly is a quintessential artificial intelligence with human intelligence ."