## Low Level Look at the Basics of NLP 
#### This is a notebook based on the NLP chapter in "Data Science from Scratch" by Joel Grus. Please check out it out the book here:
https://learning.oreilly.com/library/view/data-science-from/9781492041122/

#### Or his blog here:
https://joelgrus.com/


We are going to use the Requests and Beautiful soup livraries to get some data. Here we are going to grab some data from an essay written by Mike Loukides titled "What is Data Science"

In [1]:
import re
import random
from bs4 import BeautifulSoup
import requests
from collections import defaultdict

url = "https://www.oreilly.com/ideas/what-is-data-science"
html = requests.get(url).text
soup = BeautifulSoup(html, 'html5lib')

Taking a look a the data we have, we need to get at the article body by finding the div with the class "main-post-radar-content"

In [None]:
content = soup.find("div", "main-post-radar-content")   # find article-body div

The data is far from clean as most data you will get so lets do some basic cleaning. First we will clean some of the apostrophes that are character "â" and replace them with a normal apostrophes with a function calles fix_unicode. Next we will want to create a sequence (list) of words and use the "." as a way of marking where each sentence ends.  

In [3]:
def fix_apostrophes(text: str) -> str:
    return text.replace("â", "'")

regex = r"[\w']+|[\.]"                      
document = []

for paragraph in content("p"):
    words = re.findall(regex, fix_apostrophes(paragraph.text))
    document.extend(words)

In [10]:
document[:10]

["We'",
 've',
 'all',
 'heard',
 'it',
 'according',
 'to',
 'Hal',
 'Varian',
 'statistics']

Using a bigram model we can, given some starting word, can randomly choose one of one of the words that follow it in the source document 

In [5]:
transitions = defaultdict(list)
for prev, current in zip(document, document[1:]):
    transitions[prev].append(current)
    
def generate_using_bigrams() -> str:
    current = "."
    result = []
    while True:
        next_word_candidates = transitions[current]
        current = random.choice(next_word_candidates)
        result.append(current)
        if current == ".": return " ".join(result)

In [6]:
bi_gram_text = generate_using_bigrams()
bi_gram_text

"It' s Dynamo and find out what can create animations that analyzed the existence of time ."

Let's try using trigrams, which are triplets of consecutive words, to see if we can create sentences that a little more sense.

In [7]:
trigram_transitions = defaultdict(list)
starts = []

for prev, current, next in zip(document, document[1:], document[2:]):

    if prev == ".":              # if the previous "word" was a period
        starts.append(current)   # then this is a start word

    trigram_transitions[(prev, current)].append(next)

In [8]:
def generate_using_trigrams() -> str:
    current = random.choice(starts)   # choose a random starting word
    prev = "."                        # and precede it with a '.'
    result = [current]
    while True:
        next_word_candidates = trigram_transitions[(prev, current)]
        next_word = random.choice(next_word_candidates)

        prev, current = current, next_word
        result.append(current)

        if current == ".":
            return " ".join(result)

In [9]:
tri_gram_text = generate_using_trigrams()
tri_gram_text

'There are many packages for plotting and presenting data .'