![mobydick](mobydick.jpg)

In this workspace, you'll scrape the novel Moby Dick from the website [Project Gutenberg](https://www.gutenberg.org/) (which contains a large corpus of books) using the Python `requests` package. You'll extract words from this web data using `BeautifulSoup` before analyzing the distribution of words using the Natural Language ToolKit (`nltk`) and `Counter`.

The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.

In [7]:
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
from nltk.tokenize import RegexpTokenizer

nltk.download('stopwords')
from nltk.corpus import stopwords

url = "https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm"
r = requests.get(url)
r.encoding = 'utf-8'

html = r.text

# REQUIRED VARIABLE NAME
html_soup = BeautifulSoup(html, "html.parser")

moby_text = html_soup.get_text()

tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(moby_text)

words = [w.lower() for w in tokens]

stop_words = stopwords.words('english')
words_no_stop = [w for w in words if w not in stop_words]

count = Counter(words_no_stop)
top_ten = count.most_common(10)
print(top_ten)


[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[('whale', 1246), ('one', 925), ('like', 647), ('upon', 568), ('man', 527), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 452)]
