![mobydick](mobydick.jpg)

In this workspace, you'll scrape the novel Moby Dick from the website [Project Gutenberg](https://www.gutenberg.org/) (which contains a large corpus of books) using the Python `requests` package. You'll extract words from this web data using `BeautifulSoup` before analyzing the distribution of words using the Natural Language ToolKit (`nltk`) and `Counter`.
https://www.gutenberg.org/files/2701/2701-h/2701-h.htm
The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.

In [1]:
# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')

# Start coding here... 

[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')

# Get the Moby Dick HTML  
r = requests.get('https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm')

# Set the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extract the HTML from the request object
html = r.text

# Print the first 2000 characters in html
print(html[0:2000])

# Create a BeautifulSoup object from the HTML
html_soup = BeautifulSoup(html, "html.parser")

# Get the text out of the soup
moby_text = html_soup.get_text()

# Create a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

# Tokenize the text
tokens = tokenizer.tokenize(moby_text)

# Create a list called words containing all tokens transformed to lowercase
words = [token.lower() for token in tokens]

# Print out the first eight words
words[:8]

# Get the English stop words from nltk
stop_words = nltk.corpus.stopwords.words('english')

# Print out the first eight stop words
stop_words[:8]

# Create a list words_ns containing all words that are in words but not in stop_words
words_no_stop = [word for word in words if word not in stop_words]

# Print the first five words_no_stop to check that stop words are gone
words_no_stop[:5]

# Initialize a Counter object from our processed list of words
count = Counter(words_no_stop)

# Store ten most common words and their counts as top_ten
top_ten = count.most_common(10)

# Print the top ten words and their counts
print(top_ten)

[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >

<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
  <head>
    <title>
      Moby Dick; Or the Whale, by Herman Melville
    </title>
    <style type="text/css" xml:space="preserve">

    body { background:#faebd0; color:black; margin-left:15%; margin-right:15%; text-align:justify }
    P { text-indent: 1em; margin-top: .25em; margin-bottom: .25em; }
    H1,H2,H3,H4,H5,H6 { text-align: center; margin-left: 15%; margin-right: 15%; }
    hr  { width: 50%; text-align: center;}
    .foot { margin-left: 20%; margin-right: 20%; text-align: justify; text-indent: -3em; font-size: 90%; }
    blockquote {font-size: 100%; margin-left: 0%; margin-right: 0%;}
    .mynote    {background-color: #DDE; color: #000; padding: .5em; margin-left: 10%; margin-right: 10%; font-family: sans-serif; font-size: 95%;}
    .toc       { margin-left: 10%; m

How to approach the project
1. Request and encode the text

2. Extract the text

3. Create a BeautifulSoup object and get the text

4. Tokenize the text

5. Convert words to lowercase

6. Load in stop words

7. Remove stop words from the text

8. Count the frequency of words

Steps to complete

1
Request and encode the text
Request the Moby Dick HTML file from the following URL, assign it to r, and set the text encoding to utf-8.

https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm


How to perform a GET request
How to encode text
2
Extract the text
Extract the text from r and assign it to html, then print out the first 2000 characters in html.


How to extract text from a GET request
How to print a slice of a string
3
Create a BeautifulSoup object and get the text
Create a BeautifulSoup object from html using html.parser, assign it to html_soup, and get the text and save to moby_text.


Creating a BeautifulSoup object
Get text from a BeautifulSoup object
4
Tokenize the text
Initialize a regex tokenizer object, tokenizer, using nltk.tokenize.RegexpTokenizer, passing in a regular expression that will keep only alphanumeric text; tokenize the text to split it into individual words and assign the resulting list of words to tokens.


Using regex to match alphanumeric text
How to tokenize text
5
Convert words to lowercase
Loop through the words in tokens, make them lowercase, and store them in a list called words, then print out the first eight words.


Converting to lowercase
6
Load in stop words
Load in the English stop words from nltk and assign them to stop_words, then print out the first eight stop words in stop_words.


Loading stop words
7
Remove stop words from the text
Create a new list words_no_stop with the words from Moby Dick where stop words have been removed, then print the first five words in words_no_stop.


Create a new list from a list
8
Count the frequency of words
Initialize a Counter object called count using our words_no_stop list; use the corresponding method to return the ten most common words and their counts, assigning the result to top_ten then print.


Using Counter
Finding the frequency