![mobydick](mobydick.jpg)

In this workspace, you'll scrape the novel Moby Dick from the website [Project Gutenberg](https://www.gutenberg.org/) (which contains a large corpus of books) using the Python `requests` package. You'll extract words from this web data using `BeautifulSoup` before analyzing the distribution of words using the Natural Language ToolKit (`nltk`) and `Counter`.

The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.

# üìò Moby-Dick Text Mining & Word Frequency Analysis

### Summary
In this notebook, we scrape the novel Moby-Dick from Project Gutenberg using the requests library, then extract and clean the text with BeautifulSoup. Once the raw text is processed, we perform natural language processing using NLTK to identify the most frequently used words in the novel.

---

### üîç Project Pipeline
1. Download the HTML version of _Moby-Dick_ from Project Gutenberg  
2. Parse and extract text using BeautifulSoup 
3. Clean and tokenize the text using NLTK  
4. Remove English stopwords  
5. Count word frequencies using Counter variable
6. Visualize the most common words

---

### üì¶ Libraries Used
| Purpose | Library |
|--------|---------|
| Web scraping | `requests`, `BeautifulSoup` |
| NLP / tokenization | `nltk` |
| Frequency counting | `collections.Counter` |
| Visualization (optional) | `matplotlib` or `seaborn` |

---

### üéØ Final Goal
Produce a ranked list (and optional visualization) of the most frequent words in _Moby-Dick_ after stopword removal.

---


In [None]:
# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')

# Request and encode the text
url = "https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm"
r = requests.get(url)
r.encoding = "utf-8"
html = r.text

[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
#Extracting text from the HTML
html_soup = BeautifulSoup(html, "html.parser")
moby_text = soup.get_text()

In [None]:
#Tokenize Text
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(moby_text)

In [None]:
#Converting to lowercase
words = [word.lower() for word in tokens]

In [None]:
#Load in stopwords
from nltk.corpus import stopwords

stop_words = stopwords.words("english")
print(stop_words[:8])

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all']


In [None]:
#Removing stopwords
# Remove stopwords from the text
words_no_stop = [word for word in words if word not in stop_words]

In [None]:
#Count freq and display top ten
count = Counter(words_no_stop)

#Top 10 common words
top_ten = count.most_common(10)
print(top_ten)


[('whale', 1246), ('one', 925), ('like', 647), ('upon', 568), ('man', 527), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 452)]
