## # Introduction
<p><img src="https://assets.datacamp.com/production/project_1010/img/book_cover.jpg" alt="The book cover of Peter and Wendy" style="width:183;height:253px;"></p>
<h3 id="flyawaywithpeterpan">Fly away with Peter Pan!</h3>
<p>Peter Pan has been the companion of many children, and went a long way, starting as a Christmas play and ending up as a Disney classic. Did you know that although the play was titled "Peter Pan, Or The Boy Who Wouldn't Grow Up", J. M. Barrie's novel was actually titled "Peter and Wendy"? </p>
<p>You're going to explore and analyze Peter Pan's text to answer the question in the instruction pane below. You are working with the text version available here at <a href="https://www.gutenberg.org/files/16/16-h/16-h.htm">Project Gutenberg</a>. Feel free to add as many cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> If you haven't completed a DataCamp project before you should check out the <a href="https://projects.datacamp.com/projects/33">Intro to Projects</a> first to learn about the interface. <a href="https://www.datacamp.com/courses/intermediate-importing-data-in-python">Intermediate Importing Data in Python</a> and <a href="https://www.datacamp.com/courses/introduction-to-natural-language-processing-in-python">Introduction to Natural Language Processing in Python</a> teach the skills required to complete this project. Should you decide to use them, English stopwords have been downloaded from <code>nltk</code> and are available for you in your environment.</p>

In [2]:
# Use this cell to begin your analysis, and add as many as you would like!
import requests
from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from collections import Counter

# 1. Fetch Peter Pan text from Project Gutenberg
url = "https://www.gutenberg.org/files/16/16-h/16-h.htm"
response = requests.get(url)
response.encoding = 'utf-8'
html = response.text

# 2. Parse HTML and extract text
soup = BeautifulSoup(html, 'html.parser')
text = soup.get_text()

# 3. Tokenize and clean the text
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(text.lower())

# 4. Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]

# 5. Count word frequencies
word_counts = Counter(filtered_words)
top_ten = word_counts.most_common(10)

# 6. Identify character names in top 10
known_characters = ['peter', 'wendy', 'hook', 'tinkerbell', 'john', 'michael', 
                   'smee', 'tiger', 'lily', 'neverland']

protagonists = []
for word, count in top_ten:
    if word in known_characters:
        protagonists.append(word.capitalize())  # Capitalize names properly

print("Top 10 words:", top_ten)
print("Character names in top 10:", protagonists)

Top 10 words: [('peter', 409), ('wendy', 362), ('said', 358), ('would', 217), ('one', 212), ('hook', 174), ('could', 142), ('cried', 136), ('john', 133), ('time', 126)]
Character names in top 10: ['Peter', 'Wendy', 'Hook', 'John']
