## # Introduction
<p><img src="https://assets.datacamp.com/production/project_1010/img/book_cover.jpg" alt="The book cover of Peter and Wendy" style="width:183;height:253px;"></p>
<h3 id="flyawaywithpeterpan">Fly away with Peter Pan!</h3>
<p>Peter Pan has been the companion of many children, and went a long way, starting as a Christmas play and ending up as a Disney classic. Did you know that although the play was titled "Peter Pan, Or The Boy Who Wouldn't Grow Up", J. M. Barrie's novel was actually titled "Peter and Wendy"? </p>
<p>You're going to explore and analyze Peter Pan's text to answer the question in the instruction pane below. You are working with the text version available here at <a href="https://www.gutenberg.org/files/16/16-h/16-h.htm">Project Gutenberg</a>. Feel free to add as many cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> If you haven't completed a DataCamp project before you should check out the <a href="https://projects.datacamp.com/projects/33">Intro to Projects</a> first to learn about the interface. <a href="https://www.datacamp.com/courses/intermediate-importing-data-in-python">Intermediate Importing Data in Python</a> and <a href="https://www.datacamp.com/courses/introduction-to-natural-language-processing-in-python">Introduction to Natural Language Processing in Python</a> teach the skills required to complete this project. Should you decide to use them, English stopwords have been downloaded from <code>nltk</code> and are available for you in your environment.</p>

In [11]:
# Use this cell to begin your analysis, and add as many as you would like!


In [12]:
#Import libraries
import requests                  # To get objects from the web
import nltk                      # To manipulate text data
from bs4 import BeautifulSoup    # To manipulate HTML code
from collections import Counter  # To count words

In [13]:
#: Get HTML
r = requests.get("https://www.gutenberg.org/files/16/16-h/16-h.htm")

# Set the response encoding to utf-8
r.encoding = 'utf-8'

# Get HTML code from response
html = r.text

# Printing the first 2000 characters in html
print(html[0:2000])

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<title>The Project Gutenberg eBook of Peter Pan, by James M. Barrie</title>

<style type="text/css">

body { margin-left: 20%;
       margin-right: 20%;
       text-align: justify; }

h1, h2, h3, h4, h5 {text-align: center; font-style: normal; font-weight:
normal; line-height: 1.5; margin-top: .5em; margin-bottom: .5em;}

h1 {font-size: 300%;
    margin-top: 0.6em;
    margin-bottom: 0.6em;
    letter-spacing: 0.12em;
    word-spacing: 0.2em;
    text-indent: 0em;}
h2 {font-size: 150%; margin-top: 2em; margin-bottom: 1em;}
h3 {font-size: 130%; margin-top: 1em;}
h4 {font-size: 120%;}
h5 {font-size: 110%;}

.no-break {page-break-before: avoid;} /* for epubs */

div.chapter {page-

In [14]:
#: Get text from HTML
# Convert to Unicode
soup = BeautifulSoup(html)

# Extract text
text = soup.text

# Printing out text between characters 32000 and 34000
print(text[32000:34000])

 be tied up
this instant.”


“George, George,” Mrs. Darling whispered, “remember what I
told you about that boy.”


Alas, he would not listen. He was determined to show who was master in that
house, and when commands would not draw Nana from the kennel, he lured her out
of it with honeyed words, and seizing her roughly, dragged her from the
nursery. He was ashamed of himself, and yet he did it. It was all owing to his
too affectionate nature, which craved for admiration. When he had tied her up
in the back-yard, the wretched father went and sat in the passage, with his
knuckles to his eyes.


In the meantime Mrs. Darling had put the children to bed in unwonted silence
and lit their night-lights. They could hear Nana barking, and John whimpered,
“It is because he is chaining her up in the yard,” but Wendy was
wiser.


“That is not Nana’s unhappy bark,” she said, little guessing
what was about to happen; “that is her bark when she smells
danger.”


Danger!


“Are you sure, Wendy?”


“Oh,

In [15]:
#Get words
# Create tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer("\w+")

# Tokenize text
tokens = tokenizer.tokenize(text)

# Printing out the first 8 words / tokens 
tokens[0:8]

  tokenizer = nltk.tokenize.RegexpTokenizer("\w+")


['The', 'Project', 'Gutenberg', 'eBook', 'of', 'Peter', 'Pan', 'by']

In [16]:
#Lowercase
# Lowercase tokens
words = [token.lower() for token in tokens]

# Printing out the first 8 words / tokens 
words[:8]

['the', 'project', 'gutenberg', 'ebook', 'of', 'peter', 'pan', 'by']

In [17]:
#Load stopwords
# Download stopwords
nltk.download('stopwords')

# Make a list of stop words
stop_words = nltk.corpus.stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rtara\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
#Remove stopwords
# Remove stopwords from tokens list
words_clean = [word for word in words if word not in stop_words]

# Printing the first 5 words_ns to check that  stop words are gone
words_clean[:5]

['project', 'gutenberg', 'ebook', 'peter', 'pan']

In [19]:
#Count words
# Get count dictionary
count = Counter(words_clean)

# Get top 10 most common words
top_ten = count.most_common(10)

# Print the top ten 
print(top_ten)

[('peter', 409), ('wendy', 362), ('said', 358), ('would', 217), ('one', 212), ('hook', 174), ('could', 142), ('cried', 136), ('john', 133), ('time', 126)]


In [20]:
#Declare protagonists
protagonists = ["peter", "wendy", "hook", "john"]