## # Introduction
<p><img src="https://assets.datacamp.com/production/project_1010/img/book_cover.jpg" alt="The book cover of Peter and Wendy" style="width:183;height:253px;"></p>
<h3 id="flyawaywithpeterpan">Fly away with Peter Pan!</h3>
<p>Peter Pan has been the companion of many children, and went a long way, starting as a Christmas play and ending up as a Disney classic. Did you know that although the play was titled "Peter Pan, Or The Boy Who Wouldn't Grow Up", J. M. Barrie's novel was actually titled "Peter and Wendy"? </p>
<p>You're going to explore and analyze Peter Pan's text to answer the question in the instruction pane below. You are working with the text version available here at <a href="https://www.gutenberg.org/files/16/16-h/16-h.htm">Project Gutenberg</a>. Feel free to add as many cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> If you haven't completed a DataCamp project before you should check out the <a href="https://projects.datacamp.com/projects/33">Intro to Projects</a> first to learn about the interface. <a href="https://www.datacamp.com/courses/intermediate-importing-data-in-python">Intermediate Importing Data in Python</a> and <a href="https://www.datacamp.com/courses/introduction-to-natural-language-processing-in-python">Introduction to Natural Language Processing in Python</a> teach the skills required to complete this project. Should you decide to use them, English stopwords have been downloaded from <code>nltk</code> and are available for you in your environment.</p>

In [119]:
# Use this cell to begin your analysis, and add as many as you would like!
#import necessary libraries
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
from nltk.corpus import stopwords 

In [120]:
#getting the html from peter pan
r = requests.get('https://www.gutenberg.org/files/16/16-h/16-h.htm')

#set correct text encoding
r.encoding = 'utf-8'

#extract the html
html = r.text

#print character from html
print(html[:2000])

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<title>The Project Gutenberg eBook of Peter Pan, by James M. Barrie</title>

<style type="text/css">

body { margin-left: 20%;
       margin-right: 20%;
       text-align: justify; }

h1, h2, h3, h4, h5 {text-align: center; font-style: normal; font-weight:
normal; line-height: 1.5; margin-top: .5em; margin-bottom: .5em;}

h1 {font-size: 300%;
    margin-top: 0.6em;
    margin-bottom: 0.6em;
    letter-spacing: 0.12em;
    word-spacing: 0.2em;
    text-indent: 0em;}
h2 {font-size: 150%; margin-top: 2em; margin-bottom: 1em;}
h3 {font-size: 130%; margin-top: 1em;}
h4 {font-size: 120%;}
h5 {font-size: 110%;}

.no-break {page-break-before: avoid;} /* for 

In [121]:
#now we can get the text from the html
#create beautifulSoup from html
soup = soup = BeautifulSoup(html, 'html.parser')

# Getting the text out of the soup
text = soup.get_text()

# Printing out some sample text
print(text[3000:5000])

O YOU BELIEVE IN FAIRIES?


 Chapter XIV. THE PIRATE SHIP


 Chapter XV. “HOOK OR ME THIS TIME”


 Chapter XVI. THE RETURN HOME


 Chapter XVII. WHEN WENDY GREW UP



Chapter I.
PETER BREAKS THROUGH

All children, except one, grow up. They soon know that they will grow up, and
the way Wendy knew was this. One day when she was two years old she was playing
in a garden, and she plucked another flower and ran with it to her mother. I
suppose she must have looked rather delightful, for Mrs. Darling put her hand
to her heart and cried, “Oh, why can’t you remain like this for
ever!” This was all that passed between them on the subject, but
henceforth Wendy knew that she must grow up. You always know after you are two.
Two is the beginning of the end.


Of course they lived at 14, and until Wendy came her mother was the chief one.
She was a lovely lady, with a romantic mind and such a sweet mocking mouth. Her
romantic mind was like the tiny boxes, one within the other, that come 

In [122]:
#now we can extract the words from the text
#create tokenizer
tokenizer = tokenizer = nltk.tokenize.RegexpTokenizer(r'\b\w+\b')

#tokenize the text
tokens = tokenizer.tokenize(text)

#print out first 10 words
print(tokens[:10])

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'Peter', 'Pan', 'by', 'James', 'M']


In [123]:
#in order for proper counting, we create a list with all words in lowercase
words = [word.lower() for word in tokens]

#print out first 10 words in words
print(words[:10])

['the', 'project', 'gutenberg', 'ebook', 'of', 'peter', 'pan', 'by', 'james', 'm']


In [124]:
# assign stopwords so that we can remove uninteresting words

sw = stopwords.words('english')

#print the first 10 stopwords
print(sw[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [125]:
#now we can remove the stopwords from the list of words
words_ns = [word for word in words if word not in sw]

#print first 10 words without stopwords
print(words_ns[:10])

['project', 'gutenberg', 'ebook', 'peter', 'pan', 'james', 'barrie', 'body', 'margin', 'left']


In [126]:
#now we can count the most frequent words in Peter Pan
#create counter for list of words
count = Counter(words_ns)

#create list of top 10 words with their counts
top_ten = count.most_common(10)

#print out top 10 words
print(top_ten)

[('peter', 409), ('wendy', 362), ('said', 358), ('would', 217), ('one', 212), ('hook', 174), ('could', 142), ('cried', 136), ('john', 133), ('time', 126)]


In [127]:
#now we can see out of the top 10 most frequent words used, how many were character names
#create a list to store the character names
protagonists = ['peter', 'wendy','hook', 'john']

print(protagonists)


['peter', 'wendy', 'hook', 'john']
