## # Introduction
<p><img src="https://assets.datacamp.com/production/project_1010/img/book_cover.jpg" alt="The book cover of Peter and Wendy" style="width:183;height:253px;"></p>
<h3 id="flyawaywithpeterpan">Fly away with Peter Pan!</h3>
<p>Peter Pan has been the companion of many children, and went a long way, starting as a Christmas play and ending up as a Disney classic. Did you know that although the play was titled "Peter Pan, Or The Boy Who Wouldn't Grow Up", J. M. Barrie's novel was actually titled "Peter and Wendy"? </p>
<p>You're going to explore and analyze Peter Pan's text to answer the question in the instruction pane below. You are working with the text version available here at <a href="https://www.gutenberg.org/files/16/16-h/16-h.htm">Project Gutenberg</a>. Feel free to add as many cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> If you haven't completed a DataCamp project before you should check out the <a href="https://projects.datacamp.com/projects/33">Intro to Projects</a> first to learn about the interface. <a href="https://www.datacamp.com/courses/intermediate-importing-data-in-python">Intermediate Importing Data in Python</a> and <a href="https://www.datacamp.com/courses/introduction-to-natural-language-processing-in-python">Introduction to Natural Language Processing in Python</a> teach the skills required to complete this project. Should you decide to use them, English stopwords have been downloaded from <code>nltk</code> and are available for you in your environment.</p>

In [79]:
# Use this cell to begin your analysis, and add as many as you would like!


Student: William Ng 

# Word Frequency Analysis with Peter Pan

In [80]:
# importing all necessary libraries
from bs4 import BeautifulSoup
import requests
import nltk
from nltk.corpus import stopwords
from collections import Counter

### Fetching HTML from project gutenberg, and extract necessary data for peter pan analysis

In [81]:
# setting the link
link = "https://www.gutenberg.org/files/16/16-h/16-h.htm"

# fetching request from link 
r = requests.get(link)

# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extracting the HTML from the request object
html = r.text

# Printing the first 2000 characters in html
print(html[:2000])

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<title>The Project Gutenberg eBook of Peter Pan, by James M. Barrie</title>

<style type="text/css">

body { margin-left: 20%;
       margin-right: 20%;
       text-align: justify; }

h1, h2, h3, h4, h5 {text-align: center; font-style: normal; font-weight:
normal; line-height: 1.5; margin-top: .5em; margin-bottom: .5em;}

h1 {font-size: 300%;
    margin-top: 0.6em;
    margin-bottom: 0.6em;
    letter-spacing: 0.12em;
    word-spacing: 0.2em;
    text-indent: 0em;}
h2 {font-size: 150%; margin-top: 2em; margin-bottom: 1em;}
h3 {font-size: 130%; margin-top: 1em;}
h4 {font-size: 120%;}
h5 {font-size: 110%;}

.no-break {page-break-before: avoid;} /* for 

In [82]:
# Creating a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html.parser")

# Getting the text out of the soup
text = soup.get_text()

# Printing out text between characters 32000 and 34000
print(text[32000:34000])

e put a little
milk into your bowl, Nana.”


Nana wagged her tail, ran to the medicine, and began lapping it. Then she gave
Mr. Darling such a look, not an angry look: she showed him the great red tear
that makes us so sorry for noble dogs, and crept into her kennel.


Mr. Darling was frightfully ashamed of himself, but he would not give in. In a
horrid silence Mrs. Darling smelt the bowl. “O George,” she said,
“it’s your medicine!”


“It was only a joke,” he roared, while she comforted her boys, and
Wendy hugged Nana. “Much good,” he said bitterly, “my wearing
myself to the bone trying to be funny in this house.”


And still Wendy hugged Nana. “That’s right,” he shouted.
“Coddle her! Nobody coddles me. Oh dear no! I am only the breadwinner,
why should I be coddled—why, why, why!”


“George,” Mrs. Darling entreated him, “not so loud; the
servants will hear you.” Somehow they had got into the way of calling
Liza the servants.


“Let them!” he answered recklessly. 

In [83]:
# Creating a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

# Tokenizing the text
tokens = tokenizer.tokenize(text)

# Printing out the first 8 words / tokens 
print(tokens[:11])

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'Peter', 'Pan', 'by', 'James', 'M', 'Barrie']


In [84]:
# Make all words lowercase
words = [token.lower() for token in tokens]

# load stop words
stop_words = stop_words = nltk.corpus.stopwords.words("english")

# Remove stop words
filtered_words = [word for word in words if word not in stop_words]

In [85]:
# Printing the first 5 words_ns to check that stop words are gone
print(filtered_words[0:5])

['project', 'gutenberg', 'ebook', 'peter', 'pan']


In [86]:
# Initialize a Counter object from our processed list of words
count = Counter(filtered_words)

# Store 10 most common words and their counts as top_ten
top_ten = count.most_common(10)

# Print the top ten words and their counts
print(top_ten)

[('peter', 409), ('wendy', 362), ('said', 358), ('would', 217), ('one', 212), ('hook', 174), ('could', 142), ('cried', 136), ('john', 133), ('time', 126)]


In [None]:
protagonists = ["hook", "john", "peter", "wendy"]