# Project 2. The Beatrix Potter Universe

## *Finding stories in cultural products*

This project is an example of how it is possible to find *stories* in collections of cultural products (literature, lyrics, movie and TV scripts, etc.) if we look for a newsworthy angle.

The easiest way to achieve this is to plan ahead, and have the story ready to be published on a significant date (a relevant anniversary, the celebration of a festival, or by the end of a given period of time (year, end of a season, etc.)).

In this case we have selected the works of author Beatrix Potter, which constitute a finite sample, since she is deceased. A preliminar goal to approach the collection could be determining which were the animals that she prefered as characters of her books, but we could find other elements to highlight along the way.

This could also be one of those rare cases in which using a word cloud could be appropriate.

The data could be presented in the form of an infographic.

![](https://s19.postimg.org/ify3yjplv/beatrix_characters.jpg)

## (1) Collecting the data
Initially our idea was to use this excercise to try to write a script that scraped and store the body of the stories only, but the Project Gutenberg website has an anti-scraping policy and we encounter this problem after a few attempts to get the information:

![](https://s19.postimg.org/eeitaugkj/antiscraping_error_gutenberg.jpg)

We moved instead then to collecting the information manually in a spreadsheet. We used [this list](https://en.wikipedia.org/wiki/Beatrix_Potter#Publications) from Wikipedia as a guide to collect the 24 stories - and purposely excluded the ones listed as "Other books".

We copied and pasted the list to a spreadsheet in Excel, separated the text into two columns - one for the title, one for the year of publication - and then found each story in the [Gutenberg Project website](http://gutenberg.org/ebooks/author/292).

We were able to find all the stories, except for *The Tale of Kitty-in-Boots*, that, published in 2016 is protected by copyright law. *The Tale of Little Pig Robinson* (1990) was equally unavailable at guntenberg.org, but found it in [the Canadian website](http://www.gutenberg.ca/ebooks/potter-pigrobinson/potter-pigrobinson-00-h-dir/potter-pigrobinson-00-h.html) of the project.

We used [this web application](http://www.textfixer.com) to remove line breaks and paragraphs breaks to turn the text into a single block that would take one row of the spreadsheet.

## (2) Importing the data as a data frame

In [3]:
from pandas import *
import numpy as np

corpus = ExcelFile("Potter tales 24.xlsx").parse()
# We got an attribute error (ExcelFile' object has no attribute 'head') when we tried to visualise the resulting data
# frame assigned to the variable corpus from simply reading the excel file. It turns out the file needs to be parsed first
# see details in this link (along with more useful info about how to parse workbooks with more than one sheet)
# http://stackoverflow.com/questions/17063458/reading-an-excel-file-in-python-using-pandas


# Now we can use the head method to get a preview of our data frame 
corpus.head()

Unnamed: 0,Title,Date published,Text
0,The Tale of Peter Rabbit,1902,Once upon a time there were four little rabbit...
1,The Tale of Squirrel Nutkin,1903,This is a Tale about a tail--a tail that belon...
2,The Tailor of Gloucester,1903,In the time of swords and periwigs and full-sk...
3,The Tale of Benjamin Bunny,1904,One morning a little rabbit sat on a bank. He ...
4,The Tale of Two Bad Mice,1904,ONCE upon a time there was a very beautiful do...


## (3) Data preparation

Now that we have imported the data, we can start the transformations necessary to make our calculations. The first step will be to add a new column to the data frame containing the text of the stories in lower case. We have called that column "Normalized Text".

In [4]:
corpus["Normalized Text"] = corpus["Text"].str.lower()
corpus.head()

Unnamed: 0,Title,Date published,Text,Normalized Text
0,The Tale of Peter Rabbit,1902,Once upon a time there were four little rabbit...,once upon a time there were four little rabbit...
1,The Tale of Squirrel Nutkin,1903,This is a Tale about a tail--a tail that belon...,this is a tale about a tail--a tail that belon...
2,The Tailor of Gloucester,1903,In the time of swords and periwigs and full-sk...,in the time of swords and periwigs and full-sk...
3,The Tale of Benjamin Bunny,1904,One morning a little rabbit sat on a bank. He ...,one morning a little rabbit sat on a bank. he ...
4,The Tale of Two Bad Mice,1904,ONCE upon a time there was a very beautiful do...,once upon a time there was a very beautiful do...


Now we can proceed to break down that text into words using the NLTK library. Since we need to apply the transformation to each of the rows in that column of the data frame the easiest way to do that is to write a function like the one shown below:

In [5]:
from nltk.tokenize import *
def splitintowords(text):
    tokens = word_tokenize(text)
    return tokens

corpus["Normalized Text"].apply(splitintowords)

0     [once, upon, a, time, there, were, four, littl...
1     [this, is, a, tale, about, a, tail, --, a, tai...
2     [in, the, time, of, swords, and, periwigs, and...
3     [one, morning, a, little, rabbit, sat, on, a, ...
4     [once, upon, a, time, there, was, a, very, bea...
5     [once, upon, a, time, there, was, a, little, g...
6     [once, upon, a, time, there, was, a, pussy-cat...
7     [once, upon, a, time, there, was, a, frog, cal...
8     [this, is, a, fierce, bad, rabbit, ;, look, at...
9     [this, is, a, pussy, called, miss, moppet, ,, ...
10    [once, upon, a, time, there, were, three, litt...
11    [what, a, funny, sight, it, is, to, see, a, br...
12    [once, upon, a, time, there, was, an, old, cat...
13    [it, is, said, that, the, effect, of, eating, ...
14    [once, upon, a, time, there, was, a, village, ...
15    [once, upon, a, time, there, was, a, wood-mous...
16    [once, upon, a, time, there, was, a, little, f...
17    [i, have, made, many, books, about, well-b

Now we can add that information as a new column in our data frame - that we have called "Tokens".

In [6]:
corpus["Tokens"] = corpus["Normalized Text"].apply(splitintowords)
corpus.head()

Unnamed: 0,Title,Date published,Text,Normalized Text,Tokens
0,The Tale of Peter Rabbit,1902,Once upon a time there were four little rabbit...,once upon a time there were four little rabbit...,"[once, upon, a, time, there, were, four, littl..."
1,The Tale of Squirrel Nutkin,1903,This is a Tale about a tail--a tail that belon...,this is a tale about a tail--a tail that belon...,"[this, is, a, tale, about, a, tail, --, a, tai..."
2,The Tailor of Gloucester,1903,In the time of swords and periwigs and full-sk...,in the time of swords and periwigs and full-sk...,"[in, the, time, of, swords, and, periwigs, and..."
3,The Tale of Benjamin Bunny,1904,One morning a little rabbit sat on a bank. He ...,one morning a little rabbit sat on a bank. he ...,"[one, morning, a, little, rabbit, sat, on, a, ..."
4,The Tale of Two Bad Mice,1904,ONCE upon a time there was a very beautiful do...,once upon a time there was a very beautiful do...,"[once, upon, a, time, there, was, a, very, bea..."


## (4) Finding animals in the stories

### Creating an exhaustive list of animal names

One possible approach to finding the animals named in the stories is creating a list of names that provides the keywords for the search.

As was to be expected, there are different websites providing such lists. For our purposes we needed a list of *common names*, that we found in [A-Z Animals](http://a-z-animals.com/animals).

The following lines of code scrape and save the names of the animals to a list:

In [7]:
import urllib
from bs4 import BeautifulSoup
import pprint

animalNames = urllib.request.urlopen("http://a-z-animals.com/animals")
soup = BeautifulSoup(animalNames.read(), "html.parser") 

animalNames = []
for li in soup.findAll('li'):
    a = li.text
    animalNames.append(a)

len(animalNames)

597

Here's a preview of the content of the list:

In [8]:
animalNames[:5]

['Abyssinian',
 'Adelie Penguin',
 'Affenpinscher',
 'Afghan Hound',
 'African Bush Elephant']

### Cleaning the list

When we reviewed the list, we found that it contained elements at the end that were not animal names:

In [9]:
animalNames[585:]

['Yorkshire Terrier',
 'Zebra',
 'Zebra Shark',
 'Zebu',
 'Zonkey',
 'Zorse',
 'Contact Us',
 'About Us',
 'Contribute',
 'Donate',
 'Privacy Policy',
 '\xa0']

As you may see, we need to remove all the elements after *Zorse*. Given that the elements to be erased are only a few, we could have counted them visually, but we are going to deal with the issue as if it was larger. 

We first determine the index number of the element where we want the list to stop (*Zorse*):

In [10]:
animalNames.index("Zorse")
# This is the reference we used for this calculation
# http://stackoverflow.com/questions/176918/finding-the-index-of-an-item-given-a-list-containing-it-in-python

590

Now we can use that index to delete the unwanted items:

In [11]:
newListofAnimals = animalNames[:591]

# In this case we have decided to create a new list (newListofAnimals) and include in it the items of interest from the
# previous list (the range from 0 to 591)

# The following line prints the beginning and the end of the list:

print(newListofAnimals[1:10], "------->", newListofAnimals[580:]) 

['Adelie Penguin', 'Affenpinscher', 'Afghan Hound', 'African Bush Elephant', 'African Civet', 'African Clawed Frog', 'African Forest Elephant', 'African Palm Civet', 'African Penguin'] -------> ['Woolly Monkey', 'Wrasse', 'X-Ray Tetra', 'Yak', 'Yellow-Eyed Penguin', 'Yorkshire Terrier', 'Zebra', 'Zebra Shark', 'Zebu', 'Zonkey', 'Zorse']


Now that we have all the elements that we want, we can transform all the text to lower case: 

In [12]:
# We use a list comprehension to create a new list in lower case:
listAnimalsLC = [item.lower() for item in newListofAnimals]

# Lest's get a preview:
listAnimalsLC[1:10]

['adelie penguin',
 'affenpinscher',
 'afghan hound',
 'african bush elephant',
 'african civet',
 'african clawed frog',
 'african forest elephant',
 'african palm civet',
 'african penguin']

Intersecting the elements in this list with the elements in the list of tokens we can reduce the number of animals and preserve only those that are mentioned in the stories.

The solution we used in this case was to write a function, and then apply it to the *Tokens* column. The results were stored in a new list we created with the name of *animalsUsedbyPotter*:

In [13]:
animalsUsedbyPotter = []
def findanimal(shortstory):
    for a in listAnimalsLC:
        if a in shortstory:
            animalsUsedbyPotter.append(a)

In [14]:
corpus["Tokens"].apply(findanimal)

0     None
1     None
2     None
3     None
4     None
5     None
6     None
7     None
8     None
9     None
10    None
11    None
12    None
13    None
14    None
15    None
16    None
17    None
18    None
19    None
20    None
21    None
Name: Tokens, dtype: object

We can now print the new list to get an idea of its contents:

In [15]:
print(animalsUsedbyPotter)

['cat', 'mouse', 'rabbit', 'beetle', 'crab', 'mole', 'robin', 'squirrel', 'cat', 'cow', 'fly', 'mouse', 'snail', 'cat', 'rabbit', 'fish', 'mouse', 'cat', 'hedgehog', 'rabbit', 'robin', 'squirrel', 'dog', 'magpie', 'mouse', 'butterfly', 'fish', 'frog', 'grasshopper', 'pike', 'rat', 'tortoise', 'bird', 'rabbit', 'mouse', 'goose', 'collie', 'duck', 'cat', 'chicken', 'dog', 'mouse', 'rat', 'fly', 'mouse', 'rabbit', 'bear', 'collie', 'dog', 'dormouse', 'beetle', 'butterfly', 'fly', 'ladybird', 'mouse', 'bear', 'bird', 'chipmunk', 'squirrel', 'woodpecker', 'badger', 'bear', 'bird', 'cat', 'dog', 'ferret', 'fish', 'fox', 'mole', 'monkey', 'persian', 'pheasant', 'rabbit', 'stoat', 'wasp', 'cow', 'horse', 'pig', 'cat', 'cow', 'horse', 'mouse', 'robin', 'cat', 'cow', 'dog', 'donkey', 'duck', 'fish', 'horse', 'pig', 'sheep', 'beetle', 'bird', 'cat', 'cockroach', 'crane', 'dog', 'donkey', 'fish', 'fly', 'hedgehog', 'pig', 'sheep']


As the following count shows, this first list contains 104 elements, some of which appear more than once:

In [16]:
len(animalsUsedbyPotter)

104

We can use the set function to get a list of unique results, which will reduce this first list to less than half the original results (45). The set method also has the advantage that it shows the results organised alphabetically:

In [17]:
set(animalsUsedbyPotter)

{'badger',
 'bear',
 'beetle',
 'bird',
 'butterfly',
 'cat',
 'chicken',
 'chipmunk',
 'cockroach',
 'collie',
 'cow',
 'crab',
 'crane',
 'dog',
 'donkey',
 'dormouse',
 'duck',
 'ferret',
 'fish',
 'fly',
 'fox',
 'frog',
 'goose',
 'grasshopper',
 'hedgehog',
 'horse',
 'ladybird',
 'magpie',
 'mole',
 'monkey',
 'mouse',
 'persian',
 'pheasant',
 'pig',
 'pike',
 'rabbit',
 'rat',
 'robin',
 'sheep',
 'snail',
 'squirrel',
 'stoat',
 'tortoise',
 'wasp',
 'woodpecker'}

In [18]:
len(set(animalsUsedbyPotter))

45

`Set` only has the effect of displaying the unique occurrences in the list, but it does not transform its content. To get a new list containing those unique elements only. We have called it `uniqueAnimals`:

In [19]:
uniqueAnimals = list(sorted(set(animalsUsedbyPotter))) 
# sorted(uniqueAnimals)
# list(sorted(uniqueAnimals))
print(uniqueAnimals)

['badger', 'bear', 'beetle', 'bird', 'butterfly', 'cat', 'chicken', 'chipmunk', 'cockroach', 'collie', 'cow', 'crab', 'crane', 'dog', 'donkey', 'dormouse', 'duck', 'ferret', 'fish', 'fly', 'fox', 'frog', 'goose', 'grasshopper', 'hedgehog', 'horse', 'ladybird', 'magpie', 'mole', 'monkey', 'mouse', 'persian', 'pheasant', 'pig', 'pike', 'rabbit', 'rat', 'robin', 'sheep', 'snail', 'squirrel', 'stoat', 'tortoise', 'wasp', 'woodpecker']


## Which animals were used in which stories?

### Creating a new data frame from scratch

Now that we have determined the names of the animals mentioned in the entire collection, we can advance to go more into detail about the which ones are featured in which stories.

To do that, we are going to create a different data frame: one that makes it easy for us to visualise and count the occurances of those terms in each of the 22 stories.

Our first step will be creating the frame for the data containing the column names (titles of the short stories) and the names of the animals (rows). The distribution in this case did not matter, we just selected that organisation to make the data longer than wider (45 rows x 22 columns).  

In [20]:
import pandas as pd
import numpy as np

# We have called the new data frame animalsInStories and have used the list uniqueAnimals (created before) as index
animalsInStories = pd.DataFrame(index=uniqueAnimals, columns = corpus["Title"])
animalsInStories.head()


Title,The Tale of Peter Rabbit,The Tale of Squirrel Nutkin,The Tailor of Gloucester,The Tale of Benjamin Bunny,The Tale of Two Bad Mice,The Tale of Mrs. Tiggy-Winkle,The Tale of the Pie and the Patty-Pan,The Tale of Mr. Jeremy Fisher,The Story of A Fierce Bad Rabbit,The Story of Miss Moppet,...,"The Tale of Samuel Whiskers or, The Roly-Poly Pudding",The Tale of the Flopsy Bunnies,The Tale of Ginger and Pickles,The Tale of Mrs. Tittlemouse,The Tale of Timmy Tiptoes,The Tale of Mr. Tod,The Tale of Pigling Bland,The Tale of Johnny Town-Mouse,The Tale of Little Pig Robinson,nan
badger,,,,,,,,,,,...,,,,,,,,,,
bear,,,,,,,,,,,...,,,,,,,,,,
beetle,,,,,,,,,,,...,,,,,,,,,,
bird,,,,,,,,,,,...,,,,,,,,,,
butterfly,,,,,,,,,,,...,,,,,,,,,,


There are two possible ways to approach content in this case: 

(1) we could look only at occurrences (whether the story contains or not an animal species), or 

(2) count the number of times the work appears in each story. Let's start with number 1.

We created a function than we then applied to the entire data frame.

In [27]:
# First we create an array named `animalCount` that is the result of the iteration through the search
# for each of the names of the animals contained in the list `uniqueAnimals`.
animalCount = np.array([corpus["Text"].str.findall(i) for i in uniqueAnimals])

# Then we can create a new data frame called 
animalOccurancesBoolean = pd.DataFrame(animalCount, index=uniqueAnimals, columns=corpus["Title"])

# This is the function that will search for occurances and will print an "X" in the intersections of Title and animal name
# or will leave an empty space when there are no matches

def occurance(i):
    wordcount={}
    if i==[]:
        return True
    else:
        return False
            
animalOccurancesBoolean.applymap(occurance)
animalOccurancesBoolean.loc["magpie"].count()

22

Another way to get a cleaner visualisation would be to leave blank spaces when there are no matches and an "X" where a given animal is mentioned. We only have to do a small adjustment to our function to accomplish that:

In [23]:
animalCount = np.array([corpus["Text"].str.findall(i) for i in uniqueAnimals])
animalOccurancesX = pd.DataFrame(animalCount, index=uniqueAnimals, columns=corpus["Title"])

def verdadero(i):
    wordcount={}
    if i==[]:
        return ""
    else:
        return "X"
            
animalOccurancesX.applymap(verdadero)

Title,The Tale of Peter Rabbit,The Tale of Squirrel Nutkin,The Tailor of Gloucester,The Tale of Benjamin Bunny,The Tale of Two Bad Mice,The Tale of Mrs. Tiggy-Winkle,The Tale of the Pie and the Patty-Pan,The Tale of Mr. Jeremy Fisher,The Story of A Fierce Bad Rabbit,The Story of Miss Moppet,...,"The Tale of Samuel Whiskers or, The Roly-Poly Pudding",The Tale of the Flopsy Bunnies,The Tale of Ginger and Pickles,The Tale of Mrs. Tittlemouse,The Tale of Timmy Tiptoes,The Tale of Mr. Tod,The Tale of Pigling Bland,The Tale of Johnny Town-Mouse,The Tale of Little Pig Robinson,nan
badger,,,,,,,,,,,...,,,,,,X,,,,
bear,,,,,,,,,,,...,,,X,,X,X,,,,
beetle,,X,X,,,,,X,,,...,,,,X,,,,,,X
bird,X,,,,X,X,,X,X,,...,,,,X,X,X,,X,,X
butterfly,,,,,,,,X,,,...,,,,,,,,,,
cat,X,,X,X,,X,X,X,,,...,X,,X,,X,X,X,X,X,X
chicken,,,,,,,,,,,...,X,,,,,X,,,,
chipmunk,,,,,,,,,,,...,,,,,,,,,,
cockroach,,,,,,,,,,,...,,,,,,,,,,X
collie,,,,,,,,,,,...,,,,,,,,,X,X


This is a very good way to visualise the results, in fact, this table could be the base to create a visualisation that could either cover all the characters, or just the most popular ones, if less data points are preferred.

But we could also create a data frame based on the word count for each animal name. Again, we can do that making a few changes to our function:

In [28]:
animalCount = np.array([corpus["Text"].str.findall(i) for i in uniqueAnimals])
animalOccurances = pd.DataFrame(animalCount, index=uniqueAnimals, columns=corpus["Title"])

def verdadero(i):
    wordcount={}
    if i==[]:
        return 0
    else:
        return len(i)
            
animalOccurances.applymap(verdadero)
animalOccurances.loc["rabbit"]

Title
The Tale of Peter Rabbit                                                            [rabbit, rabbit, rabbit]
The Tale of Squirrel Nutkin                                                                               []
The Tailor of Gloucester                                                                                  []
The Tale of Benjamin Bunny                                 [rabbit, rabbit, rabbit, rabbit, rabbit, rabbi...
The Tale of Two Bad Mice                                                                                  []
The Tale of Mrs. Tiggy-Winkle                                                                             []
The Tale of the Pie and the Patty-Pan                                                               [rabbit]
The Tale of Mr. Jeremy Fisher                                                                             []
The Story of A Fierce Bad Rabbit                                                                          []
The Story of 

We have used "rabbit" because as a common character in many of the stories it was a good option to get an idea of the richness of results, but less frequently used animals will, of course, generate less matches.

## Conclusions

One important issue brought to our attention with the experience of this project is the ethical and legal issues linked to data collection, specifically scraping restrictions. It was a motivation to do more research about the topic and learn about the limits, the options, and possible alternatives to the problem.

We also learned that term frequencies can have special significance in relation to the nature of the collection. While a frequently used term by a politician could point to an issue that is a priority to their party, in fiction it can point to differences between characters, namely, who are the main and secondary characters based on mentions. 



### References

- NLTK 3.0 Documentation (2015), *Corpus readers*, available at http://www.nltk.org/howto/corpus.html#corpus-reader-classes

- Python Software Foundation (2012), *Python v3.1.5 documentation. Chapter 5. Data Structures* available at https://docs.python.org/3.1/tutorial/datastructures.html

- Wes McKinney & PyData Development Team (2016), *pandas: powerful Python data analysis toolkit. Release 0.18.1*, available at http://pandas.pydata.org/pandas-docs/stable/pandas.pdf

- Stack Overflow (2008) *Finding the index of an item given a list containing it in Python*, available at http://stackoverflow.com/questions/176918/finding-the-index-of-an-item-given-a-list-containing-it-in-python

- Stack Overflow (2013), *Reading an Excel file in python using pandas*, available at http://stackoverflow.com/questions/17063458/reading-an-excel-file-in-python-using-pandas

- Project Gutemberg (2004), *Books by Potter, Beatrix*, available at http://gutenberg.org/ebooks/author/292 