Book 1: Twenty Thousand Leagues Under the Sea
* Author: Jules Verne
* URL: https://www.gutenberg.org/ebooks/164
* Category: Sci-Fi & Fantasy
* Ownership: Public Domain (USA)

Book 2: A Journey to the Centre of the Earth
* Author: Jules Verne
* URL: https://www.gutenberg.org/ebooks/18857
* Category: Sci-Fi & Fantasy
* Ownership: Public Domain (USA)

(I'm a fan of classic sci-fi)

Question to answer: I read somewhere that it's possible for an ML algorithm to classify literature works by author based on the number of letters they use per word. With that said, what is the length of the average word penned by Jules Verne?

In [8]:
# book 1: 20,000 leagues Under the Sea
path = 'data/leagues.txt' # should be in the same directory
leagues = [line.strip() for line in open(path, 'r').readlines()]

# book 2: Journey to the Center of the Earth
path = 'data/journey.txt'
journey = [line.strip() for line in open(path, 'r').readlines()]

Before we tackle our main question, let's do some exploratory analysis on our books.

In [9]:
# what is the number of lines?
num_lines_leagues = len(leagues) # 12539
num_lines_journey = len(journey) # 11320

print(f" leagues has {num_lines_leagues} lines, journey has {num_lines_journey}.")

 leagues has 12539 lines, journey has 11320.


We cant start by filtering out the empty lines. We don't want a word length of 0 polluting our final calculation.

In [10]:
def filter_emptylines(book):
    return [line for line in book if line != '']

leagues = filter_emptylines(leagues)
journey = filter_emptylines(journey)

Just to verify that our filter worked, let's print out the first 20 lines of each book. Before, there were definitely some empty lines in there.

In [11]:
print(leagues[:20])
print("\n")
print(journey[:20])

["\ufeffProject Gutenberg's Twenty Thousand Leagues under the Sea", '(slightly abridged), by Jules Verne', 'This eBook is for the use of anyone anywhere at no cost and with', 'almost no restrictions whatsoever.  You may copy it, give it away or', 're-use it under the terms of the Project Gutenberg License included', 'with this eBook or online at www.gutenberg.org', 'Title: Twenty Thousand Leagues under the Sea (slightly abridged)', 'Author: Jules Verne', 'Release Date: Sep 1, 1994 [EBook #164]', 'Last Updated: December 13, 2016', 'Language: English', '*** START OF THIS PROJECT GUTENBERG EBOOK 20000 LEAGUES UNDER THE SEA ***', 'This etext was done by a number of anonymous volunteers of the', 'Gutenberg Project, to whom we owe a great deal of thanks and to', 'whom we dedicate this book.', 'TWENTY THOUSAND LEAGUES UNDER THE SEA', 'by', 'JULES VERNE', 'PART ONE', 'CHAPTER I']


["\ufeffProject Gutenberg's A Journey to the Centre of the Earth, by Jules Verne", 'This eBook is for the use of 

Looks goods! Now we should create a function that counts the length of each word in each line. However, we have to somehow make sure to exlclude spaces when counting word lengths. Thus, it's inadequate to simply find the length of each line.

In [12]:
# Naive implementation:
def avg_word_length(book):
    
    num_words = 0
    total_word_length = 0;
    
    # count a word every time we hit a space character
    for line in book:
        
        line_as_str = ''
        for ch in line:
            if ch == ' ':
                num_words += 1
            else:
                line_as_str += ch
                
        total_word_length += len(line_as_str)
        
    return round(total_word_length / num_words, 2)

print(f" Leagues: {avg_word_length(leagues)}, Journey: {avg_word_length(journey)}")

 Leagues: 4.98, Journey: 5.14


Both the implementation and the results seem reasonable, but there's probably some things we overlooked. Lets go back into the function defintion and print some of the line_as_str strings we created:

In [13]:
# Let's see if there's any bugs...
def modified_avg_word_length(book):
    num_words = 0
    total_word_length = 0;
    # count a word every time we hit a space character
    i = 0;
    for line in book:
        line_as_str = ''
        for ch in line:
            if ch == ' ':
                num_words += 1
            else:
                line_as_str += ch
        
        # let's only print the first 50 lines
        if i < 50:
            print(line_as_str)
            i += 1
        else:
            total_word_length += len(line_as_str)
            
    return round(total_word_length / num_words, 2)

In [14]:
modified_avg_word_length(leagues)

﻿ProjectGutenberg'sTwentyThousandLeaguesundertheSea
(slightlyabridged),byJulesVerne
ThiseBookisfortheuseofanyoneanywhereatnocostandwith
almostnorestrictionswhatsoever.Youmaycopyit,giveitawayor
re-useitunderthetermsoftheProjectGutenbergLicenseincluded
withthiseBookoronlineatwww.gutenberg.org
Title:TwentyThousandLeaguesundertheSea(slightlyabridged)
Author:JulesVerne
ReleaseDate:Sep1,1994[EBook#164]
LastUpdated:December13,2016
Language:English
***STARTOFTHISPROJECTGUTENBERGEBOOK20000LEAGUESUNDERTHESEA***
Thisetextwasdonebyanumberofanonymousvolunteersofthe
GutenbergProject,towhomweoweagreatdealofthanksandto
whomwededicatethisbook.
TWENTYTHOUSANDLEAGUESUNDERTHESEA
by
JULESVERNE
PARTONE
CHAPTERI
ASHIFTINGREEF
Theyear1866wassignalisedbyaremarkableincident,amysteriousand
puzzlingphenomenon,whichdoubtlessnoonehasyetforgotten.Notto
mentionrumourswhichagitatedthemaritimepopulationandexcitedthe
publicmind,evenintheinteriorofcontinents,seafaringmenwere
particularlyexcited.Merchants,commonsailors,ca

4.95

Immediately, there are 2 noticable bugs:
* the Project Gutemberg header, author, release date, etc. are being counted
* commas are being counted

Both of these hurt the accuracy of our avg_word_length function. To get rid of the header, we can make another filter function:

In [15]:
# looking at the previous cell, the header starts end with a line that looks something like:
# ***START OF THIS PROJECT GUTEnBERG BOOK {bookname}***
# so, we can trigger the filter with the preceding '***'

def filter_header(book):
    line = 0
    while '***' not in book[line]:
        line += 1
    return book[line+1:]

Great! Now, to handle the comma situation, we can modify the body of our avg_word_length function:

In [16]:
# store unmodified books for reference:
old_journey = journey
old_leagues = leagues

# Correct implementation:

journey = filter_header(old_journey)
leagues = filter_header(old_leagues)

def correct_avg_word_length(book):
    
    num_words = 0
    total_word_length = 0;
    
    for line in book:
        line_as_str = ''
        for ch in line:
            # count a word every time we hit a space character
            if ch == ' ':
                num_words += 1
            # pass over commas
            elif ch == ',':
                pass
            else:
                line_as_str += ch
        total_word_length += len(line_as_str)
        
    return round(total_word_length / num_words, 2)

print(f"Before bug fixes\n Leagues: {avg_word_length(leagues)}, Journey: {avg_word_length(journey)}\n")
print(f"After big fixes\n Leagues: {correct_avg_word_length(leagues)}, Journey: {correct_avg_word_length(journey)}\n")

total_avg = round((correct_avg_word_length(leagues) + correct_avg_word_length(journey)) / 2, 2)
print(f"Jules Verne's average word length is {total_avg} ")

Before bug fixes
 Leagues: 4.97, Journey: 5.14

After big fixes
 Leagues: 4.88, Journey: 5.06

Jules Verne's average word length is 4.97 


The difference may seem negligible, but it might make a big difference down the line in other applications that use our calculated data.

Next, let's see if we can perform a more granular analysis and find the average word length per chapter for each book. First, we need to split each book into chapters.

In [17]:
# this functions splits the book at each occurence of the word 'chapter,' which prefaces each chapter in both books
# however, there is also a table of contents in Journey to the Center of the Earth that includes several instances
# of the word 'chapter.' Each entry in the TOC looks like:

# - CHAPTER X (chapter name) -

# The beginning of each chapter, however, just looks like this:

# - CHAPTER X - 

# Which is a lot shorter. So, by adding a length restraint, we can bypass the TOC

def split_chapters(book):
    
    new_book = {}
    current_chapter = []
    chapter_num = 0
    
    for line in book:
        
        if 'CHAPTER' in line.upper() and len(line) <= len('chapter XX'):
            
            chapter = f"Chapter {chapter_num}"
            chapter_num += 1
            new_book[chapter] = current_chapter
            current_chapter = []
            
        else:
            current_chapter.append(line)
            
        
    return new_book
        
leagues_chapters = split_chapters(leagues)
journey_chapters = split_chapters(journey)

In [18]:
# just to confirm it works...

print(journey_chapters["Chapter 1"], "\n\n")
print(leagues_chapters["Chapter 5"])

['MY UNCLE MAKES A GREAT DISCOVERY', 'Looking back to all that has occurred to me since that eventful day, I', 'am scarcely able to believe in the reality of my adventures. They were', 'truly so wonderful that even now I am bewildered when I think of them.', "My uncle was a German, having married my mother's sister, an", 'Englishwoman. Being very much attached to his fatherless nephew, he', 'invited me to study under him in his home in the fatherland. This home', 'was in a large town, and my uncle a professor of philosophy, chemistry,', 'geology, mineralogy, and many other ologies.', 'One day, after passing some hours in the laboratory--my uncle being', 'absent at the time--I suddenly felt the necessity of renovating the', 'tissues--<i>i.e.</i>, I was hungry, and was about to rouse up our old French', 'cook, when my uncle, Professor Von Hardwigg, suddenly opened the street', 'door, and came rushing upstairs.', 'Now Professor Hardwigg, my worthy uncle, is by no means a bad sort of', 'ma

Great! That's a lot of text, though. Let's see if we can just print out the title of each chapter.

In [19]:
for k, v in leagues_chapters.items():
    print(k, v[0])

Chapter 0 This etext was done by a number of anonymous volunteers of the
Chapter 1 A SHIFTING REEF
Chapter 2 PRO AND CON
Chapter 3 NED LAND
Chapter 4 AT A VENTURE
Chapter 5 AT FULL STEAM
Chapter 6 NED LAND'S TEMPERS
Chapter 7 THE MAN OF THE SEAS
Chapter 8 ALL BY ELECTRICITY
Chapter 9 A WALK ON THE BOTTOM OF THE SEA
Chapter 10 A FEW DAYS ON LAND
Chapter 11 THE INDIAN OCEAN
Chapter 12 A NOVEL PROPOSAL OF CAPTAIN NEMO'S
Chapter 13 THE RED SEA
Chapter 14 THE ARABIAN TUNNEL
Chapter 15 THE GRECIAN ARCHIPELAGO
Chapter 16 A VANISHED CONTINENT
Chapter 17 THE SUBMARINE COAL-MINES
Chapter 18 THE SARGASSO SEA
Chapter 19 ACCIDENT OR INCIDENT?


In [20]:
for k, v in journey_chapters.items():
    print(k, v[0])

Chapter 0 Produced by Norm Wolcott
Chapter 1 MY UNCLE MAKES A GREAT DISCOVERY
Chapter 2 THE MYSTERIOUS PARCHMENT
Chapter 3 AN ASTOUNDING DISCOVERY
Chapter 4 WE START ON THE JOURNEY
Chapter 5 First Lessons in Climbing
Chapter 6 Our Voyage to Iceland
Chapter 7 Conversation and Discovery
Chapter 8 THE EIDER-DOWN HUNTER--OFF AT LAST
Chapter 9 OUR START--WE MEET WITH ADVENTURES BY THE WAY
Chapter 10 TRAVELING IN ICELAND
Chapter 11 WE REACH MOUNT SNEFFELS--THE "REYKIR"
Chapter 12 THE ASCENT OF MOUNT SNEFFELS
Chapter 13 THE SHADOW OF SCARTARIS
Chapter 14 THE REAL JOURNEY COMMENCES
Chapter 15 WE CONTINUE OUR DESCENT
Chapter 16 THE EASTERN TUNNEL
Chapter 17 DEEPER AND DEEPER--THE COAL MINE
Chapter 18 THE WRONG ROAD!
Chapter 19 THE WESTERN GALLERY--A NEW ROUTE
Chapter 20 WATER, WHERE IS IT? A BITTER DISAPPOINTMENT
Chapter 21 UNDER THE OCEAN
Chapter 22 SUNDAY BELOW GROUND
Chapter 23 ALONE
Chapter 24 LOST!
Chapter 25 THE WHISPERING GALLERY
Chapter 26 A RAPID RECOVERY
Chapter 27 THE CENTRAL SEA
Cha

Now that that's in order, let's see if we can find the average word length in chapter 1 of each book.

In [21]:
j = correct_avg_word_length(journey_chapters["Chapter 1"]) # 4.97

In [26]:
l = correct_avg_word_length(leagues_chapters["Chapter 1"]) # 5.19

In [27]:
print("Avg chapter 1 word length: ", round((j + 1) / 2, 2))

Avg chapter 1 word length:  2.98


That concludes our analysis!