# Working With Files

Another useful component of Python is reading files from the filesystem. Sometimes we want to analyze text data. Maybe they are log files, maybe they are HTML dumps (more on that later) or maybe they are somethign else. In the following example we'll be working with the text of the book Frankenstein, with special thanks to Project Guttenberg for making the text freely available. 

In [1]:
# The prefered way to open a text file in Python looks like this:
# The long string is the path to the file, the 'r' means "read mode" which can also be "w" for write mode
# 'rb' and 'wb' are available for "raw byte" mode, which you may eventually have reason to use but for now 
# we're going to ignore those options.
with open('book-texts/frankenstein-no-header-footer.txt', 'r') as franken_reader:
    # The type of franken_reader is a <class '_io.TextIOWrapper'>
    print(type(franken_reader))
    
    # This type has some interesting methods.
    ## .read() will process the whole file and we can put it in a string
    whole_text = franken_reader.read()
    print(whole_text)

    # NOTE: Once the file has been "read" it cannot be read a second time
    # in fact reading any part of the file "consumes" that section. So if you run the followingg
    # code without commenting out the above call to "read()" nothing will be output.

    # Read, with a parameter, will read the specified number of bytes
    first_50_bytes = franken_reader.read(50)
    print(first_50_bytes)
    
    # Readline can be used to read one line at a time. 
    first_line = franken_reader.readline()
    print(first_line)
    
    

<class '_io.TextIOWrapper'>
Frankenstein;

or, the Modern Prometheus

by Mary Wollstonecraft (Godwin) Shelley


 CONTENTS

 Letter 1
 Letter 2
 Letter 3
 Letter 4
 Chapter 1
 Chapter 2
 Chapter 3
 Chapter 4
 Chapter 5
 Chapter 6
 Chapter 7
 Chapter 8
 Chapter 9
 Chapter 10
 Chapter 11
 Chapter 12
 Chapter 13
 Chapter 14
 Chapter 15
 Chapter 16
 Chapter 17
 Chapter 18
 Chapter 19
 Chapter 20
 Chapter 21
 Chapter 22
 Chapter 23
 Chapter 24




Letter 1

_To Mrs. Saville, England._


St. Petersburgh, Dec. 11th, 17—.


You will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings. I arrived here yesterday, and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking.

I am already far north of London, and as I walk in the streets of
Petersburgh, I feel a cold northern breeze play upon my cheeks, which
braces my nerves and fills me with delight. Do you u

# Good Habits: Line by Line

Especially when working with large files, it's a good idea to read the file line by line if possible. This saves memory, and is usually much faster. Sometimes you need ALL of the data at once (this was the case with our CSV files).

For example, lets say we wanted to know how often each word occured in this book... We can do that one line at a time:

In [2]:
# A counter is a dictionary where the default value for every key is 0.
from collections import Counter

word_counts = Counter()

with open('book-texts/frankenstein-no-header-footer.txt', 'r') as franken_reader:
    line = franken_reader.readline()
    
    # The final line will be an empty string
    # No other lines (even blank ones) will be the empty string
    # An empty line will instead be the newline character "\n"
    while line != '':
        # The split function turns a string into an array of words based on 
        # where the whitespace characters are. You can split on other characters too!
        words = line.split() 
        
        for word in words:
            word_counts[word] += 1
            
        line = franken_reader.readline()
        
        
# Now that we have the word counts... Lets check them out!
for count in word_counts.most_common():
    print(count)
    
# If you look carefully there are some odd ones that we'd need better processing
# to handle. Here are some examples:
    ('contumely?', 1)
    ('at,', 1)
    ('“Fear', 1)
    ('“Farewell!', 1)

('the', 3898)
('and', 2903)
('I', 2719)
('of', 2634)
('to', 2072)
('my', 1631)
('a', 1338)
('in', 1071)
('was', 992)
('that', 974)
('had', 679)
('with', 654)
('which', 540)
('but', 538)
('me', 529)
('his', 500)
('not', 479)
('as', 477)
('for', 463)
('he', 446)
('by', 441)
('on', 425)
('you', 400)
('from', 373)
('it', 362)
('have', 356)
('be', 339)
('her', 313)
('this', 298)
('were', 298)
('is', 296)
('at', 289)
('when', 261)
('The', 255)
('your', 237)
('an', 208)
('so', 196)
('could', 187)
('will', 185)
('been', 182)
('would', 177)
('their', 174)
('one', 174)
('all', 172)
('she', 172)
('or', 169)
('they', 166)
('are', 164)
('if', 153)
('should', 152)
('who', 150)
('more', 149)
('me,', 148)
('him', 147)
('no', 146)
('some', 136)
('these', 130)
('now', 130)
('But', 128)
('He', 126)
('into', 124)
('upon', 123)
('before', 122)
('its', 120)
('My', 120)
('only', 119)
('our', 118)
('am', 114)
('we', 114)
('did', 112)
('yet', 109)
('than', 109)
('might', 107)
('me.', 107)
('myself', 105)
('eve

('paid', 6)
('secure', 6)
('manner,', 6)
('conduct', 6)
('alarmed', 6)
('traversed', 6)
('bear', 6)
('receive', 6)
('perhaps,', 6)
('speedily', 6)
('incidents', 6)
('witnesses', 6)
('bless', 6)
('closed', 6)
('About', 6)
('attracted', 6)
('ice.', 6)
('miles', 6)
('side', 6)
('perceiving', 6)
('hearing', 6)
('fatigue', 6)
('slow', 6)
('generally', 6)
('up,', 6)
('guest', 6)
('fled', 6)
('“And', 6)
('multitude', 6)
('animated', 6)
('frame', 6)
('improved', 6)
('appears', 6)
('brother', 6)
('astonishing', 6)
('own.', 6)
('communicated', 6)
('arguments', 6)
('countenance.', 6)
('failed', 6)
('violence', 6)
('you,”', 6)
('judge', 6)
('quick', 6)
('yet,', 6)
('truth', 6)
('doubtless', 6)
('raised', 6)
('embraced', 6)
('wretchedness.', 6)
('retreat', 6)
('united', 6)
('meantime', 6)
('house.', 6)
('hold', 6)
('bed', 6)
('procured', 6)
('Several', 6)
('manner.', 6)
('grew', 6)
('affection.', 6)
('elapsed', 6)
('train', 6)
('week', 6)
('victim', 6)
('possession', 6)
('majestic', 6)
('sublime', 

('sky,', 4)
('seem', 4)
('causes', 4)
('add', 4)
('Walton,', 4)
('alter', 4)
('sting', 4)
('been.', 4)
('relation', 4)
('useful', 4)
('case', 4)
('occurrences', 4)
('deemed', 4)
('unacquainted', 4)
('nature;', 4)
('wait', 4)
('dwell', 4)
('thin', 4)
('story,', 4)
('birth', 4)
('family.', 4)
('friendship', 4)
('worthy', 4)
('out,', 4)
('entered,', 4)
('money', 4)
('leisure', 4)
('reflection,', 4)
('incapable', 4)
('attending', 4)
('chamber.', 4)
('degree,', 4)
('yield', 4)
('previous', 4)
('tour', 4)
('frame.', 4)
('lot', 4)
('angel', 4)
('peasant', 4)
('wife,', 4)
('rest.', 4)
('four', 4)
('history.', 4)
('giving', 4)
('nursed', 4)
('rude', 4)
('permission', 4)
('childish', 4)
('cousin.', 4)
('quite', 4)
('Swiss', 4)
('winter,', 4)
('scope', 4)
('magnificent', 4)
('learn', 4)
('league', 4)
('temper', 4)
('crowd', 4)
('happier', 4)
('delights', 4)
('love.', 4)
('eager', 4)
('structure', 4)
('highest', 4)
('loveliness', 4)
('doing', 4)
('steps,', 4)
('volume', 4)
('this;', 4)
('contented

('degree.', 2)
('poignant', 2)
('grief?', 2)
('speaks,', 2)
('flow', 2)
('interests', 2)
('others.', 2)
('measures', 2)
('acquirement', 2)
('acquire', 2)
('race.', 2)
('suppress', 2)
('groan', 2)
('drunk', 2)
('reveal', 2)
('dash', 2)
('weakened', 2)
('conquered', 2)
('anew.”', 2)
('calm,', 2)
('settled', 2)
('double', 2)
('circle', 2)
('folly', 2)
('Will', 2)
('wanderer?', 2)
('merits', 2)
('quality', 2)
('things,', 2)
('varied', 2)
('music.', 2)
('perceive,', 2)
('evils', 2)
('gratification', 2)
('course,', 2)
('imagine', 2)
('moral', 2)
('succeed', 2)
('usually', 2)
('internal', 2)
('gratified', 2)
('sympathy,', 2)
('event,', 2)
('interrupt', 2)
('“but', 2)
('irrevocably', 2)
('warmest', 2)
('duties,', 2)
('pleasure;', 2)
('day!', 2)
('task,', 2)
('lustrous', 2)
('lineaments', 2)
('harrowing', 2)
('gallant', 2)
('wrecked', 2)
('situations', 2)
('indefatigable', 2)
('prevented', 2)
('husband', 2)
('character,', 2)
('flourishing', 2)
('state,', 2)
('poverty.', 2)
('honourable', 2)
('r

('cries', 2)
('More', 2)
('prey', 2)
('turnkeys,', 2)
('dungeon.', 2)
('feeble', 2)
('However,', 2)
('safe', 2)
('physician', 2)
('cheeks', 2)
('me?”', 2)
('agonising', 2)
('send', 2)
('desires.', 2)
('exclamation', 2)
('guilt', 2)
('feature', 2)
('muscle', 2)
('relaxed', 2)
('momentary', 2)
('delirium,', 2)
('father,”', 2)
('exhausted', 2)
('destiny,', 2)
('distant,', 2)
('disgrace', 2)
('hateful.', 2)
('poisoned', 2)
('Rhone,', 2)
('presence,', 2)
('scared', 2)
('fiend’s', 2)
('beings,', 2)
('pride.', 2)
('offspring', 2)
('preserved', 2)
('incoherent', 2)
('writing', 2)
('what,', 2)
('desiring', 2)
('union,', 2)
('airy', 2)
('consummate', 2)
('enjoys', 2)
('balanced', 2)
('suspect', 2)
('adversary’s', 2)
('consecrate', 2)
('live.', 2)
('replace', 2)
('ceremony', 2)
('appearance.', 2)
('Evian', 2)
('enjoying', 2)
('obscured', 2)
('dimmed', 2)
('Suddenly', 2)
('shrink', 2)
('adversary', 2)
('Peace,', 2)
('momentarily', 2)
('adversary.', 2)
('cause.', 2)
('arms.', 2)
('prayed', 2)
('rel

('ardour.', 1)
('grown', 1)
('confinement.', 1)
('failed;', 1)
('realise.', 1)
('while,', 1)
('unrelaxed', 1)
('eagerness,', 1)
('dabbled', 1)
('clay?', 1)
('swim', 1)
('remembrance;', 1)
('resistless', 1)
('frantic', 1)
('trance,', 1)
('acuteness', 1)
('stimulus', 1)
('ceasing', 1)
('operate,', 1)
('habits.', 1)
('bones', 1)
('charnel-houses', 1)
('disturbed,', 1)
('profane', 1)
('fingers,', 1)
('chamber,', 1)
('cell,', 1)
('apartments', 1)
('gallery', 1)
('staircase,', 1)
('workshop', 1)
('eyeballs', 1)
('starting', 1)
('dissecting', 1)
('slaughter-house', 1)
('furnished', 1)
('materials;', 1)
('whilst,', 1)
('increased,', 1)
('season;', 1)
('harvest', 1)
('vines', 1)
('luxuriant', 1)
('vintage,', 1)
('charms', 1)
('disquieted', 1)
('father:', 1)
('interruption', 1)
('neglected.”', 1)
('wished,', 1)
('procrastinate', 1)
('completed.', 1)
('unjust', 1)
('faultiness', 1)
('conceiving', 1)
('altogether', 1)
('blame.', 1)
('perfection', 1)
('rule.', 1)
('tendency', 1)
('weaken', 1)
('all

('snow,', 1)
('roll', 1)
('above;', 1)
('produces', 1)
('concussion', 1)
('speaker.', 1)
('luxuriant,', 1)
('sombre', 1)
('wreaths', 1)
('sensibilities', 1)
('brute;', 1)
('free;', 1)
('rise;', 1)
('wand’ring', 1)
('pollutes', 1)
('conceive,', 1)
('reason;', 1)
('Embrace', 1)
('same:', 1)
('Man’s', 1)
('ne’er', 1)
('morrow;', 1)
('Nought', 1)
('mutability!', 1)
('ascent.', 1)
('overlooks', 1)
('dissipated', 1)
('glacier.', 1)
('uneven,', 1)
('descending', 1)
('low,', 1)
('rifts', 1)
('width,', 1)
('crossing', 1)
('Montanvert', 1)
('opposite,', 1)
('league;', 1)
('majesty.', 1)
('recesses.', 1)
('sunlight', 1)
('clouds.', 1)
('“Wandering', 1)
('wander,', 1)
('beds,', 1)
('superhuman', 1)
('crevices', 1)
('caution;', 1)
('faintness', 1)
('gale', 1)
('(sight', 1)
('abhorred!)', 1)
('combat.', 1)
('approached;', 1)
('bespoke', 1)
('ugliness', 1)
('utterance,', 1)
('contempt.', 1)
('“Devil,”', 1)
('“do', 1)
('wreaked', 1)
('head?', 1)
('vile', 1)
('insect!', 1)
('stay,', 1)
('dust!', 1)
('o

('unperceived', 1)
('hovel.”', 1)
('“Cursed,', 1)
('instant,', 1)
('wantonly', 1)
('bestowed?', 1)
('glutted', 1)
('shrieks', 1)
('wood;', 1)
('howlings.', 1)
('toils,', 1)
('ranging', 1)
('stag-like', 1)
('swiftness.', 1)
('passed!', 1)
('bird', 1)
('universal', 1)
('stillness.', 1)
('All,', 1)
('enjoyment;', 1)
('arch-fiend,', 1)
('unsympathised', 1)
('havoc', 1)
('myriads', 1)
('existed', 1)
('enemies?', 1)
('Accordingly', 1)
('underwood,', 1)
('determining', 1)
('tranquillity;', 1)
('conclusions.', 1)
('imprudently.', 1)
('behalf,', 1)
('fool', 1)
('familiarised', 1)
('approach.', 1)
('errors', 1)
('irretrievable,', 1)
('representations', 1)
('party.', 1)
('afternoon', 1)
('acting', 1)
('females', 1)
('appeased,', 1)
('appear.', 1)
('violently,', 1)
('apprehending', 1)
('inside', 1)
('motion;', 1)
('suspense.', 1)
('“Presently', 1)
('countrymen', 1)
('pausing', 1)
('using', 1)
('gesticulations;', 1)
('appearances.', 1)
('consider,’', 1)
('‘that', 1)
('months’', 1)
('garden?', 1)
('

('formidable', 1)
('fortnight.', 1)
('suffered!', 1)
('miserably,', 1)
('suspense;', 1)
('period,', 1)
('weigh', 1)
('Explanation!', 1)
('explain?', 1)
('doubts', 1)
('explanation;', 1)
('postpone', 1)
('absence,', 1)
('begin.', 1)
('place.', 1)
('playfellows', 1)
('older.', 1)
('entertain', 1)
('case?', 1)
('Victor.', 1)
('Answer', 1)
('truth—Do', 1)
('another?', 1)
('travelled;', 1)
('opposed', 1)
('reasoning.', 1)
('futurity', 1)
('eternally', 1)
('choice.', 1)
('cruellest', 1)
('stifle,', 1)
('_honour_,', 1)
('disinterested', 1)
('tenfold', 1)
('obstacle', 1)
('wishes.', 1)
('sincere', 1)
('supposition.', 1)
('obey', 1)
('tomorrow,', 1)
('come,', 1)
('meet,', 1)
('17—”', 1)
('forgotten,', 1)
('fiend—“_I', 1)
('wedding-night!_”', 1)
('sentence,', 1)
('glimpse', 1)
('victorious', 1)
('vanquished,', 1)
('freedom?', 1)
('massacred', 1)
('burnt,', 1)
('lands', 1)
('laid', 1)
('waste,', 1)
('adrift,', 1)
('homeless,', 1)
('penniless,', 1)
('treasure,', 1)
('alas,', 1)
('Sweet', 1)
('Eliz

In [3]:
# It turns out, python has a lot of great stuff built in, including a solution to this problem...
import string
word_counts = Counter()

with open('book-texts/frankenstein-no-header-footer.txt', 'r') as franken_reader:
    line = franken_reader.readline()
    
    # The final line will be an empty string
    # No other lines (even blank ones) will be the empty string
    # An empty line will instead be the newline character "\n"
    while line != '':
        # Lets lowercase everything so we don't count A and a separately.
        line = line.lower()
        
        # explaination of maketrans: https://www.geeksforgeeks.org/python-maketrans-translate-functions/
        # replace hyphen (EM AND EN DASH) with space, 
        line = line.translate(str.maketrans('-—', '  '))
        
        # remove anything in string.punctuation and the two weird quotes
        line = line.translate(str.maketrans('', '', string.punctuation + '“' + '”'))
        
        # The split function turns a string into an array of words based on 
        # where the whitespace characters are. You can split on other characters too!
        words = line.split() 
        
        for word in words:
            word_counts[word] += 1
            
        line = franken_reader.readline()
        
        
# Now that we have the word counts... Lets check them out!
for count in word_counts.most_common():
    print(count)

('the', 4194)
('and', 2975)
('i', 2846)
('of', 2642)
('to', 2094)
('my', 1776)
('a', 1391)
('in', 1128)
('was', 1021)
('that', 1015)
('me', 864)
('but', 687)
('had', 686)
('with', 667)
('he', 608)
('you', 572)
('which', 558)
('it', 546)
('his', 535)
('as', 528)
('not', 510)
('for', 498)
('by', 460)
('on', 460)
('this', 402)
('from', 385)
('her', 373)
('have', 365)
('be', 360)
('when', 328)
('at', 317)
('were', 308)
('is', 307)
('she', 255)
('your', 252)
('him', 221)
('an', 211)
('so', 210)
('they', 209)
('one', 206)
('all', 200)
('could', 197)
('will', 194)
('if', 193)
('been', 190)
('their', 186)
('would', 184)
('or', 177)
('are', 175)
('we', 173)
('who', 172)
('no', 170)
('more', 165)
('these', 154)
('now', 154)
('should', 153)
('yet', 152)
('some', 147)
('before', 146)
('myself', 136)
('what', 132)
('man', 132)
('am', 126)
('upon', 126)
('our', 126)
('them', 126)
('into', 124)
('its', 123)
('only', 123)
('did', 119)
('do', 115)
('life', 114)
('father', 113)
('than', 110)
('every', 1

('conviction', 6)
('humanity', 6)
('board', 6)
('assist', 6)
('disposition', 6)
('paid', 6)
('generous', 6)
('conception', 6)
('trembling', 6)
('alarmed', 6)
('traversed', 6)
('bear', 6)
('incidents', 6)
('record', 6)
('triumph', 6)
('bless', 6)
('august', 6)
('accident', 6)
('direction', 6)
('attracted', 6)
('low', 6)
('gigantic', 6)
('rapid', 6)
('miles', 6)
('perish', 6)
('perceiving', 6)
('astonishment', 6)
('hearing', 6)
('fainted', 6)
('signs', 6)
('interesting', 6)
('generally', 6)
('questions', 6)
('multitude', 6)
('gradually', 6)
('begin', 6)
('communicated', 6)
('favour', 6)
('race', 6)
('accents', 6)
('slave', 6)
('everything', 6)
('music', 6)
('prepare', 6)
('mistaken', 6)
('embraced', 6)
('public', 6)
('merchant', 6)
('beaufort', 6)
('pride', 6)
('united', 6)
('abode', 6)
('employment', 6)
('grew', 6)
('conducted', 6)
('soft', 6)
('infant', 6)
('attached', 6)
('train', 6)
('liberty', 6)
('rude', 6)
('pretty', 6)
('majestic', 6)
('temper', 6)
('avoid', 6)
('general', 6)
('g

('airs', 3)
('considerably', 3)
('ventured', 3)
('abroad', 3)
('ruins', 3)
('vicious', 3)
('sensitive', 3)
('degradation', 3)
('clings', 3)
('deceit', 3)
('religion', 3)
('chains', 3)
('gestures', 3)
('accept', 3)
('probability', 3)
('writing', 3)
('christian', 3)
('freedom', 3)
('occupy', 3)
('leagues', 3)
('youthful', 3)
('loathed', 3)
('prolong', 3)
('distress', 3)
('alike', 3)
('admirable', 3)
('fortunately', 3)
('imaginations', 3)
('monstrous', 3)
('tumultuous', 3)
('outcast', 3)
('cherished', 3)
('barren', 3)
('fitted', 3)
('revived', 3)
('‘do', 3)
('particulars', 3)
('beast', 3)
('gush', 3)
('impatience', 3)
('fired', 3)
('scorn', 3)
('wore', 3)
('running', 3)
('me’', 3)
('pieces', 3)
('portrait', 3)
('softened', 3)
('sleeping', 3)
('stirred', 3)
('deny', 3)
('burned', 3)
('wilds', 3)
('fare', 3)
('fixing', 3)
('consume', 3)
('obtaining', 3)
('fits', 3)
('devouring', 3)
('utility', 3)
('choice', 3)
('precaution', 3)
('foe', 3)
('machinations', 3)
('pointed', 3)
('landscape', 3)


('recompensing', 1)
('exotic', 1)
('gardener', 1)
('rougher', 1)
('tend', 1)
('relinquished', 1)
('restorative', 1)
('naples', 1)
('stores', 1)
('tender', 1)
('idol', 1)
('silken', 1)
('cord', 1)
('excursion', 1)
('frontiers', 1)
('remembering', 1)
('cot', 1)
('foldings', 1)
('vale', 1)
('scanty', 1)
('babes', 1)
('eyed', 1)
('vagrants', 1)
('brightest', 1)
('gold', 1)
('despite', 1)
('clothing', 1)
('brow', 1)
('moulding', 1)
('stamp', 1)
('milanese', 1)
('nobleman', 1)
('german', 1)
('italians', 1)
('antique', 1)
('schiavi', 1)
('ognor', 1)
('frementi', 1)
('dungeons', 1)
('austria', 1)
('confiscated', 1)
('foster', 1)
('leaved', 1)
('brambles', 1)
('hall', 1)
('pictured', 1)
('cherub', 1)
('guardians', 1)
('providence', 1)
('inmate', 1)
('parents’', 1)
('reverential', 1)
('seriousness', 1)
('interpreted', 1)
('literally', 1)
('cherish', 1)
('praises', 1)
('familiarly', 1)
('disunion', 1)
('dispute', 1)
('diversity', 1)
('concentrated', 1)
('smitten', 1)
('turbulence', 1)
('summers',

('disgusting', 1)
('minutest', 1)
('‘hateful', 1)
('life’', 1)
('‘accursed', 1)
('alluring', 1)
('abhorred’', 1)
('compassionate', 1)
('importance', 1)
('plenty', 1)
('reigned', 1)
('moonshine', 1)
('frail', 1)
('fortify', 1)
('undergo', 1)
('eve', 1)
('adam’s', 1)
('supplication', 1)
('assume', 1)
('heed', 1)
('bleakness', 1)
('conformation', 1)
('endurance', 1)
('depending', 1)
('casualties', 1)
('yearned', 1)
('introducing', 1)
('mediation', 1)
('tolerated', 1)
('red', 1)
('denied', 1)
('thoughtfulness', 1)
('exerting', 1)
('planks', 1)
('knocked', 1)
('there’', 1)
('‘come', 1)
('in’', 1)
('‘pardon', 1)
('intrusion’', 1)
('oblige', 1)
('fire’', 1)
('‘enter’', 1)
('‘and', 1)
('afraid', 1)
('host', 1)
('need’', 1)
('ensued', 1)
('irresolute', 1)
('‘by', 1)
('countryman', 1)
('french’', 1)
('hopes’', 1)
('‘are', 1)
('germans’', 1)
('ever’', 1)
('brotherly', 1)
('despair’', 1)
('‘they', 1)
('monster’', 1)
('blameless', 1)
('undeceive', 1)
('them’', 1)
('terrors', 1)
('overcome’', 1)
('‘

# Counting with Context...

What if we wanted to know how many paragraph breaks there are in the book, and how many of those begin with a character speaking?

In [4]:
paragraphs = 0
starts_with_quote = 0
last_was_linebreak = True # I'm assuming the start of the book counts as a linebreak

with open('book-texts/frankenstein-no-header-footer.txt', 'r') as franken_reader:
    line = franken_reader.readline()
    
    while line != '':

        if line[0] == '“' and last_was_linebreak:
            starts_with_quote += 1
        
        last_was_linebreak = False
        if line == '\n':
            paragraphs += 1
            last_was_linebreak = True
        
        line = franken_reader.readline()
        
print(paragraphs, starts_with_quote)

936 310


# Regular Expressions

When working with text, especially when searching for patterns in text, regex is incredibly useful. Lets look at some of the basics.

* Regex offers some very fancy features that can be very helpful...
    * `.` is regex's version of `_` — it matches any one character.
    * `*` and `+` are quantitative operations that modify the previous character:
        * `.*` means "0 or more of any character"
        * `.+` means "one or more of any character"
        * `a+` means "one or more a's in a row"
    * More specific quantitive operations use `{}`
        * `a{1,3}` means between 1 and 3 a's in a row.
        * The `?` can modify any other of these quantifiers to make it "greedy" which means it matches the shortest possible match. By default regex matches the longest possible match 
        
* Character classes are grouped in `[]`
    * A character class is a set of characters to match.
    * The `-` is used to specify characters "between" each other, which has to do with the `ASCII` and `Unicode` formats and their lookup tables.
    * But simply, `[a-z]` means all the lowercase letters from a to z. `[a-zA-Z] means all lower and uppercase letters.
    * `[aeiou]` means any of the vowels. 
* Character classes can be combined with the numerical operators!
    
* `^` matches the start of a string and `$` matches the end of a string.

In [5]:
# Lets see some examples in action:
import re # this is the regular expression library built into python

book_text = ''

with open('book-texts/frankenstein-no-header-footer.txt', 'r') as franken_reader:
    # This code reads the book line by line, and strips out newlines unless
    # they appear on a line by themselves. Essentially leaving in only line breaks
    # that are paragraph breaks!
    line = franken_reader.readline()
    while line != '':
        if line == '\n':
            book_text += line
        else:
            book_text += line.replace("\n", ' ')

        line = franken_reader.readline()
        
print(book_text)

Frankenstein; 
or, the Modern Prometheus 
by Mary Wollstonecraft (Godwin) Shelley 

 CONTENTS 
 Letter 1  Letter 2  Letter 3  Letter 4  Chapter 1  Chapter 2  Chapter 3  Chapter 4  Chapter 5  Chapter 6  Chapter 7  Chapter 8  Chapter 9  Chapter 10  Chapter 11  Chapter 12  Chapter 13  Chapter 14  Chapter 15  Chapter 16  Chapter 17  Chapter 18  Chapter 19  Chapter 20  Chapter 21  Chapter 22  Chapter 23  Chapter 24 



Letter 1 
_To Mrs. Saville, England._ 

St. Petersburgh, Dec. 11th, 17—. 

You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. I arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence in the success of my undertaking. 
I am already far north of London, and as I walk in the streets of Petersburgh, I feel a cold northern breeze play upon my cheeks, which braces my nerves and fills me with delight. Do you understand this feeling? This

In [6]:
# Find substrings that start with a quote and end with a close quote or linebreak.
individual_quotation = '“.*?[”\n]'

quotes = re.findall(individual_quotation, book_text)
for q in quotes:
    print(q)

“What a noble fellow!”
“the land of mist and snow,”
“Ancient Mariner.”
“Here is our captain, and he will not allow you to perish on the open sea.”
“Before I come on board your vessel,”
“will you have the kindness to inform me whither you are bound?”
“To seek one who fled from me.”
“And did the man whom you pursued travel in the same fashion?”
“Yes.”
“Then I fancy we have seen him, for the day before we picked you up we saw some dogs drawing a sledge, with a man in it, across the ice.”
“I have, doubtless, excited your curiosity, as well as that of these good people; but you are too considerate to make inquiries.”
“Certainly; it would indeed be very impertinent and inhuman in me to trouble you with any inquisitiveness of mine.”
“And yet you rescued me from a strange and perilous situation; you have benevolently restored me to life.”
“Unhappy man! Do you share my madness? Have you drunk also of the intoxicating draught? Hear me; let me reveal my tale, and you will dash the cup from your l

In [7]:
# Find words with two vowels in a row in them
dual_vowel_words = ' [a-zA-Z]*[aeiouyAEIOUY]{2}[a-z]* '

double_vowel_instances = re.findall(dual_vowel_words, book_text)
for dv in double_vowel_instances:
    print(dv)

 Prometheus 
 Shelley 
 rejoice 
 hear 
 accompanied 
 you 
 dear 
 increasing 
 already 
 streets 
 feel 
 breeze 
 you 
 regions 
 daydreams 
 vain 
 persuaded 
 seat 
 imagination 
 region 
 beauty 
 broad 
 perpetual 
 your 
 sailing 
 may 
 beauty 
 region 
 productions 
 features 
 without 
 heavenly 
 undoubtedly 
 may 
 country 
 may 
 wondrous 
 needle 
 may 
 thousand 
 observations 
 require 
 voyage 
 their 
 eccentricities 
 satiate 
 curiosity 
 may 
 foot 
 they 
 sufficient 
 conquer 
 fear 
 death 
 laborious 
 joy 
 feels 
 holiday 
 expedition 
 you 
 near 
 reach 
 ascertaining 
 reflections 
 agitation 
 feel 
 heart 
 enthusiasm 
 tranquillise 
 steady 
 point 
 soul 
 intellectual 
 expedition 
 been 
 favourite 
 early 
 read 
 ardour 
 accounts 
 various 
 been 
 Ocean 
 seas 
 surround 
 You 
 voyages 
 our 
 education 
 yet 
 passionately 
 day 
 familiarity 
 increased 
 learning 
 dying 
 seafaring 
 visions 
 poets 
 effusions 
 soul 
 poet 
 year 
 obtain

 theory 
 said 
 greatly 
 Cornelius 
 pursue 
 seemed 
 would 
 could 
 attention 
 early 
 entertained 
 greatest 
 science 
 could 
 real 
 mood 
 betook 
 appertaining 
 science 
 being 
 our 
 bound 
 look 
 seems 
 miraculous 
 inclination 
 immediate 
 guardian 
 preservation 
 ready 
 announced 
 unusual 
 soul 
 relinquishing 
 ancient 
 taught 
 associate 
 their 
 their 
 too 
 decreed 
 attained 
 seventeen 
 should 
 schools 
 thought 
 completion 
 education 
 should 
 acquainted 
 early 
 day 
 could 
 caught 
 greatest 
 been 
 persuade 
 refrain 
 yielded 
 our 
 heard 
 favourite 
 could 
 attentions 
 consequences 
 day 
 accompanied 
 looks 
 deathbed 
 joined 
 your 
 expectation 
 consolation 
 your 
 you 
 younger 
 quit 
 thoughts 
 endeavour 
 cheerfully 
 death 
 meeting 
 died 
 countenance 
 affection 
 need 
 feelings 
 dearest 
 void 
 despair 
 persuade 
 day 
 appeared 
 our 
 eye 
 been 
 sound 
 voice 
 familiar 
 dear 
 ear 
 reflections 
 reality 
 a

 trees 
 season 
 greatly 
 joy 
 affection 
 gloom 
 cheerful 
 exclaimed 
 good 
 instead 
 being 
 you 
 been 
 repay 
 feel 
 greatest 
 disappointment 
 been 
 you 
 repay 
 you 
 you 
 you 
 good 
 may 
 you 
 may 
 could 
 Could 
 said 
 mention 
 your 
 cousin 
 they 
 you 
 your 
 They 
 you 
 been 
 uneasy 
 your 
 dear 
 could 
 thought 
 dear 
 your 
 you 
 see 
 been 
 days 
 your 
 dearest 
 been 
 dear 
 sufficient 
 reassure 
 your 
 You 
 yet 
 dear 
 our 
 thought 
 each 
 would 
 persuasions 
 restrained 
 journey 
 encountering 
 inconveniences 
 yet 
 being 
 your 
 could 
 guess 
 affection 
 your 
 Yet 
 indeed 
 eagerly 
 you 
 soon 
 your 
 You 
 cheerful 
 friends 
 you 
 Your 
 health 
 see 
 you 
 cloud 
 pleased 
 would 
 our 
 sixteen 
 desirous 
 true 
 foreign 
 least 
 pleased 
 idea 
 career 
 your 
 looks 
 odious 
 fear 
 yield 
 point 
 profession 
 our 
 you 
 blue 
 our 
 our 
 hearts 
 occupations 
 exertions 
 seeing 
 around 
 you 
 our 
 you 


 trees 
 augmented 
 habitations 
 mountain 
 Soon 
 valley 
 valley 
 beautiful 
 picturesque 
 through 
 mountains 
 immediate 
 ruined 
 glaciers 
 heard 
 raised 
 surrounding 
 tremendous 
 overlooked 
 pleasure 
 perceived 
 days 
 associated 
 lighthearted 
 soothing 
 weep 
 again 
 influence 
 found 
 again 
 grief 
 weighed 
 Exhaustion 
 fatigue 
 remained 
 played 
 pursued 
 noisy 
 sounds 
 too 
 head 
 sleep 
 day 
 through 
 stood 
 sources 
 their 
 mountains 
 glacier 
 glorious 
 imperial 
 sound 
 through 
 been 
 plaything 
 their 
 greatest 
 They 
 although 
 they 
 tranquillised 
 they 
 thoughts 
 brooded 
 waited 
 They 
 round 
 unstained 
 soaring 
 round 
 they 
 clouded 
 rain 
 pouring 
 would 
 their 
 veil 
 seek 
 their 
 rain 
 brought 
 view 
 tremendous 
 glacier 
 soul 
 soar 
 indeed 
 causing 
 without 
 acquainted 
 would 
 grandeur 
 continual 
 you 
 surmount 
 thousand 
 may 
 trees 
 leaning 
 mountain 
 you 
 continually 
 speaking 
 loud 


 unacquainted 
 language 
 country 
 good 
 Italian 
 mentioned 
 they 
 death 
 house 
 they 
 took 
 Safie 
 views 
 social 
 their 
 yet 
 looked 
 qualities 
 account 
 August 
 neighbouring 
 food 
 brought 
 found 
 ground 
 leathern 
 containing 
 eagerly 
 books 
 acquired 
 they 
 possession 
 treasures 
 continually 
 friends 
 employed 
 their 
 you 
 They 
 raised 
 frequently 
 opinions 
 been 
 found 
 source 
 speculation 
 their 
 out 
 experience 
 thought 
 being 
 contained 
 disquisitions 
 death 
 suicide 
 yet 
 opinions 
 extinction 
 without 
 applied 
 feelings 
 found 
 yet 
 beings 
 read 
 conversation 
 understood 
 hideous 
 questions 
 contained 
 histories 
 founders 
 ancient 
 book 
 learned 
 imaginations 
 taught 
 heroes 
 read 
 boundless 
 unacquainted 
 been 
 school 
 studied 
 book 
 mightier 
 read 
 their 
 greatest 
 virtue 
 understood 
 signification 
 they 
 applied 
 pleasure 
 pain 
 course 
 peaceable 
 patriarchal 
 caused 
 impressio

 cause 
 join 
 Parliament 
 amiable 
 peculiar 
 they 
 days 
 feelings 
 found 
 appearance 
 yet 
 sufficient 
 obtain 
 ancient 
 streets 
 through 
 exquisite 
 spread 
 enjoyed 
 yet 
 enjoyment 
 anticipation 
 peaceful 
 youthful 
 beautiful 
 productions 
 could 
 heart 
 should 
 soon 
 pitiable 
 period 
 endeavouring 
 Our 
 voyages 
 illustrious 
 field 
 patriot 
 soul 
 fears 
 ideas 
 chains 
 look 
 free 
 eaten 
 proceeded 
 our 
 country 
 neighbourhood 
 greater 
 green 
 always 
 mountains 
 wondrous 
 curiosities 
 collections 
 pronounced 
 quit 
 journeying 
 could 
 yet 
 streams 
 familiar 
 dear 
 cheat 
 proportionably 
 found 
 greater 
 resources 
 could 
 associated 
 could 
 said 
 mountains 
 should 
 found 
 pain 
 feelings 
 quit 
 pleasure 
 again 
 various 
 conceived 
 affection 
 period 
 our 
 our 
 friend 
 feared 
 remain 
 wreak 
 vengeance 
 idea 
 waited 
 they 
 delayed 
 thousand 
 they 
 superscription 
 read 
 ascertain 
 thought 
 fiend

 agreed 
 immediately 
 our 
 should 
 our 
 days 
 beautiful 
 near 
 meantime 
 took 
 precaution 
 fiend 
 carried 
 about 
 means 
 greater 
 period 
 threat 
 marriage 
 greater 
 certainty 
 day 
 solemnisation 
 nearer 
 heard 
 continually 
 could 
 seemed 
 tranquil 
 greatly 
 day 
 thought 
 dreadful 
 reveal 
 meantime 
 niece 
 agreed 
 should 
 our 
 sleeping 
 Evian 
 continuing 
 voyage 
 day 
 our 
 enjoyed 
 feeling 
 rays 
 enjoyed 
 beauty 
 pleasant 
 surmounting 
 beautiful 
 mountains 
 vain 
 coasting 
 ambition 
 would 
 insurmountable 
 should 
 took 
 you 
 may 
 you 
 endeavour 
 quiet 
 freedom 
 despair 
 day 
 least 
 dear 
 replied 
 joy 
 painted 
 heart 
 too 
 beauty 
 Look 
 clear 
 distinguish 
 lies 
 endeavoured 
 thoughts 
 reflection 
 joy 
 continually 
 distraction 
 through 
 approached 
 amphitheatre 
 mountains 
 eastern 
 Evian 
 woods 
 surrounded 
 mountain 
 mountain 
 carried 
 air 
 caused 
 pleasant 
 trees 
 approached 
 beneath 
 t

 thou 
 yet 
 yet 
 against 
 would 
 satiated 
 thou 
 seek 
 cause 
 thou 
 ceased 
 thou 
 against 
 vengeance 
 thou 
 superior 
 cease 
 wounds 
 death 
 cried 
 feel 
 Soon 
 miseries 
 triumphantly 
 conflagration 
 sea 
 sleep 
 said 
 lay 
 soon 
 away 


In [10]:
# Finally, for completeness, lets also write back some data to a file. 
# Specifically, lets make a CSV from the word counts we computed above:
with open('book-texts/fs-word-counts.csv', 'w') as wc_csv:
    # This line is the CSV header
    line = 'word,count\n'
    wc_csv.write(line)
    
    for word, count in word_counts.items():
        # This is a "string interpolation" syntax. Variables inside {}'s get replaced with
        # The value of the variable.
        line = f'{word},{count}\n'
        wc_csv.write(line)

In [None]:
# Now you could use that information anywhere that accepts CSV... yay!