# Strings and Lists Examples

In this presentation, we will do a few more example with Emma by Jane Austen.

In [1]:
from nltk.corpus import gutenberg

In [2]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

## **Problem 1**. Find the number of lines that start with `the`

**Note** I am going to ignore the title and chapter headings, you need to fix these in the group work.  I will remove punctuation.

**Note 2** Make sure that you have installed ``nltk`` as shown in the book/group work.

In [3]:
emma = gutenberg.raw('austen-emma.txt')

In [5]:
print(emma[:1000])

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness o

In [45]:
emma[-1000:]

' was safe.--\nBut Mr. John Knightley must be in London again by the end of the\nfirst week in November.\n\nThe result of this distress was, that, with a much more voluntary,\ncheerful consent than his daughter had ever presumed to hope for at\nthe moment, she was able to fix her wedding-day--and Mr. Elton was\ncalled on, within a month from the marriage of Mr. and Mrs. Robert\nMartin, to join the hands of Mr. Knightley and Miss Woodhouse.\n\nThe wedding was very much like other weddings, where the parties\nhave no taste for finery or parade; and Mrs. Elton, from the\nparticulars detailed by her husband, thought it all extremely shabby,\nand very inferior to her own.--"Very little white satin, very few\nlace veils; a most pitiful business!--Selina would stare when she\nheard of it."--But, in spite of these deficiencies, the wishes,\nthe hopes, the confidence, the predictions of the small band\nof true friends who witnessed the ceremony, were fully answered\nin the perfect happiness of 

To clean up the test, I will use the functions shown in the book.

In [46]:
from string import punctuation, whitespace

remove_punc = lambda s: "".join([ch for ch in s if ch not in punctuation])
make_lower_case = lambda s: s.lower()


In [47]:
emma_fixed = make_lower_case(remove_punc(emma))
emma_fixed[:1000]

'emma by jane austen 1816\n\nvolume i\n\nchapter i\n\n\nemma woodhouse handsome clever and rich with a comfortable home\nand happy disposition seemed to unite some of the best blessings\nof existence and had lived nearly twentyone years in the world\nwith very little to distress or vex her\n\nshe was the youngest of the two daughters of a most affectionate\nindulgent father and had in consequence of her sisters marriage\nbeen mistress of his house from a very early period  her mother\nhad died too long ago for her to have more than an indistinct\nremembrance of her caresses and her place had been supplied\nby an excellent woman as governess who had fallen little short\nof a mother in affection\n\nsixteen years had miss taylor been in mr woodhouses family\nless as a governess than a friend very fond of both daughters\nbut particularly of emma  between them it was more the intimacy\nof sisters  even before miss taylor had ceased to hold the nominal\noffice of governess the mildness of he

Since the problem has to do with a question about lines, we will split by lines and only keep the ones that start with ``the``

In [48]:
lines = [line for line in emma_fixed.split('\n') if line.startswith('the ')]
lines[:10]

['the real evils indeed of emmas situation were the power of having',
 'the event had every promise of happiness for her friend  mr weston',
 'the want of miss taylor would be felt every hour of every day',
 'the evil of the actual disparity in their ages and mr woodhouse had',
 'the carriage but james will not like to put the horses to for',
 'the backgammontable was placed but a visitor immediately afterwards',
 'the chances are that she must be a gainer',
 'the boy had with the additional softening claim of a lingering',
 'the child was given up to the care and the wealth of the churchills',
 'the next eighteen or twenty years of his life passed cheerfully away']

The answer to the questions is obtained by using ``len`` to count the lines

In [49]:
num_lines = len(lines)
num_lines

463

To make a general function to accomplish this, we combine these expressions into one expression with substitution.

In [26]:
count_the = lambda clean_str: len([line for line in clean_str.split('\n') if line.startswith('the ')])

# The function we want is a composition of these functions.
num_start_with_the = lambda S: count_the(make_lower_case(remove_punc(S)))

num_start_with_the(emma)

710

## Problem 2. Find the shortest non-blank line (including punctuation and whitespace).

**note** again I am naively ignoring titles and chapters (you can't)

**note 2** don't remove punctuation

The general process is

1. split by lines
2. replace the lines with their length
3. Use ``min`` to get the smallest length

In [27]:
lines = emma.split('\n')
lines[:10]

['[Emma by Jane Austen 1816]',
 '',
 'VOLUME I',
 '',
 'CHAPTER I',
 '',
 '',
 'Emma Woodhouse, handsome, clever, and rich, with a comfortable home',
 'and happy disposition, seemed to unite some of the best blessings',
 'of existence; and had lived nearly twenty-one years in the world']

In [29]:
lengths = [len(line) for line in lines]
lengths[:10]

[26, 0, 8, 0, 9, 0, 0, 67, 65, 64]

In [31]:
# I have decided to skip blank lines
non_blank = [num for num in lengths if num > 0]
non_blank[:10]

[26, 8, 9, 67, 65, 64, 40, 65, 67, 64]

In [33]:
shortest = min(non_blank)
shortest

5

Now combine them all into one expression

In [37]:
shortest = min([len(line) for line in emma.split('\n') if len(line) > 0])
shortest

5

And package in a functions

In [39]:
shortest = lambda S: min([len(line) for line in S.split('\n') if len(line) > 0])
shortest(emma)

5