# Quick Homework Review
First, let's clarify markdown vs. code blocks. Markdown blocks are used for prose. Code blocks are used for Python or Terminal commands.

Markdown blocks can be created by pressing `m` on a cell when your cursor is not in a code block.

Markdown blocks can be converted to code blocks by pressing `y` on a cell.

If you ever forget a shortcut, don't worry. You just need to use Jupyter's Commands Panel, which is represented by the picture of the artist's palette on the left. You can also access the Commands by pressing `command + shift + c`.

Doing this allows you to type just about *anything* you want Jupyter to do, and it will find the appropriate command.

## What's wrong with the code below?

In [3]:
'that' = that

SyntaxError: can't assign to literal (<ipython-input-3-ff9af23508f0>, line 1)

In [2]:
that = 'that'

In [4]:
that

'that'

In [6]:
print('this and' + that)

this andthat


# Doing things with strings
So far we have learned to combine strings with `+` and `print` strings using some formatting like `\n`. We can do lots of other things with them. One of the most important is `slicing`, which allows us to cut strings into pieces. For instance, let's say we wanted to get my initials from these variables:

In [7]:
first_name = 'Erik'

In [8]:
last_name = 'Fredner'

In [9]:
first_initial = first_name[0]
last_initial = last_name[0]
print(first_initial + last_initial)

EF


### How does the above block work?

The `[]` characters are used to indicate the `index` of the string that we want to access. Indexes represent the positions of elements in an object. Python indexes from 0, with each subsequent character indicated by the subsequent integer. So:

In [10]:
first_name[0]

'E'

In [11]:
'Erik'[0]

'E'

In [6]:
first_name[1]

'r'

In [7]:
first_name[2]

'i'

In [8]:
first_name[3]

'k'

It helps to visualize it. The top row represents the characters in the string `first_name`. The bottom row represents the index numbers:

### Indexing from Zero
Many people new to Python have difficulty remembering to count from zero. It will become second-nature eventually. One helpful way to think about it verbally is to refer to the element at the beginning as the "zeroth" item. People trip themselves up by referring to, say, the "first" character of `first_name` being "E." It is more Pythonic to say that the "zeroth" letter of `first_name` is "E".

Slicing enables us to manipulate and reconstitute objects. We could rearrange these to spell a new word:

In [9]:
first_name[2] + first_name[1] + first_name[3]

'irk'

## Slicing
We can also slice a string using indices to take a *part* of a string

In [12]:
last_name[0:4]

'Fred'

Slicing is very useful for the work we will be doing in this class. For example, we will sometimes want to slice novels into chunks of a specific size (say, 1000 words). We will do this using slicing and indices.

We use the `:` character to indicate a sequence of consecutive integers.

In [8]:
last_name[1:3]

're'

In [10]:
# equivalent:
last_name[1] + last_name[2]

're'

### Common point of confusion about slicing
People who are new to Python are often perplexed by the way that slicing works. Let's make the same chart for `last_name`:

Why does `last_name[0:4]` return `Fred`? Index `4` corresponds with `n`. Shouldn't `[0:4]` return `Fredn`?

The best way to think about it is that the general form `X[I:J]` says: "give me everything in object `X` from index `I` up to *but not including* index `J`." To get the last three letters, for instance:

In [13]:
last_name[4:7]

'ner'

Slicing works this way because we have to be able to calculate the length of the string accurately (`Fredner` is 7 characters long).

So:

In [14]:
len('Fredner') # len() measures the length of an object. More on this later.

7

In [15]:
len(last_name)

7

In [16]:
'Fredner'[4]

'n'

In [18]:
'Fredner'[4:6]

'ne'

In [6]:
'Fredner'[4:len(last_name)]

'ner'

In [7]:
last_name[4:len('Fredner')]

'ner'

### Indexing without explicit values
You will frequently see slices written like this:

In [19]:
last_name[0:4]

'Fred'

In [77]:
last_name[:4]

'Fred'

In [20]:
last_name[4:len(last_name)]

'ner'

In [21]:
last_name[4:]

'ner'

Leaving the left side of the colon blank indicates zero, and leaving the right side blank indicates the final element.

That means that this also works:

In [22]:
last_name[:]

'Fredner'

## Reverse indexing
We can also index using negative numbers. Typically, this is most useful to get the last element in a string or other object:

In [24]:
'the last letter of this string is z'[-1]

'z'

The same chart as above would look like this:

We can also slice negatively:

In [12]:
s = 'is z'
s[-4:-1]

'is '

In [13]:
s[-4:] # get the last four characters of the string, inclusive

'is z'

Again, you will primarily use negative indexing to access the *final item(s)* in a string or other object.

# Finding substrings
We frequently need to find specific sequences of characters in a string. For instance, let's say we wanted to find the word "be" in Hamlet's famous speech:

In [28]:
ham = """
To be, or not to be, that is the question,
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them?
"""

In [29]:
ham

"\nTo be, or not to be, that is the question,\nWhether 'tis nobler in the mind to suffer\nThe slings and arrows of outrageous fortune,\nOr to take arms against a sea of troubles,\nAnd by opposing end them?\n"

We can use the `find` method to search for the substring we're interested in:

In [31]:
ham.find('be')

4

The structure of the statement above is as follows:

*object*.**method**('`argument`')

`ham` is the variable referencing the text. `find` is the method that searches a string (in this case, `ham`) for a substring. `'be'` is the string that we ask `find` to look for in `ham`.

We might write a sentence like the following to describe what is happening:

On the object `ham`, perform the method `find` with the parameter `'be'`.

In [28]:
ham.find('be')

4

What does that result mean?

Whenever we're confused about the way a function like `find` works, we can use `shift + tab` to look up how it works. The printed result is called the `docstring`, which provides an explanation of the function, its arguments, and outputs.

Type `ham.find()` below, then hit `shift+tab`.

You can also invoke Jupyter's Contextual Help menu to view the docstring. Hit `command + i` or go to the Command panel and type `contextual`. With that open, select the function you want to know more about.

In [41]:
ham.find('be')

4

In [11]:
# type here, then hit shift+tab

According to the docstring, `find` gives us the `index` where the string we pass it, `'be'`, first occurs. So:

In [84]:
ham[4:6]

'be'

Of course, the speech has multiple instances of `'be'`. We can find the second one like so:

In [42]:
ham.find('be', 5)

18

What is going on here? Now we're giving `find` two arguments, `'be'` and `5`. `be` is the substring we want. `5` represents the index at which we want `find` to begin looking for instances of `be`.

Why start at `5`? We know it found one `be` at index `4`, so it makes sense to begin looking for subsequent instances at `5`.

It finds another instance at `18`, which means that:

In [43]:
ham[18:20]

'be'

Is there another `be` in this part of Hamlet's speech? Using the same logic:

In [44]:
ham.find('be', 19)

-1

As the docstring told us, `find` returns `-1` when it does not find another instance of the substring.

# Measuring strings
We frequently need to know how long strings and other objects are. We can find out with the `len` function:

In [46]:
len(ham)

200

`ham` is 200 characters long.

How long do you think this expression will be?

In [None]:
len(ham * 2)

If you need a hint, check out the code below:

In [47]:
'to be or not to be ' * 10 # the same principle as above

'to be or not to be to be or not to be to be or not to be to be or not to be to be or not to be to be or not to be to be or not to be to be or not to be to be or not to be to be or not to be '

# Cleaning strings
We frequently need to reformat strings in order to accurately count them. For example, when we count words, we would intuitively understand that `The` and `the` are two instances of the same word. But computers cannot: they would count those as one instance of `The` and one of `the`. To make the computer's counting reflect human counting, we often make strings uniformly lowercase.

Lowercasing everything can create problems. Suppose a novel had a character named `River`. In that case, `River` and `river` refer to two completely different things. Lowercasing everything will over-count the number of bodies of water, and under-count the number of people named River. We will study ways to deal with these edge-cases later.

For now, we're going to use this as our example:

I'm

In [48]:
nasx = '''
I\'m gonna take my horse to the Old Town Road.
I\'m gonna ride \'til I can\'t no more.
'''

Note that we need to use the escape character `\` to allow us to use `'` to abbreviate words like "I'm" and "can't." Otherwise, Python would think we were closing the string.

Look how Python `print`s it:

In [49]:
print(nasx)


I'm gonna take my horse to the Old Town Road.
I'm gonna ride 'til I can't no more.



We can convert the case of the text using the methods `lower()` and `upper()`.

In [14]:
print(nasx.lower())


i'm gonna take my horse to the old town road.
i'm gonna ride 'til i can't no more.



In [50]:
print(nasx.upper())


I'M GONNA TAKE MY HORSE TO THE OLD TOWN ROAD.
I'M GONNA RIDE 'TIL I CAN'T NO MORE.



Of course, this would also work:

In [52]:
nasx_lower = nasx.lower()
print(nasx_lower)


i'm gonna take my horse to the old town road.
i'm gonna ride 'til i can't no more.



Strings often have whitespace characters like `\n` that we don't care about for the purpose of counting words. We can get rid of extra whitespace at the beginning or end of a string like so:

In [54]:
nasx

"\nI'm gonna take my horse to the Old Town Road.\nI'm gonna ride 'til I can't no more.\n"

In [55]:
nasx.strip()

"I'm gonna take my horse to the Old Town Road.\nI'm gonna ride 'til I can't no more."

Note that this retains the `\n` character indicating the line-break. If we want to get rid of that, we have to use other methods.

# Replacing substrings
Sometimes we will need to replace portions of strings. This works very similarly to the `find` method.

In [57]:
print(nasx.replace('horse', 'Maserati'))


I'm gonna take my Maserati to the Old Town Road.
I'm gonna ride 'til I can't no more.



If we wanted to make multiple changes at one time, we can call the methods *in sequence*:

In [58]:
print(nasx.replace('horse','Maserati').replace('ride','drive'))


I'm gonna take my Maserati to the Old Town Road.
I'm gonna drive 'til I can't no more.



Whitespace characters like `\n` provide a real-world example of how this can be useful for text mining. Let's say we wanted to get rid of all `\n` in Hamlet's speech:

In [35]:
ham

"\nTo be, or not to be, that is the question,\nWhether 'tis nobler in the mind to suffer\nThe slings and arrows of outrageous fortune,\nOr to take arms against a sea of troubles,\nAnd by opposing end them?\n"

In [62]:
print(ham.replace('\n',' '))

 To be, or not to be, that is the question, Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? 


In the above example I chose to `replace` `\n` with a space.

Of course, that leaves spaces at the beginning and the end. We can get rid of those with `strip`:

In [63]:
ham.replace('\n',' ').strip()

"To be, or not to be, that is the question, Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them?"

We can call any number of methods in sequence.

It's not necessarily good coding practice to string them all together like this, but it's just for an example:

In [64]:
ham.replace('\n',' ').replace('.',' ').replace(',',' ').replace('  ',' ').replace('?',' ').strip().lower()

"to be or not to be that is the question whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune or to take arms against a sea of troubles and by opposing end them"

# Counting substrings
Counting words is one of our core techniques. The American author Gertrude Stein has a famous line about roses and language that we're going to use to practice counting:

In [38]:
stein = 'Rose is a rose is a rose is a rose.'

We can combine some of our techniques and add new ones to accurately count the number of roses in the line:

In [39]:
stein = stein.lower() # lowercasing to count 'Rose' as 'rose'

In [40]:
stein

'rose is a rose is a rose is a rose.'

In [41]:
stein.count('rose')

4

# Opening Files, Counting Substrings

We can also use this as a way to get very rough statistics about a text. Let's figure out how many sentences there are in the first chapter of [*Huckleberry Finn*](https://www.gutenberg.org/files/76/76-h/76-h.htm#c1).

You need to download the chapter from today's Canvas folder, which is called `huck.txt`.

In [42]:
filepath = '/Users/e/Downloads/huck.txt'

We use the function `open` to get Python to interact with files stored on disk.

In [None]:
huck = open(filepath)

In [None]:
huck = open('/Users/e/Downloads/huck.txt') # equivalent

In [149]:
huck

<_io.TextIOWrapper name='huck.txt' mode='r' encoding='UTF-8'>

That shows that Python has opened the file.

After using Python to `open` a file, we then need to use the `read` method in order to access the text.

In [None]:
huck = huck.read()

In [None]:
huck[:50] # to see the first fifty characters of the chapter

In [152]:
len(huck) # remember, len() provides the number of characters

7073

In [153]:
periods = huck.count('.') # rough measure of sentences

In [154]:
periods

66

In [155]:
huck.count('?')

1

In [156]:
huck.count('!')

5

In [157]:
huck.count(',')

92

In [158]:
huck.count(';')

18

In [159]:
spaces = huck.strip().count(' ') # counting spaces provides a very rough measure of words

In [160]:
spaces / periods # estimates average words per sentence

22.015151515151516

We will improve on the accuracy of all of these measures over time.

# Summary

1. We reviewed the homework.
2. We learned about indexing and slicing.
3. We learned how to use `find`, `lower`, `count`, and `replace` to manipulate and gather data about strings.
4. We practiced assigning variables, using integers and floats, and passing Python filepaths.