# Review from last time

In [14]:
# opening files
huck = open('/Users/e/Library/Mobile Documents/com~apple~CloudDocs/PhD/ltm/notebooks/huck.txt')

In [15]:
# read an open file object to a string
huck = huck.read()

In [16]:
huck[:50]

'CHAPTER I.\nYOU don’t know about me without you hav'

In [17]:
# general form of a function
def hours2weeks(num_hours):
    num_days = num_hours / 24
    num_weeks = num_days / 7
    return num_weeks

In [18]:
hours2weeks(500)

2.976190476190476

Remember that functions must `return` whatever we want to use elsewhere. In the above example, it is `num_weeks`.

In [19]:
# updating (incrementing) variables with +=
a = 1
a += 1
a

2

In [100]:
# decrementing variables with -=
a = 2
a -= 1
print(a)
a -= 1
print(a)
a -= 1
print(a)

1
0
-1


## String mutability
Some objects can be changed in place; some cannot. Take a look at the following example:

In [20]:
s = 'This is my string. It has capital letters in some FunNy PLaCes.'

In [21]:
s.lower()

'this is my string. it has capital letters in some funny places.'

In [22]:
s

'This is my string. It has capital letters in some FunNy PLaCes.'

Strings do not retain the effects of methods like `lower` automatically; you have to update the variable for them to hold. So, for instance:

In [23]:
s = s.upper()

In [24]:
s

'THIS IS MY STRING. IT HAS CAPITAL LETTERS IN SOME FUNNY PLACES.'

Look up Jupyter namespace

## Declaration of variables in functions
We don't need to declare variables inside of functions for them to work. We can create them in the function declaration, and pass them as arguments:

In [25]:
def repeat_word(word, number):
    word = 'gargoyle'
    number = 30
    return word * number

In [11]:
repeat_word()

TypeError: repeat_word() missing 2 required positional arguments: 'word' and 'number'

Why doesn't the above work? Because we need to pass the function the arguments `word` and `number`. By making them abstract, we can make the function useful in more cases:

In [26]:
def repeat_word(word, number):
    return word * number

In [27]:
repeat_word(word = 'gargoyle ', number = 30)

'gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle gargoyle '

In [28]:
repeat_word('anything',5)

'anythinganythinganythinganythinganything'

# Writing a better keywords in context (KWIC) function
We want to create functions that are short, easy to understand, and widely applicable. Our goal here is to create a function that:
1. Takes any text.
2. Finds every instance of any substring (word, phrase, whatever) in that text.
3. `returns` the context in which that substring appears.

Let's improve our current example from last class:

In [29]:
def kwic(loc):
    return huck[loc-75:loc+75]

In this basic function, we ask for a location, `loc`, and return the `75` characters on either side of the variable `huck`. This works for most cases:

In [30]:
kwic(300)

'ich he stretched, but mainly he told the truth.  That is nothing.  I never seen anybody but lied one time or another, without it was Aunt Polly, or th'

But it does not work well for all:

In [31]:
kwic(10)

''

What's going on here? We can use `print` statements to see what our function is doing:

In [32]:
def kwic(loc):
    print(loc-75)
    print(loc+75)
    return huck[loc-75:loc+75]

In [33]:
kwic(10)

-65
85


''

So we know that the command being `returned` is `huck[-65:85]`. That gives us an empty string, `''`, because Python does not support wrap-around indexing. Either of these would be fine:

In [16]:
huck[:85]

'CHAPTER I.\nYOU don’t know about me without you have read a book by the name of The Ad'

In [17]:
huck[-65:]

'he trees, and, sure enough, there was Tom Sawyer waiting for me.\n'

But they cannot work in combination:

In [106]:
huck[-65:85]

''

In [107]:
huck[-1:1]

''

One way to fix this problem would be by setting minimum and maximum values. I use `mn` and `mx` as variable names because `min` and `max` are reserved words in Python:

In [18]:
def kwic(loc):
    mn = 0 # avoid negative indexing; always start from the beginning
    mx = len(huck) # we never need to go further than the length of the text
    return huck[loc-75:loc+76]

We know what our minimum and maximum values should be and how to represent them, but we don't know how to *test* whether our value will be above or below our min/max. This is where **Boolean operators** come in.

# Booleans
We can test conditions about our values with what are called Boolean operations. Boolean operations return `True` or `False`, which are reserved words in Python. There are a large number of ways that we can get Boolean results ("bools"). Here are a few examples:

In [19]:
2 > 1

True

In [20]:
2 < 1

False

In [34]:
a = 1

In [35]:
a

1

In [36]:
b = 2

In [37]:
b

2

In [38]:
a == b

False

In [21]:
1 == 1 # == tests for equivalency

True

In [108]:
1 == 0

False

A common mistake people new to Python make is mixing up `=` and `==`. Think of `x=y` as saying "`x` takes the value of `y`." By contrast, `x==y` asks, "is `x` equal to `y`?" and returns `True` or `False`.

In [22]:
'False' == 'False'

True

In [23]:
'false' == 'False'

False

Under the hood, `True` and `False` in Python are actually special representations of the numbers `1` and `0` respectively. Therefore:

In [24]:
True > 0

True

In [25]:
False == 0

True

In [26]:
False < True

True

You would never use these in this way, but it is useful to know that they work this way in case you see unexpected behaviors.

## Non-equivalence
How do we test if things are *not* equal? You could ask for a response of `False` to `==`. Or:

In [27]:
False != True

True

In [28]:
2 != 2

False

You'll also need greater than or equal to, and less than or equal to:

In [29]:
1 >= 1

True

In [30]:
2 >= 1

True

In [31]:
1 <= 2

True

In [32]:
1 <= 0

False

## Testing our values
Back in our `kwic` function, we can now ask if the numbers we pass it are going to cause problems given how string slicing in Python works:

In [39]:
def kwic(loc):
    mn = 0
    mx = len(huck)
    window = 75 # setting the number of characters we want to see on either side of loc
    print('min ok?', loc - window >= mn)
    print('max ok?', loc + window <= mx)
    return huck[loc-75:loc+76]

In [34]:
kwic(5)

min ok? False
max ok? True


''

In [35]:
kwic(75)

min ok? True
max ok? True


'CHAPTER I.\nYOU don’t know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain’t no matter.  That book was m'

Now our function will tell us when there are problems, and where they are.

But wouldn't it be better if it could fix them for us?

# Logic gates and conditional operations: `if`, `and`, `or`, `not`, `else`
This is a classic computer problem: A computer asks a human to do something, the human does something, and the computer wants to know if they did what was asked. Think of when a website checks your password to make sure it has a special characters, or is of a certain length. With the material in this next section, we'll know enough to write one of those!

Python makes conditional operations easy:

In [None]:
def fun(x):
    # do something
    pass

In [40]:
if 1 < 2:
    print('that was easy')

that was easy


The general form looks like this:

In [114]:
if x:
    # do something
    pass # pass is a reserve word in Python explicitly instructing the program to do nothing, but not crash
else:
    # do something different
    pass 

Note two things:
1. Like `def function(x):`, conditionals end with a `:`
2. Also like `def function(x)`, the conditions are evaluated inside of an indented block under the declaration.

In [41]:
if 'a' == 'a':
    print('hooray')

hooray


Sometimes you want the computer to do one thing `if` a condition is `True`, and something `else` in all other cases:

In [42]:
if 'a' == 'b':
    print('everything I know is a lie')
else: # for all other cases...
    print('not everything I know is a lie')

not everything I know is a lie


Python executes one line at a time, top to bottom. Here, the first `if` statement is evaluated. Depending on its result, Python will either skip the `else`, or execute it.

In [116]:
if 'a' == 'a':
    print('a is a')
else:
    print('nothing is anything')

a is a


You can check multiple conditions in sequence using multiple successive `if` statements:

In [51]:
favorite_word = 'applesauce'

In [57]:
if len(favorite_word) < 10:
    print('your favorite word is too short')
if len(favorite_word) == 10:
    print('your favorite word is just right')
if len(favorite_word) > 10:
    print('your favorite word is too long')

your favorite word is too short


In [55]:
favorite_word = 'applesauces'

In [56]:
favorite_word = 'apples'

## Multiple conditions
Sometimes you only want to take an action if multiple conditions are simultaneously true.

You can check multiple conditions simultaneously using `and` and `or` operators:

In [61]:
favorite_word = 'art'

In [63]:
if len(favorite_word) == 10 or favorite_word[0] == 'a': # checking the length of the word and the first character
    print('you picked the correct favorite word')

you picked the correct favorite word


With `and`, both conditions must be `True` for the block to execute. With `or`, only one must be `True` for the block to execute.

You can think of `or` statements as saying "if either condition 1 `or` condition 2 is `True`, then do something."

For `and` statements: "if condition 1 `and` condition 2 are both `True`, then do something."

In [64]:
if len(favorite_word) < 10 or favorite_word[0] == 'a':
    print(len(favorite_word) < 10)
    print(favorite_word[0] == 'a')
    print('see? the block ran because at least one or condition was true')

True
True
see? the block ran because at least one or condition was true


Finally, you can check if things are `not` the case using multiple methods:

In [65]:
if favorite_word != 'applesauce': # != reads as "is not equal to"
    print('tsk tsk')
else:
    print('we love', favorite_word)

tsk tsk


`not` can be used in much the same way:

In [66]:
favorite_word='syzygy'

In [69]:
if not favorite_word == 'applesauce':
    print('your favorite word', favorite_word, 'should be applesauce')
else:
    print('we love', favorite_word)

your favorite word syzygy should be applesauce


In [70]:
if not 2 > 1:
    print('you are dreaming')

Finally, `not` is often used to find out if variables contain values. For example:

In [73]:
favorite_word = ''
if not favorite_word:
    print('pick a favorite word already!')

pick a favorite word already!


As seen in the above example, an empty string `''` does not exist from Python's perspective.

You might read this to yourself as, "`if` it is `not` the case that `favorite_word` has a value, then (`:`)..."

# Let's fix our `kwic` function
With our Boolean knowledge, we can improve our KWIC function:

In [156]:
def kwic(loc):
    mn = 0
    mx = len(huck)
    window = 75
    if loc - window < mn:
        print('loc too low')
    if loc + window > mx:
        print('loc too high')
    return huck[loc - window:loc + window]

In [157]:
kwic(1)

loc too low


''

In [158]:
kwic(10000)

loc too high


''

Great! We can catch the *reason* why the function was misbehaving. But we can go one step further and have the computer fix it for us.

Fixing it will also improve the function in general. Functions work best when they are:
1. simple
2. short
3. abstract

The advantages of the first two are pretty obvious, but the third is a little more complicated.

There are a lot of things about our current `kwic` that are specific rather than abstract.

For one, we have designed it so that the variable `loc`, which the user passes to the function, *always* goes directly to the `return` statement. We can change that so that the function can intervene when necessary:

In [74]:
def kwic(loc):
    mn = 0
    mx = len(huck)
    window = 75
    start = loc - window
    stop = loc + window
    
    if start < mn:
        start = mn
        stop = loc + window
    if stop > mx:
        start = mx - window
        stop = mx
        
    return huck[start:stop]

Here, I created a series of tests using boolean values to make sure that the function could never request less nor more than the entire text. That solves many of our problems. 

In [75]:
kwic(1)

'CHAPTER I.\nYOU don’t know about me without you have read a book by the name '

In [76]:
kwic(10000)

'in among the trees, and, sure enough, there was Tom Sawyer waiting for me.\n'

# Abstracting
We can make this function more abstract, and therefore more useful in a wider variety of contexts. We want to make it possible to get the context for any word in any text.

`window` is still set to a fixed value. We can give our function a `default` value in the fuction definition, but that also allows the user to change it:

In [78]:
def kwic(loc, window = 75):
    mn = 0
    mx = len(huck)
    start = loc - window
    stop = loc + window
    
    if start < mn:
        start = mn
        stop = window
    if stop > mx:
        start = mx - window
        stop = mx
        
    return huck[start:stop]

In [79]:
kwic(loc = 300, window = 200)

' Sawyer; but that ain’t no matter.  That book was made by Mr. Mark Twain, and he told the truth, mainly.  There was things which he stretched, but mainly he told the truth.  That is nothing.  I never seen anybody but lied one time or another, without it was Aunt Polly, or the widow, or maybe Mary.  Aunt Polly—Tom’s Aunt Polly, she is—and Mary, and the Widow Douglas is all told about in that book, '

In [80]:
kwic(loc = 300)

'ich he stretched, but mainly he told the truth.  That is nothing.  I never seen anybody but lied one time or another, without it was Aunt Polly, or th'

We can give our functions multiple arguments. We can assign default values in the functions as above, but the user can override them.

So let's abstract a little further: Why should our `text` always be `huck`?

In [81]:
def kwic(loc, window = 75, text = huck):
    mn = 0
    mx = len(text)
    start = loc - window
    stop = loc + window
    
    if start < mn:
        start = mn
        stop = window
    if stop > mx:
        start = mx - window
        stop = mx
        
    return text[start:stop]

In [82]:
kwic(loc = 15, window = 10, text = 'Now this function is getting pretty abstract!')

'his function is gett'

# How to `find` specific words?
Our KWIC function is working well with locations we give it. But we ultimately want to know things about words, not character positions.

Let's write another function to `find` locations that we can pass to `kwic`.

In [83]:
def find_next(word, text = huck, loc = 0):
    return text.find(word, loc)

In [169]:
find_next('time')

326

Note that I have set default values above, but we could theoretically use `find_next` with any word, any text, and `find` from any character position.

This is an abstract way of writing the `find` methods we were doing by hand before. Because it is abstract, we can combine this inside another function to print KWICs automatically:

In [84]:
def get_kwics(word):
    print('word:', word)
    loc = find_next(word = word) # note that there I call my custom function find_next()
    print('loc:', loc)
    print('kwic:', kwic(loc = loc))

In [173]:
get_kwics(word = 'time')

word: time
loc: 326
kwic: ly he told the truth.  That is nothing.  I never seen anybody but lied one time or another, without it was Aunt Polly, or the widow, or maybe Mary.  A


In [174]:
get_kwics(word = 'Tom')

word: Tom
loc: 97
kwic: now about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain’t no matter.  That book was made by Mr. Mark Twain


So, `get_kwics` will now allow us to see the first instance of any word in `huck`.

How can we re-write it to show us *every* instance of any word?

# `while` loops
We know that `find` returns positive location values every time it finds an example of a sub-string. When it does not find the target substring, it returns `-1`.

We can use that behavior to create a condition that causes a process to be run repeatedly until that condition is met. This is called a `while` loop:

In [85]:
x = 10
while x > 0:
    print(x)
    x -= 1 # -= works the same as +=, except for subtraction.


10
9
8
7
6
5
4
3
2
1


Why is the last item `0`? Because at the *beginning* of that loop, `x = 1`, which is `>0`. Once `x -= 1` happens inside the loop, Python runs `print(x)`, which prints `0`.

Then, the next time it checks, `x = 0`. `0 > 0` returns `False`, which stops the `while` loop.

The code under the `while` statement repeats until its condition becomes `False`.

Because of this, `while` loops can repeat *infinitely* if they are never falsified. This is useful in some cases, but it can also cause problems.

The function below demonstrates an infinite loop:

In [None]:
n = 0

while 'this' != 'that':
    n += 1
    print('\r{}'.format(n), end='') # this is a trick to prevent each of these numbers from writing on their own line

## Interrupting out-of-control loops
Of course it is always `True` that `'this'` does not equal `'that'`. Nothing in our program will change either of those values. So, to stop this function, we have to interrupt it.

To do that, you must pass a `KeyboardInterrupt`. Press `i i` on your keyboard, or go to the `Kernel` menu and select `Interrupt Kernel`.

If you've done it correctly, you'll get an error message.

# All KWICs for any word
We can use the `while` loop in combination with the above functions to print all of the KWICs for a given word. First, we'll update `get_kwics` to return a `loc` value: 

In [87]:
def get_kwics(word, loc):
    print('word:', word)
    loc = find_next(word = word, loc = loc)
    print('loc:', loc)
    
    if loc != -1: # condition to only print if we have another valid instance
        print('kwic:')
        print(kwic(loc = loc))
        
    return loc # adding a return statement to search for the next instance

We need the `loc` in order to find for the first instance that appears after it. As you may recall, we did this before by calling something like:
```python
string.find('word', loc + 1)
```

We can use that same logic here to print KWICs automatically:

In [88]:
loc = 0
total = 0 # to count the number of instances of the word
word = 'because'

while loc != -1:
    loc = get_kwics(word = word, loc = loc + 1) # need to add 1 here to search for the next instance
    total += 1 # to count how many instances we have
    print('\n') # just to make it a little easier to read

print('total instances of "{}": {}'.format(word,total-1)) # -1 because it loops once through with -1 as the value

word: because
loc: 2473
kwic:
en dead a considerable long time; so then I didn’t care no more about him, because I don’t take no stock in dead people.

Pretty soon I wanted to smok


word: because
loc: 3040
kwic:
d some good in it.  And she took snuff, too; of course that was all right, because she done it herself.
Her sister, Miss Watson, a tolerable slim old 


word: because
loc: 4110
kwic:
s going, so I made up my mind I wouldn’t try for it.  But I never said so, because it would only make trouble, and wouldn’t do no good.

Now she had g


word: because
loc: 4539
kwic:
o there, and she said not by a considerable sight.  I was glad about that, because I wanted him and me to be together.
Miss Watson she kept pecking at


word: because
loc: -1


total instances of "because": 4


This is getting there! We can set any word in `word` and automatically see all of the contexts in which it appears. We can customize the length, the text, and pretty much anything about it we want.

# Summary
1. We reviewed last class's material.
2. We discussed characteristics of good functions: simple, short, abstract.
3. We learned about Booleans (`True` and `False`).
4. We learned about logic gates and control flow (`if`, `else`, `and`, etc.)
5. We used what we learned to improve our `kwic` function and automate output.
6. We learned how functions and loops can call each other.
7. We learned how to revise and improve functions iteratively, changing their behaviors to meet our needs.

# Practice
`kwic` can be improved further. One obvious problem is that it `prints` `word` and `loc` one more time when `loc == -1`.

How could you revise the function to prevent that behavior? Which part of the function would you have to edit to get the desired result?