# Welcome to Python!

This notebook aims to get you familiar with essential aspects of the [Python](https://www.python.org) language. Anaconda is a *distribution* of Python widely used in data science that comes with many useful packages preinstalled. Jupyter Lab, which you're using right now, is one such package.

Python is great for a lot of different tasks in computational literary studies. Combined with the `pandas` package for tabular data, it can be used to do just about anything that you would use the R language for. Especially for people who haven't programmed before, Python can be easier to learn than R.

# Learn by doing: upgrading text search
One of the best ways to learn to program is by trying to figure out particular problems. So, today we're going to create an upgraded version of the `cmd/ctrl + F` search to find sets of words within user-defined proximity of one another. That is, we'll be able to find not only our target word (e.g. "whale") but also where it co-occurs with another target word (e.g. "Ahab").

This is one implementation of *keywords in context analysis*, a common technique in computational text analysis. A "contex" is ususally defined as a fixed number of words, keywords, or sentences surrounding our target word(s).

## What we learn along the way

To do the above, we'll learn many of the key ideas in Python:

- strings
- integers
- variables
- reading files
- lists
- indexing
- Booleans
- `for` loops
- functions

I briefly introduce each of these, and demonstrate how they can be applied in this case.

# First steps
Of course, we have to start somewhere. Traditionally, the first program you write prints the phrase `hello world` to standard output. In Python, we write that this way:
```python
print('hello world')
```
In Jupyter Lab, we execute Python commands in code cells, like the one below. The output of each cell is printed immediately below.

In [1]:
print('hello world')

hello world


This statement is comprised of two parts: the function `print()` and the string `'hello world'`, which we pass as an argument to `print()`.

The text that appears below the `print()` command is the *output*.

## Documentation
Every function has supporting documentation describing what it does. Here's the relevant information for `print()`:
https://docs.python.org/3/library/functions.html#print

If you're not sure what something does, look it up!

# Python Basics

In [2]:
# Lines that begin with # are comments. The computer won't attempt to execute anything here. It's for us humans.

In [3]:
# We use write numbers as you might expect
100

100

In [4]:
# Ditto for arithmetic
100 + 200

300

In [5]:
333 - 332

1

In [6]:
3 * 30

90

In [7]:
30 / 3 # note the decimal point: division yields floats

10.0

In [8]:
# characters between quotes are strings
'hello'

'hello'

In [9]:
# FYI strings can contain any sort of character, not just letters:
'h3ll0))(*^&#$)'

'h3ll0))(*^&#$)'

# Variables
So far, our code isn't that useful because we can't store our output anywhere. That's where variables come in:

In [10]:
# we wouldn't want to type this twice!
tree = """
          &&& &&  & &&
      && &\/&\|& ()|/ @, &&
      &\/(/&/&||/& /_/)_&/_&
   &() &\/&|()|/&\/ '%" & ()
  &_\_&&_\ |& |&&/&__%_/_& &&
&&   && & &| &| /& & % ()& /&&
 ()&_---()&\&\|&&-&&--%---()~
     &&     \|||
             |||
             |||
             |||
       , -=-~  .-^- _

"""

In [11]:
print(tree)


          &&& &&  & &&
      && &\/&\|& ()|/ @, &&
      &\/(/&/&||/& /_/)_&/_&
   &() &\/&|()|/&\/ '%" & ()
  &_\_&&_\ |& |&&/&__%_/_& &&
&&   && & &| &| /& & % ()& /&&
 ()&_---()&\&\|&&-&&--%---()~
     &&     \|||
             |||
             |||
             |||
       , -=-~  .-^- _




In [12]:
print('tree') # note the distinction betweeen variables and strings

tree


Some special formatting such as line breaks and tabs can be formatted using the escape character, `\`

In [13]:
print('here is a newline \n and there is a \t tab')

here is a newline 
 and there is a 	 tab


We can assign any object to any variable name.

But we need to be careful since variable names can be overwritten:

In [14]:
tree = 'a string that says tree'

In [15]:
tree

'a string that says tree'

In [16]:
tree = 6

In [17]:
tree

6

# Importing texts
Now, we're going to import *Moby-Dick* as a string.

For that to work, we need to have a *local copy of the file*, which we can make with Python. Run the code below to save a copy of *Moby-Dick* from Project Gutenberg to your `~/Downloads` folder.

In [18]:
# imports
from pathlib import Path
import os
import urllib

# get your home
home = str(Path.home())

# create a path to which the file will be written
moby_path = os.path.join(home, 'Downloads/moby.txt')

# location of the project gutenberg copy of the moby-dick text file
moby_url = 'https://www.gutenberg.org/files/2701/2701-0.txt'

urllib.request.urlretrieve(moby_url, moby_path)

print('Downloaded to:', moby_path)

Downloaded to: /Users/erik/Downloads/moby.txt


# Opening text files as strings

Now you have a file in your `~/Downloads` folder called `moby.txt` containing the Project Gutenberg *Moby-Dick*.

We can use Python to open that up as a string.

In [19]:
# this is where the file is on your computer:
moby_path

'/Users/erik/Downloads/moby.txt'

Note that Jupyter automatically indents under the `with`...`as` statement. Indentation has meaning in Python.

In [20]:
with open(moby_path, encoding = 'utf8') as moby: # UTF-8 is a file encoding. making this explicit can prevent errors
    text = moby.read()

This will confirm that you indeed have the right file:

In [21]:
print(text[27802:27848], '...')

.


  “Oh, the rare old Whale, mid storm and g ...


# Slicing texts
To get "Call me Ishamel" out of the whole *Moby-Dick*, we did something called string *slicing* with the `[x:y]` after `text`.

We can do that on any iterable object in Python, and it will return elements in sequence.

(Note that Python begins counting from `0`, not `1`. This takes a little getting used to.)

## Slicing by word (tokenization)

There's a lot more to say about slicing. For today, we're going to focus on splitting the text up *by words*, which is called tokenization. There are a lot of clever ways to tokenize texts, but we're going to use a simple one: `splitting` on spaces:

In [22]:
bartleby = 'I would prefer not to.'

In [23]:
bart_words = bartleby.split(' ')

In [24]:
bart_words

['I', 'would', 'prefer', 'not', 'to.']

While this is ok for now, we could make our tokenizer a *lot* better than this.

For example, do we really want to include punctuation marks at the end of words, as in `to.`?

One better way to do this is with a package like the [Natural Language Toolkit](https://www.nltk.org/), or `nltk`, which is designed for this problem:

In [25]:
import nltk

You will need to download the model:

In [26]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/erik/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

This cleans up our problem with `to.`

In [27]:
nltk.word_tokenize(bartleby)

['I', 'would', 'prefer', 'not', 'to', '.']

# Lists

`bart_words` is a variable containing a `split()` version of `bartleby`. `split()` returns a `list`.

`Lists` are Python objects like integers or strings. They are ordered groups of objects separated by commas.

In [28]:
# list of integers
[1,2,3]

[1, 2, 3]

In [29]:
# list of strings
['a','b','c']

['a', 'b', 'c']

In [30]:
# lists can contain any object in any combination
['a', 2 * 3, 'c']

['a', 6, 'c']

In [31]:
# they can even contain other lists:
[1, 2, [3,4]]

[1, 2, [3, 4]]

You can *slice* individual *elements* in lists using `[]` notation:

In [32]:
['aardvark', 'berry', 'capricorn'][0] # remember Python starts counting from 0

'aardvark'

In [33]:
['aardvark', 'berry', 'capricorn'][1]

'berry'

In [34]:
['aardvark', 'berry', 'capricorn'][2]

'capricorn'

You can also get multiple elements in sequence by using a `:` between values, as we did to get the beginning of *Moby-Dick*:

In [35]:
['aardvark', 'berry', 'capricorn'][0:2]

['aardvark', 'berry']

Note that the second value in an expression like `[0:2]` goes to $n-1$, i.e. "up to but not including element `2` in the sequence."

In [36]:
['aardvark', 'berry', 'capricorn'][2]

'capricorn'

# Putting it all together
So, we now have a way to make a list of words with `split`, and a way to slice texts up into sets of sequential words using our indices `[]`

Let's make *Moby-Dick* into a list of tokens:

In [37]:
text[:1000]

'\ufeffThe Project Gutenberg eBook of Moby-Dick; or The Whale, by Herman Melville\n\nThis eBook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this eBook or online at\nwww.gutenberg.org. If you are not located in the United States, you\nwill have to check the laws of the country where you are located before\nusing this eBook.\n\nTitle: Moby-Dick; or The Whale\n\nAuthor: Herman Melville\n\nRelease Date: June, 2001 [eBook #2701]\n[Most recently updated: August 18, 2021]\n\nLanguage: English\n\nCharacter set encoding: UTF-8\n\nProduced by: Daniel Lazarus, Jonesey, and David Widger\n\n*** START OF THE PROJECT GUTENBERG EBOOK MOBY-DICK; OR THE WHALE ***\n\n\n\n\nMOBY-DICK;\n\nor, THE WHALE.\n\nBy Herman Melville\n\n\n\nCONTENTS\n\nETYMOLOGY.\n\nEXTRACTS (Supplied by a Sub-Sub-Librarian).\

In [38]:
# we named our Moby-Dick variable `text` above
moby_toks = nltk.word_tokenize(text)

Where is that famous opening sentence?

In [39]:
moby_toks[5692:5695]

[',', 'the', 'rare']

Now, we can put these ideas together in a **loop** to find keywords and their contexts.

# Loops

Computers are dumb. But they're great at doing the same dumb thing many times quickly. Which is why loops are great.

There are several types of loops in Python. We're going to start with a `for` loop which performs an action on every element of an iterable in sequence:

In [40]:
my_list = ['aardvark', 'berry', 'capricorn']

In [41]:
my_list

['aardvark', 'berry', 'capricorn']

In [42]:
for word in my_list:
    print('I love', word)

I love aardvark
I love berry
I love capricorn


In [43]:
for x in [6,18,2394]:
    print(x / 99)

0.06060606060606061
0.18181818181818182
24.181818181818183


The general form of the `for` loop is as follows:
```python
for element in my_iterable:
    # do something to each element
```
Python has a handy function called `enumerate()` for getting the indices of an iterable object as we go. We use that like this:

```python
for index, element in enumerate(my_iterable):
    # do something with the element and/or its index
```

Let's apply that structure to our list above:

In [44]:
for x, y in enumerate(my_list):
    print(x, 'is at index \t', y) #\t is the equivalent of tab

0 is at index 	 aardvark
1 is at index 	 berry
2 is at index 	 capricorn


In [45]:
# note that these indices are the same as the ones we get with [] notation:
my_list[2]

'capricorn'

With a `for` loop, we can check every word in *Moby-Dick* to see which indices correspond to our keyword.

# Testing words with Boolean logic

We're going to check to see if each word is the same as our search term. To do that, we need to use Boolean logic, which returns the value `True` or `False` depending on the result of a query:

`==` tests to see if two objects are equivalent:

In [46]:
'a' == 'b'

False

`True` and `False` are reserved words in Python with special meanings. We'll use them to decide `if` we should act on a particular word.

In [47]:
'a' == 'A'

False

In [48]:
'a' == 'a'

True

In [49]:
'Ahab' == 'Ahab'

True

There are many other Booleans in Python, but `==` is the only one we'll be using for now.

## Conditionals

Booleans allow us to take specific actions `if` certain conditions are `True`:

In [50]:
today = 'Friday'

In [51]:
today == 'Friday' #== checks for equivalence

True

In [52]:
if today == 'Friday':
    print('TGIF')

TGIF


Just like a `for` loop, we indent under an `if` statement. (Jupyter Lab handles this for you.)

If a condition is `False` and you don't tell it to do anything else, Python does nothing:

In [53]:
today = 'Thursday'

In [54]:
today == 'Friday'

False

In [55]:
# this does nothing because the condition returns False
if today == 'Friday':
    print('TGIF')

## Using a `for` loop, Booleans, and `if` to check our words
This loop prints *every* index in *Moby-Dick* that is exactly the same as the string `'Ahab'`:

In [56]:
for index, word in enumerate(moby_toks[:1000]): # note that we are only checking the first 1,000 words
    if word == 'Ahab':
        print('The word Ahab appears at index number', index)

The word Ahab appears at index number 351
The word Ahab appears at index number 357
The word Ahab appears at index number 487
The word Ahab appears at index number 942
The word Ahab appears at index number 956
The word Ahab appears at index number 964


## `appending` our results to lists
It's not helpful to just `print` these. Instead, we're going to collect each of these indices in a new list using the `append` method, which adds the element you pass it to the end of an existing list:

In [57]:
# initializing an empty list with []
new_list = []

In [58]:
# empty list
new_list

[]

In [59]:
new_list.append('a')

In [60]:
# no longer empty!
new_list

['a']

In [61]:
# append updates the list in place:
new_list.append('b')

In [62]:
new_list

['a', 'b']

## `append` in a `for` loop
Using the same logic, we can `append` all of those Ahab locations we used `print()` to see before:

In [63]:
ahabs = []

for index, word in enumerate(moby_toks):
    if word == 'Ahab':
        ahabs.append(index)

In [64]:
# just looking at the first 10 indices
ahabs[:10]

[351, 357, 487, 942, 956, 964, 36713, 36720, 36735, 36860]

# From indices to contexts
Now that we have our indices, we can collect their contexts using the same logic.

Let's start by looking at one context for the first instance of `Ahab`:

In [65]:
ahabs[0]

351

In [66]:
moby_toks[311]

'22'

In [67]:
moby_toks[301:321] # +/- 10 words as context

['All',
 'Astir',
 '.',
 'CHAPTER',
 '21',
 '.',
 'Going',
 'Aboard',
 '.',
 'CHAPTER',
 '22',
 '.',
 'Merry',
 'Christmas',
 '.',
 'CHAPTER',
 '23',
 '.',
 'The',
 'Lee']

The first instance appears to be in the Table of Contents.

We're going to use `append` to gather up our results in a list of lists:

In [68]:
ahab_contexts = []

for index in ahabs:
    context = moby_toks[index - 10:index + 10] # this is looking to the 10 words before and after the index
    ahab_contexts.append(context)

In [69]:
len(ahab_contexts)

493

In [70]:
# let's see the last two contexts (spoilers)
ahab_contexts[-2:]

[['and',
  'his',
  'whole',
  'captive',
  'form',
  'folded',
  'in',
  'the',
  'flag',
  'of',
  'Ahab',
  ',',
  'went',
  'down',
  'with',
  'his',
  'ship',
  ',',
  'which',
  ','],
 ['he',
  'whom',
  'the',
  'Fates',
  'ordained',
  'to',
  'take',
  'the',
  'place',
  'of',
  'Ahab',
  '’',
  's',
  'bowsman',
  ',',
  'when',
  'that',
  'bowsman',
  'assumed',
  'the']]

Great! We have a list of contexts based on the keyword `Ahab` from across all of *Moby-Dick*.

# Filtering our contexts

So far, we have basically recreated `ctrl/cmd + F`.

Now we're going to upgrade it. We'll check each of these contexts for a **second** word. This will allow us to see how often "Ahab" and any other word co-occur within $n$ words.

One other operation that would be helpful for this is the Boolean operator `in`, which checks to see if an element appears in an iterable. For instance:

In [71]:
'a' in ['a','b','c']

True

In [72]:
'd' in ['a','b','c']

False

In [73]:
# note that there must be an exact match to return True
'a' in ['aa','aaa','aaaa']

False

So, applying that `in` logic to what we have, we can see every context that contains both words:

In [74]:
ahab_whale = [] # list to store co-occurrences of Ahab with whale

for context in ahab_contexts:
    if 'whale' in context:
        ahab_whale.append(context)

How often do they co-occur? We can use `len()` to measure the number of items in a list:

In [75]:
len([1, 2, 3])

3

In [76]:
# Ahab and Whale co-occur 24 times
len(ahab_whale)

24

In [77]:
# the first time
ahab_whale[0]

['their',
 'own',
 'peculiar',
 'quarters',
 '.',
 'In',
 'this',
 'one',
 'matter',
 ',',
 'Ahab',
 'seemed',
 'no',
 'exception',
 'to',
 'most',
 'American',
 'whale',
 'captains',
 ',']

## Making lists of words readable with `join()`
That is what we're looking for, but it's a little hard to read. Let's clean it up.

You use `join` like this:

```python
' '.join(my_list)
```

So, if you want to recombine each list element with a single space between each, you would write:

In [78]:
' '.join(ahab_whale[0])

'their own peculiar quarters . In this one matter , Ahab seemed no exception to most American whale captains ,'

# Abstracting out to functions

We know how to make this work for one instance, `Ahab` and `whale`.

Now, we just need to abstract it so that we can use it for *any* two words, and any amount of context. This will let us produce tons of results quickly.

In Python, we abstract things to make them reusable with *functions*. The general form of a function is as follows:

```python
def my_function(my_object):
    # do something
    return my_value
```

So for example:

In [79]:
def subtract_three(integer):
    value = integer - 3
    return value

I can run my new function just like any other:

# `improved_find`
We can apply the same logic to each of the steps we took above to create a few functions that comprise our upgraded search function.

I'm putting each of the pieces we did above into functions, and then calling them in sequence with a wrapper function.

In [80]:
def make_tokens(text_path):
    import nltk
    
    with open(text_path, encoding = 'utf8') as f:
        text = f.read()
    
    my_tokens = nltk.word_tokenize(text)
    
    return my_tokens

In [81]:
def get_indices(my_tokens, my_word):
    indices = []
    
    for index, word in enumerate(my_tokens):
        if word == my_word:
            indices.append(index)
            
    return indices

In [82]:
def get_contexts(my_indices, my_tokens, window):
    contexts = []
    
    for index in my_indices:
        context = my_tokens[index - window:index + window + 1]
        contexts.append(context)
    
    return contexts

In [83]:
def subset_contexts(my_contexts, next_word):
    results = []
    for context in my_contexts:
        if next_word in context:
            results.append(context)
    
    clean_results = []
    for result in results:
        cleaned = ' '.join(result)
        clean_results.append(cleaned)
    
    return clean_results

In [84]:
def improved_find(text_path, word_1, word_2, window):
    '''This function takes a text path and user-defined parameters.
    It returns a list of strings where the target words co-occur within the text.'''
    tokens = make_tokens(text_path)
    indices = get_indices(tokens, word_1)
    contexts = get_contexts(indices, tokens, window)
    results = subset_contexts(contexts, word_2)
    
    print(f'{word_1} and {word_2} co-occur {len(results)} times.')
    print('*' * 80)
    
    for result in results:
        print(result)
        print('-' * 80)

`improved_find` calls all of the above functions in order, and `prints` a nice result:

In [85]:
improved_find(moby_path, 'whale', 'Queequeg', 10)

whale and Queequeg co-occur 6 times.
********************************************************************************
Queequeg no kill-e so small-e fish-e ; Queequeg kill-e big whale ! ” “ Look you , ” roared the Captain
--------------------------------------------------------------------------------
as if it were the lower jaw of an exasperated whale . In the midst of this consternation , Queequeg dropped
--------------------------------------------------------------------------------
that I would often jerk poor Queequeg from between the whale and the ship—where he would occasionally fall , from the
--------------------------------------------------------------------------------
jalap to Queequeg , there , this instant off the whale . Is the steward an apothecary , sir ? and
--------------------------------------------------------------------------------
said Queequeg , pointing down . As when the stricken whale , that from the tub has reeled out hundreds of
-------------------------

# Challenges for today

## 1. Apply to a text of your choosing

Use the `improved_find` function to look for related words in a text of your choosing.

You need to get a copy of your text in plain-text format (i.e. `.txt` or similar).

## 2. Explanatory comments
When learning to program, it can be good practice to write comments that explain what each line and each element of a program does in order to make sure that you understand how it works. To take our example from above:

In [86]:
def subtract_three(integer): # declare new function subtract_three that takes one argument, integer
    value = integer - 3 # create a new variable, value, that contains the variable integer minus three
    return value # return the variable value to standard out

Using our remaining time, comment each line of the `improved_find` function, and the other functions on which it depends.

## 3. Improve `improved_find`

We could improve this search method in several ways:

- Improve the tokenization process to account for punctuation differently.
- Adjust the functions to make it possible to search for *multiple* words in proximity to the target word
- Add either/or logic, i.e. "Ahab `or` Queequeg within 10 words of whale."
- Added efficiency: It would be possible to write this program more efficiently, using fewer functions and system resources. How could we do that? One obvious way to start would be to prevent the program from re-tokenizing the source text *every time it runs*.

# Pro Tip: Jupyter Shortcuts
You can use keyboard shortcuts to control a lot of different parts of the Jupyter notebook. They are much faster than using the cursor. Memorizing a few will save you a lot of time!

| command | description |
|----------|-----------------------|
`enter` | enter edit mode for the selected cell
`esc`|exit edit mode for the selected cell
`shift + enter`|execute the code inside of the cell
`a`|insert a new cell above your current cell
`b`|insert a new cell below your current cell
`j`|move down one cell
`k`|move up one cell
`dd`|delete the selected cell