# Welcome to Python!

This notebook aims to get you familiar with essential aspects of the [Python](https://www.python.org) language. Anaconda is a *distribution* of Python widely used in data science that comes with many useful packages preinstalled. Jupyter Lab, which you're using right now, is one such package.

Python is great for a lot of different tasks in computational literary studies. Combined with the `pandas` package for tabular data, it can be used to do just about anything that you would use the R language for. But, especially for people who haven't written code before, Python is generally much easier than R.

# Learn by doing: upgrading text search
One of the best ways to learn to program is by trying to figure out particular problems. So, today we're going to create an upgraded version of the `cmd/ctrl + F` search to find sets of words within user-defined proximity of one another. That is, we'll be able to find not only our target word (e.g. "whale") but also where it co-occurs with another target word (e.g. "Ahab").

This is one type of keywords in context analysis, where a "contex" is ususally defined as a fixed number of words or sentences surrounding our target word(s).

## What we learn along the way

To do the above, we'll learn many of the key ideas in Python:

- strings
- integers
- variables
- reading files
- lists
- indexing
- Booleans
- `for` loops
- functions

I briefly introduce each of these, and demonstrate how they can be applied in this case.

# First steps: `hello world`
Of course, we have to start somewhere.

Traditionally, the first program you write prints the phrase `hello world` to standard output. In Python, we write that this way:
```python
print('hello world')
```
In Jupyter Lab, we execute Python commands in code cells, like the one below. The output of each cell is printed immediately below.

In [95]:
print('hello world')

hello world


This statement is comprised of two parts: the function `print()` and the string `'hello world'`, which we pass as an argument to `print()`.

The text that appears below the `print()` command is the *output*.

# Python Basics

In [96]:
# Lines that begin with # are comments. The computer won't attempt to execute anything here. It's for us humans.

In [97]:
# We use write numbers as you might expect
100

100

In [98]:
# Ditto for arithmetic
100 + 200

300

In [99]:
333 - 332

1

In [100]:
3 * 30

90

In [101]:
30 / 3

10.0

In [8]:
# characters between quotes are strings
'hello'

'hello'

In [9]:
# FYI strings can contain any sort of character, not just letters:
'h3ll0))(*^&#$)'

'h3ll0))(*^&#$)'

# Variables
Our code so far isn't that useful because we can't store our outputs anywhere. That's where variables come in:

In [115]:
print('this')

this


In [104]:
# we wouldn't want to type this twice!
tree = """
          &&& &&  & &&
      && &\/&\|& ()|/ @, &&
      &\/(/&/&||/& /_/)_&/_&
   &() &\/&|()|/&\/ '%" & ()
  &_\_&&_\ |& |&&/&__%_/_& &&
&&   && & &| &| /& & % ()& /&&
 ()&_---()&\&\|&&-&&--%---()~
     &&     \|||
             |||
             |||
             |||
       , -=-~  .-^- _

"""

In [118]:
print('tree')

tree


In [119]:
print(234)

234


In [126]:
'Don\'t tell me what to do'

"Don't tell me what to do"

escape character: `\`

In [128]:
print('\nhere\nthere')


here
there


In [106]:
print(tree)


          &&& &&  & &&
      && &\/&\|& ()|/ @, &&
      &\/(/&/&||/& /_/)_&/_&
   &() &\/&|()|/&\/ '%" & ()
  &_\_&&_\ |& |&&/&__%_/_& &&
&&   && & &| &| /& & % ()& /&&
 ()&_---()&\&\|&&-&&--%---()~
     &&     \|||
             |||
             |||
             |||
       , -=-~  .-^- _




We can assign any object to any variable name. But we need to be careful since variable names can be overwritten:

In [111]:
tree = 'a string that says tree'

In [112]:
tree

'a string that says tree'

In [113]:
tree = 6

In [114]:
tree

6

# Importing texts
Now, we're going to import *Moby-Dick* as a string.

For that to work, we need to have a *local copy of the file*, which we can make with Python. Run the code below to save a copy of *Moby-Dick* to your `~/Downloads` folder.

In [129]:
# imports
from pathlib import Path
import os
import urllib

# get your home
home = str(Path.home())

# create a path to which the file will be written
moby_path = os.path.join(home, 'Downloads/moby.txt')

# location of the project gutenberg copy of the moby-dick text file
moby_url = 'https://www.gutenberg.org/files/2701/2701-0.txt'

urllib.request.urlretrieve(moby_url, moby_path)

print('Downloaded to:', moby_path)

Downloaded to: /Users/e/Downloads/moby.txt


# Opening text files as strings

Now you have a file in your `~/Downloads` folder called `moby.txt` containing the Project Gutenberg *Moby-Dick*.

We can use Python to open that up as a string.

Note that Jupyter automatically indents under the `with`...`as` statement. Indentation has meaning in Python.

In [130]:
# this is where the file is on your computer:
moby_path

'/Users/e/Downloads/moby.txt'

In [131]:
with open(moby_path, encoding = 'utf8') as moby: # UTF-8 is a file encoding. making this explicit can prevent errors
    text = moby.read()

This will confirm that you indeed have the right file:

In [132]:
print(text[27802:27848], '...')

Call me Ishmael. Some years ago—never mind how ...


# Slicing texts
To get "Call me Ishamel" out of the whole *Moby-Dick*, we did something called string slicing.

We can do that on any iterable object in Python, and it will return elements in sequence.

(Note that Python begins counting from `0`, not `1`. This takes a little getting used to.)

## Slicing by word (tokenization)

There's a lot more to say about slicing. For today, we're going to focus on splitting the text up *by words*, which is called tokenization. There are a lot of clever ways to tokenize texts, but we're going to use a simple one: `splitting` on spaces:

In [137]:
bartleby = 'I would prefer not to.'

In [138]:
import nltk

In [139]:
nltk.word_tokenize(bartleby)

['I', 'would', 'prefer', 'not', 'to', '.']

In [140]:
# you can assign the outputs of statements to variables:
bart_words = nltk.word_tokenize(bartleby)

In [141]:
bart_words

['I', 'would', 'prefer', 'not', 'to', '.']

*Aside: We could make our tokenizer a lot better than this. For example, do we really want to count punctuation marks as "words?"*

# Lists

What is `bart_words`?

`word_tokenize` splits up `bartleby` into a **list** of words, indicated by `[]`. We're going to do the same thing to all of *Moby-Dick*.

Lists are an object type in Python, like integers or strings. They are ordered groups of objects separated by commas, and are offset by square brackets (`[]`).

In [142]:
# list of integers
[1,2,3]

[1, 2, 3]

In [143]:
# list of strings
['a','b','c']

['a', 'b', 'c']

In [144]:
[[1,2,],['a','b']]

[[1, 2], ['a', 'b']]

You can access individual *elements* in lists using `[]` notation since lists are iterable:

In [145]:
['aardvark', 'berry', 'capricorn'][0]

'aardvark'

In [146]:
['aardvark', 'berry', 'capricorn'][1]

'berry'

In [147]:
['aardvark', 'berry', 'capricorn'][2]

'capricorn'

You can also get multiple elements in sequence by using a `:` between values:

In [38]:
['aardvark', 'berry', 'capricorn'][0:2]

['aardvark', 'berry']

Note that the second value in an expression like `[0:2]` goes to `n-1`, i.e. "up to but not including element 2 in the sequence."

# Putting it all together
So, we now have a way to make a list of words with `nltk`, and a way to slice texts up into sets of sequential words using our indices `[]`

Let's make *Moby-Dick* into a list of tokens:

In [151]:
# we named our Moby-Dick variable "text" above
moby_toks = nltk.word_tokenize(text)

In [150]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/e/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [152]:
moby_toks[5696:5699]

['Call', 'me', 'Ishmael']

Now, we can put these ideas together in a **loop** to find keywords and their contexts.

# Loops

Computers are dumb. But they're great at doing the same dumb thing many times quickly. Which is why loops are great.

There are several types of loops in Python. We're going to start with a `for` loop which performs an action on every element of an iterable in sequence:

In [154]:
my_list = ['aardvark', 'berry', 'capricorn']

In [155]:
my_list

['aardvark', 'berry', 'capricorn']

In [158]:
for y in my_list:
    print('I love', y)

I love aardvark
I love berry
I love capricorn


In [156]:
for word in my_list:
    print('I love', word)

I love aardvark
I love berry
I love capricorn


In [159]:
for x in [6,18,2394]:
    print(x * 100)

600
1800
239400


The general form of the `for` loop is as follows:
```python
for element in my_iterable:
    # do something to each element
```
Python has a handy function called `enumerate()` for getting the indices of an iterable object as we go. We use that like this:

```python
for index, element in enumerate(my_iterable):
    # do something with the element and/or its index
```

Let's apply that structure to our list above:

In [160]:
for index, word in enumerate(my_list):
    print(word, 'is at index \t', index) #\t is the equivalent of tab

aardvark is at index 	 0
berry is at index 	 1
capricorn is at index 	 2


In [53]:
# note that these indices are the same as the ones we get with [] notation:
my_list[2]

'capricorn'

With a `for` loop, we can check every word in *Moby-Dick* to see which indices correspond to our keyword.

# Testing words with Boolean logic

We're going to check to see if each word is the same as our search term. To do that, we need to use Boolean logic, which returns the value `True` or `False` depending on the result of a query:

`==` tests to see if two objects are equivalent:

In [161]:
'a' == 'b'

False

`True` and `False` are reserved words in Python with special meanings. We'll use them to decide `if` we should act on a particular word.

In [162]:
'a' == 'A'

False

In [163]:
'a' == 'a'

True

In [164]:
'Ahab' == 'Ahab'

True

There are many other Booleans in Python, but `==` is the only one we'll be using for now.

## Conditionals

Booleans allow us to take specific actions `if` and only if certain conditions are `True`:

In [165]:
today = 'Friday'

In [166]:
today == 'Friday' #== checks for equivalence

True

In [167]:
if today == 'Friday':
    print('TGIF')

TGIF


Note that, just like a `for` loop, we indent under an `if` statement.

If a condition is `False` and you don't tell it to do anything else, Python does nothing:

In [168]:
today = 'Thursday'

In [169]:
today == 'Friday'

False

In [170]:
# this does nothing because the condition returns False
if today == 'Friday':
    print('TGIF')

## Using a `for` loop, Booleans, and `if` to check our words
This loop prints *every* index in *Moby-Dick* that is exactly the same as the string `'Ahab'`:

In [63]:
for index, word in enumerate(moby_toks):
    if word == 'Ahab':
        print('The word Ahab appears at index number', index)

The word Ahab appears at index number 307
The word Ahab appears at index number 313
The word Ahab appears at index number 443
The word Ahab appears at index number 898
The word Ahab appears at index number 912
The word Ahab appears at index number 920
The word Ahab appears at index number 36676
The word Ahab appears at index number 36683
The word Ahab appears at index number 36698
The word Ahab appears at index number 36823
The word Ahab appears at index number 40664
The word Ahab appears at index number 40678
The word Ahab appears at index number 40839
The word Ahab appears at index number 40866
The word Ahab appears at index number 40873
The word Ahab appears at index number 40953
The word Ahab appears at index number 41034
The word Ahab appears at index number 41112
The word Ahab appears at index number 41295
The word Ahab appears at index number 41350
The word Ahab appears at index number 41367
The word Ahab appears at index number 41393
The word Ahab appears at index number 41535


## `appending` our results to lists
It's not helpful to just `print` these. Instead, we're going to collect each of these indices in a new list using the `append` method, which adds the element you pass it to the end of an existing list:

In [178]:
# initializing an empty list with []
new_list = []

In [179]:
# empty list
new_list

[]

In [180]:
new_list.append('a')

In [181]:
# no longer empty!
new_list

['a']

In [182]:
new_list.append('b')

In [183]:
new_list

['a', 'b']

## `append` in a `for` loop
So, using the same logic, we can `append` all of those Ahab locations we `printed` to the screen before:

In [184]:
ahabs = []

for index, word in enumerate(moby_toks):
    if word == 'Ahab':
        ahabs.append(index)

In [185]:
# just looking at the first 10 indices
ahabs[:10]

[307, 313, 443, 898, 912, 920, 36676, 36683, 36698, 36823]

# From indices to contexts
Now that we have our indices, we can collect their contexts using the same logic.

Let's start by looking at one context for the first instance of `Ahab`:

In [186]:
moby_toks[307]

'Ahab'

In [76]:
moby_toks[297:317] # +/- 10 words as context

['CHAPTER',
 '27',
 '.',
 'Knights',
 'and',
 'Squires',
 '.',
 'CHAPTER',
 '28',
 '.',
 'Ahab',
 '.',
 'CHAPTER',
 '29',
 '.',
 'Enter',
 'Ahab',
 ';',
 'to',
 'Him']

We're going to use our `append` method to gather up our results in a list of lists:

In [187]:
index = 30

In [188]:
index + 10

40

In [189]:
index - 10

20

In [190]:
ahab_contexts = []

for index in ahabs:
    context = moby_toks[index - 10:index + 10] # this is looking to the 10 words before and after the index
    ahab_contexts.append(context)

In [191]:
# let's see the first two contexts
ahab_contexts[0:2]

[['CHAPTER',
  '27',
  '.',
  'Knights',
  'and',
  'Squires',
  '.',
  'CHAPTER',
  '28',
  '.',
  'Ahab',
  '.',
  'CHAPTER',
  '29',
  '.',
  'Enter',
  'Ahab',
  ';',
  'to',
  'Him'],
 ['.',
  'CHAPTER',
  '28',
  '.',
  'Ahab',
  '.',
  'CHAPTER',
  '29',
  '.',
  'Enter',
  'Ahab',
  ';',
  'to',
  'Him',
  ',',
  'Stubb',
  '.',
  'CHAPTER',
  '30',
  '.']]

Great! We have a list of contexts based on the keyword `Ahab` from across all of *Moby-Dick*.

# Filtering our contexts

Now, for the last step, we're going to check each of these contexts for a **second** word. This will allow us to see how often "Ahab" and any other word co-occur.

One other operation that would be helpful for this is the Boolean operator `in`, which checks to see if an element appears in an iterable. For instance:

In [451]:
'a' in ['a','b','c']

True

In [216]:
'd' in ['a','b','c']

False

In [79]:
# note that there must be an exact match to return True
'a' in ['aa','aaa','aaaa']

False

So, applying that `in` logic to what we have, we can see every context that contains both words:

In [193]:
ahab_whale = [] # list to store co-occurrences of Ahab with whale

for context in ahab_contexts:
    if 'whale' in context:
        ahab_whale.append(context)

In [194]:
# they co-occur 24 times
len(ahab_whale)

24

In [195]:
# the first time
ahab_whale[0]

['their',
 'own',
 'peculiar',
 'quarters',
 '.',
 'In',
 'this',
 'one',
 'matter',
 ',',
 'Ahab',
 'seemed',
 'no',
 'exception',
 'to',
 'most',
 'American',
 'whale',
 'captains',
 ',']

## Cleaning up lists of words with `join()`
That sentence is a little hard to read. Let's make it look more like a sentence by using the `join()` method to combine list elements.

You use `join` like this:

```python
' '.join(my_list)
```

So, if you want to recombine each list element with a single space between each, you would write:

In [199]:
' '.join(ahab_whale[-1])

'seemed strangely oblivious of its advance—as the whale sometimes will—and Ahab was fairly within the smoky mountain mist , which'

You'll usually use it as above. But any string works with `join`:

In [85]:
'|||'.join(ahab_whale[-1])

'seemed|||strangely|||oblivious|||of|||its|||advance—as|||the|||whale|||sometimes|||will—and|||Ahab|||was|||fairly|||within|||the|||smoky|||mountain|||mist|||,|||which'

# Abstracting out to functions

We know how to make this work for one instance. Now, we just need to abstract it so that we can use it for *any* two words, and any amount of context. This will let us produce tons of results quickly.

In Python, we abstract things to make them reusable with *functions*. The general form of a function is as follows:

```python
def my_function(my_object):
    # do something
    return my_value
```

So for example:

In [202]:
def subtract_three(integer):
    value = integer - 3
    return value

I can run my new function just like any other:

In [205]:
subtract_three(123078949823)

123078949820

# `improved_find`
We can apply the same logic to each of the steps we took above to create a few functions that comprise our upgraded search function.

I'm putting each of the pieces we did above into functions, and then calling them in sequence with a wrapper function.

In [269]:
def make_tokens(text_path): # defining new function make_tokens and argument text_path
    import nltk # import natural language toolkit
    
    # using the encoding utf 8, opens the file at the text path, assigning variable name f to take actions in next step
    with open(text_path, encoding = 'utf8') as f:
        text = f.read() # read variable f into variable text. this is the whole file.
    
    # use nltk's word_tokenize method to split our text into a list of tokens
    # putting list of tokens into variable my_tokens
    my_tokens = nltk.word_tokenize(text)
    
    return my_tokens # returning my_tokens to standard output

In [278]:
for i,x in enumerate(my_list):
    print('this is the index', i)
    print('and this is the x', x)
    print('-'*30)

this is the index 0
and this is the x aardvark
------------------------------
this is the index 1
and this is the x berry
------------------------------
this is the index 2
and this is the x capricorn
------------------------------
this is the index 3
and this is the x 1
------------------------------
this is the index 4
and this is the x 2
------------------------------
this is the index 5
and this is the x 3
------------------------------


In [285]:
def get_indices(my_tokens, my_word): # define get_indices expecting arguments my_tokens, my_word
    indices = [] # create an empty list to append our results to, defined outside of loop
    
    for index, word in enumerate(my_tokens): # for every index and every word of my_tokens...
        if word == my_word: # is the word from my_tokens exactly equal to my_word
            indices.append(index) # append index to our list of indices
            
    return indices # returning our list of indices

In [233]:
def get_contexts(my_indices, my_tokens, window):
    contexts = []
    for index in my_indices:
        context = my_tokens[index - window:index + window + 1]
        contexts.append(context)
    
    return contexts

In [234]:
def subset_contexts(my_contexts, next_word):
    results = []
    for context in my_contexts:
        if next_word in context:
            results.append(context)
    
    clean_results = []
    for result in results:
        cleaned = ' '.join(result)
        clean_results.append(cleaned)
    
    return clean_results

In [235]:
def improved_find(text_path, word_1, word_2, window):
    '''This function takes a text path and user-defined parameters.
    It returns a list of strings where the target words co-occur within the text.'''
    tokens = make_tokens(text_path)
    indices = get_indices(tokens, word_1)
    contexts = get_contexts(indices, tokens, window)
    results = subset_contexts(contexts, word_2)
    
    for result in results:
        print(result)
        print('-'*80)

`improved_find` calls all of the above functions in order, and `prints` a nice result:

In [247]:
improved_find(moby_path, 'god', 'whale', 10)

# Challenges for today

## 1. Apply to a text of your choosing

Use the `improved_find` function to look for related words in a text of your choosing.

## 2. Explanatory comments
It's common practice when learning to program to write comments that explain what each line and each element of a program does in order to make sure that you understand how it works. To take our example from above:

In [424]:
def subtract_three(integer): # declare new function subtract_three that takes one argument, integer
    value = integer - 3 # create a new variable, value, that results from integer minus three
    return value # return the variable value to standard out

Using our remaining time, comment each line of the `improved_find` function, and the other functions on which it depends.

## 3. Improve `improved_find`

We could improve this search method in several ways:

- Improve the tokenization process to account for punctuation differently.
- Adjust the functions to make it possible to search for *multiple* words in proximity to the target word
- Add either/or logic, i.e. "Ahab or Queequeg within 10 words of whale."
- Added efficiency: It would be possible to write this program more efficiently, using fewer functions and system resources. How could we do that? One obvious way to start would be to prevent the program from re-tokenizing the source text *every time it runs*.

# Pro Tip: Jupyter Shortcuts
You can use keyboard shortcuts to control a lot of different parts of the Jupyter notebook. They are much faster than using the cursor. Memorizing a few will save you a lot of time!

| command | description |
|----------|-----------------------|
`enter` | enter edit mode for the selected cell
`esc`|exit edit mode for the selected cell
`shift + enter`|execute the code inside of the cell
`a`|insert a new cell above your current cell
`b`|insert a new cell below your current cell
`j`|move down one cell
`k`|move up one cell
`dd`|delete the selected cell