# Python concepts 2

In this notebook, we are going to continue to cover all the Python concepts that you will need to complete the Futrell et al. replication. If you haven't already worked through [Python concepts](python-concepts-workbook.ipynb), please find time to do so.

## Defining functions

A function in a programming language is exactly like a function in maths (e.g. the cosine function, or the function $f(x) = x^2$). It takes in some input, does some computation, and gives you back the output. In maths, the functions usually just work on single numbers. But in Python, we have functions that work on many different types of objects. You've been using functions for a while in Python. Here are some examples of functions.

#### The `open` function

`open` takes in as input a string representing a path to a file, and gives you back as output a `file` object, which you can then read or write.

In [1]:
f = open('data/myfile.txt')

#### The `len` function

The `len` function tells you how long its argument is. You can pass in many different types as its argument.

In [2]:
mylist = [1,2,3,4,5]
len(mylist)

5

In [3]:
mystring = 'hello everyone!'
len(mystring)

15

#### The `min` function

`min` can take in arguments in one of two ways. It can either take in a list (or something similar) and it will return to you the smallest element of the list. Or it can take as input two or more objects, and it will give you the smallest one of those.

In [4]:
mylist = [1,2,3,4,5]
min(mylist)

1

In [5]:
min(1,2,3,4,5)

1

#### The `int` function

`int` will take an object and try to convert it to an integer. If it doesn't make sense to coerce that object to an integer, it will raise an error.

In [6]:
x = 12.0
print(type(x))
y = int(x)
print(type(y))

<class 'float'>
<class 'int'>


#### The `pow` function

`pow` is the power function in Python. There's a shorthand for it: `pow(2,4)` is the same as `2**4`, which both raise 2 to the power of 4.

In [7]:
pow(2, 4)

16

#### The `print` function

`print` is a special function that can take just about anything as input, and it will return `None`. But as a side-effect, it prints a representation of that object, normally to your screen.

In [8]:
x = print('this is my string')
type(x)

this is my string


NoneType

#### The `round` function

`round` rounds its numeric input to as many decimal places as you want. If you don't tell it how many places to round to, it will round to the nearest integer.

In [9]:
round(2.3456789, 2)

2.35

In [11]:
round(2.3456789, 6)

2.345679

In [12]:
round(2.3456789)

2

#### The `sorted` function

`sorted` takes in a list (or something similar) and returns a sorted version of it. `sorted` knows how to sort numbers and strings already.

In [13]:
mylist = [6,3,9,1,3,7,0]
sorted(mylist)

[0, 1, 3, 3, 6, 7, 9]

In [15]:
mylist = 'this is going to be a list of strings in a second'.split()
print(mylist)
print(sorted(mylist))

['this', 'is', 'going', 'to', 'be', 'a', 'list', 'of', 'strings', 'in', 'a', 'second']
['a', 'a', 'be', 'going', 'in', 'is', 'list', 'of', 'second', 'strings', 'this', 'to']


All these functions come built in with Python. They are called [built-in function](https://docs.python.org/3/library/functions.html). They are available to you as soon as you start Python (whether from the terminal, or a script, or in a Jupyter notebook). You never have to import them. These functions are very general. They come "built in" to Python because everyone needs them. But we can't do everything we want with just these functions. As a linguist, you may want a function that takes in a corpus and returns all the nouns. Or a function that takes in a path to a folder on your computer and returns all the files in that directory that have more than 100 words. In most cases, we have to create our own functions that suit our purpose. **In Python, "creating" a function is called "defining" it.**

Let's define a function.

In [22]:
def add_two(x):
    return x + 2

After we have defined our function, we can "call" it. "Calling" a function is a fancy way of saying "using" a function. Let's call our new function with the argument `5`.

In [23]:
add_two(5)

7

The syntax for defining a function in Python is this:

```
def function_name(input):
    ...
    return output
```

We begin with the special keyword `def`. Then we put the name we want our new function to have. We follow that with parentheses. In between the parentheses, we put the arguments we want our function to take. If we want a function that takes more than one argument, we separate them with commas. Then we finish that line with a semi-colon. Everything after that must be indented, normally using four spaces. We can put whatever code we want where the `...` is. This is where the work is done. After we've computed whatever we want, we use the keyword `return` followed by whatever we want the function to return. Let's see some examples

In [8]:
def square(n):
    return n**2

You can read out this function definition as follows:

> Define a function called `square` that takes in an argument that will be referred to hereafter as `n`. Whenever I call this argument in the future, it will `return` to me `n**2` (which is `n` squared).

In [25]:
square(2)

4

In [26]:
def add_one_then_square(t):
    return (t+1)**2

You can read out this function definition as follows:

> Define a function called `add_one_then_square` that takes in an argument that will be referred to hereafter as `t`. Whenever I call this argument in the future, it will `return` to me `(t+1)**2` (which is `t` plus one, all squared).

In [27]:
add_one_then_square(4)

25

When you define a function, Python doesn't run the body of the function. All it does is create a new variable with the function name and remember where to look whenever you call it.

In [28]:
def add_one_then_square(s):
    return square(s+1)

Here, we are overwriting the name `add_one_then_square`. We are using our `square` function instead of the usual `s**2`.

In [29]:
add_one_then_square(3)

16

In [30]:
def has_b(word):
    return word.count('b') > 0

In [31]:
has_b('hello')

False

In [32]:
has_b('baboon')

True

In [34]:
def sign(k):
    if k < 0:
        return 'negative'
    elif k > 0:
        return 'positive'
    else:
        return 'zero'

We can use conditional statements in functions. At most one of these return statements will ever be run. Also, we can `return` whatever we want. Here, we are returning strings.

In [35]:
sign(4)

'positive'

In [36]:
def long_function(n):
    intermediate_value = n + 1
    another_value = 3 * intermediate_value + 4
    second_last_value = (another_value / 8) ** 3
    last_value = square(round(second_last_value))
    return last_value

The body of a function can be as long as we want. We can create as many variables along the way as we want. But notice that none of those intermediate values that we create are available to us outside of that function. In fact, the argument that we pass in is not available to us called `n` outside of the function. However, within the function, we refer to the argument using the name that we gave it at the start.

In [37]:
long_function(5)

441

In [38]:
intermediate_value

NameError: name 'intermediate_value' is not defined

In [39]:
n

NameError: name 'n' is not defined

In [46]:
def two_argument_function(x, y):
    if x > y:
        return 7
    elif x < y:
        return min(x, 5)
    else:
        return max(y, 9)

In [47]:
two_argument_function(2, 2)

9

#### Challenge 1

What type is the function `square`? What type is the function `len`?

In [1]:
## FILL IN THE BLANKS

- Can you do `f = square`?
- What does `f(5)` mean then?

#### Challenge 2

Define a function that takes in as an argument a number and subtracts 4 from it. Make sure to give it a sensible name. Test your function by calling it on the argument `5`.

In [2]:
## FILL IN THE BLANKS

- What is the type of `subtract_four(3)`?
- What happens if you call your function on `'hello'`?
- Write a function that takes in `a` and `b`, and returns `b - a`.

#### Challenge 3

Write a function that takes in three arguments. If all the arguments are the same, it should return the string `'same'`. If they're different, it should return the largest one. Give it a sensible name. Test it on the arguments `2, 2, 2` and `2, 3, 4` to make sure it does the right thing.

In [51]:
## FILL IN THE BLANKS

In [3]:
## FILL IN THE BLANKS

- What is the type of the return value if the three arguments are all the same?

#### Challenge 4

Write a function that takes in a string representing a file path. The function should open that file, read in the contents, split the contents by whitespace, and return all the words that start with 'b'. Test it on `data/italian.txt`.

In [4]:
## FILL IN THE BLANKS

In [5]:
## FILL IN THE BLANKS

- What type does your function take as input and what type does it return?
- Write a more general version of this function that takes in a file path and a letter, and return all words in that file that begin with that letter.

#### Challenge 5

Write and test a function that takes in a file path and returns the average number of characters per word in that file.

In [62]:
## FILL IN THE BLANKS

In [6]:
## FILL IN THE BLANKS

- Change the function to return its answer rounded to 3 decimal points.
- Change the function to print its answer, rather than returning it. What is the return value of this function (i.e. one with no return statement)?

#### Challenge 6

One of the reasons we define functions is so that we don't have to copy-paste our code. For example, imagine you want to know the proportion of words in a file that have the letter 'b' in them. In fact, you want to know this proportion for each of the three files 'data/italian.txt', 'data/moredata.txt' and 'data/subfolder/big.txt'. Without a function, we might do something like this:

In [65]:
with open('data/italian.txt') as f:
    contents = f.read()
text = contents.split()
words_with_b = [word for word in text if word.count('b') > 0]
proportion = len(words_with_b) / len(text)
print(proportion)

with open('data/moredata.txt') as f:
    contents = f.read()
text = contents.split()
words_with_b = [word for word in text if word.count('b') > 0]
proportion = len(words_with_b) / len(text)
print(proportion)

with open('data/subfolder/big.txt') as f:
    contents = f.read()
text = contents.split()
words_with_b = [word for word in text if word.count('b') > 0]
proportion = len(words_with_b) / len(text)
print(proportion)

0.03411280930836537
0.023210141046241743
0.06039454410214521


Rewrite this by first defining a function (with a sensible name) and then calling it on three different arguments. Make sure your function `return`s something, rather than just `print`ing it.

In [7]:
## FILL IN THE BLANKS

- Rewrite your function calls as a `for` loop.
- Imagine you've decided that you're interested in words that have a 'z' in them. Change the copy-paste version, change the function definition version, and explain why defining functions are better than copy-pasting.

#### Challenge 7

We should always give functions meaningful names, just like we give other variables meaningful names. This makes your code more readable, for others but also for your future self. Imagine you wrote the following code six months ago. Try to figure out what it does.

In [71]:
with open('data/italian.txt') as f:
    contents = f.read()
n = len([ch for ch in contents if ch == '.'])
a = len(contents) / n
print(a)

134.18343780930385


Now, rewrite this as a function and give it a name that describes what it returns. If your functions have descriptive names, then your future self can understand your code just by looking at the function **call**, not the function **definition**.

In [8]:
## FILL IN THE BLANKS

What do you think the following code does, without knowing what happens at the function definitions?

In [None]:
## These functions haven't been defined yet, so they won't actually work.
raw_text = read('data/moredata.txt')
tokens = tokenize(raw_text)
cleaned = clean(tokens)
nouns = filter_by_pos(cleaned, 'noun')
print(avg_len(nouns))

#### Challenge 8

A key idea in programming is writing modular code. For the time being, this means that each function we define should do a single thing, not many things. Some of the functions we've written so far do more than one thing. For example, the first function from challenge 7 reads in a file, calculates the number of sentences, and returns the average number of characters per sentence. In contrast, the functions in the second part of challenge 7 do one well-defined thing, as indicated by their name. Let's implement these functions.

- The `read` function should return a string.
- The `tokenize` function should return a list of strings.
- `clean` should do something like remove all punctuation, or remove words less than three letters, and return a list of strings.
- Ignore `filter_by_pos` for now.
- `avg_len` should take in a list of words and return their average length in characters.

In [10]:
def read(path):
    pass
    ## FILL IN THE BLANKS

In [11]:
def tokenize(text):
    pass
    ## FILL IN THE BLANKS

In [40]:
def clean(tokens):
    pass
    ## FILL IN THE BLANKS

In [41]:
def filter_by_pos(tokens):
    pass

In [12]:
def avg_len(words):
    pass
    ## FILL IN THE BLANKS

#### Challenge 9

Using descriptive names for functions is a great way to help people understand our code. Sometimes, we need to provide more information than can fit in a name. For this, we use a docstring. A docstring is a string that goes immediately after the first line of a function definition. We use triple-quotes so that we can write on more than one line. We use a docstring to record information on what our function does and how to use it.

In [43]:
def vocab(path):
    """
    Return the unique words in file stored at `path`.
    
    `path` is a string representing the location of the file on the computer.
    The `set` function used below takes a list (or similar) and returns a 
    set object, with doesn't have any duplicates. I also sort the set so 
    that the elements are in alphabetical order.
    
    TODO: I should use my `read` and `tokenize` functions here.
    """
    with open(path) as f:
        contents = f.read()
    tokens = contents.split()
    unique = set(tokens)
    return sorted(unique)

vocab('data/moredata.txt')[:10]

['!Arvela',
 '!Bighare?',
 "!Boots's",
 '!Buck',
 '!Cookie',
 '!Cookie.',
 '!Danae,',
 '!Deon',
 '!Gary',
 '!Gwen']

Define and test a function that calculates the proportion of characters in a file that are 'p', 't' or 'k'. Write a docstring describing what the function does, some notes for the user on the input and output, and anything else you feel necessary. 

In [13]:
## FILL IN THE BLANKS

- There are conventions in Python on how to structure your docstrings. I like [numpydoc](https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt). Google has their [own](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) which is also popular. [Here's](https://stackoverflow.com/questions/3898572/what-is-the-standard-python-docstring-format) a good comparison of some different conventions. It really doesn't matter what you use, as long as you're i) consistent, ii) you're writing docstrings.

#### Challenge 10

How many arguments does `preach` have? Why does it work? What value does it return? Answer the same questions for `big_function`. `big_function` shows that there can be an arbitrary relationship between the arguments and the return value, even **no** relationship.

In [19]:
def preach():
    print('Linguistics is the best.')

In [20]:
preach()

Linguistics is the best.


In [21]:
def big_function(x, y, z):
    return 7

In [22]:
big_function(1, 'hello', preach)

7

- Why is the return value of `preach` `None`?
- Can a function return a function as its return value?

#### Challenge 11

Many functions in Python have lots of arguments, but you don't always have to give all of them when you call it. For example, the `round` function we saw earlier has 2 arguments: the number we want to round and the number of decimal points to round to. But if we don't give both arguments, `round` will assume we want to round to the nearest integer. 

In [23]:
round(2.3456789, 4)

2.3457

In [24]:
round(2.3456789)

2

These are default values for arguments. Default arguments are useful because we can write general functions, but also make them easy to use in the most common way. We give arguments default values when we define the function, as follows.

In [28]:
def has_letter(word, letter='b'):
    return word.count(letter) > 0

In [29]:
has_letter('hello', 'h')

True

In [30]:
has_letter('example')

False

In [31]:
has_letter('example', 'b')

False

Write and test a function called `length` that calculates the length of a text. It should take as input a file path and an argument called `level`. If `level` is set to the string 'words', then the function returns the length of the text in words. If not, it should return the length of the text in characters. The default argument of `level` should be 'words'.

In [14]:
## FILL IN THE BLANKS

In [16]:
## FILL IN THE BLANKS

In [15]:
## FILL IN THE BLANKS

## Booleans and conditional statements

"Conditional statements" refers to `if ... else ...`. Conditional statements require boolean values, so let's straighten them out first. Some expressions in Python are `int`s, some are `str`ings, and some are `bool`eans. Booleans are either `True` or `False`.

The following code assigns the value of `2` to the name `x`. The type of `x` is `int`.

In [4]:
x = 2
type(x)

int

Now we can ask whether `x` is larger than 0. This can either be `True` or `False`, so the type of this expression is `bool`.

In [8]:
x > 0

True

In [9]:
type(x > 0)

bool

We can assign the value of the exression `x < 0` to a name, just like any other value. It's still of type `bool`.

In [12]:
y = x < 0
y

False

In [13]:
type(y)

bool

Python lets us compute more complicated boolean expressions. We can use the special keywords `and` and `or` as follows (parentheses are not required by Python in these examples, but I think it makes things clearer for humans):

In [15]:
x = 2
y = 3
(x > 0) and (y > 0)

True

The final line above is `True` because both `x > 0` and `y > 0` are `True`. But if one of them is `False`, the whole thing is `False`.

In [16]:
(x > 0) and (y < 0)

False

`or` only cares that one of them is `True`. 

In [17]:
(x > 0) or (y > 0)

True

In [18]:
(x > 0) or (y < 0)

True

We can have as many as we want. Again, the parentheses are not always necessary for Python's syntax, but in complicated booleans I always prefer to have them.

In [21]:
x = 2
y = 3
z = 4
((x > 0) and (y < 0)) or ((z < 0) or ((x + y) > 0))

True

The simplest conditional statement looks like this. When Python runs this code, it does the following behind the scenes. It first checks to see whether the expression `x > 0` is `True`. If it is, it will execute every line of code indented after the semi-colon. 

In [14]:
x = 2
if x > 0:
    print('x is positive')

x is positive


If `x > 0` did not evaluate to `True`, then it would skip everything after the semi-colon. Change x to be -2 and see what happens.

#### Challenge 12

Write a function that checks whether the argument is between 0 and 10. It should return the string 'Yes' if it is, and 'No' if not.

In [17]:
## FILL IN THE BLANKS

In [18]:
## FILL IN THE BLANKS

It's very common to write functions that check whether their argument meets some criterion/has some property, such as being between 0 and 10 as above. In the challenge above, we returned a string that told us the answer. A more common idiom in Python is to return a boolean value directly. In this example, we don't even need any `if` statements. We just return exactly the value of `0 < x < 10`, which will either be `True` or `False`.

In [26]:
def is_between_0_and_10(x):
    return 0 < x < 10

In [25]:
is_between_0_and_10(5)

True

You've already seen some functions like this. Many of them have names that start with `is`. Here are some:

In [27]:
'this is a string'.islower()

True

In [28]:
'this is a string'.isupper()

False

#### Challenge 13

Conditional statements can also have `else` counterparts. If the expression in the `if` statement is not `True`, then Python will always run the code under `else`. Here's an example:

In [29]:
x = -1
if x > 0:
    print('x is positive')
else:
    print('x is not positive')

x is not positive


Write a function that takes a string as its argument. If the string is all lower case, the function should return the length of the string in characters. Otherwise, it should return the length of the string in words.

In [19]:
## FILL IN THE BLANKS

#### Challenge 14

In addition to `and` and `or`, there's also `not`. `not` flips the boolean value of whatever it negates. `in` is also useful. `x in y` will return `True` if `x` is in `y`. What exactly it means for something to be `in y` depends on what `y` is. For lists, it's what you'd expect.

In [33]:
x = 2
not x < 0

True

In [34]:
1 in [3,4,1]

True

In [35]:
not 4 in [3,4,1]

False

Write a function that takes as input a file path and a word, and returns whether that word is **not** present in that file. It should return `True` if the word is not in the file.

In [20]:
## FILL IN THE BLANKS

In [21]:
## FILL IN THE BLANKS

#### Challenge 15

Write a function that takes in a list of words and returns a list of those words that have at least two 'e's in them. You may want to write two functions for this, one that calls the other.

In [22]:
## FILL IN THE BLANKS

In [23]:
## FILL IN THE BLANKS

## Advanced topics with functions

- Lambda functions
- Higher-order functions like sorted, max, min



#### Challenge 16

Functions are just like any other value in Python. In some circumstances, there are functions that take other functions as arguments. Or some functions return functions as their return value. Below, we can pass in a function as an argument to `sorted`. `sorted` has a default argument `key`. If we don't supply an argument, it sorts things in the way you'd expect (lowest to highest for numbers, alphabetically for strings). But we can pass in a function to the `key` argument. Then, behind the scenes, `sorted` will call that function on all the things we're trying to sort. It will then sort the original arguments according to the output of those function calls.

In [3]:
x = 'this is going to be a list'.split()
sorted(x)

['a', 'be', 'going', 'is', 'list', 'this', 'to']

In [4]:
sorted(x, key=len)

['a', 'is', 'to', 'be', 'this', 'list', 'going']

In [6]:
numbers = [-4, -3, -2, -1, 0, 1, 2, 3]
sorted(numbers)

[-4, -3, -2, -1, 0, 1, 2, 3]

In [9]:
sorted(numbers, key=square)

[0, -1, 1, -2, 2, -3, 3, -4]