# Python concepts

In this notebook, we are going to cover all the Python concepts that you will need to complete the Futrell et al. replication.

## Reading in data

The main way to read in data in Python is to use the `open` function. The `open` function is always available as soon as you start Python. It has one required argument, which is a string describing the path to the file you want to open. Like all functions in Python, the `open` function **returns** something. In particular, the `open` function returns a file object. You can think of a file object as Python's representation of a file. We can store that thing in a variable just like we store numbers in variables with `x = 10`.

I have a small text file in the 'data' folder called 'myfile.txt'. Check <a href="data/myfile.txt">it</a> out.

In [11]:
open('data/myfile.txt')

<_io.TextIOWrapper name='data/myfile.txt' mode='r' encoding='UTF-8'>

In [12]:
f = open('data/myfile.txt')

In [13]:
f

<_io.TextIOWrapper name='data/myfile.txt' mode='r' encoding='UTF-8'>

In [14]:
type(f)

_io.TextIOWrapper

It has a cryptic name, but it's a file object. Normally when we use the `type` function in Python, we will get a more readable answer (e.g. try calling `type` on the number `10`). Notice the difference between `f` and the string I passed in to the `open` function. One is a string, one is a file object, whatever that is.

If we look at the right part of the Python [docs](https://docs.python.org/3/glossary.html#term-file-object), it tells us a "file object" is something you can read and write. To read everything in our file, we use the `read` method of the file object `f`. 

A method is just a function that can only be used on a particular type of object. Strings have their own method (e.g. you can capitalize a string, but not a list), lists have their own methods (e.g. you can sort a list, but not a string), and files have their own methods (e.g. you can read a file, but not a string). Because a method is a function that is restricted to a particular type, we use it after the object. Because it is still a function, we still need the parentheses. In this case, the method needs no more arguments, but later on we'll see methods that do take extra arguments inbetween the parentheses.

In [15]:
contents = f.read()

In [5]:
# This will not work, because a string is not something you can read in Python.
not_going_to_work = 'data/myfile.txt'.read()

AttributeError: 'str' object has no attribute 'read'

After we have successfully read in the contents of a file, we need to close the file in Python. Closing is something you can do to files but not strings or lists, so it's a method.

In [16]:
f.close()

The entire process of opening, reading and closing a file in Python looks like this. You open the file, you read in its contents, then you close the file.

In [19]:
f = open('data/myfile.txt')
contents = f.read()
f.close()

Closing a file is really important in Python but programmers are often forgetful. Python has a paraphrase of this process that automatically closes the file for us. It is functionally equivalent.

In [21]:
with open('data/myfile.txt') as f:
    contents = f.read()

#### Challenge 1

There's another txt file in the 'data' folder called 'moredata.txt'. Read in the file and store the text in a variable called `data`.

In [25]:
with open('data/moredata.txt', 'r') as f:
    data = f.read()

- What type is `data`? 
- How long in `data`?
- Redo the exercise without using a `with` statement.
- Read Python docs for [open](https://docs.python.org/3/library/functions.html#open). Reading documentation is a crucial skill for modern programming. 90% of the time you only need 10% of the information in the documentation, but finding that 10% is a real skill.
- BTW, the data comes from the [Santa Barbara Spoken Corpus](http://www.linguistics.ucsb.edu/research/santa-barbara-corpus)

#### Challenge 2

When you use the `read` method of a file object in Python, you get the entire contents of that file as a string. Files also have a `readlines` method. This is the same as `read`, but instead of one big long string, you get a list of strings. Each element of the list is a single line of the file. This is handy when each line of your file is a single data point, as in the previous challenge.

Read in the contents of the 'moredata.txt' file into a list of strings, and call that list of strings `data`.

In [26]:
with open('data/moredata.txt', 'r') as f:
    data = f.readlines()

- What type is `data`? 
- How long in `data`?
- Why isn't this the same as the previous challenge?
- Redo the exercise without using a `with` statement.
- StackOverflow is your friend. Have a look at the question and the top-rated answer of [this thread](https://stackoverflow.com/questions/3277503/how-do-i-read-a-file-line-by-line-into-a-list).

#### Extra on reading files
Read 7.2 and 7.2.1 from [here](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files).

## String methods

Strings are an important data type in Python, especially for linguists. It is how Python represents text data. You can make a string by putting some text in between quotation marks, either single (') or double ("). The only reason Python gives you two ways of making a string is because you may want to actually have a quotation mark in the middle of your string. If you do want one, then use the other quotation mark. For example, if you want the word _can't_ in your string, use "can't". Other than that, ' and " are the same.

In [34]:
x = 'this is a string'
y = "this is also a string"
print(x)
print(y)

this is a string
this is also a string


In [36]:
x = 'we can't do this'

SyntaxError: invalid syntax (<ipython-input-36-f07fda882c76>, line 1)

Strings are a data type in Python. There are something functions that you can do that are particular to strings. That means they're methods. **"Method" is just a fancy name for a function that is only available for certain data types.** For example, you can make a string all upper case, but you can't make a list all upper case, because that doesn't make sense. 

In [37]:
'this is a string'.upper()

'THIS IS A STRING'

In [38]:
[1, 2, 3, 'hello'].upper()

AttributeError: 'list' object has no attribute 'upper'

Whenever you write `'this is a string'.upper()`, it's as if you're writing `upper('this is a string')`, where `upper` is a function that can only accept strings.

#### Challenge 3
Assign your favourite linguist's name to the variable `linguist`. Print it out in all caps and all lower case.

In [39]:
linguist = 'Noam Chomsky'
print(linguist.upper())
print(linguist.lower())

NOAM CHOMSKY
noam chomsky


- What type is `linguist.upper()`?

#### Challenge 4

String methods, like all methods in Python, can also take additional arguments. The `count` method of a string takes as its only required argument a string, let's call it `sub`. The method will return how many times `sub` occurred in that string.

In [40]:
linguist.count('a')

1

Read in the contents of the 'big.txt', which is in a subfolder called 'subfolder' in the 'data' directory. How many times does 'a' appear? How many times does the string 'the' appear?

In [42]:
with open('data/subfolder/big.txt') as g:
    contents = g.read()
contents.count('the')

92805

#### Challenge 5

There's a string method called `endswith`, which tells us whether a string ends with a particular suffix. The method returns either `True` or `False`. Find out whether the contents of the "big.txt" file end with a 'y' and store that in a variable.

In [49]:
x = contents.endswith('y')
x

False

One of the things we can do with strings is _split_ them. That means we turn one string into a list of smaller strings. If we don't give the `split` method an argument, it will assume we want to split the string by whitespace. Whitespace is just a term for spaces, tabs and newlines. Every time it finds a whitespace character (e.g. a space), it will chop up the string.

In [50]:
x = 'this is a string'
x.split()

['this', 'is', 'a', 'string']

In [52]:
x.split('n')

['this is a stri', 'g']

In [55]:
x.split('i')

['th', 's ', 's a str', 'ng']

#### Challenge 6

Read in the text file 'italian.txt' in the 'data' folder, split it by whitespace and store the result in a variable called `italian`.

In [56]:
with open('data/italian.txt') as myfile:
    italian = myfile.read()
italian = italian.split()
italian[:10]

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'La',
 'battaglia',
 'di',
 'Benevento,',
 'by']

- What type is `italian`?
- How long is `italian`?

#### Extra on strings
- [A full list of string methods in Python](https://docs.python.org/3/library/stdtypes.html#string-methods)
- Read the Python docs for [splitting strings](https://docs.python.org/3.6/library/stdtypes.html?highlight=str#str.split).

## `For` loops

Computers are good at doing exactly the same thing many times. One way we can do that in Python is with `for` loops.

In [60]:
for x in [1,2,3,4,5]:
    print(x)

1
2
3
4
5


In [61]:
for y in [1,2,3,4,5]:
    z = y + 1
    print(z)

2
3
4
5
6


A `for` loop is just shorthand for the following code:

In [62]:
mylist = [1,2,3,4,5]

y = mylist[0]
z = y + 1
print(z)

y = mylist[1]
z = y + 1
print(z)

y = mylist[2]
z = y + 1
print(z)

y = mylist[3]
z = y + 1
print(z)

y = mylist[4]
z = y + 1
print(z)


2
3
4
5
6


Writing the same code multiple times in the same program sucks. Imagine if we decided we actually wanted to add 2 to every element of the list. We'd have to change 5 different lines of code here. It'd be easy to forget one, and then your program has a bug. Often in linguistics, we'll be operating over thousands or millions of data points, not just 5.

#### Challenge 7

Write a `for` loop that prints the square of each of the first 10 digits. To square a number in Python, use `4**2` (=16). When you run your `for` loop, you should see:

```
1
4
9
16
25
36
49
64
81
100
```

In [64]:
for i in range(1, 11):
    print(i**2)

1
4
9
16
25
36
49
64
81
100


#### Challenge 8

Write a `for` loop that prints all the words in `italian` that end in 'b'.

In [70]:
for word in italian:
    if word.endswith('b'):
        print(word)

ab
ab
sub
_sub
_ab
_ab
«_Machatub
web
web
web
Web
Web
Web


#### Challenge 9

`For` loops don't have to use lists in Python. They can iterate over any data type that is [iterable](https://docs.python.org/3/glossary.html#term-iterable). Strings are iterable.

Write a `for` loop that iterates over every character in `linguist` and prints it if it is an 'm'.

In [71]:
linguist = 'Noam Chomsky'
for char in linguist:
    if char == 'm':
        print(char)

m
m


#### Challenge 10

Use a `for` loop to store all the words in `italian` that `startswith` a 'z'. First, create an empty list called `result` by using `[]`. Then iterate over all the words in `italian`, and a word `startswith` 'z', then `append` that word to the list `result`. `append` is a list method, which takes an argument and puts it at the end of the list.

In [79]:
result = []
for word in italian:
    if word.startswith('z'):
        result.append(word)
len(result)

48

## List comprehensions

List comprehensions are particular to Python. They are handy shorthands for `for` loops. This list comprehension is functionally equivalent to the `for` loop of challenge 10:

In [82]:
result = [word for word in italian if word.startswith('z')]
len(result)

48

The syntax is the following:

`[variable for variable in iterable if condition]`

The `condition` is optional, and we can do more complicated things that just copy the variable exactly.

In [85]:
mylist = [1,2,3,4,5]
[number + 1 for number in mylist]

[2, 3, 4, 5, 6]

In [92]:
# Get a list of the first letter of words in `italian` that end with 't'
[word[0] for word in italian if word.endswith('t')][:10]

['P', 'a', 'c', 'a', 'i', 'i', 'P', 'a', 'A', 'a']

#### Challenge 11

Rewrite challenge 7 and 8 as list comprehensions.

In [93]:
[i**2 for i in range(1,11)]

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In [94]:
[word for word in italian if word.endswith('b')]

['ab',
 'ab',
 'sub',
 '_sub',
 '_ab',
 '_ab',
 '«_Machatub',
 'web',
 'web',
 'web',
 'Web',
 'Web',
 'Web']

#### Challenge 12

Use a list comprehension to make every word in `italian` all caps if it is at least three characters long.

In [97]:
result = [word.upper() for word in italian if len(word) >= 3]

- How long is the reuslting list?
- How long is each word in the resulting list? (Use a list comprehension to find out)

#### Challenge 13

Use a list comprehension to get the **last** letter of every word in 'data/subfolder/big.txt' if the word has a 'b' in it somewhere.

- You may want to re-read the contents in again.
- You might want to separate the contents into words.
- There are a few different ways of knowing whether a string contains a 'b'. One way is to ask whether the `count` of 'b' in the string is bigger than 0.

In [96]:
with open('data/subfolder/big.txt') as f:
    contents = f.read()
contents = contents.split()
result = [word[-1] for word in contents if word.count('b') > 0]
result[:10]

['g', 'y', 'y', 'e', 'g', 'g', 'e', 'g', 't', 'g']

#### Challenge 14

Rewrite challenge 13 as a `for` loop.

In [99]:
result = []
for word in contents:
    if word.count('b') > 0:
        last_letter = word[-1]
        result.append(last_letter)
result[:10]

['g', 'y', 'y', 'e', 'g', 'g', 'e', 'g', 't', 'g']

#### Challenge 15

Find the last letter of all the words in `italian` that are longer than 6 characters.

In [103]:
result = [word[-1].lower() for word in italian if len(word) > 6]

## Defining functions

Sometimes you write code that can be reused multiple times. Rather than copying and pasting it, we can encapsulate the logic in a function, and then just call the function whenever we want to. When we do this, we also abstract away from the knitty-gritty details of how the function works. Here's our code for reading in a file and splitting the contents into a list of words:

In [115]:
with open('data/italian.txt') as f:
    contents = f.read()
contents = contents.split()

We'll probably want to do this many times, so let's turn it into a function. Creating a function in Python is called 'defining' a function, and we use the special word `def` to define a function. We give the function a name (here, `tokenize`), and note its arguments (here, `fname`). The function can do whatever we want, but it must end with a `return` statement, which is the value that will be given back to us when we call the function.

In [116]:
def tokenize(fname):
    with open(fname) as f:
        contents = f.read()
    return contents.split()

Now I can call my `tokenize` function without worrying about how it works.

In [122]:
italian = tokenize('data/italian.txt')

In [125]:
words = tokenize('data/myfile.txt')

If I ever needed to change my program (perhaps I want to split by '#' instead of whitespace), I only have to do that in one place.

Importantly, when you define a function, it doesn't actually run the code. It just remembers what code to run when you want to run it later.

Functions take input (their arguments, between the parentheses) and return values (the thing after the `return`). If your function doesn't have a return statement, it will return a special value of `None`, which is basically Python's "nothing" or empty set.

In [127]:
def say_hello():
    print('hello')

In [128]:
say_hello()

hello


In [131]:
x = say_hello()
type(x)

hello


NoneType

Function definitions in Python have the following syntax:

```
def function_name(arg1, arg2, arg3):
    do some processing with the args
    return something
```

#### Challenge 16

Write a function called `add_then_square` that takes in a number `n`, adds 1 to it and returns the square of that. Test it by calling your function with the argument 5.

In [132]:
def add_then_square(n):
    return (n+1)**2

In [133]:
add_then_square(5)

36

#### Challenge 17

Write a function that takes in a list of strings and returns a list of strings that have 'b' in them. Call your function `has_b`. Test your function by calling in with `italian` as the argument.

In [134]:
def has_b(words):
    return [word for word in words if word.count('b') > 0]

#### Challenge 18

Write a function that takes in a string and returns whether it has a 'b' in it.

In [137]:
def has_b(word):
    return word.count('b') > 0

- What type of value does it return?

#### Challenge 19

Generalize the function from challenge 18 to a function that accepts a word and a letter, and returns whether the word has the letter in it.

In [139]:
def has_letter(word, letter):
    return word.count(letter) > 0