# Python for text analysis: week 1

Welcome to the course! You are now in the Jupyter Notebook environment, running the notebook for week 1.
Notebooks are pretty straightforward. Some tips:

* Cells in a notebook contain code or text. If you run a cell, it will either run the code or render the text.
* There are four ways to run a cell:
    1. Click the 'play' button next to the 'stop' and 'refresh' button in the toolbar.
    2. Alt + Enter runs the current cell and creates a new cell.
    3. Ctrl + Enter runs the current cell without creating a new cell.
    4. Use the menu and select Cell/Run all.
* The instructions are written in Markdown. [Here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) is a nice Markdown cheatsheet if you want to write some more.
* Explore the menus for more options! You can even create a presentation using Notebooks.

Hint when you're writing Python code: press Tab to auto-complete your variable names!

## Getting started

This notebook provides you with an overview of the basics of Python. You don't need to remember everything, we just want to give you a sense of what's available in the language. Also recall the 15 minute rule: if you're stuck for longer than 15 minutes, contact us and we'll help you. (In class, of course, you can ask us immediately.)

Let's first start with something really simple. Every programming language is traditionally introduced with a "Hello world" example. Please run the following cell.

In [None]:
print("Hello, world!")

What happened here? Well, Python has a large set of built-in functions, and `print()` is one of them. When you use this function, `print()` outputs its *argument* to the screen. 'Argument' is a fancy word for "object you put in a function". In this case, the argument is the string "Hello, world!". And 'string' just means "a sequence of characters".

Instead of providing the string directly as an argument to the print function, we can also create a variable that refers to the string value "Hello, world!". When you pass this variable to the print function, you get the same result as before:

In [None]:
text = "Hello, world!"
print(text)

Note that the variable name `text` is not part of Python. You could use any name you like, and the example would still work. Even if you change the name `text` to something silly like `pikachu` or `sniffles`. But it's standard practice to use clear variable names, so that your scripts will remain understandable (especially when they grow larger).

Since it's kind of boring to always use the same string, we'll make use of another built-in function: `input()`. This takes user input and returns it as a string. Try it below:

In [None]:
text = input('Please enter some text.') # If this doesn't work, you may have Python 2 installed.
print(text)                             # Please install Python 3, or you'll be unable to use these notebooks.

Until we learn how to use functions and files, we will use the following setup to explore python:

1. We assign some input value to a variable,
2. do something with that value,
3. and print the result.

## Objects

Python is an object-oriented programming language. This means that it treats every piece of data like some kind of object that can be manipulated and passed around. Python has the following basic *types* of objects:

* **String**: for representing text.
* **Integer**: for representing whole numbers.
* **Float**: for representing numbers with decimals.
* **Tuple**: for representing immutable combinations of values.
* **List**: for representing ordered sequences of objects.
* **Set**: for representing unordered sets of objects.
* **Dictionary**: to represent mappings between objects.
* ..and **functions**: to manipulate objects, or to produce new objects given some input.

You can also read about the types in Python in the documentation [here](https://docs.python.org/3/library/stdtypes.html). Basically, what you need to know now is that each type has particular *affordances*, or associated things that you can do with them. When something is of a numeric type (integer, float), the Python interpreter knows that you can perform mathematical operations with that object. You cannot use those operations with strings, because it doesn't make sense to take the square root of the string "Hello, world!".

It's the same with vehicles and food in real life. Anything that is of the type *vehicle* can be used to get around and possibly transport goods. But you cannot eat a vehicle. Anything that is of the type *food* is edible, but it's pretty difficult to use food for transportation. (Try biking home on a carrot.)

Here's an example of each type:

In [None]:
test_string  = 'test'
a_number     = 4
other_number = 3.14
pair         = (2,5)
some_list    = [1,2,3,1,2,3,'a','b','c']
a_set        = {1,2,3,4,'apple'}
a_dict       = {'milk':2, 'cheese':1, 'pickles':45}
print_function = print

We can use the `type` function to check object types. Let's use it for a selection of our newly defined objects:

In [None]:
type(print_function)

In [None]:
type(other_number)

In [None]:
type(test_string)

### Strings

We'll now take a look at the different object types in Python, one by one, starting with strings. Let's define a few of them:

In [None]:
# Here are some strings:
string_1 = 'Hello, world!'
string_2 = 'I ❤️ cheese'      # If you are using Python 2, your computer will not like this.
string_3 = '1,2,3,4,5,6,7,8,9'
# Strings that span multiple lines must start and finish with three single or double quotes.
string_4 = """This string covers
multiple lines!"""
# You can also use double quotes:
string_5 = "This one\n does too!"

Strings can contain any character you can think of, including emoji! The cell above also shows different ways to enter a string in Python: using single/double quotes, or three single/double quotes. In addition, the 5th string also shows a hidden character. Here is what that line looks like when you print it:

In [None]:
print(string_5)

`\n` stands for 'new line', and produces a line break. Another common hidden character is `\t`, which produces a tab.

Let's explore some properties of strings.


**Strings are sequences of characters**

Python provides very useful functions to work with sequences. Since strings are represented as sequences of characters, these functions work for strings as well. Important for now are:

* Length
* Containment
* Indexing
* Looping

**Length**: Python has a built-in function called `len()` that lets you compute the length of a sequence. It works like this:

In [None]:
number_of_characters = len(string_1)
print(number_of_characters) # Note that spaces count as characters too!

What happened above is that the `len`-function was called with `string_1` as its argument. The Python interpreter then counted all characters in `string_1` and *returned* the result. (Programmers say 'return' to mean the function produces some kind of result.) This result (the number of characters) got assigned to the variable `number_of_characters`, which gets printed by `print` (a function that doesn't return anything, but rather *displays* its argument on the screen).

**Containment**: The Python keyword `in` allows you to check whether a string contains a particular substring. It returns `True` if the string contains the relevant substring, and `False` if it doesn't. These two values (`True` and `False`) are called *boolean values*, or *booleans* for short. We'll talk about them in more detail later. Here are some examples to try:

In [None]:
"fun" in "function"

In [None]:
"I" in "Team"

In [None]:
"App" in "apple" # Capitals are not the same as lowercase characters!

**Indexing**: Python provides access to the characters in each string through indexing. The table below shows all indexes for the string "Sandwiches are yummy". 

| Positive index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
|----------------|---|---|---|---|---|---|---|---|---|---|----|----|----|----|----|----|----|----|----|----|
| Characters     | S | a | n | d | w | i | c | h | e | s |    | a  |  r | e  |    |  y |  u |  m |  m |  y |
| Negative index |-20|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10|-9|-8|-7|-6|-5|-4|-3|-2|-1|

You can access the letters of string using the following notation:

```python
my_string = "Sandwiches are yummy"
print(my_string[1])   # This will print 'a'.
print(my_string[1:4]) # This will print 'and'
print(my_string[1:4:1]) # This will also print 'and', but is more explicit about what's happening.
print(my_string[-1]) # This will print 'y'
```

So how does this notation work?

```python
my_string[i] # Get the character at index i.
my_string[start:end] # Get the substring starting at 'start' and ending *before* 'end'.
my_string[start:end:stepsize] # Get all characters starting from 'start', ending before 'end', 
                              # with a specific step size.
```

You can also leave parts out:

```python
my_string[:i] # Get the substring starting at index 0 and ending just before i.
my_string[i:] # Get the substring starting at i and running all the way to the end.
my_string[::i] # Get a string going from start to end with step size i.
```

You can also have negative step size. `my_string[::-1]` is the idiomatic way to reverse a string.

Let's have a small test. Do you know what the following statements will print?

In [None]:
my_string = "Sandwiches are yummy"
print(my_string[0])

In [None]:
print(my_string[11:14])

In [None]:
print(my_string[15:])

In [None]:
print(my_string[:9])

In [None]:
print('cow'[::2])

**Looping**: we'll cover this in much more detail later on, but all sequences are *iterable*, which means you can loop over them. In other words, you can do something like this:

In [None]:
for char in "word": # For every character in the string "word"..
    print(char)     # Print that character.

That is: 

1. Take all the elements (in this case: characters) in the sequence, 
2. assign them one-by-one to a variable (in this case: `char`, but you can use any name), 
3. do something with that variable (in this case: print it), 
4. move to the next element and go to (2.) until there are no more elements left.

and there are some other things you can do, like:

In [None]:
for char in reversed("word"): # For every character in the string "word" (but then reversed)..
    print(char)               # Print that character.

In [None]:
for char in sorted("word"): # For every character in the string "word" (but then sorted)..
    print(char)             # Print that character.

**Strings have useful methods**

A method is a function that is associated with an object. For example, the string-method `lower()` turns a string into all lowercase characters, and `upper()` makes strings uppercase. You can call this method using the dot-notation:

In [None]:
print(string_1)         # The original string.
print(string_1.lower()) # Lowercased.
print(string_1.upper()) # Uppercased.

So how do you find out what kind of methods an object has? There are two options:

1. Read the documentation. See [here](https://docs.python.org/3.5/library/stdtypes.html#string-methods) for the string methods.
2. Use the `dir()`-function, which returns a list of method names (as well as attributes of the object). If you want to know what a specific method does, use the `help()`-function.

Run the code below to see what the output of `dir()` looks like. 

The method names that start and end with double underscores ('dunder methods') are Python-internal. They are what makes general methods like `len` work (`len` internally calls the `string.__len__()` function), and cause Python to know what to do when you, for example, use a for-loop with a string.

The other method names indicate common and useful methods. 

In [None]:
# Run this cell to see all methods for string_1
dir(string_1)

If you'd like to know what one of these methods does, you can just use `help` (or look it up online):

In [None]:
help(string_1.upper)

In [None]:
help(string_1.split)

It's important to note that methods only *return* the result. They do not change the string itself.

In [None]:
x = 'test'    # Defining x.
y = x.upper() # Using x.upper(), calling the result y.
print(y)      # Print y.
print(x)      # Print x. It is unchanged.

### Overview of useful methods

Here's an overview of the methods that I frequently use. Others may also be useful, but I use them less often.
If you encounter a new problem that involves strings, always check the string methods before trying to implement something yourself!
This will save you time, and your code will be clearer and more idiomatic (=easier for everyone to read, including Future You).

**Transforming a string**

| Command | What does the command do? | What is it useful for? |
|---------|---------------------------|------------------------|
| casefold | Lowercase a string and try to normalize special characters, e.g. 'Gruße' => 'grusse' | Preprocessing/normalizing text, reducing the vocabulary size. |
| join | Join elements of an iterable together with a string. | Formatting data, joining tokenized text. |
| lower | Make a string lowercase. | Preprocessing/normalizing text. |
| replace | Replaces specific substrings in a string with a replacement substring. | General purpose. |
| split | Splits a string at every instance of another string (e.g. at every comma). | General purpose. |
| splitlines | Splits a text into a list of lines. | Useful data preprocessing. |
| strip | Removes trailing whitespace, or specific trailing characters from a string. | Useful data preprocessing. |

Others in this category: `expandtabs, format, format_map, ljust, lstrip, partition, rjust, rpartition, rsplit, rstrip, swapcase, title, upper, zfill`.

Special case: maketrans & translate, because they form a pair. The former creates a translation table, the latter uses the translation table to modify a string. Let's look at two examples. You don't need to remember these, but they are here for future reference. (Removing punctuation can be a useful preprocessing step for your data. This is the fastest way to do it.)

In [None]:
# The simple case: replacing characters.

# First make a translation table. 'a' will be mapped to 'n' and 'b' will be mapped to 'o'
table = str.maketrans('ab','no')
# Translate the string, using the table.
result = 'abba'.translate(table)
# Print the result: noon.
print(result)

In [None]:
# The more complex case: removing punctuation.

# Define punctuation to remove.
punctuation_to_remove = '.,!'

# Make a translation table. 
# 1. No characters are transformed, as is indicated by the two empty strings.
# 2. The optional third argument indicates which characters should be removed.
table = str.maketrans("", "", punctuation_to_remove)

# We will remove punctuation from this text.
with_punctuation = "Hello, I am a hamster! I like to sleep during the day."

# Translate.
result = with_punctuation.translate(table)

print(result)

**Getting information from a string**

| Command | What does the command do? | What is it useful for? |
|---------|---------------------------|------------------------|
| find |||
| index |||
| count | Count the occurrences of a substring in a string. WARNING: doesn't care about spaces! | General utility |

Others in this category: `rfind, rindex`.

**Checking if a string has a particular property**

| Command | What does the command do? | What is it useful for? |
|---------|---------------------------|------------------------|
| startswith | Check if a string starts with a particular sequence of characters. | Detecting words with particular base lemmas. |
| endswith | Check if a string ends with a particular sequence of characters. | Detecting particular compounds, inflectional forms, ... |
| isdigit | Check if a string consists only of numbers. | Detecting values in tokenized text. |

Others in this category: `isalnum, isalpha, isdecimal, isidentifier, islower, isnumeric, isprintable, isspace, istitle, isupper`.