# Python for Linguists

Hello and welcome to the first module of Python for Computational Linuguistics!

This module will introduce to Python, the programming language that we will use throughout this course. We recommend you to go through this module even if you're already familiar with Python - a reminder is always good.

## What is Python?

Python is a simple programming language which aim is to aid programmers to write clear, readable code quickly, and it's particulary suitable for beginners due to its simple syntax. However, don't be fooled by its simplicity - Python is a full-fledged programming languages which, later in this course, will allow us to build complex language models and train huge neural networks! 

To begin your career as a programmer, you can run your first Python code in the box below by clicking inside the box with your mouse and pressing simultaneously the keys `Control` and `Enter` on your keyboard:

In [1]:
2 + 2

4

Did the number `4` appear just below the equation? Congratulations! You have just solved a very difficult mathematical problem with Python.

## Jupyter Notebooks

The web page you are visiting right now is a *Jupyter Notebook*. A Jupyter Notebook is an environment which allows you to run Python code within your browser. 

Actually, the code does not run *in* your browser, but on a *server*; if you're running this notebook on Azure, for example, the instructions you write run on Microsoft's servers somewhere in the world, which then send back the results on this page. However, you can also run a Jupyter server, for example, on your laptop; this way, the code will effectively run on your machine.

We write code in *cells*, which are the grey boxes you see in this page. For example, you can try to write

`print('Hello world!')`

In the cell below. To run the code, press `Control` and `Enter` on your keyboard, as you did before. 

In [2]:
print('Hello world!')

Hello world!


You should see the string `Hello world!` appearing under the cell!

Now that we know how to run code, let's begin with some programming basics.

# Basic programming concepts

## Statements

A computer program is usually divided into many *instructions*, which tell the computer what operations should run. For example, run the cell below: 

In [3]:
print('This is an instruction')
print('This is another instruction')

This is an instruction
This is another instruction


As you can see, in Python, each line corresponds to an instruction. If we need, we can break a instruction in multiple lines, but this works only on special cases. For example, if you run the cell below, you will see that

In [4]:
print(
    'A multiline instruction'
)

A multiline instruction


Is a valid instruction. On the other hand, if you run the cell below, you will gen an **`Error`**:

In [5]:
2 +
2

SyntaxError: invalid syntax (<ipython-input-5-61f4b0ef3787>, line 1)

This happens because Python does not know how to read the code in the cell, so it will *raise* a an error and try to display an informative message about what is happening.

As a rule of thumb, statements inside parentheses can be split on multiple lines, in order to improve the readability of the code. However, until you are more confident with the language, we advise you to write an instruction per line.

Please note that when you encounter errors, you should **always** read the error type and message to understand what is happening; however, we will delve into this topic later.

We will start by using Python as a calculator. In Python, numbers support the basic mathematical operations, as `+`, `-`, `*, /`.

For example, we can sum two numbers by writing:

In [6]:
134 + 25

159

And we can multiply them by writing:

In [7]:
13.4 * 4

53.6

However, Python supports more complex operations as well, such as the power. For example, $2^3$ can be written as:

In [8]:
2 ** 3 

8

We can also group operations using parentheses:

In [9]:
(2 ** 3) / 3

2.6666666666666665

## Variables

While playing with a calculator is surely amusing, if we could only do mathematical operations in Python, it would be obviously quite limited. For example, what if we wanted to store the result of an operations, to reuse it again later?

For this reason, virtually all programming languages offer the possibility of storing data in  **variables**. This way, we can store a datum in the computer's memory and reuse it later, e.g. to form more complex expressions, to pass it to external programs, and so on.

The syntax for assigning variables is

`variable_name = value`

For example, if we write and run:

In [10]:
a = 5

We saved the value `5` in the variable `a`. To see the value stored in a variable, we can simply write its name in a cell, e.g.:

In [11]:
a

5

Let's play with some variables! Can you guess the results of the cells below before running them?

In [12]:
a = 5 
b = 2 

In [13]:
a ** b

25

In [14]:
(a + b) / 3

2.3333333333333335

In [15]:
(b * 2) / a - 3

-2.2

In [16]:
string1 = 'hello'
string2 = 'world'
print(string1 + ' ' + string2)

hello world


You can also **update** the content of a variable:

In [17]:
d = a + 3
d

8

In [18]:
a = 1
d = a + 3
d

4

Please note that in Python variable names **can't** contain spaces. In fact, a variable name can contain only alphanumeric characters (i.e. letters and numbers) and underscores, i.e. the character `_`. Moreover, a variable name can't begin with a number.

Knowing this, which of the following instruction will run, and which of them will fail? Try it yourself! 

In [19]:
a_number = 2

In [20]:
_a_number = 2

In [21]:
2_number = 22

SyntaxError: invalid token (<ipython-input-21-b149423fb2f5>, line 1)

In [22]:
a_number = 
2

SyntaxError: invalid syntax (<ipython-input-22-3cdda63118ad>, line 1)

In [23]:
twoIs__anumber = (
    2)

## Comments

Sometimes, when we write code, you may want to describe what a line does, in order to help the future you to understand what that code means. This is achieved using **comments**, special instructions which are *ignored* by Python when you run the code.

In particular, 
- Everything written after a `#` is ignored **until the end of the line**
- Everything written between **triple quotes** (`'''` or `"""`) is ignored.

For example, if we write:

In [24]:
# 2 + 3

Nothing happens! Other examples of comments are:

In [25]:
3 + 2  # this is a comment

5

In [26]:
""" This is 
another comment
"""

' This is \nanother comment\n'

However, these comments are **not valid**:

In [27]:
3 + 2 # this is
a broken comment

SyntaxError: invalid syntax (<ipython-input-27-f506ed4d4dba>, line 2)

In [28]:
3 + 2 """ This comment style work only on non-code lines """

SyntaxError: invalid syntax (<ipython-input-28-75ac46bdd079>, line 1)

In [29]:
3 + """ For example, we can't split a code line like this """ 2

SyntaxError: invalid syntax (<ipython-input-29-aebdf69d22f0>, line 1)

## Errors

So now we have seen what happens when we write bad code: Python refuses to run it and returns an **`Error`**. However, one can (and will) make many, many kinds of errors while writing code, and it's always important to understand them to fix and, possibly, do not repeat them.

For example, the expressions we've seen when talking about variables work because the variables `a`, `b`, and `d` have been previously created. Look what happens if we try to use a variable which has never been assigned:

In [30]:
z + 5

NameError: name 'z' is not defined

Now this error is different from the ones we encountered before. If we read the messages, in fact, the previous errors were `SyntaxError`s, i.e. errors related to the syntax of the code we wrote; now the error is a `NameError`, i.e. Python is telling us that we're referencing the name of a variable that does not exist.

When you encounter an error, you should **always** read the error type and message to know what you're doing wrong. Messages are usually informative and [will help you to solve the problem](https://geekandpoke.typepad.com/geekandpoke/2009/06/the-art-of-bugfixing-chapter-1.html).

There are two kinds of errors:
- `Error` and `Exception` are serious errors that will make your program stop and crash (in other words, bugs).
- `Warnings` are errors that *won't* make your program crash. However, you should always read the warning messages, because they may point to potential errors in your code.

# Built-in data types and functions

## Built-in functions

Python offers a set of built-in functions to perform some basic operations. We have already encountered the simplest of them all: the `print` instruction, which allowed us to write some text below the code cells.

Functions are defined this way:

```
function_name(argument_1, argument_2, ... )
```

For example, `print()` accepts some text as argument and prints it out in the browser (on on a console). Let's see again how it works:

In [31]:
print("Nothing will come of nothing.")

Nothing will come of nothing.


`print` will actually try to, um, print everything you pass to it. For example:

In [32]:
print(a)
print(3 ** 2)
print(a - b ** 7)

1
9
-127


Some functions also may **return** a value and/or accept more arguments.

Please note that when we run the cells below, a the red label `Out [xx]` appears below the result of our computation, while when we `print()` something is does not happen. This is because the functions below (can you guess what they're doing?) *return* a value, while `print()` merely shows something to the screen.

For example:

In [33]:
abs(-3)

3

In [34]:
pow(2,3)

8

You can use the return value of a function as you wish. For example, you can store it in a variable, you can use it in another computation, and so on.

In [35]:
a = pow(-2,3)
print(a)
b = abs(a) * 2
print(b)

-8
16


## Types

The built-in function `type()` tells us what is the **type** of something. For example, 

In [36]:
type(2)

int

Do you remember the variables we defined above? Let's check their types:

In [37]:
print(a)
type(a)

-8


int

In [38]:
print(string1)
type(string1)

hello


str

The **type** is a property which tells Python (or, in general, any programming language) how to deal with the object we are giving him, and which operations we can perform on it.


Now, `a` is an **`integer`**, i.e. a number. Hence, Python knows that it can sum, multiply, and divide it. `string1`, instead, is a **string**, i.e. a textual variable. While it's obvious that we can't "divide" a string, we can perform other kinds of operations on it, e.g.:

In [39]:
print(string1)
print(string1[0])   # get the first character of 'hello'
string1 + string2   # concatenate two strings

hello
h


'helloworld'

We will see later what this operations mean; by know, you can try to guess what we're doing here. 

Python offers this built-in types, which we will describe in detail below:

- Numbers, i.e. integers, floating point numbers (i.e. non-integers), and complex numbers.
- Strings, i.e. text
- Booleans, i.e. the truth values `True` and `False` of the [Boolean Algebra](https://en.wikipedia.org/wiki/Boolean_algebra) 

You don't need to know much else about types for now. However, we will sometime use `type` to see how Python handles data. 

## Numbers

We already encountered numbers. Now, we'll see some operations that Python offers to handle them.

In [40]:
print(1 + 2)    # sum
print(3 - 7)    # subtraction
print(2 * 3)    # multiplication
print(5 / 6)    # division
print(2 ** 3)   # power

3
-4
6
0.8333333333333334
8


Python offers the several built-int mathematical functions (any many more):

In [41]:
print(pow(2,3))             # power
print(abs(-3))              # absolute value
print(round(987.654321,3))  # rounding to the nth decimal
print(round(22/7,2))        # does this ring any bell?
print(max(10,1000))         # maximum value between two numbers
print(min(-1000,-10))       # minumum value between two numbers

8
3
987.654
3.14
1000
-1000


Can you guess the difference between this two divisions?

In [42]:
7/2

3.5

In [43]:
7//2

3

As we know, computers store information using bits. For this reason, numbers are stored using the [floating point representation](https://en.wikipedia.org/wiki/Floating-point_arithmetic). You don't need to know the details of how does it work; however, you should be aware that `1` and `1.0` are two different things in Python.

In fact, let's see their types:

In [44]:
print(type(1))     # Ask Python to save 1 as an integer
print(type(1.0))   # Ask Python to save 1 as a real number 

<class 'int'>
<class 'float'>


Now that you know that, what is the difference between `/` and `//`?

## Booleans

The boolean values `True` and `False` are the *truth values* associated to a statement; for example, `Shakespeare was an English poet` is a true statement, and `Claudius is a character in Romeo and Juliet` is a false statement.

Unfortunately, Python can't understand natural language statements. However, boolean algebra is vastly used when programming; for example, we can compare numerical values using the classic comparison operations:

+ greater `<` and lesser `>`
+ greater or equal `<=` and lesser or equal `>=`
+ equal `==` and not equal `!=`

The results of comparison operations are of type **boolean**.

In [45]:
2 < 3

True

In [46]:
a = 4
b = 3
c = 3

# Can you guess the result of this operations before running the cell?

print(a < b)
print(b > a)
print(b < c)
print(b <= c)

print(a == b)
print(a != b)

False
False
False
True
False
True


In [47]:
d = (a == b)
print(type(d))  # What is the type of a boolean?

<class 'bool'>


## Strings

`string` is the fancy name used by programmers for sequences of characters, i.e. for *textual* variables.

As we already encountered them, you should already know how to create a string: you just have to place the text between single (`'`) or double (`''`) quotation marks.

In [48]:
s1 = '' # empty string
s2 = "hello"
s3 = 'world'
s4 = 'hello world'
s5 = "2345"

We can perform a wide range of operations over strings. For example, we can get their length:

In [49]:
print(len(s1))
print(len(s2))

0
5


We can *concatenate* strings using the `+` operator:

In [50]:
print(s2 + s3) # we concatenate s2 and s3
print(s2 + " " + s3)

helloworld
hello world


Notice that `s2 + " " + s3` is equal to the string `"hello world"`:

In [51]:
(s2 + " " + s3) == s4

True

If needed, we can even *multiply* strings, i.e. *repeat* them:

In [52]:
print(s2 * 3)  # we repeat s2 three times

hellohellohello


### Indexing and slicing

Other common operations over strings are *indexing* and *slicing*. **Indexing** allows us to get the $n$-th element of any sequence of elements, using this syntax:
    
```python
variable[index]
```

returns the $index$-th element of the variable `variable`.

In [53]:
s = 'this is an example'

# indexing (to access a byte in the string)
print(s[0]) # print the first character of the string
print(s[1]) # print the second character of the string

t
h


If you're not familiar with programming, you probably are asking yourself why we getting the *zeroth-*element of our string did not end up in error.

<a id='zero-based'></a>
This happens because in Python indices start with zero. This is called **[Zero-based indexing](https://en.wikipedia.org/wiki/Zero-based_numbering)**, and it is a convention used in most programming language for performance reasons. 

This also means that, if a string has five characters, e.g. `hello`, its last element will have index 4:

In [54]:
'hello'[4]

'o'

Please note that strings are **immutable**, i.e. you can't change their content. For example, you can't do things like:

In [55]:
s[2] = 'x'

TypeError: 'str' object does not support item assignment

However, since `s` is just a variable, you can change its content altogether. Let's see other examples:

In [56]:
s = 'Romeo and Juliet'
print(s[0])
print(s[15])

# How to we get the last character of a string without having to 
# count how many character it contains?

print(s[len(s) - 1])       # we can use len()
print(s[-1])               # or we can use the negative notation

R
t
t
t


As we've seen in this last example, negative indexing tells Python to start looking from the last character:

In [57]:
'hello'[-4]

'e'

If we need more than a character from a string, we can use **slicing**. The syntax is:
```
variable[start_position:end_position]
```
For example, to get the first two character of a string, we write:

In [58]:
'hello'[0:2]

'he'

Please note that *spans* will *not* contain the character denoted by the right index.

Let's see tome other examples:

In [59]:
print(s)
print(s[0:2])
print(s[:2])  # the same as s[0:2]
print(s[5:])  # the same as s[5:len(s) -1]
print(s[1:2])
print(s[-3:]) # we can use negative indexing too!

Romeo and Juliet
Ro
Ro
 and Juliet
o
iet


As you have seen in this example, we can **omit** one of the two indices of the span if we want Python to look up from the beginning (omitting the left index) or to the end (omitting the right index) of the string.

### Built-in string functions

Now we will introduce a new class of functions, i.e. **object methods**. The syntax for this kind of functions is called *dot notation* and works this way:
```python
object.function()
```
This particular syntax tells Python that the function `function()` is applied to `object`. Each object (or type) has its peculiar set of functions; for example, it would not make sense to do the square root of a string, or to replace all the threes in a number with the dollar symbol. 

For example, we can find specific substrings in a string using `string.find()`:

In [60]:
print(s)
print(s.find("Juliet"))  
print(s.find("Othello"))

Romeo and Juliet
10
-1


`find()` returns the index where the given substring starts, or `-1` if the given substring is not present in the input string.

`replace()` allows us to find a substring and replace it with something new:

In [61]:
d = s.replace("Romeo", "King Lear")

print(s)
print(d)  

Romeo and Juliet
King Lear and Juliet


Please notice how the string assigned to variable `s` is not modified, and the result of the operation is stored in the new variable `d`.

You should also be aware that `replace` operation will replace _all_ mentions of the given substring:

In [62]:
s = s.replace(" ", "_")
print(s)

Romeo_and_Juliet


In [63]:
# what happens if the requested string does not exist in the input one?
s.replace("Desdemona","Ophelia")

'Romeo_and_Juliet'

Other useful operations on strings are the following:

In [84]:
x = "This is a nice University"

# convert to upper/lowercase
print(x.upper())
print(x.lower())

# count how many instances of a substring
print(x.count('i'))
print(x.count('is'))

# concatenate with a given delimiter
print("-".join(x))
print("*".join(x))

# splits string at delimiter.
# creates a list (see below) with the obtaines substrings
print(x.split("nice"))  
print(x.split(" "))     # delimiter found multiple times.
print(x.split("x"))     # delimiter not found. Creates a list with the entire string as the only element

THIS IS A NICE UNIVERSITY
this is a nice university
5
2
T-h-i-s- -i-s- -a- -n-i-c-e- -U-n-i-v-e-r-s-i-t-y
T*h*i*s* *i*s* *a* *n*i*c*e* *U*n*i*v*e*r*s*i*t*y
['This is a ', ' University']
['This', 'is', 'a', 'nice', 'University']
['This is a nice University']


We usually cannot mix strings and numbers. If we do that, we may obtain something different then expected:

In [65]:
number142 = 142
string142 = '142'

print(number142)
print(string142)

print(type(number142))
print(type(string142))

print(number142 * 3)
print(string142 * 3)

print(number142 == string142)

142
142
<class 'int'>
<class 'str'>
426
142142142
False


Now you should play with strings in the cell below.

Given the string we saved in variable `othello`, you should:

- print the string;
- determine if `Desdemona` is a substring;
- find the position of `what`;
- convert the string to uppercase;
- get the first three character of the string, convert them to lowercase, and print them.

In [66]:
othello = 'Men should be what they seem'

# insert your code below



## Converting between types

What happens if we want to treat a string like a number, of vice versa?

For example, way may want to sum the number contained in a string to an actual number, like
```python
"10" + 10
```

Or, we may want to append a number to a string, like
```python
"Shakespeare wrote " + 17 + " comedies"
```

What happens if we try to do the former?

In [67]:
"10" + 10

TypeError: must be str, not int

An error! Obviously, Python tells us that we can't sum strings and numbers. So, in order to do that, we need to **convert** our variable to the desired type:

In [68]:
int("10") + 10

20

In [69]:
"Shakespeare wrote " + str(17) + " comedies"

'Shakespeare wrote 17 comedies'

As you can see, we can use a `type` as a function, in order to convert a variable of one type to the desired one. 

It is **very** important to always remember the data type of our variables. If not, the results may be very different than expected:

In [70]:
a = "3"
b = "4"

print(a + b)
print(int(a) + int(b))

34
7


# Complex data types: tuples, lists, sets, and dictionaries

Now we will see some more complex data types, i.e. tuples, lists, sets, and dictionaries. All this types have in common that they allow us to store *more data* inside a single variable. In fact, while numbers, strings, and so on, allow us to store only *one* number, strings, etc., in a variable, it is often useful to store more than a single information in a variable.

For example, what could we do if we wanted what if we wanted to keep all the the titles of Shakespeare's comedies in a single variable, or if we want to associate to each of his plays to their respective main female character? 

## Tuples

Tuples are the most basic composite data type. A tuple is a sequence of elements, much like a string is a sequence of characters. The syntax for defining tuples is 
```python
( element_1, element_2, ... element_n )
```

Let's write some example tuples:

In [71]:
(1, 2, 3)

(1, 2, 3)

In [72]:
('hello', 'world')

('hello', 'world')

In [74]:
x = ('One', 2, 'three', 4.0, False)
print(x)

('One', 2, 'three', 4.0, False)


As you can see, elements within tuples can be of any type. You can slice and index tuples exactly as you do with strings:

In [75]:
print(x[0])
print(x[0:2])
print(x[-1])

One
('One', 2)
False


You can use some of the methods for strings on tuples, too:

In [87]:
T = (1, 2, 3, 4, 3, 2, 1)
print(len(T))
print((1,2) + (3,4))
print(T.index(4))  # the index of the first matching 4 in the tuple
print(T.count(2))  # the number of times 2 occurs in the tuple

7
(1, 2, 3, 4)
3
2


Because tuples are immutable, we cannot change the tuples (ie. item assignment, appending...) once they are created. 

In [77]:
T[0] = 2

TypeError: 'tuple' object does not support item assignment

## Lists

The simplest way to describe lists it's as *mutable tuples*.

In [86]:
L = [1,2,3,4,5]
print(L)
L[3] = 0
print(L)

[1, 2, 3, 4, 5]
[1, 2, 3, 0, 5]


Note to the reader: if you ar asking yourself why we did `L[3] = 0` and the **fourth** element of the array was modified, go back and review [zero-based indexing](#zero-based)!

Like strings and tuples, we can use indexing and slicing:

In [None]:
#index
print(L[0])

#slice
print (L[:-1])
#concatenate with another list
print(L+[2,3,4])

Further, lists have no fixed size. That is, they can grow and
shrink on demand, in response to list-specific operations

In [None]:
L=[1,'a',True]

#append at the end
L.append('b')
print (L)

# delete an item at index 1 and returns the deleted item
print (L.pop(1))
# delete an item at index 0
del L[0]
print (L)
# delete the first matching item by value in a list:
L.remove('b')
print(L) # removes 'b'

#insert: L.insert (position, item): insert an item at position of L
L.insert(0,'a')
print (L)




In [None]:
# sort the list by ascending order
L=[1,4,3]
L.sort()
print (L)
# sort the list by descending order
L.reverse()
print (L)

Note: Because lists are mutable, most of the list methods modify the lists directly instead of create a new one. (Compare it with strings) Advanced: nesting and list comprehensions

## Sets

Sets are unordered collections of unique and immutable objects. 

In [None]:
# create a set:
a={1,'a','b'}
# create a set from a list: will only maintain the unique items
b=set([1,'a','a','b'])
print (a)
print (b)

## Dictionaries

Dictionaries are a mutable mapping type that map keys to their associated value.

In [None]:
D = {'food': 'Spam', 'quantity': 4, 'color': 'pink'}
print (D['food']) # Fetch value of key 'food'
D['quantity'] = 1 # assign a new value
D['size']=10 #create a new key by assignment
print (D)

We will encounter a key error if we fetch a key that does not exist. 

In [None]:
D['shape']

We could use .get() method to return a default value

In [None]:
print(D.get('color')()
print(D.get('shape',0))

Advanced: sorting the keys

# Conditional statements and loops

In Python
we normally code one statement per line and indent all the statements in a nested block
the same amount

Assignment Statememt

In [None]:
# basic assignment
count = 0
print (count)

# augmented assignemnt
count+=1 #equals to count=count+1
print (count)

# sequence assignment
a, b, c, d = 'spam' 
print (a)

#multiple target assignment
spam=ham='spam'
print (spam)
print (ham)

In Python, you can use an expression as a statement, too—that is, on a line by itself, but they do not return any values.

In [None]:
#for example, in-place list methods returns None
L=[1,2,3]
a=L.sort()
print (a)

# compare with sorted() function
b=sorted(L)
print (b)

## IF statements

In simple terms, the Python if statement selects actions to perform. 

It
takes the form of an if test, followed by one or more optional elif (“else if”) tests and
a final optional else block. The tests and the else part each have an associated block
of nested statements, indented under a header line.

In [None]:
#check if x is negative, 0 or positive, and print accordingly
x=1
if x<0:
    print ('negative')
elif x==0:
    print ('0')
else:
    print ('positive')

## FOR and WHILE loops

For and While loops are statements that repeat an action over

The first of these, the while statement, provides a way to code general loops. 

In [None]:
# a loop that strip the last letter of a string one by one
x = 'spam'
while x: # While x is not empty
    print(x)
    x = x[1:] # Strip first character off x

We could add 'break', 'continue' statements to the loop:

In [None]:
x = 'spam'
while x: # While x is not empty
    print(x)
    x = x[1:] # Strip first character off x
    if len(x)<2:
        break # stop when the length of the string is shorter than 2
    else:
        continue # else, continue to the next iteration in the loop

The second, the for statement, is designed for stepping through the items in a sequence
object and running a block of code for each. The built-in range function produces a series of successively higher integers, which can be used as indexes in a for loop

In [None]:
# Let's write the previous while method into the for loop:
x='spam'
for i in range(len(x)): #we need to specify the number of itrations in the loop. 
    print (x)
    x=x[1:]
    if len(x)<2:
        break
    else:
        continue

# Modules

Some useful packages for NLP:

+ NLTK
+ numpy
+ gensim (?)

## How to install Python and using a Console

## Further Readings

+ [Learning Python](http://shop.oreilly.com/product/0636920028154.do)
+ [Python Cookbook](https://www.oreilly.com/library/view/python-cookbook-3rd/9781449357337/)
+ [Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit](http://shop.oreilly.com/product/9780596516499.do) (https://www.nltk.org/book/)
+ maybe links to online Python courses?
+ ...?