# LELA32051 Computational Linguistics Week 1
## What is Python?

When we talk about data in computational linguistics we usually mean text data or audio files. While this data is meaningful to us, to the computer data is always just sequences of numbers. So to a computer the character i is 1101001. Fortunately, we never have to engage with this numeric representation - we can input an i into a computer and it will be converted into the computer's language for us.

A similar situation arises with the instructions that a computer applies to data. The instructions that the computer sees are in a purely numeric format known as machine code. Fortunately, we as users do not need to engage with machine code. Instead, we use what is known as a programming language. This is a much easier to read and write language that we input and which is converted to machine code for us.

The programming language we will employ for this module is called Python.

## Variables and datatypes
Data in Python programs is manipulated using variables. A variable is like a box in which you can store information of any kind. You assign data to variables using the assignment operator =. For example you can store the number 10 as the variable i as follows:

In [None]:
i = 10

Once stored these variables can be manipulated, for example using the standard mathematical operators for addition "+", subtraction "-", multiplication "*" and division "/". For example:

In [None]:
i+1*3/2

Data in Python has different types. i, as a whole number, is an integer, or int. The output from the operations above, since it contains a decimal point is a floating point number or float. There are other types of numbers in Python but it is these two you will work with most.

Variables can also be used to store a letter or a sequence of letters, known as a string:

In [None]:
j = "My name is Harry"

The strings can be combined using the "+" symbol which when applied to strings functions as a concatenation operator: 

In [None]:
k = j + " Kane"

Once you have stored information in a variable, you can print it as follows:

In [None]:
print(k)

You can also print a series of variables as follows:

In [None]:
print(j + "Kane")

The print function however expects all variables that are concatenated in this way to be of the same data type, so if you mix them you will get an error message

In [None]:
print(k + " and my shirt number is " + i)

You can avoid this by explicitly converting the type of your variables as in the follows. To change an object stored in a variable x to an int or a float you would write int(x) or float(x) respectively.

In [None]:
print(k + " and my shirt number is " + str(i))

# A brief mention of functions

The commands print() and str() we used above are what known as functions. These are operations that can be applied to entities in our code. These are a very important part of Python programming and later in the module we will look at writing our own functions. For now I just want to mention that there are a number of useful built-in functions that can be applied to strings. A list is here: https://www.w3schools.com/python/python_ref_string.asp

Here are a couple of examples: 

In [None]:
name="tom"
str.capitalize(name) 

In [None]:
str.upper(name)

Exercise: Use a function to turn the name tom into the name tommy

# Lists

So far I have represented sentences as single strings. For most purposes in computational linguistics we don't want to do this - we want instead to represent them in ways that recognizes the word boundaries. In order to do this we often represent sentences as lists of words, each of which is represented as a string. A list of strings can be created by putting words (as strings in quotes) inside square brackets as in the following example:

In [None]:
sentence = ["this", "is", "a", "sentence"]

We can print lists of words as a single string when needed as follows. The character in the quotes before ".join" sets the character to be printed between the elements of the list. Here we use a space.

In [None]:
print(" ".join(sentence))

However we can also select elements from within the list. The entries in a list are indexed numberically starting with zero. So the first element is sentence[0] and the last element of this four element list is sentence[3]. These can then be select for printing as follows:

In [None]:
print(sentence[2])

We can also select subsequences of entries, by specifying a range as follows. Notice that the second character in the range isn't included - so 0:2 means from 0 up to the number before 2.

In [None]:
print(" ".join(sentence[0:2]))

This allows us to, for example, insert elements in the middle of sentences as follows:

In [None]:
print(" ".join(sentence[0:3]) + " short " + sentence[3])

Exercise: create a sentence in list form for the sequence "George is a cat". Then use substring selection to produce the sentence "George is a big cat".

Like strings, lists have their own built in functions that you can make use of:
https://www.w3schools.com/python/python_ref_list.asp

# Loading data

Computational linguistics involves handling text data. We saw that we can type in a string of characters, or indeed lists of words, above. However instead of typing in data, we often want to load it from files.
Download this file to your computer:
https://www.gutenberg.org/files/2554/2554-0.txt

If you are using Google Colab then run the code immediately below. Otherwise just put it in the same folder as this notebook.We can also load data directly from a web address without downloading it first, but that will involve some additional preliminary steps that require explanation so I will leave it for now.

In [None]:
# If using colab
from google.colab import files
files.upload() 

In [None]:
f = open('2554-0.txt')
raw = f.read()
chapter_one = raw[5464:23725]
print(chapter_one)

As you will notice this reads texts in as single strings. As discussed before we usually want to represent sentences in a way that reflects word boundaries. The simplest way to do it is to split on spaces. As we will see there are all sorts of problems with doing this, but for now we'll ignore that and use a built in string function split(). In order to make sure it deals with full stops as we would like we will also use the replace function to separate them from the ends of words.

In [None]:
chapter_one = chapter_one.replace("."," .")
chapter_one_tokens = str.split(chapter_one)
print(chapter_one_tokens)

# Iterating/for loops

Humans reading texts do so one word at a time. The same is often true for computers. This is most commonly performed using a "for loop". This can be straightforwardly implemented for lists. In the following code we iterate through the list printing each entry as we go. Note that the end="" in the print statement tells it to end each printed token with a space rather than a new line which is the default.

In [None]:
for word in chapter_one_tokens:
    print(word, end=" ")

You will notice that in the loop above the print statement is indented. We say that a statement that occurs within a loop is nested within that loop. Any statement that is nested inside another has to be indented in Python. The standard way to indent is to use 4 spaces, although you can also use a tab. 

One way in which we might use a for loop is to count things. This can be performed as follows.

In [None]:
k=0
for word in chapter_one_tokens:
    k=k+1
print(k)

# Booleans and Conditionals

In addition to ints, floats and strings a data type that you will use a lot is the boolean. A boolean can have just two values, true or false. In order to see how this arises, it'll be useful to introduce the equality operator "==". When used between two element a and b it asks whether they are equal. For example:

In [None]:
2==2

In [None]:
2==3

In [None]:
chapter_one_tokens[0] == "On"

In [None]:
chapter_one_tokens[0] == "a"

We can similarly check whether two elements are unequal using the operator "!=".

Other familar comparison operators like less than < and greater than > are also available and also return booleans that indicate whether a statement holds.

In [None]:
2 > 1

The way in which you will most frequently encounter booleans is in if statements. This can be used to control the computer's behaviour based on whether a particular condition holds. For example:

In [None]:
if chapter_one_tokens[0] == "On":
    print(chapter_one_tokens[0])

In [None]:
if chapter_one_tokens[0] != "a":
    print(chapter_one_tokens[0])

Combining this with a for loop of the kind we saw above we can read through the sentences and find all the instances of a particular word. 

In [None]:
k=0
for token in chapter_one_tokens:
    if token == "a":
        k = k+1
print(k)

Activity: Use a for loop to print out the text as we did above but print each sentence on a new line. For simplicity we can assume that all sentences end with a "." character.

# Making use of libraries - Natural Language Toolkit

So far in this session we have been looking at the core Python programming language. However much of the time this semester we will be making use of not just core Python but also powerful libraries for natural language processing and machine learning. Today I want to introduce the first of these - "Natural Language Toolkit" or nltk - in order to explain how libraries work.

The first thing we need to do is to make sure we have the libraries we want installed. On Google Colab they are all already there. If your are using your own machine you will have to install it using the following command.

In [None]:
!pip install nltk # If using anaconda

In order to use the library we then need to import it

In [None]:
import nltk
nltk.download('punkt')

In this brief intro to Python we have already performed two NLP tasks: word tokenizing and sentence segmentation. These split the input word sequence into words and sentences respectively. They are a surprisingly difficult tasks and we will talk next week about the many different ways these tasks can be accomplished. For now I just want to show that they are tasks that can be performed using nltk, with the tools word_tokenize and sent_tokenize which are now imported.

In [None]:
chapter_one_tokens = nltk.word_tokenize(chapter_one)
print(chapter_one_tokens)

In [None]:
chapter_one_sentences = nltk.sent_tokenize(' '.join(chapter_one_tokens))
print(chapter_one_sentences[0])

We can also use NLTK to perform versions of the tasks I discussed in my lecture this week. First POS tagging which assigns tags to words using the following tagset:https://www.cs.upc.edu/~nlp/SVMTool/PennTreebank.html

In [None]:
nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(chapter_one_tokens)

And parsing:

In [None]:
grammar1 = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)

In [None]:
sent = "Mary saw Bob".split()
rd_parser = nltk.RecursiveDescentParser(grammar1)
for tree in rd_parser.parse(sent):
    print(tree)