# LELA60001  Intro to text processing in Python
## What is Python?

When we talk about data in computational/corpus linguistics we usually mean text data or audio files. While this data is meaningful to us, to the computer data is always just sequences of numbers. So to a computer the character i is 1101001. Fortunately, we never have to engage with this numeric representation - we can input an i into a computer and it will be converted into the computer's language for us.

A similar situation arises with the instructions that a computer applies to data. The instructions that the computer sees are in a purely numeric format known as machine code. Fortunately, we as users do not need to engage with machine code. Instead, we use what is known as a programming language. This is a much easier to read and write language that we input and which is converted to machine code for us.

The programming language we will employ for this is called Python.

## Variables and datatypes
Data in Python programs is manipulated using variables. A variable is like a box in which you can store information of any kind. You assign data to variables using the assignment operator =. For example you can store the number 10 as the variable i as follows:

In [None]:
i = 5

Once stored these variables can be manipulated, for example using the standard mathematical operators for addition "+", subtraction "-", multiplication "*" and division "/". For example:

In [None]:
i+1*3/2

Data in Python has different types. i, as a whole number, is an integer, or int. The output from the operations above, since it contains a decimal point is a floating point number or float. There are other types of numbers in Python but it is these two you will work with most.

Variables can also be used to store a letter or a sequence of letters, known as a string:

In [None]:
j = "My name is "

The strings can be combined using the "+" symbol which when applied to strings functions as a concatenation operator:

In [None]:
k = j + "Jude Bellingham"

Once you have stored information in a variable, you can print it as follows:

In [None]:
print(k)

You can also print a series of variables as follows:

In [None]:
print(j + "Jude Bellingham")

The print function however expects all variables that are concatenated in this way to be of the same data type, so if you mix them you will get an error message

In [None]:
print(k + " and my shirt number is " + i)

You can avoid this by explicitly converting the type of your variables as in the follows. To change an object stored in a variable x to an int or a float you would write int(x) or float(x) respectively.

In [None]:
print(k + " and my shirt number is " + str(i))

Exercise: Add to the code above so that it prints the statement "My name is Jude Bellingham and shirt number is 5 for Real Madrid but 22 for England".

# A brief mention of functions

The commands print() and str() we used above are what known as functions. These are operations that can be applied to entities in our code. These are a very important part of Python programming and later in the subsequent Python sessions we will look at writing our own functions. For now I just want to mention that there are a number of useful built-in functions that can be applied to strings. A list is here: https://www.w3schools.com/python/python_ref_string.asp

Here are a couple of examples:

In [None]:
name="tom"
str.capitalize(name)

In [None]:
str.upper(name)

Exercise: Use a function to turn the name tom into the name tommy

# Lists

So far I have represented sentences as single strings. For most purposes in computational linguistics we don't want to do this - we want instead to represent them in ways that recognizes the word boundaries. In order to do this we often represent sentences as lists of words, each of which is represented as a string. A list of strings can be created by putting words (as strings in quotes) inside square brackets as in the following example:

In [None]:
sentence = ["this", "is", "a", "sentence"]

We can print lists of words as a single string when needed as follows. The character in the quotes before ".join" sets the character to be printed between the elements of the list. Here we use a space.

In [None]:
print(str.join(" ", sentence))

However we can also select elements from within the list. The entries in a list are indexed numberically starting with zero. So the first element is sentence[0] and the last element of this four element list is sentence[3]. These can then be select for printing as follows:

In [None]:
print(sentence[2])

We can also select subsequences of entries, by specifying a range as follows. Notice that the second character in the range isn't included - so 0:2 means from 0 up to the number before 2.

In [None]:
print(str.join(" ", sentence[0:2]))

This allows us to, for example, insert elements in the middle of sentences as follows:

In [None]:
print(str.join(" ", sentence[0:3]) + " short " + sentence[3])

Exercise: create a sentence in list form for the sequence "George is a cat". Then use substring selection to produce the sentence "George is a big cat".

Like strings, lists have their own built in functions that you can make use of:
https://www.w3schools.com/python/python_ref_list.asp

# Loading data

Computational linguistics involves handling text data. We saw that we can type in a string of characters, or indeed lists of words, above. However instead of typing in data, we often want to load it from files. We are going to use this file: https://www.gutenberg.org/files/2554/2554-0.txt

First we can download it to our workspace:

In [None]:
!wget https://www.gutenberg.org/files/2554/2554-0.txt

Then we read it in to Python.

In [None]:
f = open('2554-0.txt')
raw = f.read()

We will then extract a single chapter of the novel to work with:

In [None]:
chapter_one = raw[5464:23725]
print(chapter_one)

As you will notice this reads texts in as single strings. As discussed before we usually want to represent sentences in a way that reflects word boundaries. The simplest way to do it is to split on spaces. As we will see there are all sorts of problems with doing this, but for now we'll ignore that and use a built in string function split(). In order to make sure it deals with full stops as we would like we will also use the replace function to separate them from the ends of words.

In [None]:
chapter_one = str.replace(chapter_one, "."," .")
chapter_one_tokens = str.split(chapter_one)
print(chapter_one_tokens)

### Regular Expressions

Above we used functions that belong to the datatype string to manipulate text. Another a much more powerful way of manipulating text is the regular expression. You can find a good overview in the section 2.1 of the following chapter:
https://web.stanford.edu/~jurafsky/slp3/2.pdf
Once you have looked at that, the following will get you started on regular expressions in Python. First we need to import the regular expressions library (https://docs.python.org/3/library/re.html):

In [None]:
import re

### re.findall()
One very useful function in the re package is re.findall() - this searches for occurences of a pattern in a given string. It takes a regular expression to search for, between quotes, as its first argument and a string to search as its second argument. If it find matches, it returns them in a list:

In [None]:
sentence="I like both dogs and cats"
re.findall("cats", sentence)

Exercise: Based on what you learned about in this weeks lecture rewrite this pattern so that it finds all of the things that the speaker likes:

In [None]:
re.findall("", sentence)

For this new sentence write a pattern that detects all of the things that the speaker likes:

In [None]:
sentence3="I like dogs, cats and rabbits"

In [None]:
re.findall("", sentence3)

For this new sentence detect all past tense forms of verbs:

In [None]:
sentence4="I wanted some exercise so I walked to work today"

In [None]:
re.findall("", sentence4)

Do the same for sentence 5:

In [None]:
sentence5="I wanted some exercise so I walked to work today and was tired afterwards"

In [None]:
re.findall("", sentence5)